Machine Learning Applications In Software Engineering (Series on Software Engineering and Knowledge Engineering)

Machine Learning Applications in Software Engineering SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Serie...

Author: Du Zhang | Jeffrey J. P. Tsai

276 downloads 2361 Views 25MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Machine Learning Applications in

Software Engineering

SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Series Editor-in-Chief S K CHANG (University of Pittsburgh, USA)

Vol. 1

Knowledge-Based Software Development for Real-Time Distributed Systems Jeffrey J. -P. Tsai and Thomas J. Weigert (Univ. Illinois at Chicago)

Vol. 2

Advances in Software Engineering and Knowledge Engineering edited by Vincenzo Ambriola (Univ. Pisa) and Genoveffa Tortora (Univ. Salerno)

Vol. 3

The Impact of CASE Technology on Software Processes edited by Daniel E. Cooke (Univ. Texas)

Vol. 4

Software Engineering and Knowledge Engineering: Trends for the Next Decade edited by W. D. Hurley (Univ. Pittsburgh)

Vol. 5

Intelligent Image Database Systems edited by S. K. Chang (Univ. Pittsburgh), E. Jungert (Swedish Defence Res. Establishment) and G. Tortora (Univ. Salerno)

Vol. 6

Object-Oriented Software: Design and Maintenance edited by Luiz F. Capretz and Miriam A. M. Capretz (Univ. Aizu, Japan)

Vol. 7

Software Visualisation edited by P. Eades (Univ. Newcastle) and K. Zhang (Macquarie Univ.)

Vol. 8

Image Databases and Multi-Media Search edited by Arnold W. M. Smeulders (Univ. Amsterdam) and Ramesh Jain (Univ. California)

Vol. 9

Advances in Distributed Multimedia Systems edited by S. K. Chang, T. F. Znati (Univ. Pittsburgh) and S. T. Vuong (Univ. British Columbia)

Vol. 10 Hybrid Parallel Execution Model for Logic-Based Specification Languages Jeffrey J.-P. Tsai and Bing Li (Univ. Illinois at Chicago) Vol. 11 Graph Drawing and Applications for Software and Knowledge Engineers Kozo Sugiyama (Japan Adv. Inst. Science and Technology) Vol. 12 Lecture Notes on Empirical Software Engineering edited by N. Juristo & A. M. Moreno (Universidad Politecrica de Madrid, Spain) Vol. 13 Data Structures and Algorithms edited by S. K. Chang (Univ. Pittsburgh, USA) Vol. 14 Acquisition of Software Engineering Knowledge SWEEP: An Automatic Programming System Based on Genetic Programming and Cultural Algorithms edited by George S. Cowan and Robert G. Reynolds (Wayne State Univ.) Vol. 15 Image: E-Learning, Understanding, Information Retieval and Medical Proceedings of the First International Workshop edited by S. Vitulano (Universita di Cagliari, Italy) Vol. 16 Machine Learning Applications in Software Engineering edited by Du Zhang (California State Univ.,) and Jeffrey J. P. Tsai (Univ. Illinois at Chicago)

Machine Learning Applications in

Software Engineering editors

Du Zhang California State University, USA

Jeffrey J.P. Tsai University of Illinois, Chicago, USA

\[p World Scientific N E W J E R S E Y

• LONDON

• SINGAPORE

• BEIJING

• S H A N G H A I

• H O N G K O N G

• TAIPEI

•

CHENNAI

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

The author and publisher would like to thank the following publishers of the various journals and books for their assistance and permission to include the selected reprints found in this volume: IEEE Computer Society (Trans, on Software Engineering, Trans, on Reliability); Elsevier Science Publishers (Information and Software Technology); Kluwer Academic Publishers (Annals of Software Engineering, Automated Software Engineering, Machine Learning)

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

MACHINE LEARNING APPLICATIONS IN SOFTWARE ENGINEERING Copyright © 2005 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN

981-256-094-7

Cover photo: Meiliu Lu

Printed in Singapore by World Scientific Printers (S) Pte Ltd

DEDICATIONS DZ: To Jocelyn, Bryan, and Mei JT: To Jean, Ed, and Christina

ACKNOWLEDGMENT The authors acknowledge the contribution of Meiliu Lu for the cover photo and the support from National Science Council under Grant NSC 92-2213-E-468-001, R.O.C.. We also thank Kim Tan, Tjan Kwang Wei, and other staffs at World Scientific for helping with the preparation of the book.

TABLE OF CONTENTS Chapter 1 Introduction to Machine Learning and Software Engineering 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

The Challenge Overview of Machine Learning Learning Approaches SE Tasks for ML Applications State-of-the-Practice in ML&SE Status Applying ML Algorithms to SE Tasks Organization of the Book

Chapter 2 ML Applications in Prediction and Estimation 2.1 Bayesian Analysis of Empirical Software Engineering Cost Models, (with S. Chulani, B. Boehm and B. Steece) IEEE Transactions on Software Engineering, Vol. 25. No. 4, July 1999, pp. 573-583. 2.2 Machine Learning Approaches to Estimating Software Development Effort, (with K. Srinivasan and D. Fisher) IEEE Transactions on Software Engineering, Vol. 21, No. 2, February 1995, pp. 126-137. 2.3 Estimating Software Project Effort Using Analogies, (with M. Shepperd and C. Schofield) IEEE Transactions on Software Engineering, Vol. 23, No. 12, November 1997, pp. 736-743. 2.4 A Critique of Software Defect Prediction Models, (with N.E. Fenton and M. Neil) IEEE Transactions on Software Engineering, Vol. 25, No. 5, September 1999, pp. 675-689. 2.5 Using Regression Trees to Classify Fault-Prone Software Modules, (with T.M. Khoshgoftaar, E.B. Allen and J. Deng) IEEE Transactions on Reliability, Vol. 51, No. 4, 2002, pp. 455-462. 2.6 Can Genetic Programming Improve Software Effort Estimation? A Comparative Evaluation, (with CJ. Burgess and M. Lefley) Information and Software Technology, Vol. 43, No. 14, 2001, pp. 863-873. 2.7 Optimal Software Release Scheduling Based on Artificial Neural Networks, (with T. Dohi, Y. Nishio, and S. Osaki) Annals of Software Engineering, Vol. 8, No. 1, 1999, pp. 167-185.

ix

1 1 3 9 13 15 25 35 36 37 41

52

64

72

87

95

106

Chapter 3 ML Applications in Property and Model Discovery 3.1 Identifying Objects in Procedural Programs Using Clustering Neural Networks, (with S.K. Abd-El-Hafiz) Automated Software Engineering, Vol. 7, No. 3, 2000, pp. 239-261. 3.2 Bayesian-Learning Based Guidelines to Determine Equivalent Mutants, (with A. M. R. Vincenzi, et al.) International Journal of Software Engineering and Knowledge Engineering, Vol. 12, No. 6, 2002, pp. 675-689. Chapter 4 ML Applications in Transformation 4.1 Using Neural Networks to Modularize Software, (with R. Schwanke and S.J. Hanson) Machine Learning, Vol. 15, No. 2, 1994, pp. 137-168. Chapter 5 ML Applications in Generation and Synthesis 5.1 Generating Software Test Data by Evolution, (with C.C. Michael, G. McGraw and M.A. Schatz) IEEE Transactions on Software Engineering, Vol. 27, No. 12, December 2001, pp. 1085-1110. Chapter 6 ML Applications in Reuse

125 127

150

165 167

199 201

227

6.1 On the Reuse of Software: A Case-Based Approach Employing a Repository, (with P. Katalagarianos and Y. Vassiliou) Automated Software Engineering, Vol. 2, No. 1, 1995, pp. 55-86. Chapter 7 ML Applications in Requirement Acquisition 7.1 Inductive Specification Recovery: Understanding Software by Learning From Example Behaviors, (with W.W. Cohen) Automated Software Engineering, Vol. 2, No. 2, 1995, pp. 107-129. 7.2 Explanation-Based Scenario Generation for Reactive System Models, (with R.J. Hall) Automated Software Engineering, Vol. 7, 2000, pp. 157-177. Chapter 8 ML Applications in Management of Development Knowledge 8.1 Case-Based Knowledge Management Tools for Software Development, (with S. Henninger) Automated Software Engineering, Vol. 4, No. 3, 1997, pp. 319-340.

229

261 263

286

307 309

Chapter 9 Guidelines and Conclusion

331

References

345

X

Chapter 1 Introduction to Machine Learning and Software Engineering 1.1. The Challenge The challenge of developing and maintaining large software systems in a changing environment has been eloquently spelled out in Brooks' classic paper, No Silver Bullet [20]. The following essential difficulties inherent in developing large software still hold true today: > Complexity: "Software entities are more complex for their size than perhaps any other human construct." "Many of the classical problems of developing software products derive from this essential complexity and its nonlinear increases with size." > Conformity: Software must conform to the many different human institutions and systems it comes to interface with. > Changeability: "The software product is embedded in a cultural matrix of applications, users, laws, and machine vehicles. These all change continually, and their changes inexorably force change upon the software product." > Invisibility: "The reality of software is not inherently embedded in space." "As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs, superimposed one upon another." [20] However, in his "No Silver Bullet" Refired paper [21], Brooks uses the following quote from Glass to summarize his view in 1995: So what, in retrospect, have Parnas and Brooks said to us? That software development is a conceptually tough business. That magic solutions are not just around the corner. That it is time for the practitioner to examine evolutionary improvements rather than to wait-or hope-for revolutionary ones [56]. Many evolutionary or incremental improvements have been made or proposed, with each attempting to address certain aspect of the essential difficulties [13, 47, 57, 96, 110]. For instance, to address changeability and conformity, an approach called the transformational programming allows software to be developed, modified, and maintained at specification level, and then automatically transformed into production-quality software through automatic program synthesis [57]. This software development paradigm will enable software engineering to become the discipline of capturing and automating currently undocumented domain and design knowledge [96]. Software engineers will deliver knowledge-based application generators rather than unmodifiable application programs. A system called LaSSIE was developed to address the complexity and invisibility issues [36]. The multi-view modeling framework proposed in [22] could be considered as an attempt to address the invisibility issue. The application of artificial intelligence techniques to software engineering (AI&SE) has produced some encouraging results [11, 94, 96, 108, 112, 122, 138, 139, 145]. Some of the successful AI techniques include: knowledge-based approach, automated reasoning, expert systems, heuristic search strategies, temporal logic, planning, and pattern recognition. To ultimately overcome the essential difficulties, AI techniques can play an important role. As a subfield of AI, machine learning (ML) deals with the issue of how to build computer programs that improve their performance at some task through experience [105]. It is dedicated to creating and compiling verifiable knowledge related to the design and construction of artifacts [116]. ML

1

algorithms have been utilized in many different problem domains. Some typical applications are: data mining problems where large databases contain valuable implicit regularities that can be discovered automatically, poorly understood domains where there is lack of knowledge needed to develop effective algorithms, or domains where programs must dynamically adapt to changing conditions [105]. Not surprisingly, the field of software engineering turns out to be a fertile ground where many software development and maintenance tasks could be formulated as learning problems and approached in terms of learning algorithms. The past two decades have witnessed many ML applications in the software development and maintenance. ML algorithms offer a viable alternative and complement to the existing approaches to many SE issues. In his keynote speech at the 1992 annual conference of the American Association for Artificial Intelligence, Selfridge advocated the application of ML to SE (ML&SE): We all know that software is more updating, revising, and modifying than rigid design. Software systems must be built for change; our dream of a perfect, consistent, provably correct set of specifications will always be a nightmare-and impossible too. We must therefore begin to describe change, to write software so that (1) changes are easy to make, (2) their effects are easy to measure and compare, and (3) the local changes contribute to overall improvements in the software. For systems of the future, we need to think in terms of shifting the burden of evolution from the programmers to the systems themselves...[we need to] explore what it might mean to build systems that can take some responsibility for their own evolution [130]. Though many results in ML&SE have been published in the past two decades, effort to summarize the state-of-the-practice and to discuss issues and guidelines in applying ML to SE has been few and far between [147-149]. A recent paper [100] focuses its attention on decision tree based learning methods to SE issues. Another survey is offered from the perspective of data mining techniques being applied to software process and products [99]. The AI&SE summaries published so far are too broad a brush that they do not give an adequate account on ML&SE. There is a related and emerging area of research under the umbrella of computational intelligence in software engineering (CI&SE) recently [80, 81, 91, 92, 113]. Research in this area utilizes fuzzy sets, neural networks, genetic algorithms, genetic programming and rough sets (or combinations of those individual technologies) to tackle software development issues. ML&SE and CI&SE share two common grounds: targeted software development problems, and some common techniques. However, ML offers many additional mature techniques and approaches that can be brought to bear on solving the SE problems. The scope of this book, as depicted in the shaded area in Figure 1, is to attempt to fill this void by studying various issues pertaining to ML&SE (the applications of other AI techniques in SE are beyond the scope of this book). We think this is an important and helpful step if we want to make any headway in ML&SE. In this book, we address various issues in ML&SE by trying to answer the following questions: > What types of learning methods are there available at our disposal? > What are the characteristics and underpinnings of different learning algorithms?

2

> How do we determine which learning method is appropriate for what type of software development or maintenance task? > Which learning methods can be used to make headway in what aspect of the essential difficulties in software development? > When we attempt to use some learning method to help with an SE task, what are the general guidelines and how can we avoid some pitfalls? > What is the state-of-the-practice in ML&SE? > Where is further effort needed to produce fruitful results?

Figure 1. Scope of this book. 1.2.

Overview of Machine Learning

The field of ML includes: supervised learning, unsupervised learning and reinforcement learning. Supervised learning deals with learning a target function from training examples of its inputs and outputs. Unsupervised learning attempts to learn patterns in the input for which no output values are available. Reinforcement learning is concerned with learning a control policy through reinforcement from an environment. ML algorithms have been utilized in many different problem domains. Some typical applications are: data mining problems where large databases contain valuable implicit regularities that can be discovered automatically, poorly understood domains where there is lack of knowledge needed to develop effective algorithms, or domains where programs must dynamically adapt to changing conditions [105]. The following list of publications and web sites offers a good starting point for the interested reader to be acquainted with the state-of-the-practice in ML applications [2, 3, 9, 15, 32, 37-39, 87, 99, 100, 103, 105107,117-119,127,137]. ML is not a panacea for all the SE problems. To better use ML methods as tools to solve real world SE problems, we need to have a clear understanding of both the problems, and the tools and methodologies utilized. It is imperative that we know (1) the available ML methods at our disposal, (2) characteristics of those methods, (3) circumstances under which the methods can be most effectively applied, and (4) their theoretical underpinnings. Since many SE development or maintenance tasks rely on some function (or functions, mappings, or models) to predict, estimate, classify, diagnose, discover, acquire, understand,

3

generate, or transform certain qualitative or quantitative aspect of a software artifact or a software process, application of ML to SE boils down to how to find, through the learning process, such a target function (or functions, mappings, or models) that can be utilized to carry out the SE tasks. Learning involves a number of components: (1) How is the unknown (target or true) function represented and specified? (2) Where can the function be found (the search space)? (3) How can we find the function (heuristics in search, learning algorithms)? (4) Is there any prior knowledge (background knowledge, domain theory) available for the learning process? (5) What properties do the training data have? And (6) What are the theoretical underpinnings and practical issues in the learning process? 1.2.1. Target functions Depending on the learning methods utilized, a target function can be represented in different hypothesis language formalisms (e.g., decision trees, conjunction of attribute constraints, bit strings, or rules). When a target function is not explicitly defined, but the learner can generate its values for given input queries (such as the case in instance-based learning), then the function is said to be implicitly defined. A learned target function may be easy for the human expert to understand and interpret (e.g., first order rules), or it may be hard or impossible for people to comprehend (e.g., weights for a neural network). interpretability \

\ \

formalism

representation \

easy to understand \ \

ex pij c i t

\ \ hard to understand \ \ \ length \ \

7

bit string, \

\

bayesian networks \ \ \

7

linear functions propositions \ Horn clauses \

7

/

/

/

/ statistical significance

/

/information content / / / / tradeoff between / / complexity and degree / / of fit to data /

ANN

decision trees \

\ implicit \ \ \ \

/ predictive accuracy

properties

attribute constraints

\

/ PQrrpr

ea

/

Ser

/ / / /

lazy /

_ Target |

Function

binary classification mu lti-value

classification regression

/

generalization

output

Figure 2. Issue in target functions. Based on its output, a target function can be utilized for SE tasks that fall into the categories of binary classification, multi-value classification and regression. When learning a target function from a given set of training data, its generalization can be either eager (at learning stage) or lazy

4

(at classification stage). Eager learning may produce a single target function from the entire training data, while lazy learning may adopt a different (implicit) function for each query. Evaluating a target function hinges on many considerations: predictive accuracy, interpretability, statistical significance, information content, and tradeoff between its complexity and degree of fit to data. Quinlan in [119] states: Learned models should not blindly maximize accuracy on the training data, but should balance resubstitution accuracy against generality, simplicity, interpretability, and search parsimony. 1.2.2. Hypothesis space Candidates to be considered for a target function belong to the set called a hypothesis space H. Let/be a true function to be learned. Given a set D of examples (training data) of/, the inductive learning amounts to finding a function h e H that is consistent with D. h is said to approximate/. How an H is specified and what structure H has would ultimately determine the outcome and efficiency of the learning. The learning becomes unrealizable [124] when / £ H. Since / is unknown, we may have to resort to background or prior knowledge to generate an H in which / must exist. How prior knowledge is utilized to specify an appropriate H where the learning problem is realizable (f e H) is a very important issue. There is also a tradeoff between the expressiveness of an H and the computational complexity of finding simple and consistent h that approximates / [124]. Through some strong restrictions, the expressiveness of the hypothesis language can be reduced, thus yielding a smaller H. This in turn may lead to a more efficient learning process, but at the risk of being unrealizable. structures

lattlce

\

no structure

properties

\

realizable (fs H)

\

unrea Iizable

\

Hypothesis Space H

\ 7

/

/ Prior knowledge

/

(fg H)

/ / domain theory

/

expressiveness

/ computational complexity of finding a simple and / consistent h /

constraints

tradeoff

Figure 3. Issues in hypothesis space H.

5

1.2.3. Search and bias How can we find a simple and consistent h e H that approximates / ? This question essentially boils down to a search problem. Heuristics (inductive or declarative bias) will play a pivotal role in the search process. Depending on how examples in D are defined, learning can be supervised or unsupervised. Different learning methods may adopt different types of bias, different search strategies, and different guiding factors in search. For an /, its approximation can be obtained either locally with regard to a subset of examples in D, or globally with regard to all examples in D. Learning can result in either knowledge augmentation or knowledge (re)compilation. Depending on the interaction between a learner and its environment, there are query learning and reinforcement learning. There are stable and unstable learning algorithms depending on the sensitivity to changes in the training data [37]. For unstable algorithms (e.g., decision tree, neural networks, or rule-learning), small changes in the training data will result in the algorithms generating significantly different output function. On the other hand, stable algorithms (e.g., linear-regression and nearestneighbor) are immune (not easy to succumb to) small changes in data [37]. Instead of using just a single learned hypothesis for classification of unseen cases, an ensemble of hypotheses can be deployed whose individual decisions are combined to accomplish the task of new case classification. There are two major issues in ensemble learning: how to construct ensembles, and how to combine individual decisions of hypotheses in an ensemble [37]. outcome \

guiding factor \

VS

style

domain theory

\ generalTospecific \

_

bias \

searchbias

\ knowledge \ \ \ training data \ info. g a i n idi^ 6 \ augmentation \ ,. . \ \ alone \ language Dias \ distance metric \ , ,.... ,. . \ \ \ \ \ \ g r e e d y ( hlU c h m - ) \ \ declarative bias \ gradient desce. \ \ \ \ \ \ fitnp^ \ s i m P l e T o c o m P l e x \ training data + \ changeable vs. unchangeable \ \ deductive \ domain theory \ \ Knowledge \recompilation \ cumulat. reward \ _ \ \ I m p l i c i t vs. explicit \ \relat. frequency \ randomized beam \ \ \ \ m-esti. accuracy \ n o explicit search \ \ I ^7 ^ i ^ 7 1 T Search / / / / how to / ' / query learning / supervised / unstable / / construct / / learning / algonthms / / local (subset of ensembles / I / / I training data) / / /

reinforcement learning

/ /

unSupervised

/ /

leamin

/

/

learner-environment supervision interaction

stable / algorithms / /

stability

how to combine classifiers

ensemble

Figure 4. Issues in search of hypothesis.

6

/ / /

global (all training ta)

approximation

Another issue during the search process is the need for interaction with an oracle. If a learner needs an oracle to ascertain the validity of the target function generalization, it is interactive; otherwise, the search is non-interactive [89]. The search is flexible if it can start either from scratch or from an initial hypothesis. 1.2.4. Prior knowledge Prior (or background) knowledge about the problem domain where an / is to be learned plays a key role in many learning methods. Prior knowledge can be represented differently. It helps learning by eliminating otherwise consistent h and by filling in the explanation of examples, which results in faster learning from fewer examples. It also helps define different learning techniques based on its logical roles and identify relevant attributes, thus yielding a reduced H and speeding up the learning process [124]. There are two issues here. First of all, for some problem domains, the prior knowledge may be sketchy, inaccurate or not available. Secondly, not all learning approaches are able to accommodate such prior knowledge or domain theories. A common drawback of some general learning algorithms such as decision tree or neural networks is that it is difficult to incorporate prior knowledge from problem domains to the learning algorithms [37]. A major motivation and advantage of stochastic learning (e.g., naive Bayesian learning) and inductive logic programming is their ability to utilize background knowledge from problem domains in the learning algorithm. For those learning methods for which prior knowledge or domain theory is indispensable, one issue to keep in mind is that the quality of the knowledge (correctness, completeness) will have a direct impact on the outcome of the learning. representation \ first order \ theories \

properties \setey \ \ inaccurate

\ correctness \ \

\COnStraintS

\ completeness \

\ probabilities \

quality

\

\

\ 7 7 /expedite learning / /fromfewerdata / e ^ / define different / / learning methods /

/ identify relevant / / attributes / roles

notavailable

hard t0

\ I t0 accommodate

accommodate

accommodation

Figure 5. Issues in prior knowledge.

7

Prior Knowledge

1.2.5. Training data Training data gathered for the learning process can vary in terms of (1) the number of examples, (2) the number of features (attributes), and (3) the number of output classes. Data can be noisy or accurate in terms of random errors, can be redundant, can be of different types, and have different valuations. The quality and quantity of training data have direct impact on the learning process as different learning methods have different criteria regarding training data, with some methods requiring large amount of data, others being very sensitive to the quality of data, and still others needing both training data and a domain theory. Training data may be just used once, or multiple times, by the learner. Scaling-up is another issue. Real world problems can have millions of training cases, thousands of features and hundreds of classes [37]. Not all learning algorithms are known to be able to scale up well with large problems in those three categories. When a target function is not easy to be learned from the data in the input space, a need arises to transform the data into a possible high-dimensional feature space F and learn the target function in F. Feature selection in F becomes an important issue as both computational cost and target function generalization performance can degrade as the number of features grows [32]. Finally, based on the way in which training data are generated and provided to a learner, there are batch learning (all data are available at the outset of learning) and on-line learning (data are available to the learner one example at a time) feature selection

\

\

7

\ \ noisy/accurate irrelevant feature \ r e a l v a l u e \ \ detection and \ \ r a n d o m errors \ elimination \ \ \ -u \ vector value \ redundancy J \ filters \ \ \ wrappers \ \ I \ \ \ Training

7

/ sequences / /time series / / spatial type

properties

«rctra'kyViscre,e value \ «-™-««

\

/

valuation

/

usec

7

^ once

/

/ /

I

batch learning

/ / / used multiple / online / times / learning / / frequency

availability

/

large data set

/ large number of / features / /

large number of classes scale-up

Figure 6. Issues in training data.

8

Data

/

1.2.6. Theoretical underpinnings and practical considerations Underpinning learning methods are different justifications: statistical, probabilistic, or logical. What are the frameworks for analyzing learning algorithms? How can we evaluate the performance of a generated function, and determine the convergence issue? What types of practical problems do we have to come to grips with? Those issues must be answered if we are to succeed in real world SE applications.

application types \

convergence

, . . data mining

\

\

analysis framework \

\

\ \ \ poorly understood \ \ domains \

feasibility

settin

PAC

\

\ stationary assumption \ \ sample complexity of H

8s

\ changing conditions \ conditions \in domains \ \

\ mistake bound \

\

/ /

overfitting , . / underfitting / local minima /

/ / /

I /

/ accuracy (sample/true

/

I statistical

/

em)r)

/

confidence

/ crowding / intervals / / / Curse of / comparison / dimensionality / practical problem

Theory & Practice

\

7

evaluating h

logical

/ / /

probabilistic

/ justification

Figure 7. Theoretical and practical issues. 1.3. Learning Approaches There are many different types of learning methods, each having its own characteristics and lending itself to certain learning problems. In this book, we organize major types of supervised and reinforcement learning methods into the following groups: concept learning (CL), decision tree learning (DT), neural networks (NN), Bayesian learning (BL), reinforcement learning (RL), genetic algorithms (GA) and genetic programming (GP), instance-based learning (IBL, of which case-based reasoning, or CBR, is a popular method), inductive logic programming (ILP), analytical learning (AL, of which explanation-based learning, or EBL is a method), combined inductive and analytical learning (IAL), ensemble learning (EL) and support vector machines (SVM). The organization of different learning methods is largely influenced by [105]. In some literature [37, 124], stochastic (statistical) learning is used to refer to learning methods such as BL.

9

1.3.1. Concept learning In CL, a target function is represented as a conjunction of constraints on attributes. The hypothesis space H consists of a lattice of possible conjunctions of attribute constraints for a given problem domain. A least-commitment search strategy is adopted to eliminate hypotheses in H that are not consistent with the training set D. This will result in a structure called the version space, the subset of hypotheses that are consistent with the training data. The algorithm, called the candidate elimination, utilizes the generalization and specialization operations to produce the version space with regard to H and D. It relies on a language (or restriction) bias that states that the target function is contained in H. CL is an eager and supervised learning method. It is not robust to noise in data and does not have support for prior knowledge accommodation. 1.3.2. Decision trees A target function is defined as a decision tree in DT. Search in DT is often guided by an entropy based information gain measure that indicates how much information a test on an attribute yields. Learning algorithms in DT often have a bias for small trees. It is an eager, supervised, and unstable learning method, and is susceptible to noisy data, a cause for overfitting. It cannot accommodate prior knowledge during the learning process. However, it scales up well with large data in several different ways [37]. A popular DT tool is C4.5 [118]. 1.3.3. Neural networks Given a fixed network structure, learning a target function in NN amounts to finding weights for the network such that the network outputs are the same as (or within an acceptable range of) the expected outcomes as specified in the training data. A vector of weights in essence defines a target function. This makes the target function very difficult for human to read and interpret. NN is an eager, supervised, and unstable learning approach and cannot accommodate prior knowledge. A popular algorithm for feed-forward networks is Backpropagation, which adopts a gradient descent search and sanctions an inductive bias of smooth interpolation between data points [105]. 1.3.4. Bayesian learning BL offers a probabilistic approach to inference, which is based on the assumption that the quantities of interest are dictated by probability distributions, and that optimal decisions or classifications can be reached by reasoning about these probabilities along with observed data [105]. BL methods can be divided into two groups based on the outcome of the learner: the ones that produce the most probable hypothesis given the training data, and the ones that produce the most probable classification of a new instance given the training data. A target function is thus explicitly represented in the first group, but implicitly defined in the second group. One of the main advantages is that BL accommodates prior knowledge (in the form of Bayesian belief networks, prior probabilities for candidate hypotheses, or a probability distribution over observed data for a possible hypothesis). The classification of an unseen case is obtained through combined predictions of multiple hypotheses. It also scales up well with large data. BL is an eager and supervised learning method and does not require search during learning process. Though it has no problem with noisy data, BL has difficulty with small data sets. BL adopts a bias that is based on the minimum description length principle that prefers a hypothesis h that minimizes the description length of h plus the description length of the data given h [105]. There

10

are several popular algorithms: MAP (maximum a posteriori), Bayes optimal classifier, naive Bayes classifier, Gibbs, and EM [37, 105]. 1.3.5. Genetic algorithms and genetic programming GA and GP are both biologically inspired learning methods. A target function is represented as bit strings in GA, or as programs in GP. The search process starts with a population of initial hypotheses. Through the crossover and mutation operations, members of current population give rise to the next generation of population. During each step of the iteration, hypotheses in the current population are evaluated with regard to a given measure of fitness, with the fittest hypotheses being selected as members of the next generation. The search process terminates when some hypothesis h has a fitness value above some threshold. Thus, the learning process is essentially embodied in the generate-and-test beam search [105]. The bias is fitness-driven. There are generational and steady-state algorithms. 1.3.6. Instance-based learning IBL is a typical lazy learning approach in the sense that generalizing beyond the training data is deferred until an unseen case needs to be classified. In addition, a target function is not explicitly defined; instead, the learner returns a target function value when classifying a given unseen case. The target function value is generated based on a subset of the training data that is considered to be local to the unseen example, rather than the entire training data. This amounts to approximating a different target function for a distinct unseen example. This is a significant departure from the eager learning methods where a single target function is obtained as a result of the learner generalizing from the entire training data. The search process is based on statistical reasoning, and consists in identifying training data that are close to the given unseen case and producing the target function value based on its neighbors. Popular algorithms include: K-nearest neighbors, CBR and locally weighted regression. 1.3.7. Inductive logic programming Because a target function in ILP is defined by a set of (propositional or first-order) rules, it is highly amenable to human readability and interpretability. ILP lends itself to incorporation of background knowledge during learning process, and is an eager and supervised learning. The bias sanctioned by ILP includes rule accuracy, FOIL-gain, or preference of shorter clauses. There are a number of algorithms: SCA, FOIL, PROGOL, and inverted resolution. 1.3.8. Analytical learning AL allows a target function, represented in terms of Horn clauses, to be generalized from scarce data. However, it is in dispensable that the training data D must be augmented with a domain theory (prior knowledge about the problem domain) B. The learned h is consistent with both D and B, and good for human readability and interpretability. AL is an eager and supervised learning, and search is performed in the form of deductive reasoning. The search bias in EBL, a major AL method, is B and preference of a small set of Horn clauses (for learning h). One important perspective of EBL is that learning can be construed as recompiling or reformulating the knowledge in B so as to make it operationally more efficient when classifying unseen cases. EBL algorithms include Prolog-EBG.

11

1.3.9. Inductive and analytical learning Both inductive learning and analytical (deductive) learning have their props and cons. The former requires plentiful data (thus vulnerable to data quality and quantity problems), while the latter relies on a domain theory (hence susceptible to domain theory quality and quantity problems). IAL is meant to provide a framework where benefits from both approaches can be strengthened and impact of drawbacks minimized. IAL usually encompasses an inductive learning component and an analytical learning component, e.g., NN+EBL (EBNN), or ILP+EBL (FOCL) [105]. It requires both D and B, and can be an eager and supervised learning. The issues of target function representation, search, and bias are largely determined by the underlying learning components involved. 1.3.10. Reinforcement learning RL is the most general form of learning. It tackles the issue of how to learn a sequence of actions called a control strategy from indirect and delayed reward information (reinforcement). It is an eager and unsupervised learning. Its search is carried out through training episodes. Two main approaches exist for reinforcement learning: model-based and model-free approaches [39]. The best-known model-free algorithm is Q-learning. In Q-learning, actions with maximum Q value are preferred. 1.3.11. Ensemble learning In EL, a target function is essentially the result of combining, through weighted or unweighted voting, a set of component or base-level functions called an ensemble. An ensemble can have a better predictive accuracy than its component function if (1) individual functions disagree with each other, (2) individual functions have a predictive accuracy that is slightly better than random classification (e.g., error rates below 0.5 for binary classification), and (3) individual functions' errors are at least somewhat uncorrelated [37]. EL can be seen as a learning strategy that addresses inadequacies in training data (insufficient information in training data to help select a single best h 6 H), in search algorithm (deployment of multiple hypotheses amounts to compensating for less than perfect search algorithms), and in the representation of H (weighted combination of individual functions makes it possible to represent a true function f £ H). Ultimately, an ensemble is less likely to misclassify than just a single component function. Two main issues exist in EL: ensemble construction, and classification combination. There are bagging, cross-validation and boosting methods for constructing ensembles, and weighted vote and unweighted vote for combining classifications [37]. The AdaBoost algorithm is one of the best methods for constructing ensembles of decision trees [37]. There are two approaches to ensemble construction. One is to combine component functions that are homogeneous (derived using the same learning algorithm and being defined in the same representation formalism, e.g., an ensemble of functions derived by DT) and weak (slightly better than random guessing). Another approach is to combine component functions that are heterogeneous (derived by different learning algorithms and being represented in different formalism, e.g., an ensemble of functions derived by DT, IBL, BL, and NN) and strong (each of the component function performs relatively well in its own right) [44],

12

1.3.12. Support vector machines Instead of learning a non-linear target function from data in the input space directly, SVM uses a kernel function (defined in the form of inner product of training data) to transform the training data from the input space into a high dimensional feature space F first, and then learns the optimal linear separator (a hyperplane) in F. A decision function, defined based on the linear separator, can be used to classify unseen cases. Kernel functions play a pivotal in SVM. A kernel function relies only on a subset of the training data called support vectors. Table 1 is a summary of the aforementioned learning methods. 1.4.

SE Tasks for ML Applications

In software engineering, there are three categories of entities: processes (collections of software related activities, such as constructing specification, detailed design, or testing), products (artifacts, deliverables, documents that result from a process activity, such as a specification document, a design document, or a segment of code), and resources (entities required by a process activity, such as personnel, software tools, or hardware) [49]. There wee internal and external attributes for entities of the aforementioned categories. Internal attributes describe an entity itself, whereas external attributes characterize the behavior of an entity (how the entity relates to its environment). SE tasks that lend themselves to ML applications include, but are certainly not limited to: 1. Predicting or estimating measurements for either internal or external attributes of processes, products, or resources. 2. Discovering either internal or external properties of processes, products, or resources. 3. Transforming products to accomplish some desirable or improved external attributes. 4. Synthesizing various products. 5. Reusing products or processes. 6. Enhancing processes (such as recovery of specification from software). 7. Managing ad hoc products (such as design and development knowledge). In the next section, we take a look at applications that fall into those application areas.

13

Table 1. Major learning methods. Type

Target function representation

Target function generation

Search

Inductive bias

Sample algorithm

AL .pRT,

Horn clauses

Eager, D + B, supervised

Deductive reasoning

B + small set of Horn clauses

Prolog-EBG

BL

Probability tables Bayesian network

Eager, supervised, ^ (global), explicit or implicit

Probabilistic, no explicit search

Minimum description length

MAP, BOC, Gibbs, NBC

CL

Conjunction of attribute constraints

Eager, supervised, j) (global)

Version Space (^S) guided

c£ H

Candidate elimination

DT

Decision trees

Eager, D (global), supervised

Information gain (entropy)

Preference for small trees

ID3, C4.5, Assistant

EL

Indirectly defined through ensemble of component functions

Eager, D (global), supervised

Ensemble construction, classification combination

Determined by ensemble members

AdaBoost (for ensemble of DT)

GA GP

Bit strings, program trees

Eager, no D, unsupervised

Hill climbing (simulated evolution)

Fitness-driven

Prototypical GA/GP algorithms

IBL

Not explicitly defined

Lazy, D (local), supervised,

Statistical reasoning

Similarity to nearest neighbors

K-NN, LWR, CBR

ILP

If-then rules

Eager, supervised, D (global),

Statistical, general-tospecific

Rule accuracy, FOIL-gain, shorter clauses

SCA, FOIL, PROGOL, inv. resolution

NN

Weights for neural networks

Eager, supervised, D (global)

Gradient descent guided

Backpropagation

L\L

Determined by underlying learning methods

Eager, D + B, supervised

Determined by underlying learning methods

Smooth interpolation between data points Determined by underlying learning methods

RL

Control strategy n*

Eager, no D, unsupervised

Through training episodes

Actions with max. Q value

Q, TD

SVM

Decision function in inner product form

Eager, supervised, D local ( > support vectors)

Kernel mapping

Maximal margin separator

SMO

14

KBANN, EBNN, FOCL

1.5.

State-of-the-Practice in ML&SE

A number of areas in software development have already witnessed the machine learning applications. In this section, we take a brief look at reported results and offer a summary of the existing work. The list of applications included in the section, though not a complete and exhaustive one, should serve to represent a balanced view of the current status. The trend indicates that people have realized the potential of ML techniques and begin to reap the benefits from applying them in software development and maintenance. In the organization below, we use the areas discussed in Section 1.4 as the guideline to group ML applications in SE tasks. Tables 2 through 8 summarize targeted SE objectives, and ML approaches used. 1.5.1.

Prediction and estimation

In this group, ML methods are used to predict or estimate measurements for either internal or external attributes of processes, products, or resources. These include: software quality, software size, software development cost, project or software effort, maintenance task effort, software resource, correction cost, software reliability, software defect, reusability, software release timing, productivity, execution times, and testability of program modules. 1.5.1.1. Software quality prediction GP is used in [48] to generate software quality models that take as input software metrics collected earlier in development, and predict for each module the number of faults that will be discovered later in development or during operations. These predictions will then be the basis for ranking modules, thus enabling a manager to select as many modules from the top of the list as resources allow for reliability enhancement. A comparative study is done in [88] to evaluate several modeling techniques for predicting quality of software components. Among them is the NN model. Another NN based software quality prediction work, as reported in [66], is language specific, where design metrics for SDL (Specification and Description Language) are first defined, and then used in building the prediction models for identifying fault prone components. In [71, 72], NN based models are used to predict faults and software quality measures. CBR is the learning method used in software quality prediction efforts [45, 54, 74, 77, 78]. The focus of [45] is on comparing the performance of different CBR classifiers, resulting in a recommendation of a simple CBR classifier with Euclidean distance, z-score standardization, no weighting scheme, and selecting the single nearest neighbor for prediction. In [54], CBR is applied to software quality modeling of a family of full-scale industrial software systems and the accuracy is considered better than a corresponding multiple linear regression model in predicting the number of design faults. Two practical classification rules (majority voting and data clustering) are proposed in [77] for software quality estimation of high-assurance systems. [78] discusses an attribute selection procedure that can help identify pertinent software quality metrics to be utilized in the CBR-based quality prediction. In [74], CBR approach is used to calibrate software quality classification models. Data from several embedded systems are collected to validate the results.

15

Table 2. Measurement prediction and estimation. ML Method1

SE Task Software quality (high-risk, or faultprone component identification)

GP [48], NN [66, 71, 72, 88], CBR [45, 54, 74, 77, 78], DT [18, 75, 76, 115,121], GP+DT [79] CL [35], ILP [30]

Software size

{NN, GP} [41]

Software development cost Project/software (development) effort

DT [17], CBR [19], BL [28] CBR [82,131,140,142], {DT, NN} [135], GA+NN [133], {NN, CBR} [51], GP [93] {NN, CBR, DT} [97], NN [63,146], {GP, NN, CBR} [25]

Maintenance task effort

{NN, DT} [68]

Software resource analysis

DT [129]

Software cost/correction cost

GP [42], {DT, ILP} [34]

Software reliability

NN [69]

Defects

BL [50]

Reusability

DT [98]

Software release timing

NN [40]

Productivity

BL [136]

Execution time

GA [143]

Testability of program modules

NN [73]

In [115], a DT based approach is used to generate measurement-based models of high-risk components. The proposed method relies on historical data (metrics from previous releases or projects) for identifying components of fault prone properties. Another DT based approach is used to build models for predicting high-risk Ada components [18]. Regression trees are used in [75] to classify fault-prone software modules. The approach allows one to have a preferred balance between Type I and Type II misclassification rates. The SPRINT DT algorithm is used 1

An explanation on the notations: {...} is used to indicate that multiple ML methods are each independently applied for the same SE task, and "...+..." is used to indicate that multiple ML methods are collectively applied to an SE task. These apply to Tables 2 through 8.

16

in [76] to build classification trees as quality estimation models that predict the class of software modules (fault-prone or not fault-prone). A set of computational intelligence techniques, of which DT is one, is proposed in [121] for software quality analysis. A hybrid approach, GPbased DT, is proposed in [79] for software quality prediction. Compared with DT alone, GPbased DT approach is more flexible and allows optimization of performance objectives other than accuracy. Another comparative study result is reported in [30] on using ILP methods for software fault prediction for C++ programs. Both natural and artificial data are used in evaluating the performance of two ILP methods and some extensions are proposed to one of them. Software quality prediction is formulated as a CL problem in [35]. It is noted in the study that there are activities (such data acquisition, feature extraction and example labeling) prior to the actual learning process. These activities would have impact on the quality of the outcome. The proposed approach is applied to a set of COBOL programs. 1.5.1.2. Software size estimation NN and GP are used in [41] to validate the component-based method for software size estimation. In addition to producing results that corroborate the component-based approach for software sizing, it is noticed in the study that NN works well with the data, recognizing some nonlinear relationships that the multiple linear regression method fails to detect. The equations evolved by GP provide similar or better values than those produced by the regression equations, and are intelligible, providing confidence in the results. 1.5.1.3. Software cost prediction A general approach, called optimized set reduction and based on DT, is described in [17] for analyzing software engineering data, and is demonstrated to be an effective technique for software cost estimation. A comparative study is done in [19] which includes a CBR technique for software cost prediction. The result reported in [28] indicates that the improved predictive performance of software cost models can be obtained through the use of Bayesian analysis, which offers a framework where both prior expert knowledge and sample data can be accommodated to obtain predictions. A GP based approach is proposed in [42] for searching possible software cost functions. 1.5.1.4. Software (project) development effort prediction IBL techniques are used in [131] for predicting the software project effort for new projects. The empirical results obtained (from nine different industrial data sets totaling 275 projects) indicate that CBR offers a viable complement to the existing prediction and estimations techniques. Another CBR application in software effort estimation is reported in [140]. The work in [82] focuses on the search heuristics to help identify the optimal feature set in a CBR system for predicting software project effort. A comparison is done in [142] of several CBR estimation methods and the results indicate that estimates obtained through analogues selected by human are more accurate than estimates obtained through analogues selected by tools, and more accurate than estimates through the simple regression model. DT and NN are used in [135] to help predict software development effort. The results were competitive with conventional methods such as COCOMO and function points. The main advantage of DT and NN based estimation systems is that they are adaptable and nonparametric.

17

NN is the method used in [63, 146] for software development effort prediction and the results are encouraging in terms of accuracy. Additional research on ML based software effort prediction includes: a genetically trained NN (GA+NN) predictor [133], and a GP based approach [93]. The conclusion in [93] epitomizes the dichotomy of the application of an ML method; "GP performs consistently well for the given data, but is harder to configure and produces more complex models", and "the complexity of the GP must be weighed against the small increases in accuracy to decide whether to use it as part of any effort prediction estimation". In addition, in-house data are more significant than the public data sets for estimates. Several comparative studies of software effort estimation have been reported in [25, 51, 97] where [51] deals with NN and CBR, [97] with CBR, NN and DT, and [25] with CBR, GP and NN. 1.5.1.5. Maintenance task effort prediction Models are generated in terms of NN and DT methods, and regression methods, for software maintenance task effort prediction in [68]. The study measures and compares the prediction accuracy for each model, and concludes that DT-based and multiple regression-based models have better accuracy results. It is recommended that prediction models be used as instruments to support the expert estimates and to analyze the impact of the maintenance variables on the process and product of maintenance. 1.5.1.6. Software resource analysis In [129], DT is utilized in software resource data analysis to identify classes of software modules that have high development effort or faults (the concept of "high" is defined with regard to the uppermost quartile relative to past data). Sixteen software systems are used in the study. The decision trees correctly identify 79.3 percent of the software modules that had high development effort or faults. 1.5.1.7. Correction cost estimation An empirical study is done in [34] where DT and ILP are used to generate models for estimating correction costs in software maintenance. The generated models prove to be valuable in helping to optimize resource allocations in corrective maintenance activities, and to make decisions regarding when to restructure or reengineer a component so as to make it more maintainable. A comparison leads to an observation that ILP-based results perform better than DT-based results. 1.5.1.8. Software reliability prediction Software reliability growth models can be used to characterize how software reliability varies with time and other factors. The models offer mechanisms for estimating current reliability measures and for predicting their future values. The work in [69] reports the use of NN for software reliability growth prediction. An empirical comparison is conducted between NN-based models and five well-known software reliability growth models using actual data sets from a number of different software projects. The results indicate that NN-based models adapt well across different data sets and have a better prediction accuracy.

18

1.5.1.9. Defect prediction BL is used in [50] to predict software defects. Though the system reported is only a prototype, it shows the potential Bayesian belief networks (BBN) has in incorporating multiple perspectives on defect prediction into a single, unified model. Variables in the prototype BBN system [50] are chosen to represent the life-cycle processes of specification, design and implementation, and testing (Problem-Complexity, Design-Effort, Design-Size, Defects-Introduced, Testing-Effort, Defects-Detected, Defects-Density-At-Testing, Residual-Defect-Count, and Residual-DefectDensity). The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for DefectsIntroduced, Defects-Detected and Defect-Density. 1.5.1.10. Reusability prediction Predictive models are built through DT in [98] to verify the impact of some internal properties of object-oriented applications on reusability. Effort is focused on establishing a correlation between component reusability and three software attributes (inheritance, coupling and complexity). The experimental results show that some software metrics can be used to predict, with a high level of accuracy, the potential reusable classes. 1.5.1.11. Software release timing How to determine the software release schedule is an issue that has impact on both the software product developer and the user and the market. A method, based on NN, is proposed in [40] for estimating the optimal software release timing. The method adopts the cost minimization criterion and translates it into a time series forecasting problem. NN is then used to estimate the fault-detection time in the future. 1.5.1.12. Testability prediction The work reported in [73] describes a case study in which NN is used to predict the testability of software modules from static measurements of the source code. The objective in the study is to predict a quantity between zero and one whose distribution is highly skewed toward zero, which proves to be difficult for standard statistical techniques. The results echo the salient feature of NN-based predictive models that have been discussed so far: its ability to model nonlinear relationships. 1.5.1.13. Productivity A BL based approach is described in [136] for estimating the productivity of software projects. A demonstrative BBN is defined to capture the causal relationships among components in the COCOMO81 model along with probability tables for the nodes. The results obtained are still preliminary. 1.5.1.14. Execution time Temporal behaviors of real-time software are pivotal to the overall system correctness. Testing whether a real-time system violates its specified timing constraints for certain inputs thus becomes a critical issue. A GA based approach is described in [143] to produce inputs with the longest or shortest execution times that can be used to check if they will cause a temporal error or a violation of a system's time constraints.

19

1.5.2.

Property and model discovery

ML methods are used to identify or discover useful information about software entities. Work in [16] explores using ILP to discover loop invariants. The approach is based on collecting execution traces of a program to be proven correct and using them as learning examples of an ILP system. The states of the program variables at a given point in the execution represent positive examples for the condition associated with that point in the program. A controlled closed-world assumption is utilized to generate negative examples. Table 3. Property discovery SE Task

ML Method

Program invariants

ILP [16]

Identifying objects in programs

NN [1]

Boundary of normal operations

S VM [95]

Equivalent mutants

BL [141]

Process models

NN [31 ], EBL [55]

In [1], NN is used to identify objects in procedural programs as an effort to facilitate many maintenance activities (reuse, understanding). The approach is based on cluster analysis and is capable of identifying abstract data types and groups of routines that reference a common set of data. A data analysis technique called process discovery is proposed in [31] that is implemented in terms of NN. The approach is based on first capturing data describing process events from an ongoing process and then generating a formal model of the behavior of that process. Another application involves the use of EBL to synthesize models of programming activities or software processes [55]. It generates a process fragment (a group of primitive actions which achieves a certain goal given some preconditions) from a recorded process history. Despite its effectiveness at detecting faults, mutation testing requires a large number of mutants to be compiled, executed, and analyzed for possible equivalence to the original program being tested. To reduce the number of mutants to be considered, BL is used in [141] to provide probabilistic information to determine the equivalent mutants. A detection method based on SVM is described in [95] as an approach for validating adaptive control systems. A case study is done with an intelligent flight control system and the results indicate that the proposed approach is effective for discovering boundaries of the safe region for the learned domain, thus being able to separate faulty behaviors from normal events.

20

1.5.3.

Transformation

The work in [125, 126] describes a GP system that can transform serial programs into functionally identical parallel programs. The functional identical property between the input and the output of the transformation can be proven, which greatly enhances the opportunities of the system being utilized in commercial environments. Table 4. Transformation. SE Task

ML Method

Transform serial programs to parallel ones

GP [125, 126]

Improve software modularity

CBR+NN [128], GA [62]

Mapping 0 0 applications to heterogeneous distributed environments

GA [27]

A module architecture assistant is developed in [128] to help assist software architects in improving the modularity of large programs. A model for modularization is established in terms of nearest-neighbor clustering and classification, and is used to make recommendations to rearrange module membership in order to improve modularity. The tool learns similarity judgments that match those of the software architect through performing back propagation on a specialized neural network. Another work for software modularization is reported in [62] that introduces a new representation (aimed at reducing the size of the search space) and a new crossover operator (designed to promote the formation and retention of building blocks) for GA based approach. GA is used in [27] in experimenting and evaluating a partitioning and allocation model for mapping object-oriented applications to heterogeneous distributed environments. By effectively distributing software components of an object-oriented system in a distributed environment, it is hoped to achieve performance goals such as load balancing, maximizing concurrency and minimizing communication costs. 1.5.4.

Generation and synthesis

In [10], a test case generation method is proposed that is based on ILP. An adequate test set is generated as a result of inductive learning of programs from finite sets of input-output examples. The method scales up well when the size or the complexity of the program to be tested grows. It stops being practical if the number of alternatives (or possible errors) becomes too large. A GP based approach is described in [46] to select and evaluate test data. A tool is reported in [101, 102] that uses, among other things, GA to generate dynamic test data for C/C++ programs. The tool is fully automatic and supports all C/C++ language constructs. Test results have been obtained for programs containing up to 2000 lines of source code with complex, nested conditionals. Three separate works on test data generation are also based on GA

21

[14, 24, 144]. In [14], the issue of how to instrument programs with flag variables is considered. GA is used in [24] to help generate test data for program paths, whereas work in [144] is focused on test data generation for structural test methods. Table 5. Generation and synthesis. SE Task

ML Method ILP [10], GA [14,24,101,102,144],

Test cases/data

GP [46] Test resource

GA [33]

Project management rules

{GA, DT] [5]

Software agents

GP [120]

Design repair knowledge

CBR + EBL [6]

Design schemas

IBL [61]

Data structures

GP [86]

Programs/scripts

IBL [12], [CL, AL} [104]

Project management schedule

GA [26]

Testing resource allocation problem is considered in [33] where a GA approach is described. The results are based on consideration of both system reliability and testing cost. In [5], DT and GA are utilized to learn software project management rules. The objective is to provide decision rules that can help project managers to make decisions at any stage during the development process. Synthesizing Unix shell scripts from a high-level specification is made possible through IBL in [12]. The tool has a retrieval mechanism that allows an appropriate source analog to be automatically retrieved given a description of a target problem. Several domain specific retrieval heuristics are utilized to estimate the closeness of two problems at implementation level based on their perceived closeness in the specification level. Though the prototype system demonstrates the viability of the approach, the scalability remains to be seen. A prototype of a software engineering environment is described in [6] that combines CBR and EBL to synthesize design repair rules for software design. Though the preliminary results are promising, the generality of the learning mechanism and the scaling-up issue remain to be open questions, as cautioned by the authors. In [61], IBL provides the impetus to a system that acquires software design schemas from design cases of existing applications.

22

GP is used in [120] to automatically generate agent programs that communicate and interact to solve problems. However, the reported work so far is on a two-agent scenario. Another GP based approach is geared toward generating abstract data types, i.e., data structures and the operations to be performed on them [86]. In [104], CL and AL are used in synthesizing search programs for a Lisp code generator in the domain of combinatorial integer constraint satisfaction problems. GA is behind the effort in generating project management schedules in [26]. Using a programmable goal function, the technique can generate a near-optimal allocation of resources and a schedule that satisfies a given task structure and resource pool. 1.5.5.

Reuse library construction and maintenance

This area presents itself as a fertile ground for CBR applications. In [109], CBR is the corner stone of a reuse library system. A component in the library is represented in terms of a set of feature/term pairs. Similarity between a target and a candidate is defined by the distance measure, which is computed through comparator functions based on the subsumption, closeness and package relations. Components in a software reuse library have an added advantage in that they can be executed on a computer so as to yield stronger results than could be expected from generic CBR. The work reported in [52] takes advantage of this property by first retrieving software modules from the library, adapting them to new problems, and then subjecting those new cases to executions on system-generated test sets in order to evaluate the results of CBR. CBR can be augmented with additional mechanisms to help aid other issues in reuse library. Such is the case in [70] where CBR is adopted in conjunction with a specificity-genericity hierarchy to locate and adopt software components to given specifications. The proposed method focuses its attention on the evolving nature of the software repository. Table 6. Reuse. SE Task

ML Method

Similarity computing

CBR [109]

Active browsing

IBL [43]

Cost of rework

DT [8]

Knowledge representation

CBR [52]

Locate and adopt software to

CBR [70]

specifications Generalizing program abstractions

EBL [65]

Clustering of components

GA [90]

23

How to find a better way of organizing reusable components so as to facilitate efficient user retrieval is another area where ML finds its application. GA is used in [90] to optimize the multiway clustering of software components in a reusable class library. Their approach takes into consideration the following factors: number of clusters, similarity within a cluster and similarity among clusters. In [8], DT is used to model and predict the cost of rework in a library of reusable software components. Prescriptive coding rules can be generated from the model that can be used by programmers as guidelines to reduce the cost of rework in the future. The objective of the work is to use DT to help manage the maintenance of reusable components, and to improve the way the components are produced so as to reduce maintenance costs in the library. A technique called active browsing is incorporated into a tool that helps assist the browsing of a reusable library for desired components [43]. An active browser infers its similarity measure from a designer's normal browsing actions without any special input. It then recommends to the designer components it estimates to be close to the target of the search, which is accomplished through a learning process similar to IBL. EBL is used as the basis to capture and generalize program abstractions developed in practice to increase their potential for reuse [65]. The approach is motivated by the explicit domain knowledge embodied in data type specifications and the mechanisms for reasoning about such knowledge used in validating software. 1.5.6. Requirement acquisition CL is used to support scenario-based requirement engineering in the work reported in [85]. The paper describes a formal method for supporting the process of inferring specifications of system goals and requirements inductively from interaction scenarios provided by stakeholders. The method is based on a learning algorithm that takes scenarios as examples and counter-examples (positive and negative scenarios) and generates goal specifications as temporal rules. Table 7. Process enhancement. SE Task

ML Method

Derivation of specifications of system

CL [85]

goals and requirements Extract specifications from software

ILP [29]

Acquire knowledge for specification refinement and augmentation Acquire and maintain specification consistent with scenarios

{DT, NN) [111] EBL [58, 59]

Another work in [58] presents a scenarios-based elicitation and validation assistant that helps requirements engineers acquire and maintain a specification consistent with scenarios provided.

24

The system relies on EBL to generalize scenarios to state and prove validation lemmas. A scenario generation tool is built in [59] that adopts a heuristic approach based on the idea of piecing together partially satisfying scenarios from the requirements library and using EBL to abstract them in order to be able to co-instantiate them. A technique is developed in [29] to extract specifications from software using ILP. It allows instrumented code to be run on a number of representative cases, and generate examples of the code's behavior. ILP is then used to generalize these examples to form a general description of some aspect of a system's behavior. Software specifications are imperfect reflections of a reality, and are prone to errors, inconsistencies and incompleteness. Because the quality of a software system hinges directly on the accuracy and reliability of its specification, there is dire need for tools and methodologies to perform specification enhancement. In [111], DT and NN are used to extract and acquire knowledge from sample problem data for specification refinement and augmentation. 1.5.7.

Capture development knowledge

How to capture and manage software development knowledge is the theme of this application group where both papers report work utilizing CBR as the tool. In [64], a CBR based infrastructure is proposed that supports evolving knowledge and domain analysis methods that capture emerging knowledge and synthesize it into generally applicable forms. Software process knowledge management is the focus in [4]. A hybrid approach including CBR is proposed to support the customization of software processes. The purpose of CBR is to facilitate reuse of past experiences.

Table 8. Management. SE Task

1.6.

ML Method

Collect and manage software development knowledge

CBR [64]

Software process knowledge

CBR [4]

Status

In this section, we offer a summary of the state-of-the-practice in this niche area. The application patterns of ML methods in the body of existing work are summarized in Table 9.

25

Table 9. Application patterns of ML methods. Pattern

Description

Convergent

Different ML methods each being applied to the same SE task

Divergent

A single ML method being applied to different SE tasks

Compound

Several ML methods being combined together for a single SE task

Figure 8 captures a glimpse of the types of software engineering issues in the seven application areas people have been interested in applying ML techniques to. Figure 9 summarizes the publication counts in those areas. For instance, of the eighty-six publications included in Subsection 1.5 above, forty-five of them (52%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts. On the other hand, Figure 10 provides some clue on what types of ML techniques people feel comfortable in using. Based on the classification, IBL/CBR, NN, and DT are the top three popular techniques in that order, amounting to fifty-seven percent of the entire ML applications in our study.

Figure 8. Number of different SE tasks in each application area.

26

Figure 9. Number of publications in each application area.

Figure 10. State-of-the-practice from the perspective of ML algorithms.

Table 10 depicts the distribution of ML algorithms in the seven SE application areas. The trend information captured in Figure 11, though only based on the published work we have been able to collect, should be indicative of the increased interest in ML&SE.

27

Table 10. ML methods in SE application areas. NN

Prediction

V

Discovery

-^

Transformation

V

IBL CBR

V

V

V V

Reuse

V

Management

V

GA

ILP

GP

V

EBL

V V

Generation

Acquisition

DT

yj

V V V yj

V

CL

BL

V

V

V

AL

IAL

V

V

V

V

V

V

V V

V

V

^j

Figure 11. Publications on applying ML algorithms in SE. Tables 11-21 summarize the applications of individual ML methods.

28

EL

SVM

V

V V

RL

Table 11. IBL/CBR applications. Category

Application

Prediction

Quality Development cost Development effort

Transformation

Modularity

Generation

Design repair knowledge Design schemas Programs/scripts

Reuse

Similarity computing Active browsing Knowledge representation Locate/adopt software to specifications

Management

Software development knowledge Software process knowledge

29

Table 12. NN applications. Category

Application

Prediction

Quality Size Development effort Maintenance effort Reliability Release time Testability

Discovery

Identifying objects Process models

Transformation

Modularity

Acquisition

Specification refinement

30

Table 13. DT applications. Category

Application

Prediction

Quality Development cost Development effort Maintenance effort Resource analysis Correction cost Reusability

Generation

Project management rules

Reuse

Cost of rework

Acquisition

Specification refinement Table 14. GA applications.

Category

Application

Prediction

Development effort Execution time

Transformation

Modularity Object-oriented application

Generation

Test data Test resource allocation Project management rules Project management schedule

Reuse

Clustering of components

31

Table 15. GP applications. Category

Application

Prediction

Quality Size Development effort Software cost

Transformation

Parallel programs

Generation

Test data Software agents Data structures Table 16. ILP applications.

Category

Application

Prediction

Quality Correction cost

Discovery

Program invariants

Generation

Test data

Acquisition

Extract specifications from software Table 17. EBL applications.

Category

Application

Discovery

Process models

Generation

Design repair knowledge

Reuse

Generalizing program abstractions

Acquisition

Acquiring specifications from scenarios

32

Table 18. BL applications. Category

Application

Prediction

Development cost Defects Productivity

Discovery

Mutants

Table 19. CL applications. Category

Application

Prediction

Quality

Generation

Programs/scripts

Acquisition

Derivation of specifications

Table 20. AL applications. Category

Application

Generation

Programs/scripts

Table 21. SVM application. Category

Application

Discovery

Operation boundary

33

The body of existing work we have been able to glean definitely represents the efforts that have been underway to take advantage of the unique perspective ML affords us to explore for SE tasks. Here we point out some general issues in ML&SE as follows. > Applicability and justification. When adopting an ML method to an SE task, we need to have a good understanding of the dimensions of the leaning method and characteristics of the SE task, and find a best match between them. Such a justification offers a necessary condition for successfully applying an ML method to an SE task. > Issue of scaling up. Whether a learning method can be effectively scaled up to handle real world SE projects is an issue to be reckoned with. What seems to be an effective method for a scaled-down problem may hit a snag when being subject to a full-scale version of the problem. Some general guidelines regarding the issue are highly desirable. > Performance evaluation. Given some SE task, some ML-based approaches may outperform their conventional (non-ML) counterparts, others may not offer any performance boost but just provide a complement or alternative to the available tools, yet another group may fill in a void in the existing repertoire of SE tools. In addition, we are interested in finding out if there are significant performance differences among applicable ML methods for an SE task. To sort out those different scenarios, we need to establish a systematic way of evaluating the performance of a tool. Let S be a set of SE tasks, and let Tc and TL contain a set of conventional (non-ML) SE tools and a set of ML-based tools, respectively. Figure 12 describes some possible scenarios between S and TQ/TL, where Tc(s) c Tc and TL(S) C TL indicate a subset of tools applicable to an SE task s, respectively. If P is defined to be some performance measure (e.g., prediction accuracy), then we can use P(t, s) to denote the performance of t for seS, where t e (Tc(s) v TL(s)). Let A ::= < | = | >. Given an s e S, the performance of two applicable tools can be compared in terms of the following relationships: P(ti; s) A Pft, s), where % e Tc(s) A tj e TL(s), P(tk, s) A P(t,, s), where tk, t, e TL(s) A |T L (S)| >1.

Figure 12. Relationships between S and Tc, and between S and TL.

34

> Integration. How can an ML-based tool be seamlessly integrated into the SE development environment or tool suite is another issue that deserves attention. If it takes a heroic effort for the tool's integration, it may ultimately affects its applicability. 1.7.

Applying ML Algorithms to SE Tasks

In applying machine learning to solving any real-world problem, there is usually some course of actions to follow. What we propose is a guideline that has the following steps: Problem Formulation. The first step is to formulate a given problem such that it conforms to the framework of a particular learning method chosen for the task. Different learning methods have different inductive bias, adopt different search strategies that are based on various guiding factors, have different requirements regarding domain theory (presence or absence) and training data (valuation and properties), and are based on different justifications of reasoning (refer to Figures 2-7). All these issues must be taken into consideration during the problem formulation stage. This step is of pivotal importance to the applicability of the learning method. Strategies such as divide-and-conquer may be needed to decompose the original problem into a set of subproblems more amenable to the chosen learning method. Sometimes, the best formulation of a problem may not always be the one most intuitive to a machine learning researcher [87]. Problem representation. The next step is to select an appropriate representation for both the training data and the knowledge to be learned. As can be seen in Figure 2, different learning methods have different representational formalisms. Thus, the representation of the attributes and features in the learning task is often problem-specific and formalism-dependent. Data collection. The third step is to collect data needed for the learning process. The quality and the quantity of the data needed are dependent on the selected learning method. Data may need to be preprocessed before they can be used in the learning process. Domain theory preparation. Certain learning methods (e.g., EBL) rely on the availability of a domain theory for the given problem. How to acquire and prepare a domain theory (or background knowledge) and what is the quality of a domain theory (correctness, completeness) therefore become an important issue that will affect the outcome of the learning process. Performing the learning process. Once the data and a domain theory (if needed) are ready, the learning process can be carried out. The data will be divided into a training set and a test set. If some learning tool or environment is utilized, the training data and the test data may need to be organized according to the tool's requirements. Knowledge induced from the training set is validated on the test set. Because of different splits between the training set and test set, the learning process itself is an iterative one. Analyzing and evaluating learned knowledge. Analysis and evaluation of learned knowledge is an integral part of the learning process. The interestingness and the performance of the acquired knowledge will be scrutinized during this step, often with the help from human experts, which hopefully will lead to the knowledge refinement. If learned knowledge is deemed insignificant, uninteresting, irrelevant, or deviating, this may be indicative to the need for revisions at early stages such as problem formulation and representation. There are known practical problems in many learning methods such as overfitting, local minima, or curse of dimensionality that are due

35

to either data inadequacy, noise or irrelevant attributes in data, nature of a search strategy, or incorrect domain theory. Fielding the knowledge base. What this step entails is that the learned knowledge be used [87]. The knowledge could be embedded in a software development system or a software product, or used without embedding it in a computer system. As observed in [87], the power of machine learning methods does not come from a particular induction method, but instead from proper formulation of the problems and from crafting the representation to make learning tractable. 1.8.

Organization of the Book

The rest of the book is organized as follows. Chapters 2 through 8 cover ML applications in seven different categories of SE, respectively. Chapter 2 deals with ML applications in software measurements or attributes prediction and estimation. This is the most concentrated category that includes forty-five publications in our study. In this chapter, a collection of seven papers is selected as representatives for activities in this category. These seven papers include ML applications in predicting or estimating: software quality, software development cost, project effort, software defect, and software release timing. Those applications involve ML methods of BL, DT, NN, CBR, and GP. In Chapter 3, two papers are included to address the use of ML methods for discovering software properties and models, one dealing with using NN to identify objects in procedural programs, and the other tackling the issue of detecting equivalent mutants in mutation testing using BL The main theme in Chapter 4 is software transformation. ML methods are utilized to transform software into one with desirable properties (e.g., from serial programs to parallel programs, from a less modularized program to a more modularized one, mapping object-oriented applications to heterogeneous distributed environments). In this chapter, we include one paper that deals with the issue of transforming software systems for better modularity using nearest-neighbor clustering and a special-purpose NN. Chapter 5 describes ML applications where software artifacts are generated or synthesized. The chapter contains one paper that describes a GA based approach to test data generation. The proposed approach is based on dynamic test data generation and is geared toward generating condition-decision adequate test sets. Chapter 6 takes a look at how ML methods are utilized to improve the process of constructing and maintaining reuse libraries. Software reuse library construction and maintenance has been a fertile ground for ML applications. The paper included in this chapter describes a CBR based approach to locating and adopting reusable components to particular specifications. In Chapter 7, software specification is the target issue. Two papers are selected in the chapter. The first paper describes an ILP based approach to extracting specifications from software. The second paper discusses an EBL based approach to scenario generation that is an integral part of specification modeling. Chapter 8 is concerned with how ML methods are used to capture and manage software development or process knowledge. The one paper in the chapter discusses a CBR based method for collecting and managing software development knowledge as it evolves in an organizational context. Finally, Chapter 9 offers some guidelines on how to select ML methods for SE tasks, how to formulate an SE task into a learning problem, and concludes the book with remarks on where future effort will be needed in this niche area.

36

Chapter 2 ML Applications in Prediction and Estimation As evidenced in Chapter One, the majority of the ML applications (52%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts. The subject of the prediction or estimation involves a range of properties: quality, size, cost, effort, reliability, reusability, productivity, and testability. In this chapter, we include a set of 7 papers where ML methods are used to predict or estimate measurements for either internal or external attributes of processes, products, or resources in software engineering. These include: software quality, software cost, project or software development effort, software defect, and software release timing. Table 22 summarizes the current state-of-the-practice in this application area. Table 22. ML methods used in prediction and estimation. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Quality

V

Size

V

Development Cost Development Effort

V

Maintenance Effort

-^

Resource Analysis

V

V

V

V

V

V

V

V V

V

\j \j yj

Correction Cost

\j

y]

yj

Defects

yj

Reusability Release Time

A/ yj

Productivity

-\j

Execution Time Testability

V

V

Software Cost

Reliability

V

-\/ yj

A primary concern in prediction or estimation models and methods is accuracy. There are some general issues about prediction accuracy. The first is the measurement, namely, how accuracy is to be measured. There are several accuracy measurements and the choice of which one to use may be dependent on what objectives one has when using the predictor. The second issue is the

37

sensitivity, that is, how sensitive a prediction method's accuracy is to changes in data and time. Different approaches may have different level of sensitivity. The paper by Chulani, Boehm and Steece [28] describes a BL approach to software development cost prediction. A salient feature of BL is that it accommodates prior knowledge and allows both data and prior knowledge to be utilized in making inferences. This proves to be especially helpful in circumstances where data are scarce and incomplete. The results obtained by authors in the paper indicate that the BL approach has a predictive performance (within 30 percent of the actual values 75 percent of the time) that is significantly better than that of the previous multiple regression approach (within 30 percent of the actual values only 52 percent of the time) on their latest sample of 161 project datapoints. The paper by Srinivasan and Fisher [135] deals with the issue of estimating software development effort. This is an important task in software development process, as either underestimation or overestimation of the development effort would have adverse effect. Their work describes the use of two ML methods, DT and NN, for building software development effort estimators from historical data. The experimental results indicate that the performance of DT and NN based estimators are competitive with traditional estimators. Though just as sensitive to various aspects of data selection and representation as the traditional models, a major benefit of ML based estimators is that they are adaptable and nonparametric. The paper by Shepperd and Schofield [131] adopts a CBR approach to software project effort estimation. In their approach, projects are characterized in terms of a feature set that ranges from as few as one and as many as 29, and includes features such as the number of interfaces, development method, the size of functional requirements document. Cases for completed projects are stored along with their features and actual values of development effort. Similarity among cases is defined based on project features. Prediction for the development effort of a new project amounts to retrieving its nearest neighbors in the case base and using their known effort values as the basis for estimation. The sensitivity analysis indicates that estimation by analogy may be highly unreliable if the size of the case base is below 10 known projects, and that this approach can be susceptible to outlying projects, but the influence by a rogue project can be ameliorated as the size of dataset increases. The paper by Fenton and Neil [50] offers a critical analysis of the existing defect prediction models, and proposes an alternative approach to defect prediction using Bayesian belief networks (BBN), part of BL method. Software defect prediction is a very useful and important tool to gauge the likely delivered quality and maintenance effort before software systems are deployed. Predicting defects requires a holistic model rather than a single-issue model that hinges on either size, or complexity, or testing metrics, or process quality data alone. It is argued in [50] that all these factors must be taken into consideration in order for the defect prediction to be successful. BBN proves to be a very useful approach to the software defect prediction problem. A BBN represents the joint probability distribution for a set of variables. This is accomplished by specifying (a) a directed acyclic graph (DAG) where nodes represent variables and arcs correspond to conditional independence assumptions (causal knowledge about the problem domain), and (b) a set of local conditional probability tables (one for each variable) [67, 105]. A BBN can be used to infer the probability distribution for a target variable (e.g., "Defects Detected"), which specifies the probability that the variable will take on each of its possible values given the observed values of the other variables. In general, a BBN can be used to compute the probability distribution for any subset of variables given the values or distributions

38

for any subset of the remaining variables. In [50], variables in the BBN model are chosen to represent the life-cycle processes of specification, design and implementation, and testing. The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for Defects-Introduced, DefectsDetected and Defect-Density. The paper by Khoshgoftaar, Allen and Deng [75] discusses using a DT approach to classifying fault-prone software modules. The objective is to predict which modules are fault-prone early enough in the development life cycle. In the regression tree to be learned, the s-dependent variable is the response variable that is of the data type real, the ^-independent variables are predictors based on which the internal nodes of the tree are defined, and the leaf nodes are labeled with a real quantity for the response variable. A large legacy telecommunication system is used in the case study where four consecutive releases of the software are the basis for the training and test data sets (release 1 used as the training data set, releases 2-4 used as test data sets). A classification rule is proposed that allows the developer the latitude to have a balance between two types of misclassification rates. The case study results indicate satisfactory prediction accuracy and robustness. The paper by Burgess and Lefley [25] conducts a comparative study of software effort estimation in terms of three ML methods: GP, NN and CBR. A well-known data set of 81 projects in the late 1980s is used for the study. The input variables are restricted to those available from the specification stage. The comparisons are based on the accuracy of the results, the ease of configuration and the transparency of the solutions. The results indicate that the explanatory nature of estimation by analogy gives CBR an advantage when considering its interaction with the end user, and that GP can lead to accurate estimates and has the potential to be a valid addition to the suite of tools for software effort estimation. The paper by Dohi, Nishio and Osaki [40] proposes an NN based approach to estimating the optimal software release timing which minimizes the relevant cost criterion. Because the essential problem behind the software release timing is to estimate the fault-detection time interval in the future, authors adopt two typical NN (a feed forward NN and a recurrent NN) for the purpose of time series forecasting. Six data sets of real software fault-detection time are used in the case study. The results indicate that the predictive accuracy of the NN models outperforms those of software reliability growth models based approaches. Of the two NN models, the recurrent NN yields better results than the feed forward NN. The following papers will be included here: S. Chulani, B. Boehm and B. Steece, "Bayesian analysis of empirical software engineering cost models," IEEE Trans. SE, Vol. 25, No. 4, July 1999, pp. 573-583. K. Srinivasan and D. Fisher, "Machine learning approaches to estimating software development effort," IEEE Trans. SE, Vol. 21, No. 2, Feb. 1995, pp. 126-137. M. Shepperd and C. Schofield, "Estimating software project effort using analogies", IEEE Trans. SE, Vol. 23, No. 12, November 1997, pp. 736-743.

39

N. Fenton and M. Neil, "A critique of software defect prediction models," IEEE Trans. SE, Vol. 25, No. 5, Sept. 1999, pp. 675-689. T. Khoshgoftaar, E.B. Allen and J. Deng, Using regression trees to classify fault-prone software modules, IEEE Transactions on Reliability, Vol.51, No.4, 2002, pp.455-462. CJ. Burgess and M. Lefley, Can genetic programming improve software effort estimation? A comparative evaluation, Information and Software Technology, Vol.43, No. 14, 2001, pp.863873. T. Dohi, Y. Nishio, and S. Osaki, "Optimal software release scheduling based on artificial neural networks", Annals of Software Engineering, Vol.8, No.l, 1999, pp.167-185.

40

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,

VOL. 25,

NO. 4,

JULY/AUGUST 1999

573

Bayesian Analysis of Empirical Software Engineering Cost Models Sunita Chulani, Member, IEEE, Barry Boehm, Fellow, IEEE, and Bert Steece Abstract—To date many software engineering cost models have been developed to predict the cost, schedule, and quality of the software under development. But, the rapidly changing nature of software development has made it extremely difficult to develop empirical models that continue to yield high prediction accuracies. Software development costs continue to increase and practitioners continually express their concerns over their inability to accurately predict the costs involved. Thus, one of the most important objectives of the software engineering community has been to develop useful models that constructively explain the software development life-cycle and accurately predict the cost of developing a software product. To that end, many parametric software estimation models have evolved in the last two decades [25], [17], [26], [15], [28], [1], [2], [33], [7], [10], [22], [23]. Almost all of the above mentioned parametric models have been empirically calibrated to actual data from completed software projects. The most commonly used technique for empirical calibration has been the popular classical multiple regression approach. As discussed in this paper, the multiple regression approach imposes a few assumptions frequently violated by software engineering datasets. The source data is also generally imprecise in reporting size, effort, and cost-driver ratings, particularly across different organizations. This results in the development of inaccurate empirical models that don't perform very well when used for prediction. This paper illustrates the problems faced by the multiple regression approach during the calibration of one of the popular software engineering cost models, COCOMO II. It describes the use of a pragmatic 10 percent weighted average approach that was used for the first publicly available calibrated version [6]. It then moves on to show how a more sophisticated Bayesian approach can be used to alleviate some of the problems faced by multiple regression. It compares and contrasts the two empirical approaches, and concludes that the Bayesian approach was better and more robust than the multiple regression approach. Bayesian analysis is a well-defined and rigorous process of inductive reasoning that has been used in many scientific disciplines (the reader can refer to [11], [35], [3] for a broader understanding of the Bayesian Analysis approach). A distinctive feature of the Bayesian approach is that it permits the investigator to use both sample (data) and prior (expert-judgment) information in a logically consistent manner in making inferences. This is done by using Bayes' theorem to produce a 'postdata' or posterior distribution for the model parameters. Using Bayes' theorem, prior (or initial) values are transformed to postdata views. This transformation can be viewed as a learning process. The posterior distribution is determined by the variances of the prior and sample information. If the variance of the prior information is smaller than the variance of the sampling information, then a higher weight is assigned to the prior information. On the other hand, if the variance of the sample information is smaller than the variance of the prior information, then a higher weight is assigned to the sample information causing the posterior estimate to be closer to the sample information. The Bayesian approach discussed in this paper enables stronger solutions to one of the biggest problems faced by the software engineering community: the challenge of making good decisions using data that is usually scarce and incomplete. We note that the predictive performance of the Bayesian approach (i.e., within 30 percent of the actuals 75 percent of the time) is significantly better than that of the previous multiple regression approach (i.e., within 30 percent of the actuals only 52 percent of the time) on our latest sample of 161 project datapoints. Index Terms—Bayesian analysis, multiple regression, software estimation, software engineering cost models, model calibration, prediction accuracy, empirical modeling, COCOMO, measurement, metrics, project management.

• 1

CLASSICAL MULTIPLE REGRESSION APPROACH

M

OST of the existing empirical software engineering cost

can be used on software engineering data. We also highlight

models are calibrated using the classical multiple regression approach. In Section 1, we focus on the overall description of the multiple regression approach and how it

the assumptions imposed by the multiple regression approach and the resulting problems faced by the software engineering community in trying to calibrate empirical models using this approach. The example dataset used to facilitate the illustration is the 1997 COCOMO II dataset • S. Chulani is with IBM Research, Center for Software Engineering, 650 which is composed of data from 83 completed projects Harry Rd., San Jose, CA 95120. This work was performed while doing c o H e c t e d from commercial, aerospace, government, and research at the Center for Software Engineering, University of Southern ,. . . , , , , , , ., r« „ , T California, Los Angeles. E-mail: [email protected] com. nonprofit organizations [30]. It should be noted that with • B. Boehm is with the Center for Software Engineering, University of more than a dozen commercial implementations, COCOMO Southern California, Los Angeles, CA 90089. has been one of the most popular cost estimation models of E-mail: [email protected] the'80s and'90s. COCOMO II [2] is a recent update of the • B. Steece is with the Marshall School of Business, University of Southern ,-w-, J t California, Los Angeles, CA 90089. E-mail: [email protected]. popular COCOMO model published in [1]. Manuscript received 29 June 1998; revised 25 Feb. 1999. Multiple Regression expresses the response (e.g., Person Recommended for acceptance by D. Ross Jeffery. Months (PM)) as a linear function of k predictors (e.g., For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference 1EEECS Log Number 109543. Source Lines of Code, Product Complexity, etc.). This linear

OO98-5S89/99/$10.00 © 1999 IEEE

41

574

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 4, JULY/AUGUST 1999

function is estimated from the data using the ordinary least squares approach discussed in numerous books such as [18], [34]. A multiple regression model can be written as Vt = 0o + 0i xt\ + ••• + PkXtk

+ Et

RCON. We used a threshold value of 0.65 for high correlation among predictor variables. T a b l e 1s h o w s ^ h i S h l y correlated parameters that were aggregated for the 1997 calibration of COCOMO II. (1) The regression estimated the /? coefficients associated

where * u . . . xtk are the values of the predictor (or . •ii r i i • n r, regressor) variables for the xth observation, 0O • • • 0K are the coefficients to be estimated, st is the usual error term, and yt is the response variable for the j/th observation. Our example model, COCOMO II, has the following mathematical form:

with the scale factors and effort multipliers as shown below in the RCode (statistical software developed at University of M i n n e s o t a [8]) m n :

Data s e t = COCOMOII. 1997 Response = log[PM] - 1. 01*log[SIZE] Coefficient Estimates Label Estimate Std. Error t-value l.oi+VsFi 1? Constant_A 0.701883 0.231930 3.026 Effort = A x [Size] •=• ' x Y[EM{ (2) PMAT*log[SIZE] 0.000884288 0.0130658 0.068 PREC*log[SIZE] -0.00901971 0.0145235 0 . 6 2 1 >=1 where TEAM*log[SIZE] 0.00866128 0.0170206 0.509 FLEX*log[SIZE] 0.0314220 0.0151538 0.074 RBSL*log[SIZE] -0.00558590 0.019035 - 0 . 2 9 3 A = multiplicative constant log[PERS] 0.987472 0.230583 4.282 Size = Size of the software project measured in terms of KSLOC (thousands of Source Lines of Code) log [RELY] 0.798808 0.528549 1.511 [26] or function points [13] and programming l°g[CPLX] 1.13191 0.434550 2.605 1 log[RCON] 1.36588 0.273141 5 . 0 0 1 language log[PEXP] 0.696906 0.527474 1.321 SF = scale factor log[LTEX] -0.0421480 0.672890 0.063 EM = effort multiplier (refer to [2] for further log[DATA] 2.52796 0.723645 3.493 explanation of COCOMOII terms) logtRUSE] -0.444102 0.486480 0.913 • «. rv-^™,^ IT • u ilog[DOCU] -1.32818 0.664557 1.999 We can lineanze the COCOMO II equation by taking x „. 8 5 8 3 0 2 „_ 5 3 2 5 4 4 [ p v 0 L ] x _6 1 2 logarithms on both sides of the equahon as shown: 0.609259 0 . 920 l Q g[ A E x p ] 0 . 5 6 0 5 4 2 ln(PM) = f3o+ 0i-1.01 ••ln(Size) + fo-SF1-ln(Size) log[PCON] 0.488392 0.322021 1.517 + - . . +06-SF5-ln(Size) + 07-ln(EM1) logfTOOL] 2.49512 1.11222 2.243 „ , ' , log[SITE] 1.39701 0.831993 1.679 „ , , , , . + 0z-ln{EM2) + ---+022-ln{EMK) iog[SCED] 2.84074 0.774020 3.670 23

("I

As the results indicate, some of the regression estimates had counter intuitive values, i.e., negative coefficients (shown in

Using (3) and the 1997 COCOMO II dataset consisting of b o l d ) A sa n exam le c o n s i d e r 83 completed projects, we employed the multiple regression P ' the 'Develop for Reuse' (RUSE) , r.nD , ., ,. . . y. , , effort multiplier. This multiplicative parameter captures the r ,,.,. , , c . ,. r , , r . t , , approach [6]. Because some of the rpredictor variables had r r additional effort required to develop components intended high correlations, we formed new aggregate predictor f o r r e u s e on current or future projects. As shown in Table 2, variables. These included analyst capability and program- i f ^ R U S E r a t i n g i s E x t r a H i g h ( X H ) / iS/ d e v e l o p i n g f o r mer capability which were aggregated into personnel reuse across multiple product lines, it will cause an increase capability, PERS, and time constraints and storage con- in effort by a factor of 1.56. On the other hand, if the RUSE straints, which were aggregated into resource constraints, rating is Low (L), i.e., developing with no consideration of TABLE 1 COCOMO 11.1997 Highly Correlated Parameters TIME

STOR

ACAP

PCAP

New Parameter

JlME__L00p0

R C Q N

STOR

0.6860

1.0000

ACAP

-0.2855

-0.0769

1.0000

PCAP

I -0.2015

| -0.0027

[ 0.7339

„_,„„ | 1.0000

|

Legend: timing constraints (TIME); storage constraints (STOR); resource constraints (RCON); analyst capability (ACAP); programmer capability (PCAP); personnel capability (PERS)

42

CHULANI ET A L : BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS

575

TABLE 2 RUSE—Expert-Determined a priori Rating Scale, Consistent with 12 Published Studies Develop for Reuse (RUSE) I Low(L) I Nominal (N) I High (H) I Very High (VH) I

Extra High

(XH)

__J

Definition 1997 A-priori Values

None |

0.89

|

Across project 1.00

Across Across product Across multiple program [ine product lines 1.16 | 1.34 | 1.56

TABLE 3 RUSE—Data-Determined Rating Scale, Contradicting 12 Published Studies Develop for Reuse (RUSE) I Low (L) I Nominal (N) I High (H) I Very High (VH) I

_J

Definition

None

1997 Data-Determined Values

1.05

Across project 1.00

future reuse, it will cause effort to decrease by a factor of 0.89. This rationale is consistent with the results of 12 published studies of the relative cost of developing for reuse compiled in [27] and was based on the expertjudgment of the researchers of the COCOMO II team. But, the regression results produced a negative coefficient for the j3 coefficient associated with RUSE. This negative coefficient results in the counter intuitive rating scale shown in Table 3, i.e., an XH rating for RUSE causes a decrease in effort and a L rating causes an increase in effort. Note the opposite trends followed in Table 2 and Table 3. A possible explanation [discussed in a study by [24] on "Why regression coefficients have the wrong sign"] for this contradiction may be the lack of dispersion in the responses associated with RUSE. A possible reason for this lack of dispersion is that RUSE is a relatively new cost factor and our follow-up indicated that the respondents did not have enough information to report its rating accurately during the data collection process. Additionally, many of the responses "I don't know" and "It does not apply" had to be coded as 1.0 (since this is the only way to code no impact on effort). Note (see Fig. 1 on the following page) that with slightly more than 50 of the 83 datapoints for RUSE being set at Nominal and with no observations at XH, the data for RUSE does not exhibit enough dispersion along the entire range of possible values for RUSE. While this is the familiar errors-in-variables problem, our data doesn't allow us to resolve this difficulty. Thus, the authors were forced to assume that the random variation in the responses for RUSE is small compared to the range of RUSE. The reader should note that all other cost models that use the multiple regression approach rarely explicitly state this assumption, even though it is implicitly assumed. Other reasons for the counterintuitive results include the violation of some of the restrictions imposed by multiple regression [4], [5]:

Across program 0.94

Across product line 0.88 '

Extra High

(XH)

Across multiple product lines 0.82

data has and continues to be one of the biggest challenges in the software estimation field. This is caused primarily by immature processes and management reluctance to release cost-related data. 2. There should be no extreme cases (i.e., outliers). Extreme cases can distort parameter estimates and such cases frequently occur in software engineering data due to the lack of precision in the data collection process. 3. The predictor variables (cost drivers and scale factors) should not be highly correlated. Unfortunately, because cost data is historically rather than experimentally collected, correlations among the predictor variables are unavoidable. The above restrictions are violated to some extent by the COCOMO II dataset. The COCOMO II calibration approach determines the coefficients for the five scale factors and the 17 effort multipliers (merged into 15 due to high correlation as discussed above). Considering the rule .of thumb that every parameter being calibrated should have at least five datapoints requires that the COCOMO II dataset have data

1. The number of datapoints should be large relative to the number of model parameters (i.e., there are many degrees of freedom). Unfortunately, collecting

Fig. 1. Distribution of RUSE.

43

576

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999

Fig. 2. Example of the 10 percent weighted average approach: RUSE rating scale. TABLE 4 Prediction Accuracy of COCOMO 11.1997

COCOMO n.1997 PRED(-20) PREDC25) PREDC30)

Before Stratification by Organization 46% 49% 52%

After Stratification by Organization 49% 55% 64%

on at least 110 (or 100 if we consider that parameters are the trends followed by the a priori and the data-determined merged) completed projects. We note that the COCOMO curves are opposite. The data-determined curve has a 11.1997 dataset has just 83 datapoints. negative slope and as shown in Table 3, violates expert The second point above indicates that due to the opinion. imprecision in the data collection process, outliers can The resulting calibration of the COCOMO II model using occur causing problems in the calibration. For example, if a t h e 1 9 9 7 d a t a s e t o f 8 3 p r . t s o d u c e d e s n m a t e s within 30 particular organization had extraordinary documentation n t of ^ achjals 52 n t of ^ H m e for e f f o r t T h e requirements imposed by the management, then even a , . . . . j i ,.„ i i . i i . j i . n r „ . ., . , . , ,, . ., . . rpredichon accuracy improved to 64 percent when the data . , . , . , , • , , very small project would require a lot of effort that is expended in trying to meet the excessive documentation w a s s t r a t l f l e d m t o s e t s b a s e d o n * e 1 8 unique s o u r c e s o f t h e match to the life cycle needs. If the data collected simply d a t a <see I 19 !' I 20 !' t 14 l f o r f u r t h e r confirmation of local used the highest DOCU rating provided in the model, then calibration improving accuracy) The constant, A, of the the huge amount of effort due to the stringent documenta- COCOMO II equation was recalibrated for each of these sets tion needs would be underrepresented and the project i.e., a different intercept was computed for each set. The would have the potential of being an outlier. Outliers in constant value ranged from 1.23 to 3.72 for the 18 sets and software engineering data, as indicated above, are mostly yielded the prediction accuracies as shown in Table 4. due to imprecision in the data collection process. While the 10 percent weighted average procedure The third restriction imposed requires that no para- produced a workable initial model, we want to develop a meters be highly correlated. As described above, in the m o r e formal methodology for combining expert judgment COCOMO 11.1997 calibration, a few parameters were a n d s a m p l e information. A Bayesian analysis with an aggregated to alleviate this problem. informative prior provides such a framework. To resolve some of the counter intuitive results produced by the regression analysis (e.g., the negative coefficient for RUSE as explained above), we used a weighted average of 2 THE BAYESIAN APPROACH the expert-judgment results and the regression results, with 2.1 Basic Framework—Terminology and Theory only 10 percent of the weight going to the regression results T h e B i a n a p p r o a c h p r o v i d e s a formal process by which expert-judgment can be combined with sampling for all the parameters We selected the 10 percent weighting a factor because models with 40 percent and 25 percent . , . ,, ,° , , i • • j i .... , . , ,. . j - i: -n_information (data) to produce a robust a posteriori model, weighting factors produced less accurate predictions. This \ / r r Usin Ba es eorem we S y '^ ' can combine our two information pragmatic calibrating procedure moved the model parameters in the direction suggested by the sample data but sources as follows: retained the rationale contained within the a priori values. i /y i a\\ t IQ\ An example of the 10 percent application using the RUSE f(P\Y) = J{Y\ ^ effort multiplier is given in Fig. 2. As shown in the graph,

44

CHULANI ET AL.: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS

where P is the vector of parameters in which we are interested and Y is the vector of sample observations from the joint density function f(f3\Y). In (4), f(P\Y) is the posterior density function for j3 summarizing all the information about /?, f(Y \ ff) is the sample information and is algebraically equivalent to the likelihood function for fi, and f(/3) is the prior information summarizing the expertjudgment information about p. Equation (4) can be rewritten as: f(P\ Y) oc p | Y) / (P) In words, (5) means:

(5)

Posterior oc Sample x Prior In the Bayesian analysis context, the "prior" probabilities are the simple "unconditional" probabilities to the sample information; while the "posterior" probabilities are the "conditional" probabilities given sample and prior informay n The Bayesian approach makes use of prior information that is not part of the sample data by providing an optimal combination of the two sources of informarionAs described in many books on Bayesian analysis [21], [3], the posterior mean, 6", and variance, Var(b"), are defined as:

[ [

1

1 ~l

I" i

1

577

conducted a Delphi exercise [12], [1], [29]. Eight experts from the field of software estimation were asked to independently provide their estimate of the numeric values associated with each COCOMO II cost driver. Roughly half of these participating experts had been lead cost experts for large software development organizations and a few of them were originators of other proprietary cost models. All of the participants had at least 10 years of industrial software cost estimation experience. Based on the credibility of the participants, the authors felt very comfortable using t h e r e s u l t s o f t h e Delphi rounds as the prior information for tne purposes of calibrating COCOMO 11.1998. The reader is urged to refer to [32] where a study showed that estimates made by experts were more accurate than model-determined estimates. However, in [16] evidence showing the inefficiencies of expert judgment in other domains is highlighted. O ^ m e f i r s t r o u n d o f m e D e l P h i w a s completed, we summarized the results in terms of the means and the r a n e s of S **•* responses. These summarized results were q u i t e r a w w i t h significant variances caused by misunderstanding of the parameter definitions. In an attempt to improve the accuracy of these results and to attain better consensus among the experts, the authors distributed the results back to the participants. A better explanation of the behavior of the scale factors was provided since there was

-JJX'X + #*j x ^-X^fe + #*b*J and . _! (6)

highest variance in the scale factor responses. Each of the participants got a second opportunity to independently

— X'X + H* s J where X is the matrix of predictor variables, s is the variance of the residual for the sample data and H* and 6* are the precision (inverse of variance) and mean of the prior information, respectively From (6), it is clear that in order to determine the Bayesian posterior mean and variance, we need to determine the mean and precision of the prior information and the sampling information. The next two subsections describe the approach taken to determine the prior and sampling information, followed by a section on the Bayesian a posteriori model.

refine his/her response based on the responses of the rest of the participants in round 1. The authors felt that for the 17 effort multipliers the summarized results of round 2 were representative of the real world phenomena and decided to use these as the a priori information. But, for the five scale factors, the authors conducted a third round and made sure that the participants had a very good understanding of the exponential behavior of these parameters. The results of the third round were used as a priori information for the five scale factors. Please note that is the prior variance for any parameter is zero (in our case, if all experts responded the same value) then the Bayesian approach will completely rely o n expert opinion. However, this construct is inoperative since not surprisingly in the software field, disagreement and hence variability amongst the experts exists. Table 5 provides the a priori set of values for the RUSE parameter, i.e., the Develop for Reuse parameter. As

2.2 Prior Information To determine the prior information for the coefficients (i.e., b* and H") for our example model, COCOMO II, we

TABLE 5 COCOMO II 1998 "a priori" Rating Scale for Develop for Reuse (RUSE)

Develop for Reuse (RUSE)

Productivity Range

Low (L)

Nominal (N)

High (H)

Very High (VH)

Extra High (XH)

Definition

Least Productive Rating/ Most Productive Rating

None

Across project

Across program

Across product line

Across multiple product lines

Mean=1.73 I Variance = 0.05

0.89

1.0

1.15

1.33

1.54

1998A-priori Values

45

578

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999

discussed in Section 2, this multiplicative parameter captures the additional effort required to develop components intended for reuse on current or future projects. As shown in Table 5, if the RUSE rating is Extra High (XH), i.e., developing for reuse across multiple product lines, it will cause an increase in effort by a factor of 1.54. On the other hand, if the RUSE rating is Low (L), i.e., developing with no consideration of future reuse, it will cause effort to decrease by a factor of 0.89. The resulting range of productivity for RUSE is 1.73 (= 1.54/0.89) and the variance computed from the second Delphi round is 0.05. Comparing the results of Table 5 with the expert-determined a priori rating scale for the 1997 calibration illustrated in Table 2 validates the strong consensus of the experts in the Productivity Range of o ~ 1.7. 2.3 Sample Information The sampling information is the result of a data collection activity initiated in September 1994, soon after the initial publication of the COCOMO II description [2]. Affiliates of the Center for Software Engineering at the University of Southern California provided most of the data [30]. These organizations represent the commercial, aerospace, and federally funded research and development center (FFRDC) sectors of software development. Data of completed software projects is recorded on a data collection form that.asks between 33 and 59 questions depending on the degree of source code reuse [30]. A question asked very frequently is the definition of software size, i.e., what defines a line of source code or a Function Point (FP)? Appendix Bin the Model Definition Manual [30] defines a logical line of code using the framework described in [26], and [13] gives details on the counting rules of FPs. In spite of the definitions, the data collected to date exhibits local variations caused by differing interpretations of the counting rules. Another parameter that has different definitions within different organizations in effort, i.e., what is a person months (PM)? In COCOMO II, we define a PM as 152 person/hr. But, this varies from organization to organization. This information is usually derived from time cards maintained by employees. But, uncompensated overtime hours are illegal to report in time cards and hence do not get accounted for in the PM count. This leads to variations in the data reported and the authors took as much caution as possible while collecting the data. Variations also occur in the understanding of the subjective rating scale of the scale factors and effort multipliers [9] developed a system to alleviate this problem and help users apply cost driver definitions consistently for the PRICE S model. For example, a very high rating for analyst capability in one organization could be equivalent to a nominal rating in another organization. All these variations suggest that any organization using a parametric cost model should locally calibrate the model to produce better estimates. Please refer to the local calibration results discussed in Table 4. The sampling information includes data on the response variable, effort in person months (PM), where 1 PM = 152 hr and predictor variables such as actual size of the software in KSLOC (thousands of Source Lines of Code adjusted for breakage and reuse). The database has grown from 83

46

datapoints in 1997 to 161 datapoints in 1998. The disrributions of effort and size for the 1998 database of 161 datapoints are shown in Fig. 3. As can be noted, both the histograms are positively skewed with the bulk of the projects in the database with effort less than 500 PM and size less than 150 KSLOC. Since the multiple regression approach based on least squares estimation assumes that the response variable is normally distributed, the positively skewed histogram for effort indicates the need for a transformation. We also want the relationships between the response variable and the p r e d i c tor variables to be linear. The histograms for size in F i 3 and Fig. 4 and the scatter plot in Fig. 5 show that a log t r a n s f o r m a t i o n is appropriate for size. Furthermore, the log transformations on effort and size are consistent with (2) and (3) above. The egression analysis done in RCode (statistical software developed at University of Minnesota, [8]) on the log transformed COCOMO II parameters using a dataset of 161 datapoints yield the following results: Data s e t = COCOMOII 1998 Response = log[PM] C o e f f i c i e n t Estimates Estimate Std. Error t-value Label 0.103346 9.304 Constant_A 0 .961552 0.0460578 20.015 ± [SIZE] 0.921827 0.684836 0.481078 1.424 ^ C ' l o g l S I Z E ] 1.10203 TEAM*log[SIZE] 0.323318 FLEX*log[SIZE] 0.354658 RESL*log[SIZE] 1.32890 log[PCAP] 1.20332 ^g[RELY] 0.641228 log[CPLX] 1.03515 log [TIME] 1.58101 log[STOR] 0.784218 log[ACAP] 0.926205 log[PEXP] 0.755345 1 og [ LTEX ] 0.171569 l o g [DATA] 0.783232 l o g [RUSE] -0.339964 log[DOCU] 2.05772 lOg[PV0L] 0.867162 logfAEXP] 0.137859 i og [PC0N] 0.488392 l O g [ TOOL ] 0.551063 iog[SITE] 0.674702 l ! 11858 log[SCED]

0.373961 0.497475 0.686944 0.637678 0.307956 0.246435 0.232735 0.385646 0.352459 0.272413 0.356509 0.416269 0.218376 0.286225 0.622163 0.227311 0.330482 0.322021 0.221514 0.498431 0^275329

2.947 0.650 0.516 2.084 3.907 2.602 4.448 4.100 2.225 3.400 2.119 0.412 3.587 -1.188 3.307 3.815 0.417 1.517 2.488 1.354 4^063

T h e above results provide the estimates for the /? coefficients associated with each of the predictor variables (see (3). The t-value (ratio between the estimate and corresponding standard error, where standard error is the square root of the variance) may be interpreted as the signal-to-noise ratio associated with the corresponding predictor variables. Hence, the higher the t-value, the stronger the signal (i.e., statistical significance) being sent by the predictor variable. These coefficients can be used to

CHULANI ET AL: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS

579

Fig. 3. Distribution of effort and size: 1998 dataset of 161 observations.

Fig. 4. Distribution of log transformed effort and size: 1998 dataset of 161 observations.

adjust the a priori Productivity Ranges (PRs) to determine the data-determined PRs for each of the 22 parameters. For example, the data-determined PR for RUSE = (1.73)"034 where 1.73 is the a priori PR as shown in Table 5. While the regression provides intuitively reasonable estimates for most of the predictor variables; the negative coefficient estimate for RUSE (as discussed earlier) and the magnitudes for the coefficients on Applications Experience (AEXP), Language and Tool Experience (LTEX), Development Flexibility FLEX, and Team Cohesion (TEAM), violate our prior opinion about the impact of these parameters on Effort (i.e., PM). The quality of the data probably explains some of the conflicts between the prior information and sample data. However, when compared to the results reported in Section 2, these regression results (using 161 datapoints) produced better estimates. Only, RUSE has a

Fig. 5. Correlation between log[effort] and log[size].

negative coefficient associated with it compared to PREC, RESL, LTEX, DOCU, and RUSE in the regression results using only 83 datapoints. Thus, adding more datapoints (which results in an increase in the degrees of freedom) reduced the problems of counterintuitive results. 2.4

Combining Prior and Sampling Information: Posterior Bayesian Update

As a means of resolving the above conflicts, we will now use the Bayesian paradigm as a means of formally combining prior expert judgment with our sample data. Equation (6) reveals that if the precision of the a priori information (H*) is bigger (or the variance of the a priori information is smaller) than the precision (or the variance) of the sampling information (l/s^X'X) the posterior values will be closer to the a priori values. This situation can arise when the gathered data is noisy as depicted in Fig. 6 for an example cost factor, Develop for Reuse. Fig. 6 illustrates that the degree-of-belief in the prior information is higher than the degree-of-belief in the sample data. As a consequence, a stronger weight is assigned to the prior information causing the posterior mean to be closer to the prior mean. On the other hand (not illustrated), if the precision of the sampling information (l/s2X'X) is larger than the precision of the prior information (H*), then a higher weight is assigned to the sampling information causing the posterior mean to be closer to the mean of the sampling data. The resulting posterior precision will always be higher than the a priori precision or the sample data precision. Note that if the prior variance of any parameter is zero, then the parameter will

47

580

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999

Fig. 6. A posteriori Bayesian update in the presence of noisy data (develop for reuse, RUSE).

Fig. 7. Bayesian a posteriori productivity ranges.

be completely determined by the prior information, Although, this is a restriction imposed by the Bayesian approach, it is of little concern as the situation of complete consensus very rarely arises in the software engineering domain. The complete Bayesian analysis on COCOMO II yields the Productivity Ranges (ratio between the least productive parameter rating, i.e., the highest rating, and the most productive parameter rating, i.e., the lowest rating) illustrated in Fig. 7. Fig. 7 gives an overall perspective of the relative Software Productivity Ranges (PRs) provided by the COCOMO 11.1998 parameters. The PRs provide insight on identifying the high payoff areas to focus on in a software productivity improvement activity. For example, Product Complexity (CPLX) is the highest payoff parameter and Development Flexibility (FLEX) is the lowest payoff parameter. The variance associated with each parameter is indicated along each bar. This indicates that even though

48

the two parameters, Multisite Development (SITE) and Documentation Match to Life Cycle Needs (DOCU), have the same PR, the PR of SITE (variance of 0.007) is predicted with more than five times the certainty than the PR of DOCU (variance of 0.037). The resulting COCOMO 11.1998 model calibrated to 161 datapoints produces estimates within 30 percent of the actuals 75 percent of the time for effort. If the model's multiplicative coefficient is calibrated to each of the 18 major sources of project data, the resulting model (with the coefficient ranging from 1.5 to 4.1) produces estimates within 30 percent of the actuals 80 percent of the time. It is therefore recommended that organizations using the model calibrate it using their own data to increase model accuracy and produce a local optimum estimate for similar type projects. From Table 6, it is clear that the prediction accuracy of the COCOMO 11.1998 model calibrated using the Bayesian approach is better than the prediction accuracy

CHULANI ET AL: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS

581

TABLE 6 Prediction Accuracies of C0COM0 11.1997, a priori COCOMO 11.1998 and Bayesian a posteriori COCOMO 11.1998 Before and After Stratification

Prediction Accuracy

COCOMO n.1997 (83 datapoints) Before After PREP(.2O) 46% 49% PRED(.2S) 49% 55% PREDQ30) I 52% | 64% |

COCOMO A-Priori COCOMO Bayesian A-Posteriori 11.1997(161 11.1998 (Based on Delphi COCOMO 11.1998 datapoints) Results -161 datapoints) (161 datapoints) Before After Before After Before After 54% 57% 48% 54% 63% 70% 59% 65% 55% 63% 68% 76% 63% | 67% | 61% | 65% | 75% | 80%

of the COCOMO 11.1997 model (used on the 1997 dataset of doesn't lend itself to alleviating the third problem of 83 datapoints as well as the 1998 dataset of 161 datapoints) measurement error as discussed in Section 2. and the A Priori COCOMO II Model which is based on the Consider a reduced model developed by using a backexpert opinion gathered via the Delphi exercise. The full-set ward elimination technique, of model parameters for the Bayesian a posteriori COCOData s e t = COCOMOII .1998 MO 11.1998 model are given in Appendix A. Response = log[PM] 2.5

Cross Validation of the Bayesian Calibrated

coefficient Estimates

Model The COCOMO 11.1998 Bayesian calibration discussed above uses the complete dataset of 161 datapoints. The prediction accuracies of COCOMO 11.1998 (depicted in Table 6) are based on the same dataset of 161 datapoints. That is, the calibration and validation datasets are the same. A natural question that arises in this context is how well will the model predict new software development projects? To address this issue, we randomly selected 121 observations for our calibration dataset with the remaining 40 becoming assigned to the validation dataset (i.e "new" data). We repeated this process 15 times creahng 15 calibration and 15

Label log [SIZE] PREC. log [SIZE] RESL.log[SIZE] log[PCAP] 1 Og[RELY] iog[CPLX]

log[TIME] log[PEXP] log[DATA] x

[Ddcu]

1

[pv0L]

PeC u TT Z a prediction TJ equation t' T We/ T then developed for eachTJlu of the log [TOOL]

Estimate 0.933775 0.0120687 0.0209697 2.09098 0.570849 1 02007 1.99341 0.609801 0 714392 2 '.39447 0 .974858

0.772463

Std. Error t-value 0.0318149 29.350 0.00349253 3.456 0.00529576 3.960 0.257052 8.134 0.244610 2.334 0 232718 4 383 0317108 6^286 0.296591 2.056 0 229479 3 115 0]589210 4 ] 064 0.227189 4.291

0.199663

3.869

1.44428 0.437796 3.299 15 calibration datasets. We used the resulting a posteriori l ^ I T E ] models to predict the development effort of the 40 "new" D I S C E D ] 1.06009 0.286442 3.701 projects in the validation datasets. This validation approach, The above results have no counterintuitive estimates for known as out-of-sample validation, provides a true mea- the coefficients associated with the predictor variables. The sure of the model's predictive abilities. This out-of-sample high t-rario associated with each of these variables indicates test yielded an average PRED(0.30) of 69 percent; indicating a significant impact by each of the predictor variables. The that on average, the out-of-sample validation results highest correlation among any two predictor variables is 0.5 produced estimates within 30 percent of the actuals 69 a n d i s b e t w e e n R E L Y and CPLX. Overall, the above results percentofthehme Hence we conclude that our Bayesian a r f i s t a t i s t i c a l ] a c c e p t a b l e . This COCOMO II reduced model has reasonably good predictive qualities. , . . ., , . _ ,, „ u 7 r n ° model gives the accuracy results shown in Table 7. 2.6 Reduced Model These accuracy results are a little worse that the results When calibrating COCOMO II, the three main problems we obtained by the Bayesian A Posteriori COCOMO 11.1998 faced in our data are: 1) lack of degrees of freedom, 2) some model but the model is more parsimonious. In practice, highly correlated predictor variables, and 3) measurement removing a predictor variable is equivalent to stipulating error for a few predictor variables. These limitations led to TABLE 7 some of the regression results being counterintuitive. The posterior Bayesian update discussed in Section 3.4 alle_ .. .. . . , „ . . „ « „ « , , _ . ,, .,„„„ r . , , J ,, r , . . , Prediction Accuracies of Reduced COCOMO 11.1998 t. expert-judgment viated these problems by incorporating derived prior information into the calibration process. But, I p r e d i c t i o n Accuracy I Reduced COCOMO H. 1998 such prior information may not be always available. So, what must one do in the absence of good prior information? PRED(.2O) 54% One way to address this problem is to reduce over fitting by PRED(.25) 64% developing a more parsimonious model. This alleviates the PRED(.3O) 73% first two problems listed above. Unfortunately, our data

49

582

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999

TABLE 8 Acronym and Full Form Parameters " c b c O M O H Parameter

I VL I L I N I H I VH I XH I PR

PREC I Precendentedness " 6.20 4.96 3.72 2.48 1.24 0.00 1.33 FLEX Development Flexibility 5.07 4.05 3.04 2.03 1.01 0.00 1.26 RESL Architecture and Risk Resolution 7.07 5.65 4.24 2.83 1.41 0.00 1.38 TEAM Team cohesion 5.48 4.38 3.29 2.19 1.10 0.00 1.29 PMAT Process Maturity 7.80 6.24 4.68 3.12 1.56 0.00 1.43 RELY Required Software Reliability 0.82 0.92 1.00 1.10 1.26 1.53 DATA Data Base Size 0.90 1.00 1.14 1.28 1.42 CPLX Product Complexity 0.73 0.87 1.00 1.17 1.34 1.74 2.39 RUSE Develop for Reuse - 0.95 1.00 1.07 1.15 1.24 1.31 DOCU Documentation Match to Life-cycle Needs 0.81 0.91 1.00 1.11 1.23 1.52 TIME Time Constraint 1.00 1.11 1.29 1.63 1.63 STOR Storage Constraint 1.00 1.05 1.17 1.46 1.46 PVOL Platform Volatility 0.87 1.00 1.15 1.30 1.50 ACAP Analyst Capability 1.42 1.19 1.00 0.85 0.71 2.00 PCAP Programmer Capability 1.34 1.15 1.00 0.88 0.76 1.77 AEXP Applications Experience 1.22 1.10 1.00 0.88 0.81 1.51 PEXP Platform Experience 1.19 1.09 1.00 0.91 0.85 1.40 LTEX Language and Tool Experience 1.20 1.09 1.00 0.91 0.84 1.43 PCON Personnel Continuity T 2 9 ~ 1.12 1.00 0.90 0.81 1.59 TOOL Use of Software Tools 1.17 1.09 1.00 0.90 0.78 1.50 SITE Multi-Site Development 1.22 1.09 1.00 0.93 0.86 0.80 1.52 SCED I Required Development Schedule | 1.43 | 1.14 [ 1.00 | 1.00 | 1.00 | | 1.43 Multiplicative Effort Calibration Constant (A) = 2.94; Exponential Effort Calibration Constant (B) = 0.91

that variations in this variable have no effect on project effort. When our experts and our behavioral analyses tell us otherwise, we need extremely strong evidence to drop a variable. The authors believe that dropping variables for an individual organization via local calibration of the Bayesian Posteriori COCOMO 11.1998 model is a sounder option.

intuitive estimates when other traditional approaches are employed. We are currently using the approach to develop similar models to estimate the delivered defect density of software products and the cost of integrating commercialoff-the-shelf (COTS) software. APPENDIX A

3 CONCLUSIONS As shown in Table 6 and Table 7 of this paper, the estimation accuracy for the Bayesian a posteriori of COCOMO 11.1998 for the 161-project sample is better than the accuracies for the best version of COCOMO 11.1997, the 1998 a priori model, and a version of COCOMO 11.1998 with a reduced set of variables obtained by backward elimination. The improvement over the 1997 model provides evidence that the 1998 Bayesian variable-by-variable accommodation of expert prior information is stronger than the 1997 approach of one-factor-fits-all averaging of expert data and regression data. Overall, the class of Bayesian estimation models presented here provides a formal process for merging expert prior information with software engineering data. In many traditional models, such prior information is informally used to evaluate the "appropriateness" of the results. However, having a formal mechanism for incorporating expert prior information gives users of the cost model the flexibility to obtain predictions and calibrations based on a

different set of prior information.

Such Bayesian estimation models enable the engineering

This appendix has the acronyms and full forms of the 22 COCOMO II Post Architecture cost drivers and their associated COCOMO II. 1998 rating scales (see Table 8). F o ra further explanation of these parameters, please refer t0 PL [30]. ACKNOWLEDGMENTS This work was supported, both financially and technically, Contract No. F30602-96-C-0274, "KBSA Life Cycle Evaluation," and by the COCOMO II Program Affiliates: Aerospace, Air Force Cost Analysis Agency, A l l i e d Signal, AT&T, Bellcore, EDS, Raytheon E-Systems, GDE Systems, Hughes, IDA, JPL, Litton, Lockheed Martin, Loral, MCC, MDAC, Motorola, Northrop Grumman, Rational, Rockwell, SAIC, SEI, SPC, Sun, TI, TRW, USAF Rome Lab, U.S. Army Research Labs, and Xerox.

under AFRL

REFERENCES

g| ™ « ^ £ ! F £ f f i S tTuZe S ^ e ^

software community to more adequately address the

on Software Process and Product Measurement, J.D. Arthur and S.M.

challenge of making good decisions when the data is scarce , .

&

,

TTT

J i •

j• •

"

50

^ <&•• v °'- L PP-f" 60 ' Amsterdam, The Netherlands: J.C.

Baltzer AG, Science, 1995.

c

and incomplete. These models improve predictive performance and resolve problems associated with counter-

c£e

Processes: COCOMO 2.0," Annals of Software Eng. Special Volume

[3]

G

Box a n d G

TiaO/ Bayesian

Addison-Wesley, 1973.

inference in Statistical Analysis.

ENGINf CHULANI ET AL: AL.: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS

[4] [5]

[6] [7] [8] [9]

[10] [11] [12] [13] [14]

583

L.C. Briand, V.R. Basili, and W.M. Thomas, "A Pattern RecogniSunita Chulani holds the BE degree in compution Approach for Software Engineering Data Analysis," IEEE ter engineering from Bombay University, India, Trans. Software Eng., vol. 18, no. 11, Nov. 1992. and the MS and PhD degrees in computer S. Chulani, "Incorporating Bayesian Analysis to Improve the science from the Center for Software EngineerAccuracy of COCOMO II and Its Quality Model Extension," ing, University of Southern California. Dr. ChuQualifying Exam Report, Computer Science Dept, USC Center for lani is a research staff member at IBM Software Eng., Feb. 1998. Research, Center for Software Engineering. S. Chulani, B. Clark, and B. Boehm, "Calibration Results of She is also a visiting associate at USC participating in the COCOMO research effort. Her COCOMOII.1997," Proc. Int'l Conf. Software Eng., Apr. 1998. main contributions have included the Bayesian S.D. Conte, Software Engineering Metrics and Models. Menlo Park, in COCOMO II and the Calif.: Benjamin/Cummings, 1986. calibration approach on th development of COQUAL>del. Her D. Cook and S. Weisberg, An Introduction to Regression Graphics. MO, a cost/quality model. Hermain main interests interes include software process Wiley Series, 1994. improvement, softwaree reliability modeling, modeling and software metrics and a member of the IEEE A.M. Cuelenaere, xx. van Genuchten, and F.J. Heemstra, cost modeling. She isi a IE and the IEEE Computer "Calibrating Software Cost Estimation Model: Why and How," Society. Information and Software Technology, vol. 29, no. 10, pp. 558-567, 19XX. Barry Boehm received his BA degree from N.E. Fenton, Software Metrics: A Rigorous Approach. London: Harvard in 1957 and his MS and PhD degrees Chapman & Hall, 1991. from the University of California at Los Angeles A. Gelman, J. Garlin, H. Stern, and D. Rubin, Bayesian Data in 1961 and 1964, respectively, all in matheAnalysis. Chapman & Hall, 1995. matics. Dr. Boehm was with the DARPA O. Helmer, Social Technology. New York: Basic Books, 1966. Information Science and Technology Office, International Function Point Users Group (IFPUG), Function Point TRW, the Rand Corporation, and General Counting Practices Manual, Release 4.0, 1994. Dynamics. He is currently the director for the D.R. Jeffery and G.C. Low, "Calibrating Estimation Tools for Center for Software Engineering at USC. Software Eng.j.,/.,vvol. no.*±,4,ppp. 215-221, current >research interests j u n w a i e Development," L / e v e i u p i i i e i u , Software DUJIWUIC Lfi^. u i . o,5,iiu. p . Z.IU-£.AI, •His ««focus W K * J on ^H system's process process model, model, product product model, model, property property 1990. X990. integrating a softwarei system's via an an approach approach called called Model-Based Model-Based model via R.W. Jensen, Jensen, "An "An Improved Improved Macrolevel Macrolevel Software Software Development Development model, and success> model R.W. ware Engineering Engineering (MBASE). (MBASE). His His contributions contributions to Resource Estimation Estimation Model," Model," Proc. Proc. Fifth Fifth ISPA ISPA Conf, Conf, pp. pp. 88-92, Resource 88-92, Architecting and Software to Constructive Cost Model (COCOMO), the Spiral Apr. Apr. 1983. 1983. t n e f i e l d include: the Constructive Cost Model (COCOMO), the Spiral to process, and and the the Theory Theory W W (win-win) (win-win) approach approach to E.J. Johnson, Johnson, "Expertise "Expertise and and Decision Decision Under Under Uncertainty: Uncertainty: PerforPerfor- M o d e ' ° ' the software i process, E.J. it and mance Farr, software management mance and and Process," Process," The The Nature Nature of of Expertise, Expertise, Chi, Chi, Glaser,and Glaser,and Farr, andrequirements requirements determination. determination. He Hehas has served served eral scientific journals and as a member of the eds., 1988. eds., Lawrence Lawrence Earlbaum Earlbaum Assoc. Assoc. 1988. o n * n e board of several scientific journals and as a member of the ie IEEE C. C. Jones, Jones, Applied Applied Software Software Measurement. Measurement. McGraw-Hill, McGraw-Hill, 1997. 1997. governing board of the IEEEComputer ComputerSociety. Society.He Hecurrently currentlyserves servesas as Visitors for the CMU Software Engineering Institute. G.G. G.G. Judge, Judge, W. W. Griffiths, Griffiths, and and R. R. Carter Carter Hill, Hill, Learning Learning and and Practicing Practicing c n a i r o f t n e B o a r d o f v i s i t o r s f o r t h e C M U Software Engineering Institute. EEE, AIAA, and ACM, and a member of the IEEE Econometrics. Wiley, 1993. Econometrics Wiley 1993 He is a fellow of the IEEE, AIAA, and ACM, and member of the IEEE d the National Academy of a Engineering. C.F. Kemerer, "An Empirical Validation of Software Cost C.F. Kemerer, Empirical Validation of Software Cost Computer Society and the National Academy of Engineering. Models/' Comnt."An ACM, vol. 30, no. 5, pp. 416-429, 1987. Bert M. Steece is deputy dean of faculty and B.A. Kitchenham and N.R. Taylor, "Software Cost Models," ICL professor in the Information and Operations Technical J. vol. 1, May 1984. Management Department at the University of E.E. Learner, Specification Searches, ad hoc Inference with NonexperiSouthern California and is a specialist in mental Data. Wiley Series 1978. statistics. His research areas include statistical T.F. Masters, "An Overview of Software Cost Estimating at the modeling, time series analysis, and statistical National Security Agency," /. Parametrics, vol. 5, no. 1, pp. 72-84, computing. He is on the editorial board of 1985. Mathematical Reviews and has served on S.N. Mohanty, "Software Cost Estimation: Present and Future," various committees for the American Statistical Software Practice and Experience, vol. 11, pp. 103-121,1981. Association. Steece has consulted on a variety G.M. Mullet, "Why Regression Coefficients Have the Wrong / Quality Quality Technology, Technology 1976. of subjects: including forecasting, accounting, health care systems, legal Sign" 1976. Sign," /. L.H. Putnam and W. Myers, L.H.'putnam Myers', Measures for Excellence. Yourdon Press c a s e s ' a n d c h e m i c a l engineering. engineering. Computing Series, 1992. http://www.qsm.com/slim_estitnahttp://www.qsm.com/slim_estimate.html R.M. Park et al., "Software Size Measurement: A Framework for Counting Source Statements," CMU-SEI-92-TR-20, Software Eng. Inst., Pittsburgh, Pa. 1992. J.S. Poulin, Measuring Software Reuse, Principles, Practices and Economic Models. Addison-Wesley, 1997. H. Rubin, "ESTIMACS," IEEE, 1983. M CIKpnnprH and anH M. \A Schofield, fv-hnfiplrl "Estimating "FnHmatiiiff Software <^nftwarp Project Prniprf M. Shepperd Effort Using Analogies," IEEE Trans. Software Eng., vol. 23, no. 12, Nov. 1997. Center for Software Engineering, "COCOMO II Cost Estimation Questionnaire/' Computer Science Dept., USC Center for Software Eng., 1997. http://sunset.usc.edu/Cocomo.html Center for Software Engineering, "COCOMO II Model Definition Manual," Computer Science Dept., USC Center for Software Eng., 1997. http://sunset.usc.edu/Cocomo.html S. Vicinanza, T. Mukhopadhyay, and M. Prietula, "Software Effort Estimation: An Exploratory Study of Expert Performance, Information Systems," vol. 2, no. 40, pp. 243-262, 1991. F. Walkerden and D. Ross Jeffery, "Software Cost Estimation: A Review of Models, Process and Practices," Advances in Computers, 1997. S. Weisberg, Applied Linear Regression, second ed., New York: John Wiley & Sons, 1985. "Applications of Bayesian Analysis and Econometrics," The Statistician, vol. 132, pp. 23-34, 1983. >iw

[15] [15] [16] [16] [17] [17] [18] [18] [19] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] f?Ql [29] [30] [31] [32] [33] [34] [35]

51

126

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995

Machine Learning Approaches to Estimating Software Development Effort Krishnamoorthy Srinivasan and Douglas Fisher, Member, IEEE

Abstract—Accurate estimation of software development effort is critical in software engineering. Underestimates lead to time pressures that may compromise full functional development and thorough testing of software. In contrast, overestimates can resuit in noncompetitive contract bids and/or over allocation of development resources and personnel. As a result, many models for estimating software development effort have been proposed, This article describes two methods of machine learning, which we use to build estimators of software development effort from historical data. Our experiments indicate that these techniques are competitive with traditional estimators on one dataset, but also illustrate that these methods are sensitive to the data on which they are trained. This cautionary note applies to any model-construction strategy that relies on historical data. All such models for software effort estimation should be evaluated by exploring model sensitivity on a variety of historical data. Index Terms—Software development effort, machine learning, decision trees, regression trees, and neural networks.

A

of delivered source lines of code (SLOC). In contrast, many methods of machine learning make no or minimal assumptions a b o u { m e form o f t h e f u n c t i o n u n d e r s t u d y (e.g., development „ . , ., , , , , , •. . . . effort )< b u t a s w l t h o t h e r a P P r o a c h e s they depend on historical data. In particular, over a known set of training data, the learning algorithm constructs "rules" that fit the data, and which hopefully fit previously unseen data in a reasonable manner as w d l This ^ ^ i l l u s t r a t e s m a c h i n e learning approaches . . „ , , „ . , ... t0 estimating software development effort usmg an algorithm for building regression trees [4], and a neural-network learning approach known as BACKPROPAGATION [19]. Our experiments, u s m g established case libraries [3], [11], indicate possible a d v a n t a g e s o f the approach relative to traditional models, but . ,. . . ,, . , , also point to limitations that motivate continued research. II. MODELS FOR ESTIMATING SOFTWARE DEVELOPMENT EFFORT

I. INTRODUCTION

CCURATE estimation of software development effort

has major implications for the management of software development. If management's estimate is too low, then the software development team will be under considerable pressure to finish the product quickly, and hence the resulting software may not be fully functional or tested. Thus, the product may contain residual errors that need to be corrected during a later part of the software life cycle, in which the cost of corrective maintenance is greater. On the other hand, if a manager's estimate is too high, then too many resources will be committed to the project. Furthermore, if the company is engaged in contract software development, then too high an estimate may fail to secure a contract.

Many models have been developed to estimate software development effort. Many of these models are parametric, in that they predict development effort using a formula of fixed form that is. parameterized from historical data. In preparation for later discussion we summarize three such models that were highlighted in a previous study by Kemerer [11]. Putnam [16] developed an early model known as SLIM, which estimates the cost of software by using SLOC as the major input. The underlying assumption of this model is that resource consumption, including personnel, varies with time and can be modeled with some degree of accuracy by the Rayleigh distribution: . Z

The importance of software effort estimation has motivated considerable research in recent years. Parametric models such as COCOMO

[3], FUNCTION POINTS

[2], and SLIM [16]

Rc = -pr e where Rc

—ft 2 /9fc 2 } v ;

'

,

is the instananeous resource consumption, t is

"calibrate" prespecified formulas for estimating development effort from historical data. Inputs to these models may include the experience of the development team, the required reliability of the software, the programming language in which the software is to be written, and an estimate of the final number

the time into the development effort, and k is the time at which consumption is at its peak. The parameter k and other "management parameters" are estimated by characteristics of a particular software project, notably estimated SLOC. The general relationship between inputs such as SLOC and management parameters can be determined from historical data. Manuscript received October 1992; revised October 1993 and October 1994. T h e constructive COst MOdel (COCOMO) was developed

Recommended by D. Wile. D. Fisher's work was supported by NASA Ames

,

Grant NAG 2-834. K. Srinivasan is with Personal Computer Consultants, Inc., Washington, D •£• . , .,

by Boehm [3] based on a regression analysis of 63 completed projects. COCOMO relates the effort required to develop a software project (in terms of person-months) to Delivered

D. Fisher is with the Department of Computer Science, Vanderbilt University, Nashville, Tennessee (e-mail: [email protected]).

S

IEEE Log Number 9408517.

_

,

.-, ,

,

.

.

.

r

,„

,

,

_ . i-, o ^ T /T^Ts ™ O u r c e Instructions (DSI). Thus, like SLIM, COCOMO assumes

SLOC as a major input. If the software project is judged to

0098-5589/95$04.00 © 1995 IEEE

SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT

127

be straightforward, then the basic COCOMO model (COCOMO-

III. MACHINE LEARNING APPROACHES

basic) relates the nominal development effort (N) and DSI as

TO ESTIMATING DEVELOPMENT EFFORT

follows: N = 3.2 x (KDSI)

,

where KDSI is the DSI in 1000s. However, the prediction of the basic COCOMO model can be modified using cost drivers . Cost drivers are classified under four major headings relating to attributes of the product (e.g., required software reliability), computer platform (e.g., main memory limitations), personnel (e.g., analyst capability), and the project (e.g., use of modern programming practices). These factors serve to adjust the nominal effort up or down. These cost drivers and other considerations extend the basic model to intermediate and final forms. The Function Point method was developed by Albrecht [2]. Function points are based on characteristics of the project that are at a higher descriptive level than SLOC, such as the number of input transaction types and number of reports. A notable advantage of this approach is that it does not rely on SLOC, which facilitates estimation early in the project life cycle (i.e., during requirements definition), and by nontechnical personnel. To count function points requires that one count user functions and then make adjustments for processing complexity. There are five types of user function that are included in the function point calculation: external input types, external output types, logical internal file types, external interface file types, and external inquiry types. In addition, there are 14 processing complexity characteristics such as transaction rates and online updating. A function point is calculated based on the number of transactions and complexity characteristics. The development effort estimate given the function point, F, is: N = 54 x F — 13390. Recently, a case-based approach called ESTOR was developed for software effort estimation. This model was developed by Vicinanza et al. [23] by obtaining protocols from a human expert. From a library of cases developed from expert-supplied protocols, an instance called the source is retrieved that is most "similar" to the target problem to be solved. The solution of the most similar problem retrieved from the case library is adapted to account for differences between the source problem and the target problem using rules inferred from analysis of the human expert's protocols. An example of an adjustment rule is: IF s t a f f s i z e of S o u r c e p r o j e c t i s s m a l l , AND s t a f f s i z e of T a r g e t i s l a r g e THEN i n c r e a s e e f f o r t e s t i m a t e of T a r g e t by 20%. Vicinanza et al., have shown that E STOR performs better than COCOMO and FUNCTION POINTS on restricted samples of problems. In sum, there have been a variety of models developed for estimating development effort. With the exception of ESTOR these are parametric approaches that assume that an initial estimate can be provided by a formula that has been fit to historical data.

This section describes two machine learning strategies that we use to estimate software development effort, which we assume is measured in development months (M). In many respects this work stems from a more general methodology for developing expert systems. Traditionally, expert systems have been developed by extracting the rules that experts apparently use by an interview process or protocol analysis (e.g., ESTOR), but an alternate approach is to allow machine learning programs to formulate rulebases from historical data, Thi s methodology requires historical data on which to apply learning strategies. There are several aspects of software development effort estimation that make it amenable to machine learning analysis, Most important, previous researchers have identified at least s o m e o f m e attributes relevant to software development effort estimation, and historical databases denned over these relevant attributes have been accumulated. The following sections describe two very different learning algorithms that we use to test the machine learning approach. Other research using machine learning techniques for software resource estimation are found j n [5], [i4] ; [i5]> [22], which we will discuss throughout the pa per. In short, our work adds to the collection of machine learning techniques available to software engineers, and our analysis stresses the sensitivity of these approaches to the nature of historical data and other factors, A

Learning Decision and Regression Trees Many learning approaches have been developed that construct decision trees for classifying data [4], [17]. Fig. 1 illustrates a partial decision tree over Boehm's original 63 projects from which COCOMO was developed. Each project is described over dimensions such as AKDSI (i.e., adjusted delivered source instructions), TIME (i.e., the required system response time), and STOR (i.e., main memory limitations). The complete, set of attributes used to describe these data is given in Appendix A. The mean of actual project development months labels each leaf of the tree. Predicting development effort for a project requires that one descend the decision tree along an appropriate path, and the leaf value along that path gives the estimate of development effort of the new project. The decision tree in Fig. 1 is referred to as a regression tree, because the intent of categorization is to generate a prediction along a continuous dependent dimension (here, software development effort). There are many automatic methods for constructing decision and regression trees from data, but these techniques are typically variations on one simple strategy. A "top-down" strategy examines the data and selects an attribute that best divides the data into disjoint subpopulations. The most important aspect of decision and regression tree learners is the criterion used to select a "divisive" attribute during tree construction, In one variation the system selects the attribute with values that maximally reduce the mean squared error {MSE) of the dependent dimension (e.g., software development effort) observed in the training data. The MSE of any set, S,

53

128

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995

1=26.0

F U N C T I O N C A R T X (Instances)

( M £ A N = 5 7 3 ) [3i]

^ termination-condition(Instances)

XKDSI~ I >

<1.415

(MEAN = 299 2) {191

T H E N R E T U R N mean among Instances ELSE Set Best-Attribute to most informative a t t r i b u t e a m o n g the Instances.

'

— TIME— >1.415

RETURN

< 1 0

" I '

VEXP— >1 0

I {MEAN = 1069) (2]

V, of Best-Attribute . I

— AKDSI— f >U1

Nol BUS —

(MEAN = 702) |1]

C A R T X ( { I | 1 is an Instance

-°

*»'

Best-Attribute

(MEAN = 367.5) [2]

(MEAN=243)[1]

| V2 of Best-Attribute . I

. . .

C A R T X ( { I | I with V,})

K of Best-Attribute , I C A R T X ( { I | I with ¥ „ } )

with value V,})

~ '

Fig. 2. Decision/regression-tree learning algorithm. '— STOR—

-AKDsi<315.5

AKDSI -

>i.o3

>274.0

'

I >o.925

L_ (MEAN = 3836) [2]

MEAN _ 1600

m

2-partitions only. Similarly, techniques that 2-partition all attribute domains, for both continuous and nominally-valued J

(MEAN = 2250) p]

>315 5

- (MEAN = 9ooo) [2] Fig. 1. A regression tree over Boehm's 63 software project descriptions. Numbers in square brackets represent the number of projects classified under a no e '

of training examples taking on values yk in the continuous dependent dimension is: v~-» / _ _N2 ^—' -^ MSE(S) = where y is the mean of the yk values exhibited in S. The Values of each attribute, A{ partition the entire training data set, T, into subsets, T^, where every example in Tij takes on the same value, say Vj for attribute A;. The attribute, Ait that maximizes the difference: . „ „ „ _ MlFtT^ - Y^ MSFIT \ ~

*• '

2—i 3

^ li'

( i - e "finite,unordered) attributes, have been explored (e.g., [24]). For continuous attributes this bisection process operates as we have just described, but for a nominally-valued attribute all ways to group values of the attribute into two disjoint sets are considered. Suffice it t o say that treating all attributes as though they had the same number of values (e.g., 2) for purposes of attribute selection mitigates Certain biases that are present in some attribute selection measures (e.g., AMSE). As we will note again in Section IV, we ensure that all attributes are either continuous or binary-valued at the outset o f r e g ression-tree construction. j ^ t ^asic r e gression-tree learning algorithm is summarized in Fig. 2. The data set is first tested to see whether tree consanction is worthwhile; if all the data are classified identically or some other statistically-based criterion is satisfied, then expansion ceases. In this case, the algorithm simply returns a leaf labeled by the mean value of the dependent dimension found in the training data. If the data are not sufficiently distinguished, then the best divisive attribute according to AMSE is selected, the attribute's values are used to partition the data into subsets, and the procedure is recursively called on these subsets to expand the tree. When used to construct predictors along continuous dimensions, this general procedure is referred

is selected to divide the tree. Intuitively, the attribute that minimizes the error over the dependent dimension is used, While MSE values are computed over the training data, the inductive assumption is that selected attributes will similarly reduce error over future cases as well. This basic procedure of attribute selection is easily extended to allow continuously-valued attributes: all ordered 2-partitions of the observed values in the training data are examined, In essence, the dimension is split around each observed value. The effect is to 2-partition the dimension in k — 1 alternate ways (where k is the number of observed values), and the binary "split" that is best according to AMSE is considered along with other possible attributes to divide a regression-tree node. Such "splitting" is common in the tree of Fig. 1; see AKDSI, for example. Approaches have also been developed that split a continuous dimension into more than two ranges [9], [15], though we will assume

54

to as recursive-partitioning regression. Our experiments use a partial reimplementation of a system known as CART [4]. We refer to our reimplementation as CARTX. Previously, Porter and Selby [14], [15], [22], have investigated the use of decision-tree induction for estimating development effort and other resource-related dimensions. Their work assumes that if predictions over a continuous dependent dimension are required, then the continuous dimension is "discretized" by breaking it into mutually-exclusive ranges. More commonly used decision-tree induction algorithms, which assume discrete-valued dependent dimensions, are then applied to the appropriately classified data. In many cases this preprocessing of a continuous dependent dimension may be profitable, though regression-tree induction demonstrates that the general tree-construction approach can be adapted for direct manipulation of a continuous dependent dimension. This is also the case with the learning approach that we describe next.

SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT

129

Fig. 4. An example of function approximation by a regression tree. Fig. 3. A network architecture for software development effort estimation.

B. A Neural Network Approach to Learning A learning approach that is very different from that outlined above is BACKPROPAGATION, which operates on a network of simple processing elements as illustrated in Fig. 3. This basic architecture is inspired by biological nerve nets, and is thus called an artificial neural network. Each line between processing elements has a corresponding and distinct weight. Each processing unit in this network computes a nonlinear function of its inputs and passes the resultant value along as its output. The favored function is 1 7 \~T — I V^ Wili ] \ i / J where £ , wJi is a weighted sum of the inputs, Iit to a processing element T191 T251 The network generates output by propagating the initial inputs, shown on the leffhand side of Fig. 3, through subsequent layers of processing elements to the final output layer. This net illustrates the kind of mapping that we will use for estimating software development effort, with inputs corresponding to various project attributes, and the output line corresponding to the estimated development effort. The inputs and output are restricted to numeric values. For numerically-valued attributes this mapping is natural, but for nominal data such as LANG (implementation language), a numeric representation must be found. In this domain, each value of a nominal attribute is given its own input line. If the value is present in an observation then the input line is set to 1.0, and if the value is absent then it is set to 0.0. Thus, for a given observation the input line corresponding to an observed nominal value (e.g., COB) will be 1.0, and the others (e.g., FTN) will be 0.0. Our application requires only one network output, but other applications may require more than one. Details of the BACKPROPAGATION learning procedure are beyond the scope of this article, but intuitively the goal of learning is to train the network to generate appropriate output patterns for corresponding input patterns. To accomplish this,

comparisons

are made between a network's actual

output

pattern and an a priori known correct output pattern. The difference or error between each output line and its correct co ™sponding value is "backpropagated" through the net and S u i d e s t h e m ° d l f i c a t ' ™ of weights in a manner that will t e n d t 0 r e d u c e t h e c o l l e c t i v e e r r o r b e t w e e n a c t u a l a n d COITect out

Puts

on tralnln

t0 conver

Patterns

Se

in a

8 Pattems- ^ ^ Procedure h a s b e e n s h ° w n PP i n g s b e t w e e n i n P u t a n d °«put variet o f d o m a i n s [ 2 1 ] [25] y ' "

o n a c c u r a t e ma

a

Approximating Arbitrary Functions In trying to approximate an arbitrary function like development effort, regression trees approximate a function with a "staircase" function. Fig. 4 illustrates a function of one continuous, independent variable. A regression tree decomposes this function's domain so that the mean at each

l e a f r e f l e c t s the

8 e w i t h i n a l o c a l reSion- T h e hidden" processing elements that reside between the input and OUt ut la ers P y ° f a n e U r a l n e t w O r k d o rOU S hl y t h e S a m e t h i n g ' thou h me 8 approximating function is generally smoothed. The g r a n u l a n t y o f t h i s P h o n i n g of the function is modulated by ^ d e P t h o f a reSression tree or the number of hidden units ln a n e t w o r • . . . Each leamm S a P P r o a c h l s °°°P*»™t™, since it makes no a Priori a s s u m P t l o n s a b o u t * e f o r m o f * * f u n c t i o n b « n g a roximated PP - ""«*« ** a w i d e v a r i e t y o f Parametnc methods for function a P P r o x i m a t i o n s u c h a s regression methods of statistics a n d P^om^ interpolation methods of numerical a n a l s i s [10] O t h e r n o n a r a m e t r l c m e t h o d s i n c l u d e y P 8enetic al orithms and nearest nei hbor a S ^ S PProaches [1], though w e wiU not elaborate o n a n y of these a l t e m a t l v e s here -

D

function>s ran

- Sensitivity to Configuration Choices Both BACKPROPAGATION and CARTX require that the analyst make certain decisions about algorithm implementation. For example, BACKPROPAGATION can be used to train networks with differing numbers of hidden units. Too few hidden units can compromise the ability of the network to approximate a desired function. In contrast, too many hidden units can lead to "overfitting," whereby the learning system fits the "noise"

55

130

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995

present in the training data, as well as the meaningful trends that we would like to capture. BACKPROPAGATION is also typically trained by iterating through the training data many times. In general, the greater the number of iterations, the greater the reduction in error over the training sample, though there is no general guarantee of this. Finally, BACKPROPAGATION assumes that weights in the neural network are initialized to small, random values prior to training. The initial random weight settings can also impact learning success, though in many applications this is not a significant factor. There are other parameters that can effect BACKPROPAGATION 'S performance, but we will not explore these here. In CARTX, the primary dimension under control by the experimenter is the depth to which the regression tree is allowed to grow. Growth to too great a depth can lead to overfitting, and too little growth can lead to underfitting. Experimental results of Section IV-B illustrate the sensitivity of each learning system to certain configuration choices.

linear relationship and those close to 0.0 suggest no such relationship. Our experiments will characterize the abilities of BACKPROPAGATION and CARTX using the same dimensions as Kemerer: MRE and R2. As we noted, each system imposes certain constraints on the representation of data. There are a number of nominally-valued attributes in the project databases, including implementation language. BACKPROPAGATION requires that each value of such an attribute was treated as a binary-valued attribute that was either present (1) or absent (0) in each project. Thus, each value of a nominal attribute corresponded to a unique input to the neural network as noted in Section III-B. We represent each nominal attribute as a set of binary-valued attributes for CARTX as well. As we noted in Section III-A this mitigates certain biases in attribute selection measures such as AMSE. In contrast, each continuous attribute identified by Boehm corresponded to one input to the neural network. There was one output unit, which reflected a prediction of development effort and was also continuous. Preprocessing for the neural IV. OVERVIEW OF EXPERIMENTAL STUDIES network normalized these values between 0.0 and 1.0. A simple scheme was used where each value was divided by We conducted several experiments with CARTX and ^ m a j d m u m o f ^ y a , u e s for ^ a t t r i b u t e i n ^ ^ BACKPROPAGATION for the task of estimating software ^ ft h a s b e e n s h o w n ^ neural networks iricall development effort In general, each of our experiments c o n ^ckly tf a U ^ y a l u e s f ( j r ^ a t t r i b u t e s K]a&wl partitions historical data into samples used to train our learning ^ b e ( w e e n z e f o a n d Q n e NQ such norma]ization was systems, and disjoint samples used to test the accuracy of the d o n e for c s i n c e u w o u W h a y e n Q e f f e c t ofl CARTX>S trained classifier in predicting development effort. , _ f b r performance. For purposes of comparison, we refer to previous experimental results by Kemerer [11]. He conducted comparative analyses between SLIM, COCOMO, and FUNCTION POINTS on A Experiment 1: Comparison with Kemerer's Results a database of 15 projects.1 These projects consist mainly of O u r first business applications with a dominant proportion of them experiment compares the performance of machine leamin 8 algorithms with standard models of software devel(12/15) written in the COBOL language. In contrast, the COCOMO database includes instances of business, scientific, ° P m e n t e s t i m a t i o n u s i n g Kemerer's data as a test sample. To and system software projects, written in a variety of languages t e s t C A R T X a n d BACKPROPAGATION, we trained each system including COBOL, PL1, HMI, and FORTRAN. For compar- o n COCOMO'S database of 63 projects and tested on Kemerer's isons involving COCOMO, Kemerer coded his 15 projects using 1 5 P r o J e c t s ' F o r BACKPROPAGATION we initially configured the n e t w o r k with 3 3 in ut units 1 0 h i d d e n umts a n d l out ut the same attributes used by Boehm. P ' ' P One way that Kemerer characterized the fit between the pre- u n i t ' a n d r e 1 u i r e d m a t *** t r a m i n 8 s e t e r r o r r e a c h ° 0 0 0 0 1 o r dieted (M e 8 t ) and actual (Mact) development person-months c o n t i n u e f o r a maximum of 12 000 presentations of the training data T r a i n i n " 8 c e a s e d a f t e r 1 2 0 0 ° Presentations without conwas by the magnitude of relative error (MRE): verging to the required error criterion. The experiment was _ Mest ~ Mact done on an AT&T PC 386 under DOS. It required about 6-7 Mpp Mact hours for 12000 presentations of the training patterns. We __. ,. , ,.., , . actually repeated this experiment 10 times, though we only , , f . f , This measure normalizes the difference between actual and , . , , , , , ,. , ., report the results of one run here; we summarize the complete an analyst with a . . _ . . . . _ predicted development months, and supplies K yy J c ,. . . . . ' . , ,.„ , , set of expenments in Section IV-B. . , • •• i • c ^ v n J.U measure of the reliability of estimates by different models. T c , . . , . , . . In our initial configuration of CARTX, we allowed the TT However, when using a model developed at one site for . : „ . „ , „, , , , . , i •, , c , regression tree to grow to a maximum depth, where each , . . . . , . .. . ., estimation at another site, there may be local factors that , , a ... ... , , • , leaf represented a single software project description from the are not modeled, but which nonetheless impact development .-,„„„„„ . . „ , _• . f. . • „ , . . ., . ~ . ^, „„ . , . , COCOMO data. We were motivated initially to extend the tree • , , i_ J • i • effort m a systematic way. Thus, following earlier work u , ... . t^ ,r ",.,,• , , . to singleton leaves, because the data is very sparse relative to by Albrecht r[2], Kemerer did a linear regression/correlation , , • , , ., , , c ,. , . .. '., „ , ... ., , . , the number of dimensions used to describe each data point; , ... -„. . . . , analysis to calibrate the rpredictions, with Mest treated as . . . , . ,, , .. , : , , our concern is not so much with overfitting, as it is with , ,, . ., , t „ . „ ... ., the independent variable and Mact treated as the dependent • •, r™ > - , • > , • j . , ^ underfitting the data. Expenments with the regression tree , , . „ . „_ . ^, , variable. The R value indicates the amount of variation in , , . , , , ,. , . , . . , learner were performed on a SUN 3/60 under TTXTT UNIX, and the actual values accounted for by a linear relationship with JU . . „, ... u u • J r r , . , , _2 , , , . required about a minute. The predictions obtained from the the estimated values. R* values close to 1.0 suggest a strong , . , •, ,a •• »u ^ AN 00 " learning algorithms (after training on the COCOMO data) are 'we thank Professor Chris Kemerer for supplying this dataset. shown in Table I with the actual person-months of Kemerer's

56

SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT

131

TABLE I

TABLE II

CARTX AND BACKPROPAGATION ESTIMATES ON KEMERER'S DATA —| 1 n Actual CARTX BACKPROP

287.00

1893.30

8145

82 50

162 03

14 14

A COMPARISON OF LEARNING AND ALGORITHMIC APPROACHES. THE REGRESSION EQUATIONS GIVE Mact AS A FUNCTION OF Mest(x)

MRE(%) R-Square CARTX

364

0.83

BACKPROP 70 FuNC - PTS- 103 COCOMO 610

0.80

Regress. Eq. 102.5 + 0.075x

1107.31 11400.00 86.90 243.00 336.30 6600.00

1000.43 88.37 540.42

84.00 23.20

129.17 129.17

13.16 45.38

130.30

243.00

78.92

c a s e of

116.00 72.00 258.70 230.70 157.00 246.90

1272.00 129.17 243.00 243.00 243.00 243.00

113.18 15.72 80.87 28.65 44.29 39.17

by "calibrating" a model's predictions in a new environment, the adjusted model predictions can be reliably used. Along the R2 dimension learning methods provide significant fits to the data. Unfortunately, a primary weakness of these learning approaches is that their performance is sensitive to a number of implementation decisions. Experiment 2 illustrates some of

69.90

129.17

214.71

these sensitivities.

II S u M ^j

I772

modeiSi

058

0.70 I0'89

78.13 + 0.88* -37 + 0.96* 27.7 + 0.156* I49'9 + 0 M 2 x

Kemerer argues that high R2 suggests that

B. Experiment 2: Sensitivity of the Learning Algorithms 15 projects. We note that some predictions of CARTX do not correspond to exact person-month values of any COCOMO (training set) project, even though the regression tree was developed to singleton leaves. This stems from the presence of missing values for some attributes in Kemerer's data. If, during classification of a test project, we encounter a decision node that tests an attribute with an unknown value in the test project, both subtrees under the decision node are explored, In such a case, the system's final prediction of development effort is a weighted mean of the predictions stemming from each subtree. The approach is similar to that described in [17]. Table II summarizes the MRE and R2 values resulting from a linear regression of Mest and Mact values for the two learning algorithms, and results obtained by Kemerer with COCOMO-BASIC, FUNCTION POINTS, and SLIM. 2 These results

indicate that CARTX'S and BACKPROPAGATION 'S predictions show a strong linear relationship with the actual development effort values for the 15 test projects.3 On this dimension, the performance of the learning systems is less than SUM'S performance in Kemerer's experiments, but better than the other two models. In terms of mean MRE, BACKPROPAGATION does strikingly well compared to the other approaches, and CARTX'S MRE is approximately one-half that of SLIM and COCOMO. In sum, Expenment 1 rilustrates two points. In an absolute sense, none of the models does particularly well at estimating software development effort, particularly along the MRE dimension, but in a relative sense both learning approaches are competitive with traditional models examined by Kemerer on one dataset. In general, even though MRE is high in the ,

'Results are reported for COCOMO-BASIC (i.e., without cost drivers), which was comparable to the intermediate and detailed models on this data, in addition, Kemerer actually reported R2, which is R? adjusted for degrees of freedom^and which is slightly lower than the unadjusted R2 values that we report. R2 valuesreportedby Kemerer are 0.55,0.68, and 0.88 for FUNCTION POINTS, COCOMO, and SLIM, respectively.

,

3

, ,

,

. ,, .

. -c

„„

Both the slope and R value are significant at the 99% confidence level. The t coefficients for determining the significance of slope are 8.048 and 7.25 for CARTX and BACKPROPAGATION, respectively.

W e have noted m a t each leaming system assumes a number

«grow- regres. ] u d e d i n t h e neural n e t w o r k T h e s e c h o i c e s c a n significantly impact the success o f l e a m i n g . E x p e riment 2 illustrates the sensitivity of our t w o l e a m i n g systems relative t0 different choices along ^ ^ d i m e n s i o n s . I n particular, W e repeated Experiment 1 using BACKPROPAGATION with differing numbers of hidden units a n d u s i n g C A R T X w i t h d i f f e r i n g c o n s t r aints on regression-tree growth T a b l e m i l l u s t r a t e s o u r r e s u l t s w i t h BACKPROPAGATION. E a c h c e l l s u m m a r i z e s reS ults over 10 experimental trials, rather ta one ^ w h i c h w a s lepoTted in Section IVA for p r e s e n t a t i o n p u r p o s e s . Thus, Max, and Min values of important choices such as depth to which t0 sion

of

^ ^

R2

o r ±e

a n d

of hidden units i n c

mmhel

in

M R E

each

cell

of

Table

m

suggest

± e

to initial random weight settingS( w h k h w e r e different in e a c h o f t h e 1 0 e x p e r i m e n t a i t r f a l s T h e e x p e r i m e n t a l r e s u lts of Section IV-A reflect the .< best ,, ^ ^ 10 ^ s u m m a r i z e d in Table Ill's 10m d d e n _ u n i t c o l u m n . I n general, however, for 5, 10, and 15 h i d d e n ^ ^ MRE s c o r e s m s t i n c o m p a r a b i e o r s u p e r i o r to s o m e o f m e o t h e r m o d e l s s u m m a r i z e d i n Table II, and mean R2 s c o r e s s u g g e s t m a t s i g ni f i c ant linear relationships between p r e d i c t e d a n d a c t u a l development months are often found. Poor resuks obtained with n o hidden units indicate ^ i r n p O rt a nce o f m e s e for a c c u r a t e f u n c t i o n a p p r o x i m a t i o n . T h e p e r f o r m a n c e o f C A R T X c a n vary with the depth to w h i c h w e e x t e n d t h e regression tree. The results of Experiment l ^ repeated ^ a n d r e p r e S ent the case where required sensitivity

of

BACKPROPAGATION

accuracy over the training data is 0%—that is, the tree is

, , . , , _ _ . . . decomposed to singleton leaves. However, w e experimented with more conservative tree expansion policies, where CARTX extended the tree only to the point where an error threshold ( r e l a t i v e to the training data) is satisfied. In particular, trees , , .,__,

were grown to leaves where the mean MRE among projects .

,

._ l prespecified threshold that ranged from 0% to 500%. The MRE of each project at a leaf

a t a l e a f w a s less t h a n o r e< ual t 0 a

57

132

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995

TABLE III

TABLE V

BACKPROPAGATION RESULTS WITH VARYING NUMBERS OF HIDDEN NODES

SENSITIVITY OVER 20 RANDOMIZED TRIALS ON COMBINED COCOMO AND KEMERER'S DATA

(J Mean R2

I

Hidden Units I 5 10 15

CARTX

0.04 0.52 0.60 0.59

2

Max R Min R2

y

0.04 0.84 0.80 0.85 0.03 0.08 0.10 0.01

Mean MRE(%)

618

104

133

1

n

111

Mini? 2

1

0.00

BACKPROPAGATION 0.00 '

'

1 n Meanii 2 Max R2 — — —

0.48

0.97

0.40

0.99

'

U

TABLE VI

„

SENSITIVITY OVER 20 RANDOMIZED TRIALS ON KEMERER'S DATA

Max MRE{%) Min MRE(%)

915 I 369

163 72

254 70

161 77

Min ft2 Mean R2 Max i P CARTX BACKPROPAGATION

TABLE IV

0.00 0.03

0.26 0.39

0.90 0.90

'

"

CARTX RESULTS WITH VARYING TRAINING ERROR THRESHOLDS

R2

o% 25% 50% ioo% 200% 300% 400% 500%

0.83 0.62 o,63 0.60 0.59 0.59 o 59 o 60

tree configuration. The holdout method divides the available data into two sets; one set, generally the larger, is used to build decision/regression trees or train networks under different configurations. The second subset is then classified using each alternative configuration, and the configuration yielding the best results over this second subset is selected as the final configuration. Better yet, a choice of configuration may rest on a orm f °f resampling that exploits many randomized holdout trials. Holdout could have been used in this case by dividing the COCOMO data, but the COCOMO dataset is very small as is. Thus, we have satisfied ourselves with a demonstration of the sensitivity of each learning algorithm to certain configuration decisions. A more complete treatment of resampling and other strategies for making configuration choices can be found in Weiss and Kulikowski [24].

MRE(%) 364 404 461 870 931 931 931 835

was calculated by M — Mact Mact where M is the mean person-months development effort of projects at that node Table IV shows CARTX's performance when we vary the required accuracy of the tree over the training data. Table entries correspond to the MRE and R2 scores of the learned trees over the Kemerer test data. In general, there is degradation in performance as one tightens the requirement for regressiontree expansion, though there are applications in which this would not be the case. Importantly, other design decisions in decision and regression-tree systems, such as the manner in which continuous attributes are "split" and the criteria used to select divisive attributes, might also influence prediction accuracy. Selby and Porter [22] have evaluated different design choices along a number of dimensions on the success of decision-tree induction systems using NASA software project descriptions as a test-bed. Their evaluation of decision trees, not regression trees, limits the applicability of their findings to the evaluation reported here, but their work sets an excellent example of how sensitivity to various design decisions can be evaluated. The performance of both systems is sensitive to certain configuration choices, though we have only examined sensitivity relative to one or two dimensions for each system. Thus, it seems important to posit some intuition about how learning systems can be configured to yield good results on new data, given only knowledge of performance on training data. In cases where more training data is available a holdout method can be used for selecting an appropriate network or regression-

58

^ Experiment 3: Sensitivity to Training and Test Data Thus far, our results suggest that using learning algorithms to discover regularities in a historical database can facilitate predictions on new cases. In particular, comparisons between our experimental results and those of Kemerer indicate that relatively speaking, learning system performance is competitive with some traditional approaches on one common data set. However, Kemerer found that performance of algorithmic approaches was sensitive to the test data. For example, when a selected subset of 9 of the 15 cases was used to test the models, each considerably improved along the R2 dimension, By implication, performance on the other 6 projects was likely poorer. We did not repeat this experiment, but we did perform similarly-intended experiments in which the COCOMO and Kemerer data sets were combined into a single dataset of 78 projects; 60 projects were randomly selected for training the learning algorithms and the remaining 18 projects were used for test. Table V summarizes the results over 20 such randomized trials. The low average R2 should not mask the fact that many runs yielded strong linear relationships. For example, on 9 of the 20 CARTX runs, R2 was above 0.80. We also ran 20 randomized trials in which 10 of Kemerer's cases were used to train each learning algorithm, and 5 were used for test. The results are summarized in Table VI. This experiment was motivated by a study with ESTOR [23], a casebased approach that we summarized in Section II: an expert's protocols from 10 of Kemerer's projects were used to construct

SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT

133

a "case library" and the remaining 5 cases were used to test the model's predictions; the particular cases used for test were not reported, but ESTOR outperformed COCOMO and FUNCTION POINTS on this set. We do not know the robustness of ESTOR in the face of the kind of variation experienced in our 20 randomized trials (Table VI), but we might guess that rules inferred from expert problem solving, which ideally stem from human learning over a larger set of historical data, would render ESTOR more robust along this dimension. However, our experiments and those of Kemerer with selected subsets of his 15 cases suggest that care must be taken in evaluating the robustness of any model with such sparse data. In defense of Vicinanza's et al.P methodology, we should note that the creation of a case library depended on an analysis of expert protocols and the derivation of expert-like rules for modifying the predictions of best matching cases, thus increasing the "cost" of model construction to a point that precluded more complete randomized trials. Vicinanza et al. also point out that their study is best viewed as indicating ESTOR's "plausibility" as a good estimator, while broader claims require further study. In addition to experiments with the combined COCOMO and Kemerer data, and the Kemerer data alone, we experimented with the COCOMO data alone for completeness. When experimenting with Kemerer's data alone, our intent was to weakly explore the kind of variation faced by ESTOR. Using the COCOMO data we have no such goal in mind. Thus, this analysis uses an JV-fold cross validation or a "leave-one-out" methodology, which is another form of resampling. In particular, if a data sample is relatively sparse, as ours is, then for each of JV (i.e., 63) projects, we remove it from the sample set, train the learning system with the remaining TV - 1 samples, and then test on the removed project. MRE and R2 are computed over the N tests. CARTX's R2 value was 0.56 (144.48+0.74*, t = 8.82) and MRE was 125.2%. In this experiment we only report results obtained with CARTX, since a fair and comprehensive exploration of BACKPROPAGATION across possible network configurations is computationally expensive and of limited relevance. Suffice it to say that over the COCOMO data alone, which probably reflects a more uniform sample than the mixed COCOMO/Kemerer data, CARTX provides a significant linear fit to the data with markedly smaller MRE than its performance on Kemerer's data.

effort estimation suggest the promise of an automated learning approach to the task. Both learning techniques performed well on the R2 and MRE dimensions relative to some other approaches on the same data. Beyond this cursory summary, our experimental results and the previous literature suggest several issues that merit discussion.

In sum, our initial results indicating the relative merits of a learning approach to software development effort estimation must be tempered. In fact, a variety of randomized experiments reveal that there is considerable variation in the performance of these systems as the nature of historical training data changes, This variation probably stems from a number of factors. Notably, there are many projects in both the COCOMO and Kemerer datasets that differ greatly in their actual development effort, but are very similar in other respects, including SLOC. Other characteristics, which are currently unmeasured in the COCOMO scheme, are probably responsible for this variation. V. GENERAL DISCUSSION Our experimental comparisons of CARTX and BACKPROPAGATION with traditional approaches to development

A

Limitations of Learning from Historical Data There

are

well

" k n ° w n limitations of models constructed " In Particular- attributes u s e d t0 Predict software development effort can change over time and/or differ between software development environments. Mohanty [13] m a k e s this P o i n t i n comparisons between the predictions of a w i d e variet y of m o d e l s on a single hyP°thetlcal software ro ect I n P J ' P ^ u l a r , M o h a n t y s u r v e y e d approximately 15 m o d e l s a n d m e t h o d s for P ^ i c t i n g software development effort These models were used to P r e d i c t s o f t w a r e development effort o f a sin le 8 hypothetical software project. Mohanty's m a i n finding w a s t h a t estimated effort on this single project varied significantly over models. Mohanty points out that each model was developed and calibrated with data collected within a uni( ue software l environment. The predictions of m e s e m o d e l s in ' Part> r e f l e c t underlying assumptions that are not ex Iicitlv P presented in the data. For example, software development sites may use different development tools. These tools are constant wlthin a facihty and thus not ' represented exP l i c i t l v i n d a t a c o l l e c t e d ^ t h a t f a c i l i t y ' b u t t h i s environmental factor is n o t c o n s t a n t a c r o s s faclhtlss Differin S environmental factors not reflected in data are undoubtedly responsible for much of the unexplained variance in o u r experiments. To some extent, the R2 derived from linear egression is intended to provide a better measure of a model s in cases w h e r e ' " f i t " t 0 a r b i t r a i y n e w d a t a t h a n MRE the environment f r o m w h i c h a model was derived is different from the environment from which new data was drawn. Even so> t h e s e environmental differences may not be systematic in a wa y m a t i s w e l 1 a c c o u n t e d for by a linear model. In sum, great care must be taken w h e n uslng a model constructed from data from one environment to make predictions about data from another environment. Even within a site, the environment mav evolve over time thus ' compromising the benefits of previously-derived models. Machine learning research has u s i n g h i s t o r i c a l data

recentI

P r o b I e m o f trackin8 &* a c c u r a c y ' w h i c h t r i g g e r s relearning when ex e P rience with new data suggests that the environment has c h a n g e d [6] Howe ' v e r , in an application such as software development effort estimation, there are probably explicit ^ c a t e r s that an environmental change is occurring or will o c c u r (e ^' w h e n n e w development tools or quality control ractices P are implemented), y

focussed

on the

of a learned m o d e l o v e r time

&• Engineering the Definition of Data if environmental factors are relatively constant, then there is little need to explicitly represent these in the description of d a t a H o w e v e r j w h e n t h e env ironment exhibits variance along some dimension, it often becomes critical that this variance be codified and included in data description. In this way,

59

134

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995

differences across data points can be observed and used in estimating the person-month effort required for the project model construction. For example, Mohanty argues that the requiring 23.20 M or the project requiring 1107.31 M; the desired quality of the finished product should be taken into projects closest to each among the remaining 14 projects are account when estimating development effort. A comprehensive 69.90 M and 336.30 M, respectively. survey by Scacchi [20] of previous software production studies The root of CARTX'S difficulties lies in its labeling of leads to considerable discussion on the pros and cons of many each leaf by the mean of development months of projects attributes for software project representation. classified at the leaf. An alternative approach that would enable Thus, one of the major tasks is deciding upon the proper CARTX to extrapolate beyond the training data, would label codification of factors judged to be relevant. Consider the each leaf by an equation derived through regression—e.g., dimension of response time requirements (i.e., TIME) which a linear regression. After classifying a project to a leaf, the was included by Boehm in project descriptions. This attribute regression equation labeling that leaf would then be used to was selected by CARTX during regression-tree construction, predict development effort given the object's values along the However, is TIME an "optimal" codification of some aspect independent variables. In addition, the criterion for selecting of software projects that impacts development effort? Consider divisive attributes would be changed as well. To illustrate, that strict response time requirements may motivate greater consider only two independent attributes, development team coupling of software modules, thereby necessitating greater experience and KDSI, and the dependent variable of software communication among developers and in general increasing development effort. CARTX would undoubtedly select KDSI, development effort. If predictions of development effort must since lower (higher) values of KDSI tend to imply lower be made at the time of requirements analysis, then perhaps (higher) means of development effort. In contrast, development TIME is a realistic dimension of measurement, but better team experience might not pro vide as good a fit using CARTX'S predictive models might be obtained and used given some error criterion. However, consider a CART-like system that divides data up by an independent variable, finds a best measure of software component coupling. In sum, when building models via machine learning or sta- fitting linear equation that predicts development effort given tistical methods, it is rarely the case that the set of descriptive development team experience and KDSI, and assesses error attributes is static. Rather, in real-world success stories in- in terms of the differences between predictions using this volving machine learning tools the set of descriptive attributes best fitting equation and actual development months. Using evolves over time as attributes are identified as relevant or this strategy, development team experience might actually be irrelevant, the reasons for relevance are analyzed, and addi- preferred; even though lesser (greater) experience does not tional or replacement attributes are added in response to this imply lesser (greater) development effort, development team analysis [8]. This "model" for using learning systems in the experience does imply subpopulations for which strong linear real world is consistent with a long-term goal of Scacchi [20], relationships might exist between independent and dependent which is to develop a knowledge-based "corporate memory" of variables. For example, teams with lesser experience may not software production practices that is used for both estimating adjust as well to larger projects as do teams with greater and controlling software development. The machine-learning experience; that is, as KDSI increases, development effort tools that we have described, and other tools such as ESTOR, increases are larger for less experienced teams than more might be added to the repertoire of knowledge-acquisition experienced teams. Recently, machine learning systems have strategies that Scacchi suggests. In fact, Porter and Selby [14] been developed that have this flavor [18]. We have not yet make a similar proposal by outlining the use of decision-tree experimented with these systems, but the approach appears promising. induction methods as tools for software development. The success of CARTX, and decision/regression-tree learners generally, may also be limited by two other processing C. The Limitations of Selected Learning Methods characteristics. First, CARTX uses a greedy attribute selection Despite the promising results on Kemerer's common data- strategy—tree construction assesses the informativeness of a base, there are some important limitations of CARTX and single attribute at a time. This greedy strategy might overlook BACKPROPAGATION. We have touched upon the sensitivity attributes that participate in more accurate regression trees, to certain configuration choices. In addition to these prac- particularly when attributes interact in subtle ways. Second, tical limitations, there are also some important theoretical CARTX builds one classifier over a training set of software limitations, primarily concerning CARTX. Perhaps the most projects. This classifier is static relative to the test projects; important of these is that CARTX cannot estimate a value any subsequent test project description will match exactly one along a dimension (e.g., software development effort) that is conjunctive pattern, which is represented by a path in the outside the range of values encountered in the training data, regression tree. If there is noise in the data (e.g., an error in the Similar limitations apply to a variety of other techniques as recording of an attribute value), then the prediction stemming well (e.g., nearest neighbor approaches of machine learning from the regression-tree path matching a particular test project and statistics). In part, this limitation appears responsible for may be very misleading. It is possible that other conjunctive a sizable amount of error on test data. For example, in the patterns of attribute values matching a particular test project, experiment illustrating CARTX'S sensitivity to training data but which are not represented in the regression tree, could using 10/5 splits of Kemerer's projects (Section IV-C), CARTX ameliorate CARTX'S sensitivity to errorful or otherwise noisy is doomed to being at least a factor of 3 off the mark when project descriptions.

60

SR1NIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT

135

The Optimized Set Reduction (OSR) strategy of Briand,

Basili, and Thomas [5] is related to the CARTX approach in several important ways, but may mitigate problems associated with CARTX—OSR conducts a more extensive search for multiple patterns that match each test observation. In contrast to CARTX'S construction of a single classifier that is static relative to the test projects, OSR can be viewed as dynamically building a different classifier for each test project. The specifics of OSR are beyond the scope of this paper, but suffice it to say that OSR looks for multiple patterns that are statistically justified by the training project descriptions and that match a given test project. The predictions stemming from different patterns (say, for software development effort) are then combined into a single, global prediction for the test project. OSR was also evaluated in [5] using Kemerer's data for test, and COCOMO data as a (partial) training sample.4 The authors report an average MRE of 94% on Kemerer's data. However, there are important differences in experimental design that make a comparison between results with OSR, BACKPROPAGATION, and CARTX unreliable. In particular, when OSR was used to predict software development effort for a particular Kemerer project, the COCOMO data and the remaining 14 Kemerer projects were used as training examples. In addition, recognizing that Kemerer's projects were selected from the same development environment, OSR was configured to weight evidence stemming from these projects more heavily than those in the Cocomo data set. The sensitivity of results to this "weighting factor" is not described. We should note that the experimental conditions assumed in [5] are quite reasonable from a pragmatic standpoint, particularly the decision to weight projects more heavily that are drawn from the same environment as the test project. These different training assumptions simply confound comparisons between experimental results, and OSR's robustness across differing training and test sets is not reported. In addition, like the work of Porter and Selby [14], [15], [22], OSR assumes that the dependent dimension of software development effort is nominally-valued for purposes of learning. Thus, this dimension is partitioned into a number of collectivelyexhaustive and mutually-exclusive ranges prior to learning. Neither BACKPROPAGATION nor CARTX requires this kind of preprocessing. In any case, OSR appears unique relative to other machine learning systems in that it does not learn a static classifier; rather, it combines predictions from multiple, dynamically-constructed patterns. Whether one is interested in software development effort estimation or not, this latter facility appears to have merits that are worth further exploration. In sum, CARTX suffers from certain theoretical limitations: it cannot extrapolate beyond the data on which it was trained, it uses a greedy tree expansion strategy, and the resultant classifier generates predictions by matching a project against a single conjunctive pattern of attribute values. However, there appear to be extensions that might mitigate these problems.

"Our choice of using COCOMO data for training and Kemerer's data for test was made independently of [5].

VI. CONCLUDING REMARKS

^ the CARTX and micle has c o m pared BACKPROPAGATION learning methods to traditional a p p r o a c h e s for software effort estimation. We found that the l e a r n i n g approaches were competitive with SLIM, COCOMO, a n d ^ c n o N POINTS as represented in a previous study b y K emerer. Nonetheless, further experiments showed the s e n s i t i v i t y o f learning to various aspects of data selection and repr esentation. Mohanty and Kemerer indicate that traditional models ^ quite sensitive as well. advantage of learning systems is that they are A pTlm3ly adaptable and nonparametric; predictive models can be tailored t 0 t h e d a t a a t a p a r t i c u l a r s i t e . D e c ision and regression trees ^ p a r t i c u l a r l y We ll- S uited to this task because they make explicit the attributes (e.g., TIME) that appear relevant to ^ pre diction task. Once implicated, a process that engineers t h e d a t a d e f i n i t j o n i s o f t e n required to explain relevant and i r r e l e v a n t a s p e c t s o f t h e d a t a , and to encode it accordingly. T h i s p r o c e s s is b e s t d o n e locally> w i t h i n a s o f t w a r e shop, w h e r e t h e i d i o s y n c r a s i e s of that environment can be factored i n o f o u t I n s u c h a s e t t i n g a n a l y s t s m a y w a n t t 0 investigate t h e b e h a v i o r o f s y s t e m s l i k e BACKPROPAGATION, CART, and r e l a t e d a p p r o a c h e s [5]> [ 1 4 ] > [ 1 5 ] > [ 2 2 ] over a range of permiss i W e c o n f i g u r a t i o n S ) thus obtaining performance that is optimal i n m e i r env ironment.

APPENDIX A DATA DESCRIPTIONS Th e attributes defining the COCOMO and Kemerer databases were used to develop the COCOMO model. The following is a brief description of the attributes and some of their suspected influences on development effort. The interested reader is referred to [3] for a detailed exposition of them, These attributes can be classified under four major headings, They are Product Attributes; Computer Attributes; Personnel Attributes; and Project Attributes, A. Product Attributes ]) Required Software Reliability (RELY): This attribute measures how reliable the software should be. For example, if serious financial consequences stem from a software fault, then the required reliability should be high, 2) Database Size (DATA): The size of the database to be used by software may effect development effort. Larger databases generally suggest that more time will be required to develop the software product. 3) Product Complexity (CPLX): The application area has a bearing on the software development effort. For example, communications software will likely have greater complexity than software developed for payroll processing. 4) Adaptation Adjustment Factor (AAF): In many cases software is not developed entirely from scratch. This factor reflects the extent that previous designs are reused in the new project.

61

136

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995

B. Computer Attributes -™,^ „ 1) Execution Time Constraint (TIME): tt there are conSttaintS on processing time, then the development time may , 8 • 2) Main Storage Constraint (STOR): If there are memory constraints, then the development effort will tend to be high. .,. . . .

, ,,

,.

...

...

,,,,r.T.>

Tj-

,

,

, .

3) Virtual Machine Volatility (VIRT): If the underlying hardware and/or system software change frequently, then development effort will be high.

[3] B. W. Boehm, Software Engineering Economics. Englewood Cliffs, NJ: Prentice-Hall, 1981. [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth International, 1984. [5] L. Briand, V. Basili, and W. Thomas, "A pattern recognition approach for software engineering data analysis," IEEE Trans. Software Eng., vol. is, pp. 931-942, Nov. 1992. W c - B r o d l e v a n d E - R'ssiand, "Measuring concept change," in AAAI Spring Symp. Training Issues in Incremental Learning, 1993, pp. 98-107.

^ [9]

C Personnel Attributes [JO] 1) Analyst Capability (ACAP): If the analysts working on the software project are highly skilled, then the development effort of the software will be less than projects With less-skilled analysts. T, . ,. . _ . / J C V D I TU • t 2) Applications Experience (AEXP): The experience of project personnel influences the software development effort. 3) Programmer Capability (PCAP): This is similar to • ,. ACAP, but it applies to programmers. 4) Virtual Machine Experience (VEXP): Programmer experience with the underlying hardware and the operating system has a bearing on development effort. 5) Language Experience (LEXP): Experience of the programmers with the implementation language affects the software development effort. 6) Personnel Continuity Turnover (CONT): If the same , , , . r , . . , , . personnel work on the project from beginning to end, then the development effort will tend to be less than Similar projects experiencing greater personnel turnover. ° D: Project Attributes J) Modern Programming Practices (MODP): Modern programming practices like Structured software design ° reduces the development effort. 2) Use of Software Tools (TOOL): Extensive use of software tools like source-line debuggers and syntax-directed editors reduces the software development effort. 3) Required Development Schedule (SCED): If the devel. , , . „ . ,

...

.

. . , . , ,

.

,

opment schedule of the software project is highly constrained, then the development effort will tend to be high. Apart from the attributes mentioned above, other attributes that influence the development are: programming language, and the estimated lines of code (unadjusted and adjusted for the use of existing software). ACKNOWLEDGMENT

The authors would like to thank the three reviewers and the action editor for their many useful comments.

REFERENCES

[1] M. /\ioen, Albert, "Instance-based ti j D. L/. Aha, mid, D. u. Kibler, rviuier, and anu ivi. insiaiitx-Daieu learning learning algorithms," aigorunms, Machine Machine Learning, Learning, vol. vol. 6, 6, pp. pp. 37-66, 37-66, 1991. 1991. [2] A. Albrecht and J. Gaffney Jr., "Software function, source lines of code, code, [2] A. Albrecht and J. Gaffney Jr., "Software function, source lines of and development development effort effort prediction: prediction: A A software software science science validation," validation," IEEE IEEE and Trans. Software Eng., vol. 9, pp. 639-648, 1983. Trans. Software Eng., vol. 9, pp. 639-648, 1983.

62

"Learning with genetic algorithms," Machine Learning, vol. 3, pp. 121-138, 1988. B. Evans andID^Fisher, "Overcoming process delays with decision tree induction, IEEE Expert, vol. 9, pp. 60-66, Feb. 1994. U. Fayyad, "On the induction of decision trees for multiple concept learning," Doctoral dissertation, EECS Dep., Univ. of Michigan, 1991. L. Johnson and R. Riess, Numerical Analysis. Reading, MA; AddisonWesley, 1982. ^^I^^^^^™. " " A. Lapedes and R. Farber, "Nonlinear signal prediction using neural networks: Prediction and system modeling," Los Alamos National Laboratory, 1987, Tech. Rep. LA-UR-87-2662 s M o h a I ( t y > "Software cost estimation: Present and future," Software—Practice and Experience, vol. 11, pp. 103-121, 1981. A. Porter and R.Selby, "Empirically-guided software development using metric-based classification trees," IEEE Software, vol. 7, pp. 46-54,

[7] K D e J o n g ?

"« [12] t,3]

[14]

Mar

1990

[15] A. Porter and R. Selby, "Evaluating techniques for generating metric based classification trees," 7. S ^ . to/rvvare, vol. 12, pp. 209-218, July [16] L. H. Putnam, "A general empirical solution to the macro software s i z i n 8 a n d estimating problem," IEEE Trans. Software Eng., vol. 4, pp. [17] j 3 4 ^ 3 ^ ^ . * Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. t 18 l J- R- Q u i n l a n . "Combining instance-based and model-based learning," in Proc. the 10th Int. Machine Learning Conf, 1993, pp. 236-243. [ 1 9 j D . E . Rume lhart, G. E. Hinton, and R J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Processr o m ("«• Cambridge MA: MIT Press, 1986. [20] W. Scacchi, Understanding software productivity: Toward a knowledge-based approach," Int. J. Software Eng. and Knowledge Eng., vol. 1, pp. 293-320, 1991. [21] T. J. Sejnowski and C. R. Rosenberg, "Parallel networks that learn to pronounce english text," Complex Systems, vol. 1, pp. 145-168, 1987. [22] R. Selby and A. Porter, "Learning from examples: Generation and evaluation of decision trees for software resource analysis, IEEE Trans. Software Eng., vol. 14, pp. 1743-1757, 1988. [23] S. Vicinanza, M. J. Prietulla, and T. Mukhopadhyay, "Case-based 7 ^ 9 9 ^ ^ ^ - ^ e s t i m a t i ° n > " i n Proc- IUh Int C o n / '"f°[24] S. Weiss and C. Kulikowski, Computer Systems that Learn. San Mateo, CA: Morgan Kaufmann, 1991. [25] J. Zaruda, Introduction to Artificial Neural Networks.

St. Paul, MN:

W e s t 1992

Krishnamoorthy Srinivasan, received the MB.A, in management information systems from the Owen Graduate School of Management, Vanderbilt University, and the M.S. in computer science from Vanderbilt University. He also received the Post Graduate Diploma in industrial engineering from the National Institute for Training in Industrial Engineering, Bombay, India, and the B.E. from the University of Madras, Madras, India. He is currently working as a Principal Software Engineer Inc., engineer with wun Personal rerj>onai Computer computer Consultants, consultants, inc., re joining PCC, he he worked as as a Senior Specialist with Washington, D.C. Before joining PCC, worked a Senior Specialist with Inc., Cambridge, Cambridge, MA. MA. His His primary primary research research interests interests McKinsey & Company,, Inc., :ations of machine learning techniques to real-world are in exploring applications of machine learning techniques to real-world business problems.

SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT

Douglas Fisher (M'92) received his Ph.D. in information and computer science from the University of California at Irvine in 1987. He is currently an Associate Professor in computer science at Vanderbilt University. He is an Associate Editor of Machine Learning, and IEEE Expert, and serves on the editorial board of the Journal of Artificial Intelligence Research. His research interests include machine learning, cognitive modeling, data analysis, and cluster analysis. An electronic addendu to this article, which reports any subsequent analysis, can be found at (http://www.vuse.vanderbilt.edurdfisher/dfisher.html). Dr. Fisher is a member of the ACM and AAA1.

63

137

736

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 12, NOVEMBER 1 9 9 ?

Estimating Software Project Effort Using Analogies Martin Shepperd and Chris Schofield Abstract—Accurate project effort prediction is an important goal for the software engineering community. To date most work has focused upon building algorithmic models of effort, for example COC0M0. These can be calibrated to local environments. We describe an alternative approach to estimation based upon the use of analogies. The underlying principle is to characterize projects in terms of features (for example, the number of interfaces, the development method or the size of the functional requirements document). Completed projects are stored and then the problem becomes one of finding the most similar projects to the one for which a prediction is required. Similarity is defined as Euclidean distance in n-dimensional space where n is the number of project features. Each dimension is standardized so all dimensions have equal weight. The known effort values of the nearest neighbors to the new project are then used as the basis for the prediction. The process is automated using a PC-based tool known as ANGEL. The method is validated on nine different industrial datasets (a total of 275 projects) and in all cases analogy outperforms algorithmic models based upon stepwise regression. From this work we argue that estimation by analogy is a viable technique that, at the very least, can be used by project managers to complement current estimation techniques. Index Terms—Effort prediction, estimation process, empirical investigation, analogy, case-based reasoning.

+ 1 INTRODUCTION

A

N important aspect of any software development proj-

tion in the dependent variable that can be "explained" in

ect is to know how much it will cost. In most cases the major cost factor is labor. For this reason estimating development effort is central to the management and control of a software project. A fundamental question that needs to be asked of any estimation method is how accurate are the predictions. Accuracy is usually defined in terms of mean magnitude of relative error (MMRE) [6] which is the mean of absolute percentage errors: ;=n ( r p \ V (2) ,=11 J n

terms of the independent variables. Unfortunately, this is not always an adequate indicator of prediction quality where there are outlier or extreme values. Yet another approach is to use Pred(25) which is the percentage of predictions that fall within 25 percent of the actual value. Clearly the choice of accuracy measure to a large extent depends upon the objectives of those using the prediction system, For exlmple, MMRE is fairly conservative with a bias against overestimates while Pred(25) will identify those prediction systems that are generally accurate but occasiona % wildly inaccurate. In this paper we have decided to adopt MMRE and Pred(25) as prediction performance indicators since these are widely used, thereby rendering our results more comparable with those of other workers, The remainder of this paper reviews work to date in the field of effort prediction (both algorithmic and nonalgorithmic) before going on to describe an alternative approach to effort prediction based upon the use of analogy, Results from this approach are compared with traditional statistical methods using nine datasets. The paper then discusses the results of a sensitivity analysis of the analogy method. An estimation process is then presented. The paper concludes by discussing the strengths and limitations of analogy as a means of predicting software project effort,

where there are n projects, E is the actual effort and E is the predicted effort. There has been some criticism of this measure, in particular that it is unbalanced and penalizes overestimates more than underestimates. For this reason Miyazaki et al. [19] propose a balanced mean magnitude of relative error measure as follows: \ ^00 ,~\ min(£, E) j " ' This approach has been criticized by Hughes [8], among others, as effectively being two distinct measures that should not be combined. Other workers have used the adjusted R squared or coefficient of determination to indicate the percentage of varia-

(

•

2

A BRIEF HISTORY OF EFFORT PREDICTION

Over the past two decades there has been considerable activity in the area of effort prediction with most approaches being typified as being algorithmic in nature. Well known M. Shepperd and C. Schofield are with the Department of ComputinQ, i • i J r^r^r^i,n/-^ r,n J r .• • ] r_n Bournemouth University. Talbot Campus, Poole, BH12 5BB United King- examples include COCOMO [4] and function points [2]. dom. E-mail: Imshepper, [email protected]. Whatever the exact niceties of the model, the general form tends to be:

Manuscript received 10 Feb. 1997. Recommended for acceptance by D.R. Jeffery. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 104091.

1. We include function points as an algorithmic method since they are dimensionless and therefore need to be calibrated in order to estimate effort.

0098-5589/97/$10.00 © 1997 IEEE

64

SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES

E = aS b

(3) _ „ . where E is effort, S is size typically measured as lines of , /T__. , . . 'v . , . . code (LOC) or function points, a is a productivity parameter and b is an economies or diseconomies of scale parame^^^.^ ^ , , ,, , ter. COCOMO represents an approach that could be re, , „ .. , r , ,,,, TT f . . , , earded as off the shelf. Here the estimator hopes that the o . , . , ; represent equations contained in the cost model adequately 4 4

737

compares linear regression with a neural net approach using the COCOMO dataset. Both approaches seem to per, , ,, ... . j A , O T r j mQ 1 ,. c ™ i form badly with MMREs of 520.7 and 428.1 percent, reHI , . , ,.„ , , , c . . Srmivasan and cFisher [27] also report on the use of a , ., , , . , . , ., neural net with a back propagation learning algorithm, „, , , ., . ., , ° . , ? ,," , They found that the neural net outperformed other ttech. , u nnroc ™ *. TUT

, . , , . their development environment and that any variations can . , ., , r • / , • be satisfactorily accounted for in terms of cost drivers or

niques and gave results of MMRE = 70 percent. However, it . , ., u ±u J t <. J - J J r is unclear exactly how the dataset was divided urp for . . , ... . , , . TT ,

, . , . . , . .r j i ^ • ^ ^ v , ™ , , , training and validation purposes. Unfortunately, they also parameters built into the model. For instance COCOMO £ , f, . ., ,, ... , ., , ,, ., v ._ . , , ., , , found that the results were sensitive to the number of hidTT , has 15 such drivers. Unfortunately, there is considerable , . ,, „ ,. , , , , ,.„/-,•, , ,^,, , . , den units and layers. Results to date suggest that accuracy ... , , . . ,. ., , O D , , ., f evidence that this off the shelf approach is not always . . , T, ..,, " / is sensitive to decisions regarding the topology of the net, very1 successful. Kemerer [12] reports average errors (in t.,n e , . , ° , . / . .•[. , , c , , ,•„ i i- j ? i • number of learning epochs and the initial random . , . , ., ..-r. ., , T .,-.• ., terms ofe the difference between predicted and actual proi.. . r . ,. , , f r weights of the neurons within the net. In addition, there is rnn ect effort) of over 600 percent in his independent study of l,..., , .. , , , ., . . , ,, lttle

^r^min r^^ • j J . t J- n . i n o i i . i explanation value in a neural net, that is such models j ^ j r ^ , X J I COCOMO. Other independent studies L[141,L[18] have also , .K . r .... '' ' do not help us understand software proiect development r reported high error rates. „ Another algorithmic approach is to calibrate a model by _, ' , , , , ^ ^ ^ J f , , r r . ,, . , , There have been a number of attempts to use regression J-J. L C LI. • estimating values for the parameters (a and b in the case of j j • • ,. L .... T T ° ., ..•!..<• j i j • ^ and decision trees to predict aspects of software engmeerHowever, the most straightforward method is to as„ . . , c.r, . f ., ., ° r c v(3)). / j , i • i • ,i m g . Srmivasan a n d Fisher [27] describe the use of a regressume a linear model, that is set b to unity, and then use re. . . j - ,. rr ^ • i_ix J n^t . . . , , , x _, sion tree to predict effort using the Kemerer dataset [121. ,„ , j li_ . ui_ u - ^ i r J r^r^r^rwtr^ J 6eression analysis to estimate the slope (parameter a) and ., , . , . if,. They found that although it outperformed C O C O M O a n d

possibly so the model becomes: r J introduce an intercept r

(4) E = aj+ a2S so that a i represents fixed development costs (for example regression testing will consume a fixed amount of effort irrespective of the size of the software) and a2 represents productivity. Kok et al.[15] describes how this approach has been successfully utilized on the Esprit MERMAID Project. Function points [2] are also often calibrated to local environments in order to convert size in function points to predieted effort. Again, as with COCOMO, quite mixed results have been reported [9], [10], [12], [17]. Kitchenham and Kansala [13] also note that better results can be obtained through disaggregating the components of function points and using stepwise regression to reestimate weights and determine the significant components. Although, most research into project effort estimation has adopted an algorithmic approach there has been limited exploration of machine learning or nonalgorithmic methods. For example, Karunanithi et al. [11] report the use of neural nets for predicting software reliability, and conelude that both feed forward and Jordan networks with a cascade correlation learning algorithm, out perform traditional statistical models. More recently Wittig and Finnie [28] describe their use of back propagation learning algorithms on a multilayer perceptron in order to predict development effort. An overall error rate (MMRE) of 29 percent was obtained which compares favorably with other methods. However, it must be stressed that the datasets were large (81 and 136 projects, respectively) and that only a very small number of projects were withdrawn for validation purposes. Some outliers also appear to have been removed.

,. , , f, . ... . ,. .. CT T / . ., SLIM, the results were less good than using either a statistic a l model derived from function points or a neural net. Bria n d e t aL ^ obtained rather better results (MMRE = 94 P e r « n t ) from their tree induction analysis. In this case they used a combination of the Kemerer and COCOMO datasets. P o r t e r a n d Selb y I 2 1 !' P 2 1 describe the use of decision or classification trees in predicting aspects of the software development process. Results from this approach seem to be <3 mte m l x e d a n d - a s w i t h t h e n e u r a l n e t approach, results are
This tends to confirm the findings from Serluca [25] that

3

neural nets seem to require large training sets in order to give good predictions Another study by Samson et al. [24] uses an Albus multilayer perceptron in order to predict software effort. In this instance they use Boehm's COCOMO dataset. The work

. . , „„„ Estimation by analogy is a form of CBR. Cases are defined ^abstractions of events that are limited in time and space. ^J " ^ * e s t i m a t l 0 n ^ *nalo& o f f e r s s o m e d i s t i n c t °

65

ANALOGY

738

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO, 12, NOVEMBER 1997

• It avoids the problems associated both with knowledge elidtation and extracting and codifying the knowledge.

(C, - C2 f . . _ _ . _ F«fa*_As»*»ton/y(C v , C2/) 0

• Analogy-based systems only need deal with those 1 problems that actually occur in practice, while gen. . „. ., , , r . .. , ., {. ^ u JI li where 1) the features are numeric, 2) if the features erative (i.e., algorithmic) systems must handle all posare categorical and Qj = C2j, or 3) where the features sible problems. • Analogy-based systems can also handle failed cases are categorical and, Cj ^ C2j, respectively. (i.e., those cases for which an accurate prediction was , Manually guided induction. Here an expert manually not made). This is useful as it enables users to identify identifies key features, although this reduces some potentially high-risk situations. o f the adavantages of using a CBR system in that • Analogy is able to deal with poorly understood doa n ex pert is required. mains (such as software projects) since solutions are . Template retrieval. This functions in a similar fashj o n t 0 query by example database interfaces, that is based upon what has actually happened as opposed to chains of rules in the case of rule based systems. the user supplies values for ranges, and all cases • Users may be more willing to accept solutions from that match are retrieved. analogy based systems since they are derived from a . Qoa\ directed preference. Select cases that have the form of reasoning more akin to human problem s a m e g o a i a s the current case, solving, as opposed to the somewhat arcane chains of . Specificity preference. Select cases that match fearules or neural nets. This final advantage is particut ures exactly over those that match generally, larly important if systems are to be not only deployed . frequency preference—select cases that are most but also have reliance placed upon them. frequently retrieved. The key activities for estimating by analogy are the • Recency preference. Choose recently matched cases identification of a problem as a new case, the retrieval of over those that have not been matched for a period similar cases from a repository, the reuse of knowledge deof time. rived from previous cases and the suggestion of a solution • Fuzzy similarity. Where concepts such as at4east-asfor the new case. This solution may be revised in the light similar and just-noticeable-difference are employed, of actual events and the outcome retained to augment the xh e similarity measures suffer from a number of disadrepository of completed cases. This approach to prediction vantages. First, they tend to be computationally intensive, poses two problems. First, how do we characterize cases? although Aha [1] has proposed a number of more efficient Second, how do we retrieve similar cases, indeed how do algorithms that are only marginally less accurate. However, we measure similarity? efficiency is not an issue for project effort estimation as Characterization of cases is largely a pragmatic issue of typically one is dealing with less than 100 cases. Second, the what information is available. Variables can be continuous algorithms are intolerant of noise and of irrelevant features, (i.e., interval, ratio or absolute scale measures) or categori- Qne strategy to overcome this problem is to build in leamcal (i.e., nominal or ordinal measures). When designing a j n g s o that the algorithm learns the importance of the varinew CBR system, experts should be consulted to try to es- o u s features. Essentially, weights are increased for matchtablish those features of a case that are believed to be sig- i n g f eatU res for successful predictions and diminished for nificant in determining similarity, or otherwise, of cases, unsuccessful predictions. Third, symbolic or categorical Rich and Knight [23] describe the problem of choosing in- features are problematic. Although there are several algosufficiently general features. Again the solution appears to rithms that have been proposed to accommodate such feabe to use an expert. tures they are all fairly crude in that they adopt a Boolean Assessing similarity is the other problem. There are a va- approach: features match or fail to match with no middle riety of approaches including a number of preference heu- g r o u n d . A fourth criticism of these similarity measures is ristics proposed by Kolodner [16]: t h a t they fail to take into account information which can be . Nearest Neighbor Algorithms. These are the most derived from the structure of the data, thus, they are weak popular and are either based upon straightforward for higher order feature relationships such as one might distance measures or the sum of squares of the differ- e x p e ct to see exhibited in legal systems, ences for each variable. In either case each variable O ur approach has been guided by the twin aims of exmust be first standardized (so that it has an equal in- pe diency and simplicity. In essence we take a new project, fluence) and then weighted according to the degree of o n e for w hich we wish to predict effort, and attempt to find importance attached to the feature. A common algo- o t h e r similar completed projects. Since these projects are rithm is given by Aha [1]. completed, development effort will be known and can be 1 used as a basis for estimating effort for the new project. biM(^,L2, I) = , = • . . • Similarity is defined in terms of project features, such as I/^-IIEP ~ -' li' ' 2 i ' number of interfaces, development method, application where P is the set of n features, Q and C2 are cases d o m a i n a n d s 0 f o r t h - CleailY t h e features used will depend upon what data is available to characterize projects. The ancj number of features is also flexible. We have analyzed data-

66

SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES

sets with as few as one feature and as many as 29 features, Features may be either categorical or continuous. Similarity, defined as proximity in n-dimensional space (where each dimension corresponds to a different feature), is most intuitively appealing, hence we use unweighted Euclidean distance. The most similar projects will be closest to each other. Note that each dimension is standardized (between 0 and 1) so that it has the same degree of influence and the method is immune to the choice of units, Moreover, the notion of distance gives an indication of the degree of similarity. Once the analogous projects have been found, the known effort can be used in a variety of ways, We use the weighted or unweighted average of up to three analogies. No one approach is consistently more accurate so the decision requires a certain amount of experimentation on the part of the estimators. Because of the small datasets, we cope with noise (that is, unhelpful features that do not aid in the process of finding good analogies) by means of an exhaustive search of all possible subsets of the project features so as to obtain the optimum predictions for projects with known effort. The whole method, from storing analogies through eliminating redundant features to finding analogies is automated by a PC-based software tool known as ANGEL (ANaloGy Estimation tool). A fuller description is to be found in Shepperd et al. [26]. 4

COMPARING ESTIMATION BY ANALOGY WITH

R .. KEGRESSION MODELS Next, we compared the accuracy of software project effort prediction using analogy with an algorithmic approach based upon equations derived through stepwise regression analysis. Table 1 summarizes the datasets that were used for our comparison of analogy based estimation with stepwise regression. As can be seen from the table the datasets are quite diverse and are drawn from many different application domains ranging from telecommunications to commerdal information systems. All the data was taken from industrial projects, that is, no academic or student projects are included. The projects range in size from a few person months to over 1,000 person months. It is also important to stress that none of the data was collected with estimation by analogy in mind, instead we were able to exploit data that was already available. The final point is that we only utilized information that would be available at the time the prediction would be made, so we avoided project features such as LOC. This is important if we wish to avoid creating a false impression as to the efficacy of different prediction methods. Table 2 shows the accuracy of the respective methods using the MMRE and Pred (25) values. A jack-knifing procedure was adopted for the analogy-based predictions, since this could be automated in the ANGEL tool, the re2. The authors are happy to provide a simple version of ANGEL at no

gression models were generated using the entire dataset. This means the results are likely to be biased in favor of the regression models. Note that we use two slightly different regression analysis techniques. Both regression 1 and 2 use stepwise regression, however, regression 1 restricts the procedure to the three variables most highly correlated with the dependent variable (i.e., effort). Not surprisingly the results are in general similar, however, occasional differences are due to the fact that the regression procedure attempts to minimize the sum of the squares of the residuals, whereas MMRE is based upon the mean of the sum of the unsquared residuals. Each dataset is treated separately since each one has different project features available and therefore we are not able to merge all the data into a single all encompassing dataset. This is appropriate since it is unlikely that an organization would have access to such large volumes of data and there seems some merit in estimating using smaller, more homogenous datasets, a point we will return to. From Table 2 we see that for all datasets the MMRE performance of estimating by analogy is better than that of the regression methods. This suggests that analogy is capable of yielding more accurate predictions, at least for these datasets. An interesting problem occurs for Real-time 1 dataset. Here it was not possible to develop an algorithmic model or to use regression analysis since the dataset comprises only categoriC l d & with * f ' ** e x c e P t i o n o f a c t u a l P ro J ect effort, indeed the dataset was very sparse and was made up of only three distinguishing project features. Yet even in these highly unpropitious circumstances the analogy method was able to yield a predictive accuracy of 74 percent. This is indicative of the possibility of being able to use analogy based estimation at an extremely early stage of a project when other estimation techniques may not be possible for the reason that analogy does not require quantitative data. Similarly, an accuracy of 39 percent was obtained for the dataset Telecom 1 despite the fact that only a single distinguishing feature was available, Again, stepwise regression only achieves a result of MMRE = 86 percent by method 1 or 2. The Pred(25) results from Table 2 are slightly more mixed. Recall that unlike MMRE, a higher score implies better predictive accuracy. Two datasets (Atkinson and Desharnais) yield a higher Pred(25) score for the regression model. In general, the results are closer than for the MMRE analysis. One explanation lies in the fact that the ANGEL tool explicitly tries to optimize the MMRE result so that it is not surprising that it performs best in terms of this indicator. A second explanation lies in the fact that MMRE and Pred(25) are assessing slightly different characteristics of a prediction system. MMRE is conservative and looks at the mean absolute percentage error whereas Pred(25) is optimistic and focuses upon the best predictions (i.e., those within 25 percent of actual) and ignores all other predictions. The choice of indicator to some extent depends upon the objectives of the user. Nevertheless, the overall picture suggests that estimation by analogy tends to be the more

cost. The zip files may be downloaded from http://xanadu.bournemoirth.ac.uk/ a c c u r a t e p r e d i c t i o n m e t h o d . ComputingResearch/ChrisSchofield/Angel/AngelPage.html 3. Jack knifing is a validation technique whereby each case is removed from the dataset and the remainder of the cases are used to predict the removed case. The case is then returned to the dataset and the next case removed. This procedure is repeated until all cases have been covered.

67

739

740

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 12, NOVEMBER 1997

TABLE 1 DATASETS USED TO COMPARE EFFORT PREDICTION METHODS Name

|

Source

n

Albrecht Atkinson

[2] [3]

Desharnais Finnish

J L 7] Finnish Dataset: dataset made available to the ESPRIT Mermaid Project by the TIEKE organization [12] MM2 Dataset: Dataset made available to the ESPRIT Mermaid Project anonymously not in public domain Appendix A not in public domain

Kemerer Mermaid Real-time 1 Telecom 1 Telecom 2

~

Features |

Description

24 21

5 12

IBM DP Services projects Builds to a large telecommunications product at U.K. company X Canadian software house—commercial projects Data collected by the TIEKE organization from IS projects from nine different Finnish companies.

77 38

9 29

15 28

2 17

Large business applications New and enhancement projects

21 18 33

3 1 13

Real-time projects at U.K. company Z Enhancements to a U.K. telecommunication product Telecommunication product at U.K. company Y

TABLE 2 RELATIVE ACCURACY LEVELS OF EFFORT ESTIMATION FOR ANALOGY AND REGRESSION Analogy (MMRE) (%)

Dataset Albrecht Atkinson Desharnais

62 39 64

Finnish Kemerer Mermaid Real-time 1 Telecom 1 Telecom 2

41 62 78 74 39 37

If

Regression 1 (MMRE) (%) ~

Regression 2 (MMRE) (%) ==

90 45 66

|

101 107 252 N/A 86 142

*

|

Analogy (Pred 25) (%)

Regression 1 (Pred 25) (%)

90 40 66

33 38 36

33 43 42

33 38 42

128 107 226 N/A 86 72

39 40 21 23 44 51

21 13 14 N/A 44 27

29 13 14 N/A 44 42

1

~

|

Regression 2 (Pred 25) (%)

|

TABLE 3 RELATIVE ACCURACY LEVELS OF HOMOGENIZED DATASETS

Dataset Desharnais 1 Desharnais 2 Desharnais 3 Mermaid E Mermaid N

||

Analogy (MMRE) (%)

[|

37 29 26 53 60

|

Regression 1 (MMRE) (%)

|

41 29 36 62 -

|

Regression 2 (MMRE) (%)

|

41 29 49 62 -

In general, the best results seem to be achieved where the data is drawn from many builds or enhancements to an existing system, for example the Atkinson, Telecom 1, and Telecom 2 datasets. The poorest results occur when the data is drawn from a wide range of projects from more than one organisation, such as the Mermaid dataset. This tendency appears to be true for both analogy and regression analysis. Table 3 shows the results of dividing the Desharnais and Mermaid datasets into more homogenous subsets. The Desharnais dataset is divided on the basis of differing development environments. The Mermaid data is divided into enhancement (E) and new (N) projects. We observe that this division leads to enhanced accuracy for all estimation methods. Overall analogy has equal or superior performance to regression based prediction for seven out of eight comparisons, the only exception being the Desharnais 2

||

47 47 70 39 25

|

Regression 1 (Pred 25) (%)

|

45 48 30 27 -

|

Regression 2 (Pred 25) (%) 45 48 50 27

|

regression based prediction when using the Pred (25) indicator. The Mermaid N dataset is particularly interesting as it shows a dataset for which no statistically significant relationships could be found between any of the independent variables and effort hence no statistically significant regression equation can be derived. By contrast, the analogy method is able to produce an overall estimation accuracy of MMRE = 60 percent. Finally, we note that the procedure to search for optimum subsets of features for predicting effort reduced the set of features for every dataset studied excepting, of course, Telecom 1 which only had a single feature in the first place. This procedure has a significant impact upon the levels of accuracy that we were able to obtain, g

dataset which reveals fractionally superior performance for

|

Analogy (Pred 25) (%)

SENSITIVITY ANALYSIS

An important question to ask about any prediction method is how sensitive is it to any peculiar characteristics of the data 4. In a previous paper [26] we reported an accuracy level of MMRE = 62 and how will it behave over time. All the datasets we studied

percent. The improvement is due to the use of additional project features with which to find analogies that were not utilized during our earlier work.

W e r e

68

1 • L • 1 • Li ,1 , ,1 j -i 1 1 xnA historical in the sense that they described completed

SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES

projects and we conducted the analysis after the event. This section explores the dynamic behavior of effort prediction by simulating the growth of a dataset over time. This enables us to answer questions such as how many data points do we need for estimation by analogy to be viable and how stable are the results (in other words, are the accuracy levels vulnerable to the addition of a single rogue project)? Figs. 1 and 2 show the trends in estimation accuracy as the datasets grow. The Albrecht dataset (Fig. 1) was selected as an example of a dataset for which a comparatively low level of accuracy was achieved and in contrast the Telecom 2 dataset (Fig. 2) showed the highest level of accuracy. The procedure was to randomly number the projects from 1 to n (where n is the number of projects in the dataset). Projects are added to the dataset one at a time, in the random number order. Thus, the dataset grows until all projects are added. The optimum subset of features was recalculated as each new project was added. This involved for each partial dataset (starting from two projects), jack knifing the dataset by holding out each project, one at a time, and using the remaining projects to predict effort. The average absolute prediction error for all projects contained in the partial dataset gives the MMRE of that partial dataset. This procedure was repeated three times for each dataset (hence, Al, A2, and A3 and Tl, T2, and T3).

741

risk technique at below this number of projects. The Telecom 2 dataset shows little improvement beyond 15 projects. On this theme, it is interesting to note that, overall, it is not the largest datasets such as the Desharnais dataset that have the lowest MMREs and clearly other factors, over and above size, such as homogenity also have an impact. An interesting feature of Fig. 1 is the sharp rise in the MMRE values that occurs after 10 projects have been added for random sequence Al and 16 added for random sequence A2. Further investigation reveals that both of these anomalies are linked to the introduction of the same project. The project is third in sequence A3, when predictions are still very poor. This suggests that the results from estimating by analogy, like regression, can be influenced by outlying projects. However, A2 demonstrates that the affect of a rogue project is ameliorated as the size of the dataset increases. Superficially there appears to be a similar effect in Fig. 2 for sequences Tl and T3 and projects 4 and 7, respectively. In this case, however, the peaks are caused by different projects and the most likely explanation is the vulnerability of finding analogous cases from very small datasets. 6

AN ESTIMATION PROCESS

This section considers how estimation by analogy can be introduced into software development organizations. The following are the main stages in setting up an estimation by analogy program: • identify the data or features to collect • agree data definitions and collection mechanisms • populate the case base • tune the estimation method • estimate the effort for a new project The first stage, that of identifying what data to collect, will be very dependent upon the nature of the projects for which estimates are required. Because of these variations, our software tool ANGEL is designed to be very flexible in the data that is used to characterize analogies and the user is able to define a template describing the data that will be supplied. Factors to be taken into account include beliefs as to what features significantly impact development effort (and are measurable at the time the estimate is required) and what features can easily be collected. There is little sense in identifying huge numbers of variables that cannot be easily or reliably collected in practice. Estimation by analogy can cope with both continuous and categorical data, although categorical data has to be held as binary values. For instance, programming language would be represented as a series of truth valued variables e.g., COBOL, 4GL, C++, etc. The reason for this is that the similarity measure treats categorical features as either being the same or different: there are no degrees of difference. The second stage is to agree definitions as to what is being Fig. 2. Estimation accuracy over time (Telecom 2 dataset). collected. Even within an organizations there may be no shared understanding of what is meant by effort. Any estiOverall, Figs. 1 and 2 show that the MMRE decreases as mation program will be flawed, possibly fatally, if different the size of the dataset grows. There is a tendency for the projects are measuring the same features in different ways. It MMRE to start to stabilize at approximately 10 projects is also important to identify who is responsible for the data which suggests that estimation by analogy may be a high collection and when they should collect the data. Sometimes it

69

742

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, V O L 23, NO. 12, NOVEMBER 1997

can be beneficial to have the same person collecting the data across projects in order to increase the level of consistency. Next, the case base must be populated. Like all estimation methods, other than inspired guess work, analogy requires some data collection. Our experience suggests that a minim u m of 10-12 projects are required in order to provide a stable basis for estimation. In general, more data is preferable although, in most cases, data collection will be an on-going process as projects are completed and their effort data becomes available. However, there appear to exist some tradeoffs between the size of the dataset and homogeneity. Again, our experience suggests there is merit in the strategy of dividing highly distinct projects into separate datasets. Often this separation is quite straightforward using such distinguishing features as application type or development site. The penultimate stage is to tune the estimation method. The user also will need to experiment with the optimum number of analogies searched for, and whether t o use a subset of variables, since some features may not usefully contribute to the process of finding effective analogies. Tuning can make quite a difference to the quality of predictions—typically tuning can yield a twofold improvement in performance—and for this reason the ANGEL tool provides automated support for this process. The last stage is to estimate for a new project. It must be possible to characterise the project in terms of the variables that have been identified at the first stage of the estimation process. From these variables, ANGEL can be used to find similar projects and the user can make a subjective judgment as to the value of the analogies. Where they are believed to be trustworthy the prediction can be relied on to greater extent than where they are thought to be doubtful. Here we wish to sound a note of caution. The value of estimation by analogy as an independent source of prediction will be somewhat reduced if the users discount values that are not consistent with their prior beliefs and for this reason there was no expert intervention or manipulation in any of the foregoing analysis. Another indicator of likely prediction quality is the average MMRE figure obtained through • , ? . , . . , , . °A • i f •,,.,• . jack knifing the dataset. Again a low figure will indicate 6 ' ,.° , , . ,6.. more confidence than a high figure.

this type of situation may be quite common particularly at a very early stage in a project, for example in response to an invitation to tender. This makes analogy an attractive method for producing very early estimates, Estimation by analogy also offers an advantage in that it is a very intuitive method. There is some evidence to suggest that practitioners use analogies w h e n making estimates b y means of informal methods [8]. O u r approach allows users to assess the reasoning process behind a prediction by identifying the most analogous projects thereby increasing, or reducing, their confidence in the prediction, Many experts have suggested that it is appropriate to use more than one method when predicting software development effort. We believe that estimation by analogy is a viable technique and can usefully contribute to this process. This is not to suggest that it is without weakness b u t on the empirical evidence presented in this paper it is certainly worthy of further consideration, APPENDIX A _ = — = ACT ACT_DEV ACT_TST 39522 250.49 54.73 330.29 225.4 104.89 333.96 177.35 156.61 1504 114 7 -PJLZ f^fl | f ^ i|L^ 1115 54 33305 2 67 09 153.56 130.4 28.16 573.71 372.15 201.56 276.95 232.7 44.25 ^Li5. i§^ -?§^ *374 ^4

975

fid

Qfl 7

5

^ = 1 CHNGS FILES = 218 105 357 237 136 98 .25 i! 4 ||3 _L|Z "377 "284" 48 137 118 53 178 116 ^ -2§ OCS(\

1 fifi

-^|| -^§|| -||^g j ^ -^ 358^37 281J8 77719 T43 "84 123.1 87.7 35.4 ~ 257 ~ 257 23.54 16.42 7.12 6 6 34 25 27 5 6 75 §. j> 1131.8 |24.2 | 7.6 |3 |3 T , , J ,. • J C u J 1. •. T 1 ^ »rr The above data is drawn from tthe dataset Telecom 1. ACT . . , ,, . . „_ ,_.„,, , . „ „ _,„_ . , , is actual effort, ACT DEV and ACT TEST are actual devel, . ",, .. 7 „ . „ . „ „ . . 0 & opment and testing effort, respectively. CHNGS is the number of changes made as recorded by the configuration man7 CONCLUSIONS agement system and files is the number of files changed by Accurate estimation of software project effort at an early t h e Particular enhancement project. Only FILES can be used stage in the development process is a significant challenge f o r Predictive purposes since none of the other information for the software engineering community. This paper has w o u l d b e a v a i l a W e at the time of making the prediction, described a technique based upon the use of analogy sometimes referred to as case based reasoning. We have com- ACKNOWLEDGMENTS _ ., . , . . ., r. . , ^ T , , , ^ . .. c rpared the use of analogy with prediction models based . . • 1 • r • J ± 1 The authors are grateful to the Finish TIEKE organization for upon stepwise regression analysis for nine datasets, a total .. ,, ,° , . . ., r. . , ,°. , . n JT7E • i A I_-I • L • 1 ..• granting the authors leave to use the Finnish dataset; to Barof 275 projects. A striking pattern emerges in that estima- ? °, , , ,. .Ux* - J J , . .... nnu ,. , 1 j • j- • r bara Kitchenham for supplying the Mermaid dataset; to Bob , , . Xf , . ° . T , _, . tion by analogy produces a superior predictive perform- „ , . ,1 , j , .>,,mj Hughes for supplying the dataset Telecom 2; and to anony° . „ . V ; ° . . , , . i T 1 . J U I ance in all cases when measured by MMRE and in seven . r . i .1 n j/r,.-\ • ,. ,» • mous staff for the provision of datasets Telecom 1 and Keal, , , , t-. out of nine cases for the Pred(25) indicator. Moreover, esti- , . , „ „ . . . . , . ,, . . time 1. Many improvements have been suggested by Uan mation by analogy is able to operate in circumstances „ . „ iT < *\ _ , TT , _ . T ^?u , o * • > . : , • , , , f , ., . , , Diaper, Pat Dugard, Bob Hughes, Barbara Kitchenham,CiSteve where it is not possible to generate an algorithmic model, . . r_. „ A . „ • ° •, „ . , , „ . , ,c , ., • . .-n , .. , , „ ? , MacDonell, Austen Rainer, and Bill nSamson. This workuhas such as the dataset Real-time 1 where all the data was cate- , . ,, _ . . , ' , ITTT^T-• A ... . .u -Kit - j H.T J i i. i • been supported by British Telecom, the U.K. Engineering and goncal in nature or the Mermaid N dataset where no statis- „ , . ( { . -L , _ ., , _ °r--n/r •Zriaa °. „ .,-. . i .• i ,,, , , ... , ,. Physical Sciences Research Council under Grant GR/L372yo, J ticallyJ significant relationships could be found. We believe . .. _ , _ . . r ° and the Defence Research Agency.

70

SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES

743

I2°l T- Mukhopadhyay, S.S. Vicinanza, and M.J. Prietula, "Examining the Feasibility of a Case-Based Reasoning Model for Software Effort Estimation," MIS Quarterly, vol. 16, pp. 155-171, June, 1992. D.W. Aha, "Case-Based Learning Algorithms," Proc. 1991 DARPA Case-Based Reasoning Workshop. Morgan Kaufmann, 1991. [21] A. Porter and R. Selby, "Empirically Guided Software DevelopA.J. Albrecht and J.R. Gaffhey, "Software Function, Source Lines of ment Using Metric-Based Classification Trees," IEEE Software, no. 7, pp. 46-54,1990. Code, and Development Effort Prediction: A Software Science Validation," IEEE Trans. Software Eng., vol. 9, no. 6, pp. 639-648,1983. [22] A. Porter and R. Selby, "Evaluating Techniques for Generating K. Atkinson and M.J. Shepperd, "The Use of Function Points to Metric-Based Classification Trees," J. Systems Software, vol. 12, pp. Find Cost Analogies," Proc. European Software Cost Modelling 209-218,1990. Meeting, Ivrea, Italy, 1994. [23] E. Rich and K. Knight, Artificial Intelligence, second edition. B.W. Boehm, "Software Engineering Economics," IEEE Trans. McGraw-Hill, 1995. Software Eng., vol. 10, no. 1, pp. 4-21,1984. [24] B. Samson, D. Ellison, and P. Dugard, "Software Cost Estimation L.C. Briand, V.R. Basili, and W.M. Thomas, "A Pattern RecogniUsing an Albus Perceptron (CMAC)," Information and Software tion Approach for Software Engineering Data Analysis," IEEE Technology, vol. 39, nos. 1/2,1997. Trans. Software Eng., vol. 18, no. 11, pp. 931-942,1992. [25] C. Serluca, "An Investigation into Software Effort Estimation Using a Back Propagation Neural Network," MSc dissertation, S. Conte, H. Dunsmore, and V.Y. Shen, Software Engineering Metrics and Models. Menlo Park, Calif.: Benjamin Cummings, 1986. Bournemouth Univ., 1995. J.M. Desharnais, "Analyse statistique de la productivitie des pro- [26] M.J. Shepperd, C. Schofield, and B.A. Kitchenham, "Effort Estijets informatique a partie de la technique des point des fonction," mation Using Analogy," Proc. 18th Int'l COM/. Software Eng., Berlin: masters thesis, Univ. of Montreal, 1989. IEEE CS Press, 1996. R.T. Hughes, "Expert Judgement as an Estimating Method," In- [27] K. Srinivasan and D. Fisher, "Machine Learning Approaches to formation and Software Technology, vol. 38, no. 2, pp. 67-75,1996. Estimating Development Effort," IEEE Trans. Software Eng., vol. 21, no. 2, pp. 126-137,1995. D.R. Jeffery, G.C. Low, and M. Barnes, "A Comparison of Function Point Counting Techniques," IEEE Trans. Software Eng., vol. [28] G.E. Wittig and G.R. Finnie, "Using Artificial Neural Networks 19, no. 5, pp. 529-532,1993. and Function Points to Estimate 4GL Software Development efR. Jeffery and J. Stathis, "Specification Based Software Sizing: An fort," Australian J. Information Systems, vol. 1, no. 2, pp. 87-94, 1994. Empirical Investigation of Function Metrics," Proc. NASA Goddard Software Eng. Workshop. Greenbelt, Md., 1993. N. Karunanithi, D. Whitley, and Y.K. Malaiya, "Using Neural H___E___S_BH_ I Martin Shepperd received a BSc degree Networks in Reliability Prediction," 7EEE Software, vol. 9, no. 4, _B___E_H_^___| (honors) in economics from Exeter University, an pp. 53-59,1992. _ ^ H _ _ H 9 _ i M S c d e 9 r e e t r o m A s t o n University, and the PhD C.F. Kemerer, "An Empirical Validation of Software Cost Estima- H _ _ _ _ _ _ H _ H H _ _ | degree from the Open University, the latter two tion Models," Comm. ACM, vol. 30, no. 5, pp. 416-429,1987. • _ _ S _ 8 _ E _ B ' n c o m P u t e r s c i e n c e - He has a chair in software B.A. Kitchenham and K. Kansala, "Inter-Item Correlations among _ H _ H _ r i _ B _ S _ _ engineering at Bournemouth University. ProfesFunction Points," Proc. First Int'l Symp. Software Metrics, Balti- S____E___|____| s o r Shepperd has written three books and pub-

REFERENCES [I] [2] [3] [4] [5] [6] [7] [8] [9] [10] [II] [12] [13]

H E H H i H I H i 'ished more tnan 5 0 PaPers in *ne areas of soft-

more, Md.: IEEE CS Press, 1993.

[14] B.A. Kitchenham and N.R. Taylor, "Software Cost Models," ICL _ B _ _ _ _ H _ B _ H | w a r e metrics and process modeling. Technology /., vol. 4, no. 3, pp. 73-102,1984. W^mlKMSam [15] P. Kok, B.A. Kitchenham, and J. Kirakowski, "The MERMAID Approach to Software Cost Estimation," Proc. ESPRIT Technical _ _ _ _ _ _ _ _ _ _ _ _ _ C n r i s

Week, 1990.

HWllftrlBSBI

Schofield received a BSc degree (honors)

[16] J.L. Kolodner, Case-Based Reasoning. Morgan Kaufmann, 1993. IWlMimiMlWffii ' n s o f t w a r e engineering management from [17] J.E. Matson, B.E. Barrett, and J.M. Mellichamp, "Software Devel- E S _ H | _ _ _ _ 9 _ H | Bournemouth University, where he is presently opment Cost Estimation Using Function Points," IEEE Trans. IraB_B___gra_fR studying for his PhD. His research interests inSoftware Eng., vol. 20, no. 4, pp. 275-287,1994. _ _ _ _ _ _ H B S K c ' u c l e software metrics and cost estimation. [18] Y. Miyazaki and K. Mori, "COCOMO Evaluation and Tailoring," _H_H_HH__H Proc. Eighth Int'l Software. Eng. Conf. London: IEEE CS Press, 1985. _HHH__HH| [19] Y. Miyazaki et al., "Method to Estimate Parameter Values in HHBH^fflSH Software Prediction Models," Information and Software Technology, _ _ _ H _ _ _ _ _ _ S _ _ H | vol. 33, no. 3, pp. 239-243,1991. •__^__BH_B

71

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER//OCTOBER 1999

675

A Critique of Software Defect Prediction Models Norman E. Fenton, Member, IEEE Computer Society, and Martin Neil, Member, IEEE Computer Society Abstract—Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art. Most of the wide range of prediction models use size and complexity metrics to predict defects. Others are based on testing data, the "quality" of the development process, or take a multivariate approach. The authors of the models have often made heroic contributions to a subject otherwise bereft of empirical studies. However, there are a number of serious theoretical and practical problems in many studies. The models are weak because of their inability to cope with the, as yet, unknown relationship between defects and failures. There are fundamental statistical and data quality problems that undermine model validity. More significantly many prediction models tend to model only part of the underlying problem and seriously misspecify it. To illustrate these points the "Goldilock's Conjecture," that there is an optimum module size, is used to show the considerable problems inherent in current defect prediction approaches. Careful and considered analysis of past and new results shows that the conjecture lacks support and that some models are misleading. We recommend holistic models for software defect prediction, using Bayesian Belief Networks, as alternative approaches to the single-issue models used at present. We also argue for research into a theory of "software decomposition" in order to test hypotheses about defect introduction and help construct a better science of software engineering. Index Terms—Software faults and failures, defects, complexity metrics, fault-density, Bayesian Belief Networks. A

1

INTRODUCTION

O

RGANIZATIONS are still asking how they can predict the This paper provides a critical review of this literature quality of their software before it is used despite the with the purpose of identifying future avenues of research, substantial research effort spent attempting to find an answer We cover complexity and size metrics (Section 2), the testto this question over the last 30 years. There are many papers ing process (Section 3), the design and development process advocating statistical models and metrics which purport to (Section 4), and recent multivariate studies (Section 5). For a answer the quality question. Defects, like quality, can be de- comprehensive discussion of reliability models, see [4]. We fined in many different ways but are more commonly de- uncover a number of theoretical and practical problems in fined as deviations from specifications or expectations which these studies in Section 6, in particular the so-called "Goldimight lead to failures in operation. lock's Conjecture." Generally, efforts have tended to concentrate on the folDespite the many efforts to predict defects, there appears lowing three problem perspectives [1], [2], [3]: t o b e mle c o n s e n s u s o n w h a t t h e constituent elements of the 1) predicting the number of defects in the system; problem really are. In Section 7, we suggest a way to improve 2) estimating the reliability of the system in terms of the defect prediction situation by describing a prototype, time to failure; Bayesian Belief Network (BBN) based, model which we feel 3) understanding the impact of design and testing pro- c a n a t \east palt\y s o lve the problems identified. Finally, in cesses on defect counts and failure densities. Section 8 we record our conclusions. A wide range of prediction models have been proposed. Complexity and size metrics have been used in an attempt _ _ •• *» #•» to predict the number of defects a system will reveal in o j - 2 PREDICTION USING SlZE AND COMPLEXITY eration or testing. Reliability models have been developed METRICS to predict failure rates based on the expected operational Most defect prediction studies are based on size and cornusage profile of the system. Information from defect detec- p l e x i t y metrics. The earliest such study appears to have been tion and the testing process has been used to predict de- p^y^s, | 5 | , w h i c h w a s b a s e d o n a s y s t e m developed at fects. The maturity of design and testing processes have F u j i t s u j ft is k a l of r e g r e S sion based "data been advanced as ways of reducing defects. Recently large f „ m o d d s w h i c ^ b e c a m e c o m m o n lace in t h e litera. complex multivariate statistical models have been pro%, t J u j i L * iJ i *• - i , ,. .. r- _j • i i • . . ture. The study showed that linear models of some simple duced in an attempt to find a single complexity metric that . ._/ ,, . . , i_ *• r .,! » f j f t metrics provide reasonable estimates for the total number of will account for defects. , . , , , _. defects D (the dependent variable) which is actually defined as the sum of the defects found during testing and the de. JV.£. Fenton and M. Neil are with the Centre for Software Reliability,

fects f o u n d

durin

E-mail: {n.fenton, martin}<s>csr.city.ac.uk. Manuscript received3 Sept. 1997; revised25 Aug. 1998.

Akiyama's first Equation (1) predicted defects from lines of code (LOC). From (1) it can be calculated that a 1,000

g ^ o months after release. Akiyama com-

puted four regression equations.

Northampton Square, London EC1V OHB, England.

Recommended for acceptance by R.Hamlet. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 105579.

TOP (\ P 1 K7 PlPI mnrlnlp k pvnprtpH tn have- annrnyi L O C (le " l KLU<~> module IS expected to have approximately 23 defects.

0098-5589/99/S10.00 © 1999 IEEE

72

676

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999

D = 4.86 + 0.018L (1) _. . i_ j i_ r ii • J J *• Other equations had the following dependent metrics: r J •• /u r u Tii r J u number of decisions C; number of subroutine calls J; and a . „ . C ° AnotherTariy study by Ferdinand, [6], argued that the expected number of defects increases with the number n of code segments; a code segment is a sequence of executable statements which, once entered, must all be executed. Specifically the theory asserts that for smaller numbers of segments, the number of defects is proportional to a power of n; for larger numbers of segments, the number of defects increases as a constant to the power n. Halstead, [7], proposed a number of size metrics, which have been interpreted as "complexity" metrics, and used these as predictors of program defects. Most notably, Halstead asserted that the number of defects D in a program P is predicted by (2): y £) = -^j^r (2) where Vis the (language dependent) volume metric (which like all the Halstead metrics is defined in terms of number of unique operators and unique operands in P; for details see [8]). The divisor 3,000 represents the mean number of mental discriminations between decisions made by the programmer. Each such decision possibly results in error and thereby a residual defect. Thus, Halstead's model was, unlike Akiyama's, based on some kind of theory. Interestingly, Halstead himself validated (1) using Akiyama s data. Ottenstein, [9], obtained similar results to Halstead. .. r,ni i r i_ i_ u J U Lipow, [10] went much further, because he got round the problem of computing V directly in (3), by using lines of executable code I instead. Specifically, he used the Halstead theory to compute a series of equations of the form: _ _ _ A + J 4 In L + A In2 L (3) 2 ^ ' where each of the A; are dependent on the average number of usages of operators and operands per LOC for a particular language. For example, for Fortran Ao = 0.0047; A, = 0.0023; A2 = 0.000043. For an assembly language Ao = 0.0012; Ai = 0.0001; A2 = 0.000002. Gaffney, [11], argued that the relationship between D and L was not language dependent. He used Lipow's own data to deduce the prediction (4): D = 4 . 2 + 0.0015(L)4/3 (4) An interesting ramification of this was that there was an optimal size for individual modules with respect to defect density. For (4) this optimum module size is 877 LOC. Numerous other researchers have since reported on optimal module sizes. For example, Compton and Withrow of UNISYS derived the following polynomial equation, [12]: D = 0.069 + 0.00156L + 0.00000047 (L)2 (5) Based on (5) and further analysis Compton and Withrow concluded that the optimum size for an Ada module, with respect to minimizing error density is 83 source statements. They dubbed this the "Goldilocks Principle" with the idea that there is an optimum module size that is "not too bis

The phenomenon that larger modules can have lower defect densities was confirmed in [13], [14], [151. Basili and , . . . . Pern cone argued that this may be explained by the rfact , .. ... , i_ r • r J r J- ^ -L. that there are a large number of interface defects distnbt e d evenl a( " Y ™ S s modules. Moller and Paulish suggested * a t l a r 8 e r m o ^ l e s tend to be developed more carefully, the y d f o v e r ^ d t h a t m o d u l e s i n s i s t i n g of greater than 70 lines of c o d e hav , f s i m i l a r d e f e c t densities^ For modules of size less t h a n 70 lines of c o d e t h e defect densit ' y increases significantly. Similar experiences are reported by [16], [17]. Hatton examined a number of data sets, [15], [18] and concluded that there w a s evidence of "macroscopic behavior" common to a11 d a t a s e t s d e s i t e t h e P massive internal complexity of each system studied, [19]. This behavior was likened to "molecules" in a gas and used to conjecture an entropy model for defects which also borrowed from ideas in cognitive psychology. Assuming the short-term memory affects the rate of human error he developed a logarithmic model, made up of two parts, and fitted it to the data sets.1 The first part modeled the effects of small modules on short-term memory, while the second modeled the effects of large modules, He asserted that, for module sizes above 200-400 lines of c o d e the human "memory cache" overflows and mistakes a r e m a c j e leading to defects. For systems decomposed into s m a i l e r pieces than this cache limit the human memory between the c a c h e i s u s e d i n e f f i c i e n t l y s t o r i n g •miksm o d u i e s thus also leadi to more defects H e c o n c i u d e d ^ j co n t s a r e p r o p o r t i o n a i i y m o r e reliable ., „ „. , ... ,, . . than smaller components. Clearly this would, ifr true, cast . _• t_ , . r_• se ous doubt o v " f * e theory of program decomposition w h i c h ls s o c e n t r a l t 0 s o f t w a r e engineering, The realization that size-based metrics alone are poor general predictors of defect density spurred on much research into more discriminating complexity metrics. McCabe's cyclomatic complexity, [20], has been used in many studies, but it too is essentially a size measure (being equal to the number of decisions plus one in most programs). Kitchenham et al. [21], examined the relationship between the changes experienced by two subsystems and a number of metrics, including McCabe's metric. Two differe nt regression equations resulted (6), (7): _ C - 0.042MCi - 0.075N + 0.00001HE (6) C = 0.25MCJ - O.53D7 + 0.09VG (7) F o r t h e first subsystem changes, C, was found to be reasonabl y dependent on machine code instructions, MCI, operator and operand totals, N, and Halstead's effort metric, HK F o r t h e o t h e r subsystem McCabe's complexity metric, VG w a s f o u n d t 0 P a rtially explain C along with machine code instructions, MCI and data items, DI. All of the metrics discussed so far are defined on code. There are now a large number of metrics available earlier in t h e Me-cycle most of which have been claimed by their P r ° P o n e n t s t 0 h a v e some predictive powers with respect l. There is nothing new here since Halstead [3] was one of the first to apply g P^P16 can only effectively recall seven plus or minus two

Millers fmdin that

.. „ nor too small.

items from their short-term memory. Likewise the construction of a partitioned model contrasting "small" module effects on faults and "large" module effects on faults was done by Compton and Withrow in 1990 [7].

73

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

to residual defect density. For example, there have been numerous attempts to define metrics which can be ex-

tracted from design documents using counts of between module complexity" such as call statements and data flows; the most well known are the metrics in [221. Ohls-

son a n d Alberg, [23], reported on a s t u d y at Ericsson w h e r e metrics d e r i v e d automatically from design docum e n t s w e r e u s e d t o predict especially fault-prone m o d u l e s p r i o r t o testing. Recently, there h a v e been several att e m p t s , such as [24], [25], to define metrics on objectoriented designs. The advent a n d w i d e s p r e a d use of Albrecht Function Points (FPs) raises the possibility of defect density predictions based on a metric which can be extracted at t h e specification stage. There is w i d e s p r e a d belief that FPs are a better (one-dimensional) size metric t h a n LOC; in theory at least they get r o u n d t h e problems of lack of uniformity a n d they are also language independent. We already see defect density defined in terms of defects p e r FP, a n d empirical studies are emerging that seem likely t o be t h e basis for predictive models. For example, in Table 1, [26] reports the following bench-marking study, reportedly based on large a m o u n t s of data from different commercial sources.

3

PREDICTION USING TESTING METRICS

677

". **CTS ™^f™

Defect Origins "alZTZZ^IZZ requirements Design Coding Documentation Bad fixes | Total

1

„ K^**™

Defects per Function Point Tnri l .«JU 1.25 1.75 0.60 0.40 5_00

I TABI C O DEFECTS FOUND PER TESTiNG APPROACH

Testing Type ~~Reaular use Black box White box Reading/inspections

Defects Found/hr 0~210 0 282 0.322 1.057

"inherent to the p r o g r a m m i n g process itself." Also useful (providing y o u are aware of t h e kind of limitations discussed in [33]) is the kind of data published by [34] in Table 2. O n e class of testing metrics that a p p e a r to be quite promising for predicting defects a r e t h e so called test coverage measures, A structural testing strategy specifies that we

have to select enough test cases so that each of a set of "obSome of the most promising local models for predicting jects" in a program lie on some path (i.e., are "covered") in residual defects involve very careful collection of data a t least on test case. For example, statement coverage is a about defects discovered during early inspection and test- structural testing strategy in which the "objects" are the ing phases. The idea is very simple: you have n predefined statements. For a given strategy and a given set of test cases phases at which you collect data dn (the defect rate. Sup- we can ask what proportion of coverage has been achieved, pose phase n represents the period of the first six months of The resulting metric is defined as the Test Effectiveness Rathe product in the field, so that dn is the rate of defects tio (TER) with respect to that strategy. For example, TER1 is found within that period. To predict dn at phase n - 1 the TER for statement coverage; TER2 is the TER for branch (which might be integration testing) you look at the actual coverage; and TER3 is the TER for linear code sequence and sequence d\, .... dn_i and compare this with profiles of simi- jump coverage. Clearly we might expect the number of dislar, previous products, and use statistical extrapolation covered defects to approach the number of defects actually techniques. With enough data it is possible to get accurate in the program as the values of these TER metrics increases, predictions of dn based on observed du .... dm where m is less Veevers and Marshall, [35], report on some defect and relithan n-l. This method is an important feature of the Japa- ability prediction models using these metrics which give nese software factory approach [27], [28], [29]. Extremely quite promising results. Interestingly Neil, [36], reported accurate predictions are claimed (usually within 95 percent that the modules with high structural complexity metric confidence limits) due to stability of the development and values had a significantly lower TER than smaller modules, testing environment and the extent of data collection. It T h i s supports our intuition that testing larger modules is appears that the IBM NASA Space shuttle team is achieving m o r e difficult and that such modules would appear more similarly accurate predictions based on the same kind of l i k e t y t 0 contain undetected defects. approach [18] Voas and Miller use static analysis of programs to conjecIn the absence of an extensive local database it may be t u r e t h e presence or absence of defects before testing has possible to use published bench-marking data to help with t a k e n P ] a c e ' I37J- T h e i r method relies on a notion of program this kind of prediction. Dyer, [30], and Humphrey, [31], con- testability, which seeks to determine how likely a program tain a lot of this kind of data. Buck and Robbins, [32], report w i U fail assuming it contains defects. Some programs will on some remarkably consistent defect density values during contain defects that may be difficult to discover by testing by different review and testing stages across different types of v i r t u e o f t h e i r structure and organization. Such programs software projects at IBM. For example, for new code devel- h a v e a I o w d e f e c t revealing potential and may, therefore, oped the number of defects per KLOC discovered with Fa- h i d e defects until they show themselves as failures during gan inspections settles to a number between 8 and 12. There operation. Voas and Miller use program mutation analysis to is no such consistency for old code. Also the number of man- simulate the conditions that would cause a defect to reveal hours spent on the inspection process per major defect is i t s e l f ^ a f a i l u r e if a defect was indeed present. Essentially if always between three and five. The authors speculate that, program testability could be estimated before testing takes despite being unsubstantiated with data, these values form P l a c e t h e estimates could help predict those programs that "natural numbers of programming," believing that they are w o u l d reveal less defects during testing even if they contained

74

678

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999

defects. Bertolino and Strigini, [38], provide an alternative exposition of testability measurement and its relation to testing, debugging, and reliability assessment.

TABLE 3 RELATIONSHIP BETWEEN CMM LEVELS AND DELIVERED DEFECTS MULTIVARIATE APPROACHES SEI CMM

4 PREDICTION USING PROCESS QUALITY DATA

Defect Potentials

__J£^

There are many experts who argue that the "quality" of the development process is the best predictor of product quality (and hence, by default, of residual defect density). This issue, and the problems surrounding it, is discussed extensively in [33]. There is a dearth of empirical evidence linking process quality to product quality. The simplest metric of process quality is the five-level ordinal scale SEI Capability Maturity Model (CMM) ranking. Despite its widespread popularity, there was until recently no evidence to show that level (n + 1) companies generally deliver products with lower residual defect density than level (n) companies. The Diaz and Sligo study, [39], provides the first promising empirical support for this widely held assumption. Clearly the strict 1-5 ranking, as prescribed by the SEICMM, is too coarse to be used directly for defect prediction since not all of the processes covered by the CMM will relate to software quality. The best available evidence relating particular process methods to defect density concerns the Cleanroom method [30]. There is independent validation that, for relatively small projects (less than 30 KLOC), the use of Cleanroom results in approximately three errors per KLOC during statistical testing, compared with traditional development postdelivery defect densities of between five to 10 defects per KLOC. Also, Capers Jones hypothesizes quality targets expressed in "defect potentials" and "delivered defects" for different CMM levels, as shown in Table 3 [40].

1

|

2 3 4 •>

Removal

5

|

4 3 2 1

Delivered Defects

!5£i£EL« 85

j

89 91 93

95

0.75

1

—

0.44 0.27 0.14 0-Q5

underlying dimension being measured, such as control, volume and modularity. In [43] they used factor analytic variables to help fit regression models to a number of error data sets, including Akiyama's [5]. This helped to get over the inherent regression analysis problems presented by multicolinearity in metrics data. Munson and Khoshgoftaar have advanced the multivariate approach to calculate a "relative complexity metric." This metric is calculated using the magnitude of variability from each of the factor analysis dimensions as the input weights in a weighted sum. In this way a single metric integrates all of the information contained in a large number of metrics. This is seen to offer many advantages of using a univariate decision criterion such as McCabe's metric [44]. g

O F CURRENT APPROACHES TO _ PpcnirnnM DEFECT PREDICTION Despite the heroic contributions made by the authors of previous empirical studies, serious flaws remain and have detrimentally influenced our models for defect prediction. Of course, such weaknesses exist in all scientific endeavo u r s b u t if w e a r e t 0 i m r o v e P scientific enquiry in software 5 MULTIVARIATE APPROACHES engineering we must first recognize past mistakes before There have been many attempts to develop multilinear re- suggesting ways forward. gression models based on multiple metrics. If there is a conT h e k e y i s s u e s a ff ec ting the software engineering comsensus of sorts about such approaches it is that the accuracy m u n i t y - s historical research direction, with respect to defect of the predictions is never significantly worse when the prediction are' metrics set is reduced to a handful (say 3-6 rather than 30), [41]; a major reason for this is that many of the metrics are ' t h e u n k n o w n relationship between defects and failu r e s S e c t i o n 61 colinear; that is they capture the same underlying attribute ( )> (so the reduced set of metrics has the same information con* problems with the "multivariate" statistical approach tent, [42]). Thus, much work has concentrated on how to (Section 6.2); select those small number of metrics which are somehow * problems of using size and complexity metrics as sole the most powerful and/or representative. Principal Com"predictors" of defects (Section 6.3); ponent Analysis (see [43]) is used in some of the studies to * problems in statistical methodology and data quality reduce the dimensionality of many related metrics to a (Section b.4), smaller set of "principal components," while retaining most * f a l s e c l a i m s a b o u t s o f t w a r e decomposition and the of the variation observed in the original metrics. "Goldilocks Conjecture" (Section 6.5). For example, [42] discovered that 38 metrics, collected on 6 # 1 T h e unknown Relationship between Defects and around 1,000 modules, could be reduced to six orthogonal Failures . . ., ,. , , , ~ .... dimensions that account for 90 percent of the variability. J The TU . ^ There is considerable disagreement about the definitions ofr most important dimensions; size, nesting, and prime were , r ,. J, , ., ,.„ , . ,. . c T j . , . ,. . . . defects, errors, faults, and failures. In different studies dethen used to develop an equation to discriminate between r r . .... • .... _, , feet counts refer to: low and high maintainability modules. Munson and Khoshgoftaar in various papers, [41], [43], • postrelease defects; [44] use a similar technique, factor analysis, to reduce the • the total of "known" defects; dimensionality to a number of "independent" factors. • the set of defects discovered after some arbitrary fixed These factors are then labeled so as to represent the "true" point in the software life cycle (e.g., after unit testing).

75

A CR|T|QUE

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

679

The terminology differs widely between studies; defect TABLE 4 rate, defect density, and failure rate are used almost interDEFECTS DENSITY (F/KLOC) VS. MTTF I F/KLOC I MTTF changeably. It can also be difficult to tell whether a model is ™3 0 TrrAn predicting discovered defects or residual defects. Because of these problems (which are discussed extensively in [45]) we 20-30 4-5 min 5-10 1 hr have to be extremely careful about the way we interpret 2 5 several hours published predictive models. ~ Apart from these Fproblems of terminology and defini. . .. . r .J i 0.5-1 1 month K r any prediction of residual ' • ' ' tion the most serious weakness of defects or defect density _ ,. ... .... , , , , , , , . , J concerns the weakness of defect ,_ . . i. i_.i. 2 r. -r ReliabilityJ rprediction should, therefore, be viewed as corncount itself as a measure of software reliability. Even if we . ,c , . .... , , , , „ ., , , _ r plementary to defect density prediction, knew exactly the number of residual defects in our system we have to be extremely wary about making definitive 6.2 Problems with the Multivariate Approach statements about how the system will operate in practice. Applying multivariate techniques, like factor analysis, proThe reasons for this appear to be: d u c e s metrics w n i c h cannot be easily or directly interpret• difficulty of determining in advance the seriousness able in terms of program features. For example, in [43] a of a defect; few of the empirical studies attempt to factor dimension metric, control, was calculated by the distinguish different classes of defects; weighted sum (8): • great variability in the way systems are used by difcontrol = aflNK + a2PRC + a3E + a4VG + a5MMC ferent users, resulting in wide variations of opera® + a £ r r o r + a fg\fp + a IQQ tional profiles. It is thus difficult to predict which de6 7 8 fects are likely to lead to failures (or to commonly oc- where the a, s are derived from factor analysis. HNK was curring failures). Henry and Kafura's information flow complexity metric, T,, . ,. . . 11 . j i u u- u PRC is a count of the number of procedures, E is Halstead's , . . ..„,„. The latter point is particularly serious and has been high„ . ,._ . „ , , r i- u» J j ..• n u l^c^ A J • J J * t effort metric, VG is McCabe s complexity metric, MMC is lighted dramatically by [46]. Adams examined data from TT . . . , ,„„.,. , ., , i_-i_ i_ _ir Harrison s complexity metric, and LOC is lines ofc code. Al? , c rane large software products, each with many thousands of . . . . . . . . ._, • . i. . . , ,, ., , . , X. w uthough this equation might help to avoid multicohnearity it years off llogged use world wide. uHe charted the relationship . . . , . . , . . i t J~ . , , „ . , ., . .„ . .. is hard to see how you might advise a programmer or deJ f ., between detected defects and their manifestation as fail. , . 6, ,. ... ,, i oo r11jr4.1j4.c-1 signer on how to redesign the programs to achieve a bet& ures. For example, 33 percent of all defects led to failures ,, , . , e . , , , ., . , ter ... 4.4. c -i 4.u c nnn i control metric value for a given module. Likewise the with a mean time to failure greater than 5,000 years. In . f , . , ,. , , ,c .,, , effects of such a change & in module control on defects is less practical terms, this means that such defects will almost never manifest themselves as failures. Conversely, the pro„ ' ,. _. . , , . . _. . i_. i_ i J • r -i . These problems are compounded in the search for an ulportion of defects which led to a mean time to failure oft less .. , .. . £ . . . . . , „, . ,. . , rn ii / .o \ TT timate or relative complexity metnc [43]. The simplicity ofr H than 50 years was very small (around 2 percent). However, ^ g number ^ J de appealing bu/the it is these defects which are the important ones to find, . o f m e a s u r e m e n t a r e b a s e d o n i d e n t i f y l n g differing since these are the ones which eventually exhibit them- ^ e l I . d e f i n e d a t t r i b u t e s w i t h s i n g i e s t a n d a r d m e a s u r e s [ 4 5 f selves as failures to a significant number of users. Thus A l t h o u h t h e r e i s a c l e a r r o l e f o r d a t a r e d u c t i o n a n d a n a l y s i s Adams data demonstrates the Pareto principle: a very small t e c h n i s such as factor anal is t h i s s h o u l d n o t b e c o n . proportion of the defects in a system will lead to almost all f u s e d Q r u s e d i n s t e a d o f measurement theory. For example, the observed failures in a given period of time; conversely, s t a t e m e n t count and lines of code are highly correlated bemost defects in a system are benign in the sense that in the c a u s e p r o g r a m s w i t h m O re lines of code typically have a same given period of time they will not lead to failures. h i g h e r n u m b e r of statements. This does not mean that the It follows that finding (and removing) large numbers of t r u e s i z e o f prO grams is some combination of the two metrics, defects may not necessarily lead to improved reliability. It A m O re suitable explanation would be that both are alternaalso follows that a very accurate residual defect density pre- t i v e measures of the same attribute. After all centigrade and diction may be a very poor predictor of operational reliabil- fahrenheit are highly correlated measures of temperature, ity as has been observed in practice [47]. This means we Meteorologists have agreed a convention to use one of these should be very wary of attempts to equate fault densities a s a standard in weather forecasts. In the United States temwith failure rates, as proposed for example by Capers Jones perature is most often quoted as fahrenheit, while in the (Table 4 [48]). Although highly attractive in principle, such a United Kingdom it is quoted as centigrade. They do not take model does not stand up to empirical validation. a weighted sum of both temperature measures. This point Defect counts cannot be used to predict reliability because, lends support to the need to define meaningful and standard despite its usefulness from a system developer's point of measures for specific attributes rather than searching for a view, it does not measure the quality of the system as the single metric using the multivariate approach, user is likely to experience it. The promotion of defect counts

as a measure of "general quality" is, therefore, misleading. _ ,,

„

2. Here we use the technical concept of reliability, defined as mean time to failure or probability of failure on demand, in contrast to the "looser" concept of reliability with its emphasis on defects.

76

6 3

- Problems in Using Size and Complexity Metrics

to Predict Defects A discussion of the theoretical and empirical problems with

c i- • J • • J i • J• JI_ m a n y of the i n d i v i d u a l metrics discussed above m a y be

680

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, V O L 25, NO. 5, SEPTEMBER/OCTOBER 1999

found in [45]. There are as many empirical studies (see, for example, [49], [50], [51]) refuting the models based on Halstead, and McCabe as there are studies "validating" them. Moreover, some of the latter are seriously flawed. Here we concentrate entirely on their use within models used to predict defects. The majority of size and complexity models assume a straightforward relationship with defects-defects are a function of size or defects are caused by program complexity. Despite the reported high correlations between design complexity and defects the relationship is clearly not a straightforward one. It is clear that it is not entirely causal because if it were we couldn't explain the presence of defects introduced when the requirements are defined. It is wrong to mistake correlation for causation. An analogy would be the significant positive correlation between IQ and height in children. It would be dangerous to predict IQ from height because height doesn't cause high IQ; the underlying causal factor is physical and mental maturation, There are a number of interesting observations about the way complexity metrics are used to predict defect counts: . the models ignore the causal effects of programmers and designers. After all it is they who introduce the defects so any attribution for faulty code must finally restwithindividual(s). overly complex programs are themselves a consequence of poor design ability or problem difficulty. Difficult problems might demand complex solutions and novice programmers might produce "spaghetti coc j e ». . defects may be introduced at the design stage because of the overcomplexity of the designs already produced. Clerical errors and mistakes will be committed because the existing design is difficult to comprehend.

6.4.1 Multicolinearity Multicolinearity is the most common methodological probi e m encountered in the literature. Multicolinearity is pres e n t w h e n a number of predictor variables are highly posit i v e i y o r neg atively correlated. Linear regression depends o n m e assumption of zero correlation between predictor variables, [52]. The consequences of multicolinearity are m a n y f o l d ; i t c a u s e s u n s t a b l e coefficients, misleading statist i c a l t e s t s m d une xpected coefficient signs. For example, o n e o f t h e equations in [21] (9): _ C = 0MZMCI " 0 O 7 5 N + 0-00001HE (9) shows clear signs of multicolinearity. If we examine the equation coefficients we can see that an increase in the operator and operand total, JV, should result in an increase in changes, c, all things being equal. This is clearly counterintuitive. In fact analysis of the data reveals that machine code instructions, MCI, operand, and operator count, N, and Halstead's Effort metric, HE, are all highly correlated [42]. This type of problem appears to be common in the software metrics literature and some recent studies appear to h a v e fallen victim t0 t h e multicolinearity problem [12], [53]. Colinearity between variables has also been detected in a n u m b e r of studies that reported a negative correlation bem e e n defect densitv a n d module size. Rosenberg reports that since there m u s t b e a ' negative correlation between X, size a n d 1 / X h follows t h a t t h e ' correlation between X and Y / X (defects/size) must be negative whenever defects are gr°wing at most linearly with size [54]. Studies which have postulated such a linear relationship are more than likely to have detected negative correlation, and therefore concluded t h a t lar e 8 modules have smaller defect densities, because of this P ro Perty of arithmetic, 6 4 2 Factor Analysis vs. Principal Components Analysis

Defects of this type are "inconsistencies" between de, , j , ,, ,,. ,. . . sign modules and can be thought ofr as quite distinct . , c r from requirements defects. M

„, r * i • J • • i * i • The use ofr factor analysis and principal components analysis . , . . . . . . , . . solves the multicohneanty problem by creating new or, . . • • \ _•• . ,.m thogonal cfactors or principal component dimensions, [43]. 6.4 Problems in Data Quality and Statistical Unfortunately the application of factor analysis assumes the Methodology errors are Gaussian, whereas [55] notes that most software The weight given to knowledge obtained by empirical metrics data is non-Gaussian. Principal components analysis means rests on the quality of the data collected and the de- c a n b e u s e d i n s t e a d o f f a c t o r analysis because it does not rely gree of rigor employed in analyzing this data. Problems in o n a n y distributional assumptions, but will on many occaeither data quality or analysis may be enough to make the s i o n s produce results broadly in agreement with factor resulting conclusions invalid. Unfortunately some defect analysis. This makes the distinction a minor one, but one that prediction studies have suffered from such problems. These n e e d s t 0 b e considered, Predicting Data problems are caused, in the main, by a lack of attention to 6 4 3 FJWng Models vs the assumptions necessary for successful use of a particular „ . , ,. , . „ , ..... , . , . ,-.,, . .. . , , .. Regression modeling approaches are typically concerned statistical technique. Other serious problems include the .° _. . _, , , , , .. . , „ i i r ... ... j u t jir-J JI with fitting 6 models to data rather than predicting data. Relack of distinction made between model fitting and model . . . . „ , , . , , ~ .... , ., . ..f. , , c , . gression analysis typically finds the least-squares fit to the prediction and the unjustified removal of data points or 5 _• , J r , • r- , i .. . . J J * data and the goodness of this fit demonstrates how well the misuse ofc averaged data. .& , . . TT ,.,.,. ° . . . . , . c model explains historical data. However a truly successful T, u The ability to replicate results is a key component of any ... ,., ,. , / , c . . . . . . . . r T , . j . _ „ i model is one which can predict the number of defects disempincal discipline. In software development different find,. , , , ^ , . ,. . . u i _ i - j u i _ r covered in an unknown module. Furthermore, this must be a ings cfrom diverse experiments could be explained by the fact , , , . , - . . ^ , . , , , , ., . j . i r . i j module not used in the derivation of the model. Unfortut n j that different, perhaps uncontrolled, processes were used on , , , , , . , ,.tt j. . v. r* ui-4. 4. j ' •u u nately, perhaps because of the shortage of data, some redifferent proiects. Comparability over case studies might be , , J • • , r. , ... r. , .e .. , , . J 1 » searchers have tended to use their data to fit the model . , , . . , , , _, , better achieved if the processes used during development , *j l ... .. . ?., ^ . without being able to test the resultant model out on a new were documented, along with estimates of the extent to , ° , r r , „„, „ . . , ., , ii r H j data set. See, for example, [51, [12], [16]. p which they were actually followed. ' ' ' ' '

77

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

681

6.4.4 Removing Data Points However we can see that (9) and (10) are not equivalent. In standard statistical practice there should normally be The use of (10) mistakenly assumes the power of a sum is strong theoretical or practical justification for removing equal to a sum of powers. data points during analysis. Recording and transcription .._ . . . . , _ . „ errors are often an acceptable reason. Unfortunately it is 6 5 T h e "Gold.lock's Conjecture" often difficult to tell from published papers whether any T n e r e s u l t s of inaccurate modeling and inference is perhaps data points have been removed before analysis, and if they m o s t evident in the debate that surrounds the "Goldilock's have, the reasons why. One notable case is Compton and Conjecture" discussed in Section 2—the idea that there is an Withrow, [12], who reported removing a large number of optimum module size that is "not too big nor too small." data points from the analysis because they represented Hatton, [19], claims that there is modules that had experienced zero defects. Such action is "compelling empirical evidence from disparate sources to sugsurprising in view of the conjecture they wished to test; that gest that in any software system, larger components are propordefects were minimised around an optimum size for Ada. If tionally more reliable than smaller components." the majority of smaller modules had zero defects, as it ap- I f t h e s e r e s u l t s w e r e g e n e r a ] l y t r u e t h e implications for pears, then we cannot accept Compton and Withrow's con- s o f t w a r e engineering would be very serious indeed. It elusions about the "Goldilock's Conjecture." w o u ] d m e a n { h a t p r o g r a m decomposition as a way of solvm 64 5 Using "Averaged" Data S problems simply did not work. Virtually all of the work ,,.,,. , , ,., . . , . done in software engineering extending from fundamental r .. ... U J . i_ We believe that the use of averaged data in analysis ^ ... . , .,. ~, • c , , , . . ij . .. ,. i , concepts, like modularity and lnformation-hidine, to methrather than the original data prejudices many studies. The _, i-i u- .. • * J J .. * J J • UI_ B ,.,,„, i_ i j . j r i°ds, "ke object-oriented and structured design would be study in 19 uses graphs, apparently denved from the , J .. c , , . r , • • i K A c A n J 3 A J 4. i ^« • • suspect because all of them rely on some notion of decomonainal NASA-Goddard data, plotting average size in .. .. , J ,. , , , .. jr , . f , f, r f „ „ J| - ^ , position. If decomposition doesn t work then there would statements against number of defects or defect den- f , JT , . . .. A , . * j r ^u be no good reason for doing it. sity. Analysis ofr averages are one step removed from the „. . ., , . ° , , . . i j .. . > c• ITClaims with such serious consequences as these deserve original data and it raises a number of issues. Using aver. . .. ... , M , ., ., , _, ° , , *r-r • i i _ i i _ special attention. We must ask whether the data and m i • • ages reduces the amount of information available to test the , , . • * ,. _» L ° . , . j i • .11 L knowledge exists to support lthem. These are clear criteria coniecture under study and any conclusions will be corre. , 5 ,. . ^. K ., . . , , , J ,. . , _^ . . ^ j . noi J —ifr the data exist to refute the coniecture that large modspondinely weaker. The classic study in [13] used average . „, , .r , .1.1 1 . r ,,. . . J J ^ \u\ ^ j ules are better and if we have a sensible explanation for c fault density ofc sgrouped data in a way that suggested a ,. . , . . ... , „ 1 . 1 1 F , U J .JU»U j TU c this result then a claim will stand. Our analysis shows that, trend lthat was not supported by the raw data. The use of . , .. . , . . . . , 5 1 J *u i_ using these cntena, these claims cannot currently stand. TIn averages may be a practical way around the common prob- ., ° ,. ., ^ . ., . . , , f. c n , , _/c , . ,, . , ,. . . , , the studies that support the conjecture we found the followlem where defect data is collected at a higher level, perhaps ,, ,

1.

*

1

1

1.

• -j

1 j T

ing problems:

b r at the system or subsystem level, than is ideal; defects recorded against individual modules or procedures. As a con* n o n e d e f l n e "module" in such a way as to make cornsequence data analysis must match defect data on systems parison across data sets possjble; against statement counts automatically collected at the • none explicitly compare different approaches to structurin g a n d decomposing designs; module level. There may be some modules within a subsysthe tern that are over penalized when others keep the average * data analysis or quality of the data used could not high because the other modules in that subsystem have support the results claimed; # more defects or vice versa. Thus, we cannot completely a number of factors exist that could partly explain the results w h i c h ^ ^ s t u d i e s h a v e neglected to examine. trust any defect data collected in this way. Misuse of averages has occurred in one other form. In Additionally, there are other data sets which do not show Gaffney's paper, [11], the rule for optimal module size was any clear relationships between module size and defect derived on the assumption that to calculate the total num- density. ber of defects in a system we could use the same model as If W e examine the various results we can divide them into had been derived using module defect counts. The model three main classes. The first class contains models, exempliderived at the module level is shown by (4) and can be ex- fied by graph Fig. la, that shows how defect density falls as tended to count the total defects in a system, DT, based on module size increases. Models such as these have been proLj, (9). The total number of modules in the system is de- duced by Akiyama, Gaffney, and Basili and Pericone. The noted by N. second class of models, exemplified by Fig. lb, differ from w N the first because they show the Goldilock's principle at work. D T = X D i = 4-2N + ° 0 0 1 5 X ^ ® Here defect density rises as modules get bigger in size. The i=1 i=1 third class, exemplified by Fig. lc, shows no discernible patGaffney assumes that the average module size can be tern whatsoever. Here the relationship between defect denused to calculate the total defect count and also the opti- sity and module size appears random (no meaningful curvimum module size for any system, using (10): linear models could be fitted to the data at all. The third class of results show the typical data pattern p N -,4/3 y* £ from a number of very large industrial systems. One data set (10) was collected at the Tandem Corporation and was reported D = 4 2N + 0 0015JV '=' r ' ' N in, [56]. The Tandem data was subsequently analyzed by Neil [42], using the principal components technique to produce a

78

682

IEEE TRANSACTIONS O N SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999

Fig. 1. Three classes of defect density results, (a) Akiyama (1971), Basili and Perricome (1984), and Gaffney (1984); (b) Moeller and Paulish (1993), Compton and Withrow (1990), and Hatton (1997); (c) Neil (1992) and Fenton and Ohlsson (1997).

"combined measure" of different size measures, such as deci- 7 PREDICTING DEFECTS USING BBNs sion counts. This principal component statistic was then plot^ ft from Qur is jn Section 6 ^ ion ted against the number of changes made to the system mod£ ^ defects be or sizf meas. ules (these were rpredominantlyJ changes made to fix defects). . . r . , J , .. -L, , „ . , , , , ,. , ° ,. . ures alone presents only a skewed picture. The number ofc This defect data was standardized according to normal statis, . ,. ,. , , , »j ; U r.. *• . , . • , .i . c-^. . i defects discovered is clearly related to ithe amount of testing ,. _• .. J / I_- u t. tical practice. A polynomial regression curve was fitted to the A , ,. . . * i • u tu *u • -cperformed, as discussed above. A program which has never data in order to determine whether there was significant f j , c .. , / &... , , . ,. _. „ . , r . , .t T , ,. been tested, or used for that matter, will have a zero defect nonlinear effects of size on defect density. The results were , , . . . , ,. , ., , . . , , , , ,, . p. o count, even though its complexity may be very high. Moreb here in Fig. 2. *u * * « *• c i ypublished and are reproduced y „ r i_ i •i u • over, we can assume the test effectiveness of complex proDespite some parameters of the polynomial curve being . . , , .„_, , , ij u . . „ .„ . . i . u /., . j. r grams is relatively low, 137], and such programs could be statistically significant it is obvious that there is no discerm- ° , . uu*i u r j c * ir ,. 1 . . . I _i c J j i • . expected to exhibit a lower number of defects per line of ble relationship between defect counts and module size in j , • ,. . . *u -U-J » J r » rr . _ , , ,, ii j i • j code during testing because they hide defects more effec. _. ,. , . ,,, .. , , ., . the Tandem data set. Many small modules experienced no , , „ , , ~ . , -i ij u tively.T1 This could explain many of the empirical results that defects at all and the fitted polynomial curve would be use- , J , , , , , i . , .i. T . J c . . ... „. , •: , , , . • ! • • larger modules have lower defect densities. Therefore, cfrom c less for prediction. This data clearly refutes the simplistic , i r ^ u-iu i J i_ i • -r- _• •_ i T~ i jiu J i iu what we know of testability, we could conclude that large ,, *• j j i j r ^ u u assumptions typified by class Fig. la and lb models (these ji ii. i • i- T J J \ . modules contained many residual defects, rather than conmodels couldn t explain the Tandem data) nor accurately ^ rf m o d u l e s w e f e m o r e reljable (and im predict the defect density values of these Tandem modules. A «n ^ s o « w a r e d e c o m i t i o n is w r o n } F S similar analysis and result is presented in [47]. ^ c l e a r aH of ^ o b l e m s d e s c r i b e d i n Sec tion 6 a r e n o t VVe conclude that the relationship between defects and ^ tQ so]ved easil H o w e v e r w e believe that model. module size is too complex in general, to admit to straightfhe c o m l e x i t i e s o f s o f t w a r e development using new forward curve fitting models. These results, therefore, conb a b i l i s t i c t e c h n i q u e s presents a positive way forward, tradict the idea that there is a general law linking defect T h e s e m e t h o d s c a l l e d gayesian Belief Networks (BBNs), density and software component size as suggested by the a l l o w u s t 0 e x p r e s s c o m p l e x interrelations within the model "Goldilock's Conjecture."

Fig. 2. Tandem data defects counts vs. size "principal component."

79

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

683

at a level of uncertainty commensurate with the problem, ties for the NPTs. One of the benefits of BBNs stems from the In this section, we first provide an overview of BBNs (Sec- fact that we are able to accommodate both subjective probtion 7.1) and describe the motivation for the particular BBN abilities (elicited from domain experts) and probabilities example used in defects prediction (Section 7.2). In Section based on objective data. Recent tool developments, notably on the 7.3, we describe the actual BBN. SERENE project [58], mean that it is now possible to build very large BBNs with very large probability tables (in7.1 An Overview Of BBNs eluding continuous node variables). In three separate indusBayesian Belief Networks (also known as Belief Networks, trial applications we have built BBNs with several hundred Causal Probabilistic Networks, Causal Nets, Graphical nodes and several millions of probability values [59]. T h e r e a r e mar >y advantages of using BBNs, the most imProbability Networks, Probabilistic Cause-Effect Models, and Probabilistic Influence Diagrams) have attracted much portant being the ability to represent and manipulate cornrecent attention as a possible solution for the problems of plex models that might never be implemented using convendecision support under uncertainty. Although the underly- tional methods. Another advantage is that the model can ing theory (Bayesian probability) has been around for a predict events based on partial or uncertain data. Because long time, the possibility of building and executing realistic B B N s h a v e a rigorous, mathematical meaning there are softmodels has only been made possible because of recent algo- ware tools that can interpret them and perform the complex rithms and software tools that implement them [57]. To date calculations needed in their use [58]. Th BBNs have proven useful in practical applications such as e benefits of using BBNs include: medical diagnosis and diagnosis of mechanical failures. . specification of complex relationships using condiTheir most celebrated recent use has been by Microsoft tional probability statements; where BBNs underlie the help wizards in Microsoft Office; . u s e of "what-if? analysis and forecasting of effects of also the "intelligent" printer fault diagnostic system which process changes; you can run when you log onto Microsoft's web site is in . easier understanding of chains of complex and seemfact a BBN which, as a result of the problem symptoms you ingly contradictory reasoning via the graphical forenter, identifies the most likely fault. mat; A BBN is a graphical network that represents probabilis. explicit modeling of "ignorance" and uncertainty in tic relationships among variables. BBNs enable reasoning estimates; under uncertainty and combine the advantages of an intui• u s e of subjectively or objectively derived probability tive visual representation with a sound mathematical basis distributions; in Bayesian probability. With BBNs, it is possible to articu• forecasting with missing data, late expert beliefs about the dependencies between different variables and to propagate consistently the impact of evi- 7 - 2 Motivation for BBN Approach dence on the probabilities of uncertain outcomes, such as Clearly defects are not directly caused by program complex"future system reliability." BBNs allow an injection of scien- ity alone. In reality the propensity to introduce defects will be tific rigor when the probability distributions associated influenced by many factors unrelated to code or design comwith individual nodes are simply "expert opinions." plexity. There are a number of causal factors at play when we A BBN is a special type of diagram (called a graph) to- want to explain the presence of defects in a program: gether with an associated set of probability tables. The graph # Difficulty of the problem is made up of nodes and arcs where the nodes represent un. C o m p l e x i t y o f des igned solution certain variables and the arcs the causal/relevance relation. Programmer/analyst skill , D ships between the variables Fig. 3 shows a BBN for an exammethods and edures used pie reliability prediction problem. The nodes represent discrete or continuous variables, for example, the node "use Eliciting requirements is a notoriously difficult process of IEC 1508" (the standard) is discrete having two values ar>d is widely recognized as being error prone. Defects intro"yes" and "no," whereas the node "reliability" might be con- duced at the requirements stage are claimed to be the most tinuous (such as the probability of failure). The arcs represent expensive to remedy if they are not discovered early enough, causal/influential relationships between variables. For ex- Difficulty depends on the individual trying to understand ample, software reliability is defined by the number of (la- a n d describe the nature of the problem as well as the probtent) faults and the operational usage (frequency with which km itself. A "sorting" problem may appear difficult to a novfaults may be triggered). Hence, we model this relationship i c e programmer but not to an expert. It also seems that the by drawing arcs from the nodes "number of latent faults and difficulty of the problem is partly influenced by the number "operational usage" to "reliability." °f foiled attempts at solutions there have been and whether a For the node "reliability" the node probability table (NPT) "ready made" solution can be reused. Thus, novel problems might, therefore, look like that shown in Table 5 (for ultra- have the highest potential to be difficult and "known" probsimplidty we have made all nodes discrete so that here reli- l e m s t e n d to be simple because known solutions can be idenability takes on just three discrete values low, medium, and tified and reused. Any software development project will high). The NPTs capture the conditional probabilities of a h a ve a mix of "simple" and "difficult" problems depending node given the state of its parent nodes. For nodes without o n w h a t intellectual resources are available to tackle them, parents (such as "use of IEC 1508" in Fig. 3.) the NPTs are G o °d managers know this and attempt to prevent defects by simply the marginal probabilities. pairing up people and problems; easier problems to novices There may be several ways of determining the probabili- a n d difficult problems to experts.

80

684

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL 25, NO. 5, SEPTEMBER/OCTOBER 1999

Fig. 3. "Reliability prediction" BBN example.

TABLE 5 NODE PROBABILITY TABLE (NPT) operational usage faults reliability

FOR THE NODE "RELIABILITY"

low

med

high

low

med

high

low

I med

high

low

med

low

0.10

0.20

0.33

0.20 | 0.33

O50

0~20

033

0J0

med high

0.20 0.70

0.30 0.33 0.50 | 0.33

0.30 | 0.33 0.50 j 0.33

0.30 0.30 0.20 j 0.50

0.33 0.33

0.20 0.10

When assessing a defect it is useful to determine when it was introduced. Broadly speaking there are two types of defect; those that are introduced in the requirements and those introduced during design (including coding/ implementation which can be treated as design). Useful defect models need to explain why a module has a high or low defect count if we are to learn from its use, otherwise we could never intervene and improve matters. Models using size and complexity metrics are structurally limited to assuming that defects are solely caused by the internal organization of the software design. They cannot explain defects introduced because: . the "problem" is "hard"• . problem descriptions are inconsistent; . the wrong "solution" is chosen and does not fulfill the requirements

high

Central to software design method is the notion that problems and designs can be decomposed into meaningful chunks where each can be readily understood alone and finally recomposed to form the final system. Loose coupling between design components is supposed to help ensure that defects are localized and that consistency is maintained. What we have lacked as a community is a theory of program composition and decomposition, instead we have fairly ill-defined ideas on coupling, modularity and cohesiveness. However, despite not having such a theory every day experience tells us that these ideas help reduce defects and improve comprehension. It is indeed hard to think of any other scientific or engineering disci line P * * h a s n o t b e n e f l t e d f r o m Ms approach. Surprisingly, much of the defect prediction work has been pursued without reference to testing or testability. According to [37], [38] the testability of a program will dictate its propensity to reveal failures under test conditions and We have long recognized in software engineering that u s e A ] s Q a t a ficiai level t h e a m o u n t of testi program quality can bepotentialfy improved through the use f o r m e d ^ d e t e r m i n e h o w defects ^ be discov. of proper project procedures and good design methods. Basic e r e d a s s u m i t h e r e a r e d e f e c t s t h e r e t 0 d i s c o v e r Q e a r l . f project procedures like configuration management, incident n o .. . f ., , c . .„ , r , , . , . > j j i_ u L i J testing is done then no defects will be found. nBy exten. logging, documentation and standards should help reduce s i.o n w e . ,. , . ,.™ , ,. ., J u i-i vu j r j r < . c L • u i u might argue that difficult problems, with complex r r the likelihood of defects. Such practices may not help the . . °. , ° _. . , _. ,. , ., „ ,.&. ,. . solutions, might be difficult to test and so might demand 7 , unique genius you need to work on the really difficult prob^ ,,- , . «• e , , , u u u j j r i _ Jmore test effort. If such testing effort is not forthcoming (as lems but they should raise the standards of the mediocre. . . . . . . . ,f. is typical in many commercial projects when deadlines

81

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

685

loom) then less defects will be discovered, thus giving an probability but, because less testing effort was allocated over estimate of the quality achieved and a false sense of than required, the distribution of defects detected peaks security. Thus, any model to predict defects must include around zero with probability 62 percent. The distribution testing and testability as crucial factors. for defect density at testing contrasts sharply with the residual ORM defect density distribution in that the defect density at testing 7.3 A Prototype bBN appears very favourable. This is of course misleading beWhile there is insufficient space here to fully describe the c a u s e t h e res idual defect density distribution shows a much development and execution of a BBN model here we have higher probability of higher defect density levels, developed a prototype BBN to show the potential of BBNs From the model we can see a credible explanation for and illustrate their useful properties. This prototype does not observing large "modules" with lower defect densities, exhaustively model all of the issues described in Section 7.2 Underallocation of design effort for complex problems nor does it solve all of the problems described in Section 6. r e s u i t s in more introduced defects and higher design size. Rather, it shows the possibility of combining the different Higher design size requires more testing effort, which if software engineering schools of thought on defect prediction unavailable, leads to less defects being discovered than into a single model. With this model we should be able to a r e actually there. Dividing the small detected defect show how predictions might be made and explain historical c o u n t s with large design size values will result in small results more clearly. defect densities at the testing stage. The model explains The majority of the nodes have the following states: t h e "Goldilock's Conjecture" without ad hoc explanation, "very-high," "high," "medium," "low," "very low," except Clearly the ability to use BBNs to predict defects will for the design size node and defect count nodes which have depend largely on the stability and maturity of the develinteger values or ranges and the defect density nodes which O p m e nt processes. Organizations that do not collect metrics have real values. The probabilities attached to each of these fata, do not follow defined life-cycles or do not perform states are fictitious but are determined from an analysis of a n y forrns o f systematic testing will find it hard to build or the literature or common-sense assumptions about the di- a p p i y s u c h m o dels. This does not mean to say that less marection and strength of relations between variables. t u r e organizations cannot build reliable software, rather it The defect prediction BBN can be explained in two stages, implies that they cannot do so predictably and controllably The first stage covers the life-cycle processes of specification, Achieving predictability of output, for any process, dedesign or coding and the second stage covers testing. In Fig. 4 m a n d s a degree of stability rare in software development problem complexity represents the degree of complexity inher- organizations. Similarly, replication of experimental results ent in the set of problems to be solved by development. We c a n o n ] y D e predicated on software processes that are decan think of these problems as being discrete functional re- fined a n d repeatable. This clearly implies some notion of quirements in the specification. Solving these problems ac- Statistical Process Control (SPC) for software development, crues benefits to the user. Any mismatch between the problem complexity and design effort is likely to cause the introduction of defects, defects introduced, and a greater design size. ° UONCLUSIONS Hence the arrows between design effort, problem complexity, Much of the published empirical work in the defect predicintroduced defects, and design size. The testing stage follows the tion area is well in advance of the unfounded rhetoric sadly design stage and in practice the testing effort actually allo- typical of much of what passes for software engineering cated may be much less than that required. The mismatch research. However every discipline must learn as much, if between testing effort and design size will influence the num- not more, from its failures as its successes. In this spirit we ber of defects detected, which is bounded by the number of have reviewed the literature critically with a view to better defects introduced. The difference between the defects detected understand past failures and outline possible avenues for and defects introduced is the residual defects count. The defect future success. density at testing is a function of the design size and defects Our critical review of state-of-the-art of models for predetected (defects/size). Similarly, the residual defect density is dieting software defects has shown that many methodoresidual defects divided by design size. logical and theoretical mistakes have been made. Many past Fig. 5 shows the execution of the defect density BBN studies have suffered from a variety of flaws ranging from model under the "Goldilock's Conjecture" using the Hugin model misspecification to use of inappropriate data. The Explorer tool [58]. Each of the nodes is shown as a window issues and problems surrounding the "Goldilock's Conjecwith a histogram of the predictions made based on the facts ture" illustrate how difficult defect prediction is and how entered (facts are represented by histogram bars with 100 easy it is to commit serious modeling mistakes. Specifically, percent probability). The scenario runs as follows. A very we conclude that the existing models are incapable of precomplex problem is represented as a fact set at "very high" dieting defects accurately using size and complexity metrics and a "high" amount of design effort is allocated, rather than alone. Furthermore, these models offer no coherent expla"very high" commensurate with the problem complexity. The nation of how defect introduction and detection variables design size is between 1.0-2.0 KLOC. The model then affect defect counts. Likewise any conclusions that large propagates these "facts" and predicts the introduced defects, modules are more reliable and that software decomposition detected defects and the defect density statistics. The distribu- doesn't work are premature, tion for defects introduced peaks at two with 33 percent

82

686

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL 25, NO. 5, SEPTEMBER/OCTOBER 1999

Fig. 4. BBN topology for defect prediction.

Fig. 5. A demonstration of the "Goldilock's Conjecture."

83

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

Each of the different "schools of thought" have their own view of the prediction problem despite the interactions and subtle overlaps between process and product identified here. Furthermore each of these views model a part of the problem rather than the whole. Perhaps the most critical issue in any scientific endeavor is agreement on the constituent elements or variables of the problem under study, Models are developed to represent the salient features of the problem in a systemic fashion. This is as much the case in physical sciences as social sciences. Economists could not predict the behavior of an economy without an integrated, complex, macroeconomic model of all of the known, pertinent variables. Excluding key variables such as savings rate or productivity would make the whole exercise invalid. By taking the wider view we can construct a more accurate picture and explain supposedly puzzling and contradictory results. Our analysis of the studies surrounding the "Goldilock's Conjecture" shows how empirical results about defeet density can make sense if we look for alternative explanations. Collecting data from case studies and subjecting it to isolated analysis is not enough because statistics on its own does not provide scientific explanations. We need compelling and sophisticated theories that have the power to explain the empirical observations. The isolated pursuit of these single issue perspectives on the quality prediction problem are, in the longer-term, fruitless. Part of the solution to many of the difficulties presented above is to develop prediction models that unify the key elements from the diverse software quality prediction models. We need models that predict software quality by taking into account information from the development process, problem complexity, defect detection processe, and design complexity, We must understand the cause and effect relations between important variables in order to explain why certain design processes are more successful than others in terms of the products they produce. It seems that successful engineers already operate in a way that tacitly acknowledges these cause-effect relations. After all if they didn't how else could they control and deliver quality products? Project managers make decisions about software quality using best guesses; it seems to us that will always be the case and the best that researchers J , . can do is 1) recognize this fact and . . . . „ n{ . 2) improve the guessing process. We, therefore, need to model the subjectivity and uncertainty that is pervasive in software development. Likewise, the challenge for researchers is in transforming this uncertain knowledge, which is already evident in elements of the

687

All of the defect prediction models reviewed in this paper operate without the use of any formal theory of program/ problem decomposition. The literature is however replete with acknowledgments to cognitive explanations of shortcomings in human information processing. While providing useful explanations of why designers employ decomposition as a design tactic they do not, and perhaps cannot, allow us to determine objectively the optimum level of decomposition within a system (be it a requiremen's specification or a program). The literature recognizes the two structural3 aspects of software, "within" component structural complexity and "between" component structural complexity, but we lack the way to crucially integrate these two views in a way that would allow us to say whether one design was more or less structurally complex than another, Such a theory might also allow us to compare different decompositions of the same solution to the same problem requirement, thus explaining why different approaches to problem or design decomposition might have caused a designer to commit more or less defects. As things currently stand without such a theory we cannot compare different decompositions and, therefore, cannot carry out experiments comparing different decomposition tactics. This leaves a gap in any evolving science of software engineering that cannot be bridged using current case study based approaches, despite their empirical flavor, ACKNOWLEDGMENTS The work carried out here was partially funded by the ESPRIT projects SERENE and DeVa, the EPSRC project IMPRESS, and the DISPO project funded by Scottish Nuclear. The authors are indebted to Niclas Ohlsson and Peter Popov for comments that influenced this work and also to the anonymous reviewers for their helpful and incisive contributions. REFERENCES 1" ^ Mjn-^jnd-^R ^ K ^ f f i T ] ^ Eng., vol. 5, no. 3, May 1979. [2] D. Potier, J.L. Albin, R. Ferreol, A, and Bilodeau, "Experiments ? " * C o m ? u , t e r S o £ w a r e C ° m Pi e o xit y a n d RellabiUt y." p™- Sixth IntlConf. Software Eng., pp. 94-103, 1982. [3] T N a k a j o a n d H K u m e »A C a s e History Analysis of Software Error Cause-Effect Relationships," IEEE Trans. Software Eng., vol. 17, no. 8, Aug. 1991. [4] s Brocklehurst and B. Littlewood, "New Ways to Get Accurate f ^ f ^ R e l i a b i l "y Modelling," IEEE Software, vol. 34, no. 42, [5] ^Akivama, "An Example of Software System Debugging," Information Processing, vol. 71, pp. 353-379,1971. 16] A.E. Ferdinand, "A Theory of System Complexity," InflJ. General

various quality models already discussed, into a prediction

.,, ? T u e ^ \ v o l J ' E P ' ^ f V ^ i '

model that other engineers can learn from and apply. We are already working on a number of projects using Bayesian Belief Networks as a method for creating more sophisticated models for prediction, [59], [60], [61], and have described one of the prototype BBNs to outline the approach. Ultimately, this research is aiming to produce a method for the Statistical process control (SPC) of software production

1975 [8] N.E. Fenton and B.A. Kitchenham, "Validating Software Measures •" J- Software Testing, Verification & Reliability, vol. 1, no. 2, PP-27-42,1991.

, , ,

,

,.

__T, \ , W J lmplied by the Sfc.1 S Capability Maturity Model.

I

c•

c>

• M U U ,. J

[7] M.H. Halstead, Elements of Software Science. Elsevier, North-Holland,

ring

84

3 We

'

are careful here to use the term structural complexity when dis-

cussing attributes of design artifacts and cognitive complexity when referto an individuals understanding of such an artifact. Suffice it to say, that structural complexity would influence cognitive complexity.

688

[9] [10] [11] [12] [13] [14] [15] [16]

[17] [18]

[19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37]

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999

L.M. Ottenstein, "Quantitative Estimates of Debugging Require- [38] A. Bertolino and L. Strigini, "On the Use of Testability Measures for Dependability Assessment," IEEE Trans. Software Eng., vol. 22, ments," IEEE Trans. SoftwareEng., vol. 5, no. 5, pp. 504-514,1979. no. 2, pp. 97-108,1996. M. Lipow, "Number of Faults per Line of Code," IEEE Trans. Software Eng., vol. 8, no. 4, pp. 437-439, 1982. [39] M. Diaz and J. Sligo, "How Software Process Improvement J.R. Gaffney, "Estimating the Number of Faults in Code," IEEE Helped Motorola," IEEE Software, vol. 14, no. 5, pp. 75-81, 1997. Trans. SoftwareEng., vol. 10, no. 4, 1984. [40] C. Jones, "The Pragmatics of Software Process Improvements," Software Engineering Technical Council Newsletter, Technical Council T. Compton, and C. Withrow, "Prediction and Control of Ada Software Defects," ;. Systems and Software, vol. 12, pp. 199-207, on Software Eng., IEEE Computer Society, vol. 14 no. 2, Winter 1990. 1996. V.R. Basili and B.T. Perricone, "Software Errors and Complexity: An [41] J.C. Munson and T.M. Khoshgoftaar, "Regression Modelling of Software Quality: An Empirical Investigation," Information and Empirical Investigation," Comm. ACM, vol. 27, no. 1, pp. 42-52, 1984. Software Technology, vol. 32, no. 2, pp. 106-114, 1990. V. Y. Shen, T. Yu, S.M., Thebaut, and L.R. Paulsen, "Identifying [42] M.D. Neil, "Multivariate Assessment of Software Products," /. Software Testing, Verification and Reliability, vol. 1, no; 4, pp. 17-37, Error-Prone Software—An Empirical Study," IEEE Trans. Software 1992. Eng., vol. 11, no. 4, pp. 317-323, 1985. K.H. Moeller and D. Paulish, "An Empirical Investigation of [43] T.M. Khoshgoftaar and J.C. Munson, "Predicting Software DevelSoftware Fault Distribution," Proc. First Int'l Software Metrics opment Errors Using Complexity Metrics," IEEE J. Selected Areas Symp, pp. 82-90, IEEE CS Press, 1993. in Comm., vol. 8, no. 2, pp. 253-261, 1990. L. Hatton, "The Automation of Software Process and Product [44] J.C. Munson and T.M. Khoshgoftaar, "The Detection of Fault-Prone Quality," M. Ross, C.A. Brebbia, G. Staples, and J. Stapleton, eds., Programs," IEEE Trans. Software Eng., vol. 18, no. 5, pp. 423-433, Software Quality Management, pp. 727-744, Southampton: Compu1992. tation Mechanics Publications, Elsevier, 1993. [45] N.E. Fenton and S. Lawrence Pfleeger, Software Metrics: A Rigorous L. Hatton, C and Safety Related Software Development: Standards, and Practical Approach, second edition, Int'l Thomson Computer Subsets, testing, Metrics, Legal Issues. McGraw-Hill, 1994. Press, 1996. T.Keller, "Measurements Role in Providing Error-Free Onboard [46] E. Adams, "Optimizing Preventive Service of Software Products," Shuttle Software," Proc. Third Int'l Applications of Software Metrics IBM Research ]., vol. 28, no. 1, pp. 2-14, 1984. Conf. La Jolla, Calif., pp. 2.154-2.166, 1992. Proc. available from [47] N. Fenton and N. Ohlsson, "Quantitative Analysis of Faults and Software Quality Engineering. Failures in a Complex Software System," IEEE Trans. Software L. Hatton, "Re-examining the Fault Density-Component Size Eng., 1999. to appear [48] T. Stalhane, "Practical Experiences with Safety Assessment of a Connection," IEEE Software, vol. 14, no. 2, pp. 89-98, Mar./Apr. 1997. System for Automatic Train Control," Proc. SAFECOMP'92, ZuT.J. McCabe, "A Complexity Measure," IEEE Trans. Software Eng., rich, Switzerland, Oxford, U.K.: Pergamon Press, 1992. vol. 2, no. 4, pp. 308 - 320,1976. [49] P. Hamer and G. Frewin, "Halstead's Software Science: A Critical B.A. Kitchenham, L.M. Pickard, and S.J. Linkman, "An Evaluation Examination," Proc. Sixth Int'l Conf. Software Eng., pp. 197-206, of Some Design Metrics," Software Eng J., vol. 5, no. 1, pp. 50-58, 1982. 1990. [50] V.Y. Shen, S.D. Conte, and H. Dunsmore, "Software Science RevisS. Henry and D. Kafura, "The Evaluation of Software System's ited: A Critical Analysis of the Theory and Its Empirical Support," IEEE Trans. SoftwareEng., vol. 9, no. 2, pp. 155-165, 1983. Structure Using Quantitative Software Metrics," Software— Practice and Experience, vol. 14, no. 6, pp. 561-573, June 1984. [51] M.J. Shepperd, "A Critique of Cyclomatic Complexity as a SoftN. Ohlsson and H. Alberg "Predicting Error-Prone Software ware Metric," Software Eng. J., vol. 3, no. 2, pp. 30-36,1988. Modules in Telephone Switches, IEEE Trans. Software Eng., vol. 22, [52] B.F. Manly, Multivariate Statistical Methods: A Primer. Chapman & no. 12, pp. 886-894, 1996. Hall, 1986. V. Basili, L. Briand, and W.L. Melo, "A Validation of Object Ori- [53] F. Zhou, B. Lowther, P. Oman, and J. Hagemeister, "Constructing and Testing Software Maintainability Assessment Models," First ented Design Metrics as Quality Indicators," IEEE Trans. Software Eng., 1996. Int'l Software Metrics Symp., Baltimore, Md., IEEE CS Press, 1993. S.R. Chidamber and C.F. and Kemerer, "A Metrics Suite for Object [54] J. Rosenberg, "Some Misconceptions About Lines of Code," Software Metrics Symp., pp. 37-142, IEEE Computer Society, 1997. Oriented Design," IEEE Trans. Software Eng., vol. 20, no. 6, pp. 476498, 1994. [55] B.A. Kitchenham, "An Evaluation of Software Structure Metrics," C. Jones, Applied Software Measurement. McGraw-Hill, 1991. Proc. COMPSAC'88, Chicago 111., 1988. M.A. Cusumano, Japan's Software Factories. Oxford Univ. Press, [56] S. Cherf, "An Investigation of the Maintenance and Support 1991. Characteristics of Commercial Software," Proc. Second Oregon Workshop on Software Metrics (AOWSM), Portland, 1991. K. Koga, "Software Reliability Design Method in Hitachi," Proc. Third European Conf. Software Quality, Madrid, 1992. [57] S.L. Lauritzen and D.J. Spiegelhalter, "Local Computations with K. Yasuda, "Software Quality Assurance Activities in Japan," Probabilities on Graphical Structures and Their Application to Japanese Perspectives in Software Eng., pp. 187-205, Addison-Wesley, Expert Systems (with discussion)," J.R. Statistical Soc. Series B, 50, 1989. no. 2, pp. 157-224, 1988. M. Dyer, The Cleanroom Approach to Quality Software Development. [58] HUGIN Expert Brochure. Hugin Expert A/S, Aalborg, Denmark, Wiley, 1992. 1998. W.S. Humphrey, Managing the Software Process. Reading, Mass.: [59] Agena Ltd, "Bayesian Belief Nets," http://www.agena.co.uk/bbnarticle/ Addison-Wesley, 1989. bbns.html R.D. Buck and J.H. Robbins, "Application of Software Inspection [60] M. Neil and N.E. Fenton, "Predicting Software Quality Using Methodology in Design and Code," Software Validation, H.-L. Bayesian Belief Networks, "Proc 21st Ann. Software Eng. Workshop, Hausen, ed., pp. 41-56, Elsevier Science, 1984. pp. 217-230, NASA Goddard Space Flight Centre, Dec. 1996. N.E. Fenton, S. Lawrence Pfleeger, and R. Glass, "Science and [61] M. Neil, B. Littlewood, and N. Fenton, "Applying Bayesian Belief Networks to Systems Dependability Assessment," Proc. Safety Substance: A Challenge to Software Engineers," IEEE Software, pp. 86-95, July 1994. Critical Systems Club Symp., Springer-Verlag, Leeds, Feb. 1996. R.B. Grady, Practical Software Metrics for Project Management and Process Improvement. Prentice Hall, 1992. A. Veevers and A.C. Marshall, "A Relationship between Software Coverage Metrics and Reliability," J Software Testing, Verification and Reliability, vol. 4, pp. 3-8, 1994. M.D. Neil, "Statistical Modelling of Software Metrics," PhD thesis, South Bank Univ. and Strathclyde Univ., 1992. J.M. Voas and K.W. Miller, "Software Testability: The New Verification," IEEE Software, pp. 17-28, May 1995.

85

FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS

Norman E. Fenton is professor of computing science at the Centre for Software Reliability, City University, London and is also a director at Agena Ltd. His research interests include software metrics, empirical software engineering, safety critical systems, and formal development methods. However, the focus of his current work is on applications of Bayesian nets; these applications include critical systems' assessment, vehicle reliability prediction, and software quality assessment. He is a chartered engineer (member of the IEE), a fellow nber of the IEEE Computer Society. of the IMA, and a member

689

Martin Neil holds a first degree in mathematics for business analysis from Glasgow Caledonian University and has achieved a PhD in statistical analysis of software metrics jointly from South Bank University and Strathclyde University. Currently he is a lecturer in computing at the Centre for Software Reliability, City University, London. Before joining the CSR, He spent three years with Lloyd's Register as a consultant and researcher and a year at South Bank University. He has also worked with J.P. Morgan as a software quality consultant. nt. His research interests cover software metrics, Bayesian probability, and the software process. Dr. Neil is a director at Agena Ltd., a consulting :ing company specializing in decision support and risk assessment of safety afety and business critical systems. He is a member of the CSR Council, sil, the IEEE Computer Society, and the ACM.

86

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002

455

Using Regression Trees to Classify Fault-Prone Software Modules Taghi M. Khoshgoftaar, Member, IEEE, Edward B. Allen, Member, IEEE, and Jianyu Deng

Abstract—Software faults are defects in software modules that might cause failures. Software developers tend to focus on faults, because they are closely related to the amount of rework necessary to prevent future operational software failures. The goal of this paper is to predict which modules are fault-prone and to do it early enough in the life cycle to be useful to developers. A regression tree is an algorithm represented by an abstract tree, where the response variable is a real quantity. Software modules are d a s sified as fault-prone or not, by comparing the predicted value to a threshold. A classification rule is proposed that allows one to choose a preferred balance between the two types of misclassification rates. A case study of a very large telecommunications systerns considered software modules to be fault-prone if any faults were discovered by customers. Our research shows that classifying fault-prone modules with regression trees and the using the classification rule in this paper, resulted in predictions with satisfactory accuracy and robustness.

xit j X; fauUs.

Vi ;(/, ^ £>(„,•„) > *«* mindev minsize £(x;) „,

Class

' Class(x t )

Index TW-Classification, fault-prone modules, regression trees, software metrics, software reliability, S-Plus.

(/) v

'

ACRONYMS 1 EMERALD Cdf f

rp nfp pdf

Enhanced Measurement for Early Risk Assessment of Latent Defects cumulative distribution function , f raun-prone not fault-prone probability density function NOTATION

3

mentiner ot a predictor u a o i e s u ana i n snow software metrics notation)

I

node identifier number of objects (modules)

object # i ' s value of Xj vector of predictor values for object #i n u m b e r o f c u s tomer-discovered faults in object „. response for object #i predicted y, average response for training objects in n o d e #1 s-deviance of node #1 . ..!•• ^ > i_ •_• pnor probabilities of class membership s-deviance threshold minimum number of objects in a decision node the leaf that object #i falls into number of training objects that fall into leaf * , , „ ,f i,. actual class of object #z predicted class of object #i, based on its x,p r { a n o b j e c t i m , e a f ( ig f a u l t . p r o n e l : _ }_. _ , ,., , .. '

<J(L(XJ))

estimated qi

£

classification-rule parameter

_ , , , . , Pr{fp|nfp}

T , — ™ ,^ , ^ type I misdassification rate, Pr{Class( X i ) fp|Class; - nfp} type II misclassification rate, Pr{Class(x,;) . . nf P |Clas S i = fp} pdf of the Gaussian Cdf.

Prjnfp fp} Saud«

I.

= =

INTRODUCTION

H

IGH software reliability is important for many software systems, especially those that support society's infras t r u c t u r e s s u c h a s t e l e c o m m u n i c a t i o n s y s t e m s . Reliability is u s u a l l y m e a s u r e d f r o m t h e user>s v i e w p o i n t i n t e r m s o f t i m e between failures, according to an operational profile [29], A s o f t w a r e fau[{

predictor #j

{& d e f m e d a s a d e f e c t j n a n e x e c u t a b l e

software

product that may cause a failure [26]. Thus, faults are attributed to the software modules that cause failures. Developers tend to Manuscript received December 29, 1999; revised October 1, 2001 and focus on faults, because they are closely related to the amount November 15, 2001. This work was supported in part by a grant from Nortel o f r e w o r k : necessary to prevent future failures. This paper Networks through the Software Reliability Engineering Department. The , ,. , , r , , . - , . , . , r findings and opinions in this paper belong solely to the authors, and are not d e f m e s a software module fault-prone when there is a high risk necessarily those of the sponsor. Moreover, our results do not in any way reflect that faults will be discovered during operations, the quality of the sponsor's software products. Responsible Editor: M. A. Vouk. p a u l t y modules cannot be identified until failures occur T.M. Khoshgoftaar is with the Empirical Software Engineering Lab., Depart. . _,. . . , J - I J I ment of Computer Science and Engineering, Florida Atlantic University, Boca d u n n g operation. This IS too late to be useful to developers. Raton, FL 33431 USA (e-mail: [email protected]). However, if one could predict during development which modE. B. Allen is with the Department of Computer Science, Mississippi State u j e s ^ g f au lt-prone, then developers could take COSt-effective University, Mississippi State, MS 39762 USA (e-mail: Edward.Allen@com. ,f . . . . . puterorg) proactive measures to prevent the release of faulty software. J. Deng is with Motorola Metrowerks Corp., Austin, TX 78758 USA (e-mail: This in turn, would reduce the amount of expensive rework [email protected]). needed to repair faulty software during the operational phase. Digital Object Identifier 10.1109/TR.2002.804488 „, , - {.• u• r A AI L The goal of this research is to find ways to predict, early enough 'The singular and plural of an acronym are always spelled the same. in the life cycle to be useful to developers, which modules are 0018-9529/02S17.00 © 2002 IEEE XJ

87

456

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002

fault-prone. The exact nature of the software improvement processes that developers could apply to fault-prone modules is not addressed here. In a well-built system, fault-prone modules typically are only a small fraction of the total. A variety of classification techniques have been used to model software quality, including • logistic regression [2], [14]; • discriminant analysis [21], [28]; • discriminant power [34], [35]; • discriminant coordinates [30]; • optimal set reduction [4]; • neural networks [24]; • fuzzy classification [7]; • classification trees [37], ti°n A classification tree is an algorithm represented by an abstract tree of decision rules. The s-dependent variable is the response variable which is categorical (e.g., fault-prone or not). The s-independent variables are predictors. Each internal node represents a decision that is based on a predictor. Each edge leads to a potential next decision. Each leaf is labeled with a class. An object (e.g., software module) is classified by traversing a path from the root of the tree to a leaf, according to the values of the object's predictors. Finally, the object's response variable is assigned the leaf's class. A classification tree accommodates nonmonotonic and nonlinear relationships among combinations of variables in a model that is easy to understand and use. References [31], [37] model software quality using the ID3

is similar to, but s-independent of the training data set. Both the training and evaluation data sets must represent historical software modules where actual faults are known. After a tree model has been built and evaluated with historical data, it is ready to make predictions for a similar current development project, where predictors are known, but faults have not yet been discovered. The accuracy of a classification model is characterized by misclassification rates. When the response variable can be 1 of 2 classes, e.g., fault-prone or not, then a model can make 2 kinds °f rnisclassifications. In the application in this paper, a Type I misclassification is when the model predicts that a module is fault-prone when it is not. Conversely, a Type II misclassifica*s when the model predicts that a module is not fault-prone when it is. This P a P e r presents a method for using regression trees to classify software modules as fault-prone or not, allowing one to choose a preferred balance between Types I and II misclassifixation rates. To our knowledge, this is the first time the S-Plus regression tree algorithm has been used for classification of software quality. A case study of a very large telecommunication system illustrates the approach [6]. Future work might include a comparative study of the various tree algorithms, The remainder of this paper explains how S-Plus builds a regression tree, defines the authors' classification rule for choosing a preferred balance between misclassification rates, ^ d presents details of the authors' case study.

algorithm [32] to build trees using an entropy-based criterion.

II. A CLASSIFICATION RULE FOR REGRESSION TREES

Reference [38] extended the ID3 algorithm by applying Akaike Information Criterion procedures [1] to prune the tree. The authors' research group has classified fault-prone modules with the CART algorithm [3], [17], [22] and the TREEDISC algorithm [23], [33] which is a refinement of the CHAID algorithm [12]. S-plus also has an algorithm for constructing classification trees [5]. However, this algorithm does not incorporate prior probabilities of membership nor costs of misclassifications [13]. In one case study, this algorithm did not build a tree, because our data had a very small proportion of fault-prone modules. This led the authors to explore the use of regression trees for the purpose of classifying fault-prone modules. A regression tree is also an algorithm represented by an abstract tree. However, the response variable is a real quantity, instead of a class. Decision nodes are similar to a classification tree's, but each leaf is labeled with a quantity for the response variable. The processing of an object is similar to a classification tree, but once the object reaches a leaf, the response variable is assigned the appropriate quantity. Reference [25] briefly reports using the Classification and Regression Trees (CART) regression i

-.v.

m .

J i

7

•

J

• •

<-.

A tree al

- g°nthm

bullds a

based on a training set of obPonse variable a n d Predictor v a l u e s a r e k n o w n f o r e a c h ob ect J I n this P a P e r ' a s o f t w a r e m o d u l e i s c o n s i d e r e d ™ object. The V" l s e n c o d e d for each module as a real number: * 0.0 for fault-prone, * 10 for not fault-prone. S-Plus constructs a regression tree that predicts the value of this real response variable. The Appendix gives details on the S-Plus algorithm for building a regression tree. After the regress i o n tree is built, each leaf, /, must be labeled with a class so that t he tree can be used for classification. This, in effect, determines a m l e f o r classifying objects. A threshold is applied to the &, to determine the predicted class Because of the way y is encoded, the w , is the proportion of n o t f a u i t . p r o n e training modules that fall into leaf I. This yields m e s t i m a t e o f t h e p r o b a b i l i t y t h a t m o d u l e ; i s f au lt- P rone J

ects

'

^

where the value of the res

,..

„ ,„,

c

..

,TI

n

<j(() = Pr-fClassi = fp L ( X i ) = / } = 1 — tii,

tree algorithm [3] to model software project productivity. Case studies in [9] and [39] used the S-Plus regression tree algorithm to predict the number of faults. As future work, [9] suggests applying a threshold to the predicted quantity to classify modules. Tree algorithms are often considered "machine learning" techniques, because the structure of a tree is derived from processing a training data set that represents objects of interest; the algorithm "learns" from the structure of the training data set. One should evaluate a tree's accuracy with an s-unbiased method, such as cross-validation, or an evaluation data set that

88

w

l

Hl l

'

'

^'

v

(1)

' (2)

The goal of this paper is to allow appropriate emphasis on each type of misclassification according to the preference of the project. The authors proposed such a classification rule for use with software quality models based on discriminant analysis [15] and this paper adapts it to regression trees. Let a software quality modeling technique produce a likelihood function for each class, /nfp and /fp. Equation (3) enables a project to select

KHOSHGOFTAAR el al.: USING REGRESSION TREES TO CLASSIFY FAULT-PRONE SOFTWARE MODULES

its preferred balance between the misclassification rates by choosing a parameter f /nfp (xi)

i

(3) P

"TWxiT ^ (3) , . ,,„ ... , .c . , ' , , , When applied to a classification tree, the /nfp and /fp, are probabilities of class membership at the leaves. Thus, the general classification rule for a regression tree is: -, ~,T/ w fn if l ~ i(L(Xi^ •>/• r

{

nrp r

fpy

the generalizability of results. Software applications have various product characteristics, and software is developed under a variety of conditions in various organizations. Section HI-A de-

" q(L(xi)) , . otherwise.

An alternative formulation is: Class(xi) = \ p l f 1 ?(M X »)) - ° • I * p otherwise

(4)

scribes the subject development organization and its product, so that others can assess its similarity to their own. Sections ni-B and III-C present the methodology of the case study and its em. . . . A- System Description For a n e m pincal study to be credible, the software engineering community demands that the subject be a system with , „ ., . , . . .«, the following charactenstics r[401: 1) developed by a group, rather than an individual; 2) developed by professionals, rather than students; ^) developed in an industrial environment, rather than an artificial setting; 4) large enough to be comparable to real industry projects, The case study in this paper fulfills all these criteria, The case study was of a very large legacy telecommunication system with the characteristics: 1) developed by teams in a large organization; 2) developed by professional programmers using the procedural development paradigm and a standardized development process; 3 ) Part of a commercial product, which was an embedded, real-time system with many finite-state machines; 4 ) consisted of appreciably more than 107 lines of code in a high-level language (Protel) similar to Pascal. Four consecutive releases (labeled 1-4 in this paper) were studied - R e l e a s e ] w a s u s e d a s a tmininS &** set, and the remaining 3 releases were used as evaluation data sets. Even thou h the software w a s a S PPreciably enhanced from release to release < * e Project staff considered the software development Process t o b e stableA module consisted of one or more functionally related , , - , . , , • • >, source-code files. A problem-reporting-system recorded data , , , , .. , ., . r , a t t n,e module level on customer-discovered problems. A fault ., , J , I _ J I _ J J was attnbuted to a module when source code was changed due ,. ,. , .. „ . c c .. . , . . to a customer-discovered problem. Repair of faults in deployed telecommunication tems can be extremel expensive, ^ custQmer ^ ^ o f t e n n e c e s (7) b e c a u s e ^ ^ t Q a ,, ^ t mstajj a Datcn ja& s m d y c o n s i d e r e d a moduiQfauU.prone i f my f a u k s w e r e d i s c o v e r e d b c u s t o m e r s , a n d not fault-prone otherwise:

e = ——. 1+ £ (5) Software engineering is a complex human activity; consequently, it is impossible for any model to account for all the things that influence human mistakes. When a model predicts a module's classification, it might turn out to be wrong. The goal in this paper is to have correct predictions most of the time. For 2 classes, there are 2 kinds of misclassification rates. Type I: Pr{fp|nfp} =, proportion of not fault-prone modules that are incorrectly classified as fault-prone; Type II: Pr{nfp|fp} = , proportion of fault-prone modules that are incorrectly classified as not fault-prone. With various classification techniques, a tradeoff is observed between Types I and H misclassification rates as functions of C [15], [19], [22], [41]. As Pr{nfp|fp} goes down, Pr{fp|nfp} goes up, and conversely. This paper chooses a preferred value of C empirically. Given a candidate value of C, estimate the misclassification rates Pr{f P |nfp} and Pr{nfp|fp} by resubstitution of the training data set into the model. If the balance is not satisfactory, select another candidate value of C and estimate . 1 .u u »/• • c A t tu • . TUA • again, until the best C is found for the project. This procedure is . ,c ,. , u • 1 •/ • straightforward in practice, because the misclassification rates , . „ .. y -c 1 •*• u /• are monotonic functions ofc C. For example, if one chooses C u u . n i£ 1 r 1 TI r r if 1 u 1. 1 • i • such that Pr{fp nfp} = Pr{ nfp fp}, then the larger misclassi} '. , '. ', , . .. fication rate is minimized [36]. In practice, one can achieve only approximate equality due to finite discrete data-sets. Let some software improvement process be applied to each module predicted to be fault-prone, and let Ct and Cn be the costs of misclassification, based on improvement costs, effectiveness in finding faults prior to release, and the consequences of uncorrected faults during operations. Equation (3) is a minf nfp faults; = 0 UasSi = ( imum-cost rule [13], [36] when \ fp faults; > 0. ' = C ( ) ' ( "pr" I • (6) A configuration-management-system recorded data on \ nfp/ \ 1 j changes to source-code files. Modules were identified that However, the costs of misclassifications are often difficult to w e f e u n c h a n d f r o m t h e rior r e l e a s e M o r e t h a n 9 9 % o f t h e III. EMPIRICAL CASE STUDY estimate. If 7rfp and 7rnfp are estimated from the training data u n c h d m o d u l e s h a d n o faults i-e a l m o s t a l l u n c h a n g e d set then a preferred value of C implies a subjective assessment m o d u , e s w e r e ^ f a u l t Consequently, the scope of the of CnlCu under a cost minimization rule. ^ to m o d u l e s ^ h a d ^ , e a s t j c h a n g e case study w a s ^ to source code since the prior release, including new modules, m . EMPIRICAL CASE STUDY T h e se( . o f u p d a t e d m o d u l e s h a d s e v e r a l million lines of code in The case study in this paper illustrates how a general classi- a few thousand modules in each release. fication rule can yield useful classification accuracy when apFault data were collected from the problem reporting system, plied to regression trees. Case studies have inherent limits on Problem reports were tabulated and anomalies were resolved. 89

457

458

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002

TABLE I

TABLE II

DISTRIBUTIONS OF FAULTS

SOFTWARE PRODUCT METRICS

EYaction of updated modules F

*ults

0 1 2 3 4 g g

5—:

Keliase^ 2 2

_}

93.7% 5.1% 0.7% 0.3% 0.1% * * j

95.3% 3.9% 0.7% 0.1% *

Symbol

98.7% 1.0% 0.2% 0.1%

CAL

~

97.7% 2.1% 0.2% 0.1%

T,,

..

, •

r

.„,

„

, .

„

.

„

,

.

, ..

a similar amount of time. Comparison of fault-discovery rates across releases is a topic for future research. This paper advocates a pragmatic approach to collecting software metrics, and does not recommend one set of metrics to the exclusion of others recommended in the literature. A data-

mining approach is preferred to exploiting metric data [8], [18] .

r

v v

L J» L

f a

J

analyzing a broad set of candidate metrics. The subject system was supported by the EMERALD rim

c i / r n u r ,

i

j

j i.

XT

I

IXT

system [10]. EMERALD was developed by Nortel Networks in partnership with Bell Canada [27] et al. EMERALD provides software designers and managers access to software measurements and software quality models based on those metrics. EMERALD'S software metrics analysis tool measured over 50 metrics from source code. Preliminary data analysis selected metrics that were appropriate for modeling purposes. Table II lists the 24 software product metrics used in this study [20]; CAL and VARUSD were not used as predictors because they are redundant with others. They measure attributes of call ,

,

,„

U

J

.

.

r-

number of calls.

Number of distinct procedure calls to others. Number of second and following calls to others. CAL2 = CAL - CALUNQ Control Flow Graph Metrics CNDNOT Number of arcs that are not conditional arcs, CNDSPNSM Total span of branches of conditional arcs. The unit of measure is arcs. CNDSPNMX Maximum span of branches of conditional arcs. CTRNSTMX Maximum control structure nesting. FILINCUQ Number of distinct include files. IFTH Number of non-loop conditional arcs:

tomers. The proportion of modules with no faults among the updated modules of the^i? data set (Release 1) was 7rnfp = 0.937, and the proportion with at least 1 fault was 7rfp = 0.063. Such a small set of modules is often difficult to identify early in development. In this study, due to a lack of detailed data, the conservative assumption was made that customers used the releases

...

Tot8j

CALUNQ CAL2

Table I summarizes the distribution of faults discovered by cus.

Description

Call-Graph Metrics

i i _

graphs, control flow graphs, and statements. For example, the span of variables is the number of lines of code between and last use of a variable in a procedure; VARSPNSM and VARSPNMX are totals and maximums, respectively. Table III lists 4 execution metrics used in this study. The proportion of installations that had each module, USAGE, was approximated by data from a prior release [11]. The project considered usage across releases to be similar, because the customer-base was stable. Execution times were measured in a laboratory under 3 workloads. For example, RESCPU is the amount of execution time of a module under the workload of a system serving consumers. Refinement of execution metrics is a topic for future research. B. Methodology The case study in this paper consists of the steps: 1) Collect data on historical releases, and perform preliminary data analysis.

90

KNT

LGPATH £° NDSENT

if-then constructs.

Number of knots. A "knot" in a control flow

^ ^ i8 w h e r e „ „ c r o s s d u e t o a v i o l a t i o n structured programming principles. iog2(number of ^-independent paths). JJ^J * g c ° o f ^ c t s Number of entry nodes: t h e n u m b e r of procedures.

NDSEXT

Number of exit nodes.

NDSINT

N u m b e r of i n t e r n a i nodes>

™SPND STMDEC STMEXE

XJ2SSS

VARSPNMX

VARSPNSM VARUSD VARUSDUQ

VARUSD2

of

^ not an entry, exit, or pending node. Number of £ * • « * . dead code segments, Number of declarative statements, Number of executable statements.

^T^

°f ^ ' / " ^ l * U"
Maximum span of variables.

Total span of variables. total number of variable uses.

Number of distinct variables used.

Number of second and following uses of variables: VARUSD - VARUSDUQ.

TABLE HI SOFTWARE EXECUTION METRICS Symbol BUSCPU

Description Execution time (microseconds) of an average

RESCpu

Execution time

first TANCPU USAGE

transaction on a system serving businesses,

( m i c r o a e c o n ds) of an average transaction on a system serving consumers, Execution time (microseconds) of an average transaction on a tandem system. P a y m e n t fraction of the module.

2) Select a response variable and a broad set of candidate predictors. 3) Prepare training and evaluation data sets. 4) Build a regression-tree based on the training data-set using S-Plus. 5 ) c h o o s e t h e p r e f e r r e d v a l u e o f t h e classification rule's parameter, 6, based on training data-set results and project-specific criteria 6) Classify each module in the evaluation data-sets and calculate misclassification rates. 7) Evaluate model accuracy and interpret the structure of the tree.

KHOSHGOFTAAR el al:. USING REGRESSION TREES TO CLASSIFY FAULT-PRONE SOFTWARE MODULES

459

TABLE IV BALANCING MISCLASSIFICATION RATES

0 0.50 0.60 0.70 0.80 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

Training Release 1 Type I Type II lWo OHM 1.5% 77.7% 1.5% 77.7% 7.7% 45.0% 9.6% 39.7% 15.0% 31.0% 20.4% 23.1% 20.4% 23.1% 20.4% 23.1% 25.6% 18.8% 25.6% 18.8% 37.2% 12.2% 54.2% 5.2% | 72.4% 1.8% I

Misclaasification Rates Evaluation Release 2 I Release 3 I Release 4 Type I Type II Type I Type II Type I Type II 8577% Tffit, 93~6^ TWa 81.5% TWo 1.9% 83.6% 2.4% 87.2% 6.1% 70.7% 1.9% 83.6% 2.4% 87.2% 6.1% 70.7% 9.0% 60.9% 9.9% 59.6% 12.9% 51.0% 10.7% 57.7% 11.9% 53.2% 15.0% 45.7% 15.8% 42.9% 17.9% 38.3% 21.4% 39.1% 20.5% 38.6% 23.7% 31.9% 30.2% 27.2% 20.5% 38.6% 23.7% 31.9% 30.2% 27.2% 20.5% 38.6% 23.7% 31.9% 30.2% 27.2% 26.1% 26.5% 26.7% 21.3% 32.2% 21.7% 25.1% 26.5% 26.7% 21.3% 32.2% 21.7% 36.8% 16.4% 35.4% 17.0% 38.5% 20.7% 52.7% 7.9% 54.0% 10.6% 63.0% 9.8% 74.3% 3.7% | 79.9% 4.3% | 92.9% 3.3%

The model is then ready for application to a current release or a similar project. C Empirical Results The case study used data on 4 consecutive releases of a large legacy telecommunication system. The response variable was whether a module was fault-prone or not. The candidate predictors were the 24 product metrics in Table II and the 4 execution metrics in Table III. The training data-set consist of data on updated modules from Release 1; and the evaluation data sets consist of data on updated modules from Releases 2-A. The S-Plus regression-tree algorithm built a tree based on the training data-set. The minimum s-deviance parameter was mindev = 0.10, and the minimum size node was minsize = 40; these parameters were chosen empirically [6]. The resulting tree had 41 nodes, 21 leaves, and 11 important predictors. The important predictors are: FILINCUQ, LGPATH, NDSENT, CNDNOT, LOP, STMDEC, VARSPNMX, VARSPNSM, NDSPND, USAGE, RESCPU. The number of distinct files included, FILINCUQ, is the predictor at Nodes 1 and 2. Because programmers typically put externally defined function prototypes in included files ("header" files), this variable indicates the variety of interfaces among files. Interfaces can be easily misunderstood by developers. The logarithm of paths in the control flow graph, LGPATH, which indicates the size and complexity of the logic, was somewhat s-correlated with FILINCUQ. NDSENT is equivalent to the number of procedures in the module, because every procedure had only one entry point. This was strongly s-correlated with the size of a module. CNDNOT, LOP, STMDEC were also s-correlated with the overall size of the module. VARSPNMX, VARSPNSM were s-correlated with each other, and indicate the locality of references to variables. Small locality of reference can improve awareness of all uses of each variable. NDSPND ("dead code": pending nodes in the control flow graph) can indicate incomplete maintenance.

91

USAGE is a surrogate measure for the extent that customers used a module, and thus, roughly gauges opportunities to discover faults. The execution time of a module, e.g., RESCPU, also indicates opportunities for faults to manifest as failures. Table IV shows how misclassification rates vary as a function of 9. The Type II misclassification was preferred to be less than the Type I rate for the training data set, and the misclassification rates were preferred to be approximately balanced. Thus 6 = 0.95 was chosen (it is bold in Table IV). Another project might prefer a different criterion for choosing 6. For example, due to resource constraints, one might prefer to limit the total fraction predicted to be fault-prone. Fig- 1 depicts the tree. The cutpoint for each decision node is marked on its left edge. For example, at the root node (Node *). if as marked on the left edge, FILINCUQ < 50, then the algorithm represented by the tree proceeds to the left (Node 2), and otherwise to the right (Node 25). Each leaf shows its mean response, fit, and the preferred class for 6 = 0.95. For example, upon arriving at Leaf 5 for module z, the algorithm assigns ^ = Us = 0.995. By (2) and (5), 1 - <j(L(x;)) = y{ > 0 = 0.95, a n d t h u s t h e p r e d i c t e d c l a s s o f m o d u l e { i s not fauit.prone: Classfx,) = nfp ! I f a l l t h e l e a v e s deS cending from a decision node have the s a m e classification, then one can draw an equivalent simplified t r e e F o r e x a m p l e ; b o t h o f t h e l e a v e s descending from Node 39 m l a b e ] e d f a u i t . p r o n e , because ft, < 0 for both. Consequently, t h e d e c i s i o n a t N o d e 3 9 d o e s n o t a f f e c t t h e classification. One c o u i d redraw the tree, replacing Node 39 and its child nodes w jth a i ea f labeled fault-prone. E v e n though the S-Plus regression tree algorithm allows only binary splits, some combinations of nodes are equivalent to multiway splits. For example, • if FILINCUQ < 35 then a module probably has low-risk as determined by the subtree at Node 3. • If 35 < FILINCUQ < 50, then a module's class is predieted by the subtree at Node 16; otherwise (50 < FTLINCUQ) a module probably has high risk as determined by the subtree at Node 25. In other words, Nodes 1 and 2 together form a 3-way

460

Fig. 1.

IEEE TRANSACTIONS ON RELIABILITY, VOL. 5 1 , NO. 4, DECEMBER 2002

Regression tree with classifications.

split. Similarly, consecutive nodes elsewhere that represent a single concept can be viewed as a multi-way split. For example, • Nodes 26 and 27 indicate a 3-way split on USAGE. • Nodes 18, 19, and 20 indicate a 4-way split on the concept of span of variables. Consider the benefits of using the preferred model to target software improvement efforts [16]. For example, let a current release be similar to the last release in the study, Release 4, having 7Tfp = 2.3% fault-prone modules, and 7rnfP = 97.7% not fault-prone (see Table I). Recall that the preferred 9 = 0.95, i.e., C = 19 = 0/(1 — 6). For Release 4, the model correctly predicted the class of 78.3% = 1 - 0.217 of the fault-prone modules, and incorrectly predicted that 32.2% of the not fault-prone modules were fault-prone. For the hypothetical current release, the model

92

would predict that 33.3% = 0.783 • 7Tfp + 0.322 • 7rnfp of the modules are fault-prone, and thus, are candidates for improvement. Of these candidates, one would anticipate 5.4% = 0.783 • (7Tfp/0.333) to be fault-prone. For comparison, if one randomly chose a set of modules for improvement, only 2.3% = 7Tfp would actually be fault-prone, Let the . c o s t rf i m p r o v i n g my m o d u l e beCl = l unit; * v a l u e o f ™P r o v ' n g a n o t fault-prone module be negligible; • cost-avoidance (benefit) of improving a fault-prone module be Cu - 807.1 = C • (**{?/*fp) by (6) under a minimum-cost classification rule. In light of the high cost of fixing faults in telecommunication software after release, and the very small proportion of fault-

KHOSHGOFTAAR et al.: USING REGRESSION TREES TO CLASSIFY FAULT-PRONE SOFTWARE MODULES

prone modules, this subjective estimate of the cost ratio appears to be plausible. The cost of improving n modules predicted to be fault-prone by the model would be n units. The value of improving those candidates that were fault-prone would be 43.66 • n = 0.054 • Cu • n. Thus, the profit for using the model's predictions would

461

Outline of the S-Plus Regression-Tree Algorithm [5] 1} I n i t i a l i z e t h e c u r r e n t n o d e 2 ) ff t h e cumM a>,

por eacn

L

n o d e is n o t nuUj then predictor

t h e c u r r e n t n o d e > s s e t of

Pmition

objects into 2 subsets, minimizes the

c h o o s i n g a cutpoint for the current predictor that

sum of the s-deviances of the left and right prospective child "°deS

Profit = 42.66 • 7i = 43.66 •ra- n,

-D(Weft, bright! y)=l_s

and the return on investment would be 42.66-n ROI = 4266% = . For comparison, improving n randomly selected modules would result in a profit of Profit = 17.56 • 7i = 7Tfp • Cu • n — n, and a return on investment of ROI = 1756% =

Th

^(Weft! Vi)+ isleft

D

(^ighU

ieright

c)

.

fj,light; y).

If one of the following conditions, (13) or (14), is true for

n ( < minsize.

vv

(12)

node* ^ n o d e ^-deviance is less than a small fraction of the root node s-deviance ^ J-J-^—;—-^- < mindev. (13) „, , ^ro°t ' ^' . , . • The number of modules in the current node is less than a threshold,

S-Plus requires that each predictor be an ordinal measure. Only ranks of quantitative predictors are considered. In this application, all the predictors are software metrics, modeled as ordinal measures, where Xij is the predictor s value #j for

partitioning. . , . . , . For the purpose of regression, this algorithm assumes that the response variable is s-normally distributed [5]: yi ~ gaud(/Ui, a2) (8)

(11)

tne current

APPENDIX BUILDING A REGRESSION TREE WITH S-PLUS

, *' . . ,,,. , - , , , . , T In the course of the tree-building algorithm, modules in the training data-set are assigned to nodes, and thus are "modules in a node." The algorithm initially assigns all the modules in the training data-set to the root node The algorithm then recursively partitions each node s modules into 2 subsets that are assigned to child nodes, until a stopping criterion halts further

Vi)-

b) Choose the predictor whose best split maximizes the chan e in S -"deviance between the s-deviance of the current noc e a n c t n e s u m * ' °^ t n e s"deviances of the prospective child nodes &D = D(nr, y) - D(meft,

That is, using predictions from this model would more than double the profit of improving a random selection of modules. Thus, under a plausible assessment of C / / / C / , the level of accuracy of this preferred classification model in Table IV could be very useful to a software development project.

2s

(14)

. ^ ^ nodes; ^ dse ged ^ R e c u r s ivel y call the algorithm to process the left

then d o nQt

, ...

,

B) Recursively call the algorithm to process the right .... . 3) R t th t . . n ^ ^ ^ ^ i t c a n b e used to predict each yt of . ^ & m o d u l e s T h e e d i c t e d v a [ u e rf ^ f e a current . ,. , , . .. ., • . , . . , , „ c, vanable for module i is the mean of training modules in the leaf it falls into yi — u i f

s.

(15)

The parameters mindev and minsize are tools for controlta is estimated by the mean value of y over all training mod- U n g overfilling. Future work will evaluate these parameters and ules that all in the same leaf as module i. The variance, a2, is p r u n i n g a l a r g e o v e r f l t t e d t r e e t 0 control overfitting. assumed to be constant for all modules. For classification, violation of these assumptions by the response variable was not a ACKNOWLEDGMENT practical problem. The s-deviance of module i is minus twice the log-likelihood, The authors are pleased to thank W. D. Jones and the scaled by a2, which reduces to [5]: EMERALD team for collecting the case-study data, J. P. Hude_, \ —( \2 /Q-. P°hl f° r his encouragement and support, and the anonymous UUH, yi) - (Vi - Vi) • (V) reviewers for their thoughtful comments. The s-deviance of a node / is the sum of the s-deviances of all the training modules in the node [5]: REFERENCES TC

,,

, ,

•

D(fif, y) = \ ^ (j/i — Hi)2. ~^ , , , ,

(10) c

.

[1] H. Akaike, "Factoranalysis and AIC,"Psychometrika, vol. 52, no. 3, pp. 317-332, 1987. [2] V. R. Basili, L. C. Briand, and W. Melo, "A validation of object-ori-

, •

J e n t e d d e s i g n m e t r i c s a s q u a l i t y i n d i c a t o r s .. I E E E T r a n s . S o f t w a r e E n g i .

If all modules in a node have the same value of y, then each is equal to the mean, and thus, the s-deviance is zero.

neering, vol. 22, no. 10, pp. 751-761, Oct. 1996.

93

462

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002

[3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees: Chapman & Hall, 1984. [4] L. C. Briand, V. R. Basili, and C. J. Hetmanski, "Developing interpretable models with optimized set reduction for identifying high-risk software components," IEEE Trans. Software Engineering, vol. 19, no. 11, pp. 1028-1044, Nov. 1993. [5] L. A. Clark and D. Pregibon, "Tree-based models," in Statistical Models in S, J. M. Chambers and T. J. Hastie, Eds: Wadsworth, 1992, pp. 377-^119. [6] J.Deng, "Classification of software quality using tree modeling with the S-Plus algorithm," Master's thesis (advised by Taghi M. Khoshgoftaar), Florida Atlantic University, Dec. 1999. [7] C. Ebert, "Classification techniques for metric-based software development," Software Quality J., vol. 5, no. 4, pp. 255-272, Dec. 1996. [8] U. M. Fayyad, "Data mining and knowledge discovery: Making sense out of data," IEEE Expert, vol. 11, no. 4, pp. 20-25, Oct. 1996. 19] S. S. Gokhale and M. R. Lyu, "Regression tree modeling for the prediction of software quality," in Proc. Third ISSATInt. Conf. Reliability and Quality in Design, 1997, pp. 31-36. [10] J. P. Hudepohl, S. J. Aud, and T. M. Khoshgoftaar et al, "EMERALD: Software metrics and models on the desktop," IEEE Software, vol. 13, no. 5, pp. 56-60, Sept. 1996. [11] W. D. Jones, J. P. Hudepohl, T. M. Khoshgoftaar, and E. B. Allen, "Application of a usage profile in software quality models," in Proc. Third European Conf. Software Maintenance and Reengineering: IEEE Computer Soc, 1999, pp. 148-157. [12] G. V. Kass, "An exploratory technique for investigating large quantities of categorical data," Appl. Statistics, vol. 29, pp. 119-127, 1980. [13] T. M. Khoshgoftaar and E. B. Allen, "Classification of fault-prone software modules: Prior probabilities, costs, and model evaluation," Empirical Software Engineering: An International Journal, vol. 3, no. 3, pp. 275-298, Sept. 1998. [14] , "Logistic regression modeling of software quality," Int. J. Reliability, Quality, and Safety Engineering., vol. 6, no. 4, pp. 303-317, Dec. 1999. [15] , "A practical classification-rule for software quality models," IEEE Trans. Reliability, vol. 49, no. 2, pp. 209-216, June 2000. [16] T. M. Khoshgoftaar, E. B. Allen, W. D. Jones, and J. P. Hudepohl, "Return on investment of software quality models," in Proc. 1998 IEEE Workshop on Application-Specific Software Engineering and Technology pp 145-150 [17] , "Classification tree models of software quality over multiple releases," in Proc: Tenth Int. Symp. Software Reliability Engineering, 1999 pp 116-125 [18] ' "Data mining for predictors of software quality," Int J Software Engineering and Knowledge Engineering., vol. 9, no. 5, pp. 547-563, [ggg [19] ', "Which software modules have faults that will be discovered by customers?," J. Software Maintenance: Research and Practice, vol. 11, no. l,pp. 1-18, Jan. 1999. [20] , "Accuracy of software quality models over multiple releases," Annals of Software Engineering., vol. 6, 2000. [21] T. M. Khoshgoftaar, E. B. Allen, K. S. Kalaichelvan, and N. Goel, "Early quality prediction: A case study in telecommunications," IEEE Software, vol. 13, no. 1, pp. 65-71, Jan. 1996. [22] T. M. Khoshgoftaar, E. B. Allen, and A. Naik et al, "Using classification trees for software quality models: Lessons learned," Int. J. Software Engineering and Knowledge Engineering, vol. 9, no. 2, pp. 217-231, 19 9 ' [23] T. M. Khoshgoftaar, E. B. Allen, and X. Yuan et al, "Preparing measurements of legacy software for predicting operational faults," in Proc: Int. Conf. Software Maintenance, 1999, pp. 359-368. [24] T. M. Khoshgoftaar and D. L. Lanning, "A neural network approach for early detection of program modules having high risk in the maintenance phase," /. Systems and Software, vol. 29, no. 1, pp. 85-91, Apr. 1995. [25] B. A. Kitchenham, "A procedure for analyzing unbalanced datasets," IEEE Trans. Software Engineering, vol. 24, no. 4, pp. 278-301, Apr. 199 °[26] M. R. Lyu, "Introduction," in Handbook of Software Reliability Engineering: McGraw-Hill, 1996, ch. 1, pp. 3-25. [27] J. Mayrand and F. Coallier, "System acquisition based on software product assessment," in Proc. 18th Int. Conf. Software Engineering, 1996, pp. 210-219. [28] J. C. Munson and T.M. Khoshgoftaar, "The detection of fault-prone programs," IEEE Trans. Software Engineering, vol. 18, no. 5, pp. 423-433, May 1992.

94

[29] J. D. Musa, "Operational profiles in software reliability engineering," IEEE Software, vol. 10, no. 2, pp. 14-32, Mar. 1993. [30] N. Ohlsson, M. Zhao, and M. Helander, "Application of multivariate analysis for software fault prediction," Software Quality J., vol. 7, pp. 51—66, 1998. [31] A. A. Porter and R. W. Selby, "Empirically guided software development using metric-based classification trees,"/£££S<>/*ware, vol. 7, no. 2, pp. 46-54, Mar. 1990. [32] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986. [33] SAS Institute staff, "TREEDISC macro (beta version)," 1995 Technical report, SAS Institute. Documentation with macros. [34] N. F. Schneidewind, "Software metrics validation: Space Shuttle flight software example," Annals of Software Engineering, vol. 1, pp. 287-309, 1995. [35] , "Software metrics model for integrating quality control and prediction," in Proc. 8th Int. Symp. Software Reliability Engineering, 1997, pp. 402-415. [36] G. A. F. Seber, Multivariate Observations: John Wiley & Sons, 1984. [37] R. W. Selby and A. A. Porter, "Learning from examples: Generation and evaluation of decision trees for software resource analysis," IEEE Trans. Software Engineering, vol. 14, no. 12, pp. 1743-1756, Dec. 1988. [38] R. Takahashi, Y. Muraoka, and Y. Nakamura, "Building software quality classification trees: Approach, experimentation, evaluation," in Proc. 8th Int. Symp. Software Reliability Engineering, 1997, pp. 222-233. [39] J. Troster and J. Tian, "Measurement and defect modeling for a legacy software system," Annals of Software Engineering, vol. 1, pp. 95-118, 1995. [40] L. G. Votta and A. A. Porter, "Experimental software engineering: A report on the state of the art," in Proc. 17th Int. Conf. Software Engineering, 1995, pp. 277-279. [41] X. Yuan, "Modeling software quality with TREEDISC," Master's thesis (advised by Taghi M. Khoshgoftaar), Florida Atlantic University, Dec. 1999.

Taghi M. Khoshgoftaar is a Professor of the Department of Computer Science and Engineering, Florida Atlantic University, and is also Director of the EmpirSoftware Engineering Laboratory. His research interests are in software engreeting, software measurements, software reliability and quality engineering, computational intelligence, computer performance evaluation, multimedia syst e m s m
E d w a r d B . A U e n rec eived

the B.S. in 1971 in engineering from Brown Univeri n 1 9 7 3 i n s y s t e m s engineering from the University o f Pennsylvania, Philadelphia, and the Ph.D. in 1995 in computer science from F l o r i d a A t l a n t i c U n i v e r s i t y i B oca Raton; his work for this paper was performed w h i l e h e w a s a t m i s u n i v e r s i t y . H e is an assistant professor in the Department of Computer Science at Mississippi State University. He began his career as a p r o g r a m m e r w i t h the U.S. Army. From 1974 to 1983, he performed system engineering and software engineering on military systems, first for Planning Research Corp. and then for Sperry Corp. From 1983 to 1992, he developed c o r p o r a t e d a t a p r o c e s s i n g s y s t e m s for Glenbeigh, Inc., a specialty health care c o m p a n y . F r o m 1 9 9 5 t 0 2 000, he performed research in software engineering a t p ^ ^ A t l a m i c U n i v e r s i t y . His research interests include software measurem e m s o f t w a r e prO cess, software quality, and computer performance modeling. H e h a s m o r e t h a n 6 0 r e f e r e e d p u b i i c a t i o n s i n t h e S e areas. He is a member of the IEEE Computer Society and the Association for Computing Machinery. s i t y R h o d e Island_ m e M s

Jianyu Deng received the M.S. in 1999 in computer science from Florida Atlantic University, Boca Raton. She is a software engineer with Motorola. Her research interests include software engineering and software quality.

*|®§^§r

INFORMATION

t ojf^\t'i

* ITKTB J&emm

ELSEVIE R

AND

Informatio n and Software Technology 43 (2001) 863-873

SOFTWARE TECHNOLOGY

= ^ ^ = = = www.elsevier.com/locate/infsof

Can genetic programmin g improve software effort estimation ? A comparativ e evaluation Colin J . Burgess** , Marti n Lefleyb "Department of Computer Science, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, UK ^School of Design Engineering and Computing, University of Bournemouth, Talbot Campus, Poole BH12 5BB, UK

Abstract Accurat e software effort estimation is an importan t par t of the software process. Originally , estimation was performe d usin g only huma n expertise, but more recently, attentio n ha s turne d to a variety of machin e learnin g (ML) methods. This paper attempts to evaluat e criticall y the potentia l of genetic programmin g (GP) in software effort estimation when compare d with previously publishe d approaches , in terms of accurac y an d ease of use. The compariso n is based on the well-known Desharnai s data set of 81 software projects derived from a Canadia n software hous e in the late 1980s. The inpu t variable s ar e restricte d to those availabl e from the specification stage an d significan t effort is put into the GP an d all of the other solution strategie s to offer a realisti c an d fair comparison . Ther e is evidence tha t GP can offer significan t improvements in accurac y but this depends on the measur e an d interpretatio n of accurac y used. GP ha s the potentia l to be a valid additiona l tool for software effort estimation but set up an d runnin g effort is high and interpretatio n difficult, as it is for an y complex meta-heuristi c technique . © 2001 Elsevier Science B.V. All right s reserved. Keywords: Case-based reasoning ; Genetic programming ; Machin e learning ; Neura l networks; Software effort estimation

1. Introduction

2. Background to the problem

Most organisation s must decide how to allocat e valuabl e resource s based on prediction s of the unknow n future . The example studied her e is software effort estimation , where volume and costs ar e not directly proportionall y relate d [12]. Any improvement in the accurac y of prediction of development effort can significantl y reduc e the costs from inaccurat e estimation , misleadin g tenderin g bids an d disablin g the monitorin g progress . Accurat e modelling can also assist in schedulin g resource s an d evaluatin g risk factors . In this paper , we evaluat e the potentia l benefits from applying the tool of genetic programmin g (GP) to improve the accurac y of software effort estimation . In order to do this, we compar e it with other methods of prediction, in terms of accurac y an d other qualitativ e factor s deemed importan t to potentia l user s of such a tool. All the parameter s used for the estimation process ar e restricte d to those availabl e at the specification stage. Fo r those reader s who ar e alread y familia r with the concepts of GP, Sections 6.1 an d 6.2 can be skipped.

. „ * Correspondin g author .

Learnin g allows human s to solve hugel y complex problems at speeds which outperfor m even the fastest computer s of the presen t day [36]. Machin e learnin g (ML) technique s have been used successfull y in solving man y difficult problems, such as speech recognitio n from text [37], adaptive contro l [34,16], integrate d circui t synthesis [31] an d mark-u p estimation in the constructio n industr y [16]. Due to their minima l assumption s an d broa d an d deep coverage of solution space, ML approache s deserve exploratio n to evaluat e the contributio n they have to the prediction of software effort. One of the first methods used to estimate software effort automaticall y was COCOMO [5], where effort is expressed as a functio n of anticipate d size. The genera l form of the model tends to be: _ b

(1)

where E is the effort required , S is the anticipate d size and a an d b ar e domain specific constants . Other s have developed local models, usin g statistica l technique s such as stepwise regressio n e.g. Kok et al. [23]. Linea r approache s to the problem cover only a small L rr r J '

E-mail addresses: [email protected] k (c.J . Burgess), p a « °f th e possible solutio n space , potentiall y missin g th e [email protected] (M. Lefley). complex set of relationship s evident in man y complex 0950-5849/01/$ - see front matter © 2001 Elsevier Science B.V. All rights reserved. HI : S0950-5849(01)00192-6

95

864

C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873

Table l Summary of the variables available in the Desharnais data set

build software cost prediction systems and compares preliminary results against other previously researched

J T ^

^

Explanation

Project name ExpEquip ExpProjMan Transactions Entities RawFPs AdjFPs DevEnv YearFm Envergure

xi XI X3 XA X5 X6 xi X8 X9

Numeric identifier Team experience in years Project manager's experience in years Number of transactions processed Number of entities Unadjusted function points Adjusted function points Development environment Year of completion Complex measure derived from other factors — denning the environment

x\ 0

Measured in person-hours

approaches. Emphasis is placed on assisting the software engineering manager to realistically consider the use of Qp a s a n estimator of effort. For example, parameters that cannot be measured at the outset of a project, such as lines of , , ^ , . , , ,. code, are not used. Evaluation does not merely focus on prediction accuracy but also on other important factors necessary for good design of an estimation methodology. \ y e attempt to assess G P against alternative techniques with minimum bias to give a fair comparison. We have tried . ° to avoid p u t t i n g a lot of t u n m g effort into o n e p a r a d i g m With the misguided a i m to p r o v e that it is better than other methods that have been comparatively sparsely explored. _ , , . f, c , • Some researchers make significant benefits by removing alleged outliers from the data sets, but we have avoided this since w e want to use as much of the data as possible. Also, the determination of outliers can incorporate bias, complicate comparison and could be based on any number of heuristics. Thus, this is not a simple task for an estimator with only, in general, a limited data set available. Another danger is to generate many diverse solutions so that a best model, based on query performance can be selected that will always fit very closely to the evaluation data. Ultimately, we test the hypothesis that a G P based solution can learn the nuance of a data set and make its own decisions for issues such as variable weighting and pre-processing, dealing with outliers and the selection of feature subsets.

Dependent variable Effort

domains such as software development environments. For example, Kemerer [21] and Conte et al. [6] frequently found errors considerably in excess of 100% even after model calibration. A variety of M L methods has been used to predict software development effort. Artificial neural networks (ANNs) [20,42], case based reasoning (CBR) [15], rule induction (RI) [19,22,40] offer good examples. Hybrids are also possible [22], e.g. Shukla [39] reports better results from using an evolving A N N compared to a standard back propagation ANN. Dolado [9,10] and Fernandez and Dolado [13] analyse many aspects of the software effort estimation problem. Recent research by Dolado [11] shows promising results for a GP based estimation system on a single input variable. This research extends this idea into richer models requiring larger populations and much longer learning lifetimes. This paper investigates the potential for the use of GP methods to

3 . Data set used for comparisons and Weibull distribution modelling In order to explore and compare the potential of the three M L techniques for building effort prediction models we selected an existing project effort data set. The data set USed comprises 81 software projects derived from a Canadian software house in the late 1980s by Jean-Marc Desharnais [8]. Despite the fact that this data is now over 10 years o l d > i t i s o n e o f t h e l a r g e r ; p u b l i c l y available, data sets and

The project number and effort of the 18 projects selected for measuring prediction capability Project number

Effort (person-hours)

2 8 12

5635 3913 9051

15

4977

19

4494

22 35 38

651 5922 2352

42

14987

50

g232

56 61 63

2926 2520 2275 1386 ° . , „„

72 orl oU

can be used to assess the comparative performance of new approaches. The data set comprised 10 features, one dependent and nine independent, as summarised by Table 1. A second dependent feature, namely length of code, is ignored, as ' ' s far ' e s s important to predict than total effort. Four of the 81 projects contained missing values so these were replaced by random samples from the other projects. In order to compare many different prediction paradigms, the data was divided into two sets which are called a training set an d a query set. The query set chosen was 18 projects, which were randomly selected from the 81 projects, i.e. approximately 2 2 % of the total (Table 2). T h e remaining 63 projects were used as the learning set. In order to eliminate all possible distortion and to simulate more realis-

tn s

jloU

tically the practical situation, the query set was not used in

96

865

C.J. Burgess, M. Lefley I Information and Software Technology 43 (2001) 863-873

Table 3 Trainin g data output statistics

predictions?' . A numbe r of summar y statistics were used, can be seen as significantl y better tha n an y of the others . Ultimately the decision shoul d be made by the user , based on estimates of the costs of unde r or over estimatin g project effort. All ar e based on the calculate d error , j e m e differenc e between the predicted and observed ., . , r outpu t values for the projects.

n o n e of w hic h

Effor t

Mean Minimum ,, . SD

Maximu m

Weibuil Alpha Weibull Beta

4908.556 4563.736 546 ..... 23940

F

*

J

The methods used in this paper were:

1.41

4830

any of the furthe r analysi s or tuning , apar t from measurin g the accurac y of the final prediction of effort. The trainin g outpu t values have been foun d by the authors , to be fitted well by Weibull distributions , with a highl y significan t Chi Squar e and smaller erro r tha n other candidat e distributions . The Weibull parameter s were foun d by piecewise approximatio n an d provided in Table 3. One usefu l measur e of how much the inpu t or indepen dent variable s influenc e the outpu t or dependen t variables , is to calculat e the correlatio n coefficients between inpu t an d output . Since variable s may have functiona l relationship s a numbe r of function s of inpu t variable s (such as logarithmi c and exponential ) were used to searc h for the highes t corre lations . None of the function s was foun d to have increase d the ra w correlation s by enoug h to justify their continue d use. An illustratio n of the outpu t data' s distributio n is given in Fig . 1, alon g with the fitted distributions . 4. How to evaluat e the techniques ?

(a) Correlatio n coefficient. (b) Adjusted mean squar e erro r — AMSE. (c) Prediction s within 25% — Pred(25) . (d) Percentag e of prediction s within 25% — Pred(25) % Distributions . (e) Mean magnitud e of relative erro r — MMRE. (f) Balance d mean magnitud e of relative erro r — BMMRE. Most commonly, accurac y is defined in terms of mean magnitud e of relative erro r (MMRE) [10], which is the mean of absolut e percentag e errors : = "' / I F — F \ in n (2) MMRE = T I — I (2) ;=i V ' /n ^ ^ ^ a c t u a , e f f o r t ^ £ .g ^ wher e ther e ^ n ^ ^ h a s b e e n SQme ^ ^ ^ dicted effor t of ^ measure > i c u l a r t h e fac t tha t it is u n b a l a n c e d a n d p e n a l i s e s o v e r . e s t i m a t e s m o r e t h a n underestimates . Fo r this reason , Miyazaki et al. [33] proposed a balance d MMRE measur e as follows: i—n /

Software effort prediction is typified by comparativel y small data sets with prediction error s havin g significan t effect on costs. An importan t question tha t needs to be asked of an y estimation method is 'Ho w accurat e ar e the

c\

\ 1 QQ

(3)

This approac h ha s been criticised by Hughe s [18], amongs t others , as effectively being two distinct measure s

Distribution s

Fig . 1. Distributio n of effort alon g with fitted Weibull distribution .

97

£

BBMRE ~ A I m m ( £ . jr.) J ~^ ~

866

C.J. Burgess, U. Leftey / Information and Software Technology 43 (2001) 863-873

that should not be combined. Another approach is to use Pred(25)%, which is the percentage of predictions that fall within 25% of the actual value. It is clear that the choice of accuracy measure depends to a large extent upon the objectives of those using the prediction system. For example, MMRE is fairly conservative with a bias against over-estimates, whilst Pred(25) supports those prediction systems that are generally accurate but occasionally wildly inaccurate. Other workers have used the adjusted R squared or coefficient of determination to indicate the percentage of variation in the dependent variable that can be 'explained' in terms of the independent variables. In this paper, we replace this with the simpler correlation coefficient. The simplest raw measure is the mean square error, however this depends on the mean of the data sets and it is thus difficult to interpret or make comparisons. Instead, we use AMSE, the adjusted mean square error. This is the sum of the squared errors, divided by the product of the means of the predicted and observed outputs. ~ 2 i=n AMSE = y J (4) i=i (£•, * £,)

(e) Robustness. (f) Likelihood of convergence, (g) Prediction beyond learning data set space,

5. Previous related work using machine learning Any method of transforming data may be used for software estimation, for example, least squares regression. This section considers ML techniques previously recommended for comparative effort estimation viz. CBR and ANNs. These techniques have been selected on the grounds that, there exists adequate previous research to promote their efficacy and because of their significantly differing approaches, they form a good basis for comparisons, 5]

Artificiai

neurai

networks

ANNs are learning systems inspired by organic neural systems, which are known to be very successful at learning to solve problems. They comprise a network of simple interconnected units called neurons. The connections between the neurons are weighted and the values of these weights determine the function of the network. Input values are multiplied by the weights, summed, passed through a step function and then on to other neurons and finally to the output neurons. The weights are selected to optimise the output vector produced from a given input vector. Typically, this selection is carried out by adjusting the weights systematically, based on a given data set, a process of training. The most common method of training is by back propagation. Here, the weights begin with small random values and then a proportion of the difference between desired and current output, called the error, is used to adjust weights back through the net, proportionally to their contribution to the error. Thus, the outputs are moved towards the desired values and the net learns the required behaviour, Recent studies concerned with the use of ANNs to predict software development effort have focused on comparative accuracy with algorithmic models, rather than on the suitability of the approach for building software effort prediction systems. An example is the investigation by Wittig and Finnie [46]. They explore the use of a back propagation neural network on the Desharnais and ASMA (Australian Software Metrics Association) data sets. For the Desharnais data set, they randomly split the projects three times between 10 test and 71 training sets, which is very similar to the procedure we follow in our paper. The results from three validation sets are aggregated and yield a high level of accuracy. The values cited are Desharnais set MMRE = 27% and ASMA set MMRE = 17%, although some outlier values are excluded. Mair et al. [30] show neural networks offer accurate effort prediction, though not as good as Wittig and Finnie, but conclude that they are difficult to configure and interpret.

4.1. Qualitative measures Whatever measures are being used, it is clear that although accuracy is an important consideration, it is not sufficient to consider the accuracy of prediction systems in isolation. Hence, in assessing the utility of these techniques, we have considered three factors, adapted from Mair et al. [30]: accuracy, explanatory value and ease of configuration. Accuracy has been the primary concern of researchers and clearly it is of considerable importance; a prediction system that fails to meet some minimum threshold of accuracy will not be acceptable. However, we believe that accuracy, by itself, is not a sufficient condition for acceptance. The quality of information provided for the estimator is of great importance. Empirical research has indicated that endusers coupled with prediction systems can outperform either prediction systems or end-users alone [41]. The more explanations given for how a prediction was obtained, the greater the power given to the estimator. If predictions can be explained, estimators may experiment with'what if scenarios and meaningful explanations can increase confidence in the prediction. Apart from accuracy, other evaluation factors for a prediction system, which could be important to non-expert practitioners are:(a) Resources required (i) Time and memory needed to train. (ii) Time and memory needed to query. (b) Ease of set up. (c) Transparency of solution or decision. (d) Generality.

98

867

C.J. Burgess, M. Lefley /Information and Software Technology 43 (2001) 863-873

accurate predictors, within the measurement constraints already outlined in Section 4.

5.2. Case-based reasoning CBR is based on the psychological concepts of analogical reasoning, dynamic memory and the role of previous situations in learning and problem solving [36]. Cases are abstractions of events, solved or unsolved problems. New problems are solved by a weighted concatenation of the solutions from the set of similar cases. Aarmodt and Plaza [1] describe CBR as being cyclic and composed of four

6. Background to genetic programming 6 L

Introduction to genetic algorithms

Genetic

algorithms (GA) were developed as an alternative technique for tackling general optimisation problems with large search spaces. They have the advantage that they do not need any prior knowledge, expertise or logic , , , . , ,, , . , K„ . ,, related tto the particular problem being solved. Occasionally, , . . they can produce the optimum solution, but for most ,, .. , , , . . problems with a large search space, a good approximation r r , . . ,., . ° to the optimum is a more likely outcome. _ , , . . , . , , _ . . . The basic ideas used are based on the Darwinian theory of evolution, which in essence says that genetic operations feetween c h r o m o s o m e s eventually leads to fitter individuals w b j c h &K mQK ] i k d y tQ s u r v i y e T h u S j o y e r & , o n g p e r i o d o f ^ the population of the species as a whole improves. However, not all of the operations used in the computer a n a l o g y o f Ms p r o c e s s n e c e s s a r i l h a v e a b i o i o g i c a l e q u ivaj { In ^ c o m p u t e r i m p l e m e n t a t i o n of these ideas a solutiori;

. . . . . 1. retrieval of similar cases. . , , _ , . . , 2. reuse ofr the retrieved cases to find a solution to the ,. problem. _ . . , , , , . ... 3. revision of the proposed solution if necessary. . , . :, . , 4. retention of the solution to form a new case. Consequently, issues concerning case characterisation [35], similarity [2,45] and solution revision [28] must be addressed prior to CBR system deployment. Software project estimation has been tackled by a number of researchers using CBR. Three doculented examples are Estora [43], finding analogies for cost estimation (FACE) [4] and ANGEL [38]. A summary of each of these works 0

ows ' Estora uses the estimator's protocols, and infers rules from these. An analogy searching approach is used to produce estimates which the developers claim were comparable, in terms of/^-squared values, to the expert's and superior to those obtained using function points, regression based techniques, or COCOMO. FACE was developed by Bisio and Malabocchia who assessed it using the COCOMO data set. It allocates a normalised similarity score 6, with values between 0 and 100 for each candidate analogy from the case repository. A user threshold, typically 9 = 70, is used to decide which cases are used to form the estimate. If no cases score above the threshold, then reliable estimation is not deemed possible. The research shows that FACE performs very favourably against algorithmic techniques such as regression Shepperd and Schofield report on their tool ANGEL, an estimation tool based upon analogical reasoning. Here projects, or cases are represented in a Euclidean hyperspace where a modified nearest neighbour algorithm identifies the best analogies. They report results, derived from a number of data sets, of superior performance to LSR models. Other approaches include that of Debuse and RaywardSmith [7] who apply simulated annealing algorithms to the problem of feature subset selection which could then be used with other tools. There are also those who consider supplementary tools to enhance another method, for exampie an evolutionary tool. Though such approaches could be used in conjunction with the tools here, assessing the complete tools available is the goal of this work. Whatever overall methodology is used, the aim is to find the most

but not necessarily a very good solution, to the problem being

s o l y e d is r e p r e s e n t e d t

icall

by

a

fixed

binafy stfing; w h j c h js t e r m e d a c h r o m o s o m e b

^

^

measure is o n e

b i o l o g i c a l e q u i v a l e n t . It

^

^

fitness

of any

is n e a r g r

^

^

^

must

whefe Q

o p t i m a l SOM[OR

w h i c h i s r e l e v a n t tQ t h e m m m

be

length anal

possible

^

t0

individual

Qn&

example

p r o b l e m > is t h a t a

solution minimises the error between the predicted

fitter

values

and tng tnje vaiues The basic

s s o f ±&

^

a,though a n u m b e r of variations m

a l g o r i t h m is as

follows>

possible:.

Generate at random a population of solutions, i.e. a family of chromosomes 2 . Create a new population from the previous one by applyi n g g e n e t i c o p e r a t o r s t 0 t h e fittest chromosomes, or pairs of fittest chromosomes of the previous population. 3 R e p e a t s t e p ( 2 ) i u n t i l e i t h e r t h e fitness o f t h e b e s t s o i u t i o n has converged or a specified number of generations have b e e n Droduceci L

The best solution in the final generation is taken as the approximation to the optimum for that problem that can be attained in that run. The whole process is normally run a number of times, using different seeds to the pseudo-random number generator. The key parameters that have to be determined for any given problem are:

best

(a) The best way of representing a solution as a fixed length binary string, (b) The best combination of genetic operators. For GA,

99

868

CJ. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873

Fig. 2. (a) Illustration of crossover operator before operation. The double line illustrates where the trees are cut, (b) Illustration of crossover operator after operation. The double line illustrates where the sub-trees have been swapped.

reproduction, crossover and mutation are the most common. (c) Choosing the best fitness function, to measure the fitness of a solution. (d) Trying to keep enough diversity in the solutions in a population to allow the process to converge to the global optimum but not converge prematurely to a local optimum.

that GP can be used to solve a large number of seemingly different problems from many different fields" [24]. He goes on to state that GP offers solutions in representations of computer programs. These offer the flexibility to:(a) Perform operations in a hierarchical way. (b) Perform alternative computations conditioned on the outcome of intermediate calculations. (c) Perform iterations and recursions. (d) Perform computations on variables of different types. (e) Define intermediate values and sub-programs so that they can be subsequently reused.

More information related to GA can be found in Goldberg [14] or Mitchell [32]. 6.2. Introduction to genetic programming

The determination of a program to model a problem offers many advantages. Inspection of the genetic program solutions potentially offers understanding of the forces of behaviour behind the population. For example, Langley et al. [27] describe BACON, a system that successfully rediscovered the scientific laws; Ohm's laws, Coulomb's law, Boyle's law and Kepler's law from the given finite samples of data. As so many of the phenomena in our universe may be represented by programs, there is encouraging evidence for the need to explore a general program space for solutions to problems from that universe, The initial preparation for a GP system has several steps.

GP is an extension of GA, which removes the restriction that the chromosome representing the individual has to be a fixed length binary string. In general, in GP, the chromosome is some type of program, which is then executed to obtain the required results. One of the simplest forms of program, which is sufficient for this application, is a binary tree containing operators and operands. This means that each solution is an algebraic expression, which can be evaluated. Koza "offers a large amount of empirical evidence to support the counter-intuitive and surprising conclusion

100

869

C.J. Burgess, M. Lefley IInformation and Software Technology 43 (2001) 863-873

Table 4 Main parameters used for the GP system

is not defined for negative or non-integer values. The multip l y i s a p r o t e c t e d multiply which prevents the return of any values greater than 1020. This is to minimise the likelihood of real overflow occurring during the evaluation of the solations. On average, 10% of the operands used were random constants in the range 0-1.0.

Value

Size of population Number of generations Number of runs Maximum initial full tree depth Maximum number of nodes in a tree

Percentage of elitism

1000 500 10 5 64

6.2.1. Illustration

5

of the crossover

operator

Parents before the crossover: i \ ! ± ! L

(a)

First, it is necessary to choose a suitable alphabet of operands and operators. The operands are normally the independent input variables to the system and normally include an ephemeral random constant (ERC), which is a random variable from within a suitable range, e.g. 0-1.0. The operators should be rich enough to cover the type of functionality expected in solutions, e.g. trigonometric functions if solutions are expected to be periodic, but having too many operators can hamper convergence. Secondly, it is necessary to construct an initial population. In this paper, we use a set of randomly constructed trees from the specified alphabet, although this is dependent on the type of problem being solved and the representation chosen for the programs. For an initial population of trees, a good and common approach is called Ramped Half and Half. This means that half the trees are constructed as full trees, i.e. operands only occur at the maximum depth of the tree, and half are trees with a random shape. Within each half, an equal number of trees are constructed for each depth, between some minimum and maximum depths. This is found to give a good selection of trees in the original population. The main genetic operations used are reproduction and crossover. Mutation is rarely used, but other more specialised operators are sometimes used, but not for the problem tackled in this paper. Reproduction is the copying of one individual from the previous generation into the next generation unchanged. This often takes the form of elitism, where the top n% of the solutions, as measured by fitness, is copied straight into the next generation, where n can be any value but is typically 1-10. The crossover operator chooses a node at random in the first chromosome, called crossover point 1, and the branch to that node is cut. Then it chooses a node at random in the second chromosome, called crossover point 2, and the branch to that node is cut. The two sub-trees produced below the cuts are than swapped. The method of performing crossover can be illustrated using an example, see Fig. 2a and b. Although this example includes a variety of operations, for simplicity in this application only the set { +, - , *} were made available. More complex operators can of course be developed by combining these simple operations. Simple operators eliminate the bounding problems associated with more complex operations such as XOR, which

x^Y2

(b)

2

New children produced by crossover operation:(X + 3) + X (X XORX ) * 3 (a) — Y — X ^ o 3 2 Thus two new individuals are created, whose fitness can be evaluated. This process is shown diagrammatically in Fig. 2a and b. 6.2.2. Controlling the GP algorithm Since trees tend to grow as the algorithm progresses, a maximum depth is normally imposed (or maximum number of nodes) in order that the trees do not get too large. This is often a lot larger than the maximum size allowed in the initial population. Any trees produced by crossover that are too large are discarded. The reasons for imposing the size limit is to save on both the storage space required and the execution time needed to evaluate the trees. There is also no evidence that allowing very large trees will necessarily lead to improved results. The basic algorithm is the same as for GA as given in Section 6.1. Key parameters that have to be determined for any given problem are: ( a ) The best way of choosing the alphabet, and representS t h e solution as a tree (or other structure), (°) The best genetic operators. ( c ) Choosing the best fitness function, to measure the fitness of a solution. ( d ) Trying to keep enough diversity in the solutions in a population to allow the process to converge to the global optimum but not converge prematurely to a local optimum. ( e ) Choosing sensible values for the population size, maximum tree size, number of generations etc. in order t0 S e t g ° o d solutions without using too much time or space, in

^

More information related to GP can be found in Banzhaf a n d l n K o z a 25 26 [ > 1-

7. Applying GP to software effort estimation The software effort estimation problem is an example of a

101

870

C.J. Burgess, M. Lefley /Information and Software Technology 43 (2001)

863-873

Table 5 Comparing the best prediction systems, within each paradigm. Best performing estimators are highlighted in bold type. The figures for ANNs and GP are averages over ten executions of the systems. For Pred25, AMSE and MMRE; (*denotes significance at 5% level and **at 1% level) Estimated effort Random Correlation

- 0.16589

AMSE Pred(25) Pred(25)% MMRE BMMRE

9.749167 3 16.67 181.72 191

Linear LSR

2 nearest neighbours

5 nearest neighbours

Artificial neural network

0.557

0.550

0.586

0.635

*7.378 10 **55.56 **46.18 59

"6.432 8 **44.44 162.30 66

**5.733 8 **44.44 168.30 70

**5.477 10 **S5.S6 **60.63 69

symbolic regression problem, which means, given a number of sets of data values for a set of input parameters and one output parameter, construct an expression of the input parameters which best predicts the value of the output parameter for any set of values of the input parameters. In this paper, GP is applied to the Desharnais [8] data set already described in Section 3, with the same 63 projects in the Learning Set and remaining 18 projects in the Test Set as used for the other methods. The parameters chosen for the GP system, after a certain amount of experimentation, are shown in Table 4. The results obtained depend on the fitness function used. In order to make comparison with the other methods, the fitness function was designed to minimise the MMRE measure as applied to the Learning Set of data. The values of MMRE quoted in the results are the result of applying the solution obtained to the Test Set of data. The GP system is written in C and runs on a shared Sun 4 x 1 7 0 MHz Ultra Sparc, which means that timings are approximately equivalent to 300 MHz Pentium. However, the run-time core size is approximately 4 Mbytes to store the trees required in any one generation. One run of 500 generations takes 10 min c.p.u. time.

Genetic programming 0.752 11.13 4.2 *23.3 **44.55 75

predictors at 1 and 5% levels of significance. If the results between estimators show universal increases in accuracy, then, comparative statistical tests may also be applied, Ease of configuration depends mainly on the number of parameters required to set up the learner. For example, a nearest neighbour system needs parameters to decide the weight of the variables and the method for determining closeness and combining neighbours to form a solution, Many learners need an evaluation function to determine error to decide on future search direction. The choice of method to make this evaluation is crucial for accurate modelling. Koza [24] lists the number of control parameters for GP as being 20 as compared to the neural networks 10, but it is not always easy to count what constitutes a control parameter. However, it is not so much the number of the parameters as the sensitivity of the accuracy of the solutions to their variation. It is possible to search using different parameters but this will depend on their range and the granularity of search. In order to determine the ease of configuration for a genetic program, we test empirically whether parameter values suggested by Koza offer sufficient guidance to locate suitably accurate estimators. Similarly, all of our solutions use either limited experimentation or commonly used values for contr ° l parameters.

7.1. Comparing the GP results with other methods The hypothesis to be tested is that GP can significantly produce a better estimate of effort than other techniques. In the comparison, those that contain a stochastic element are tested over a population of solutions, independently gener, • , ,^ , , _ ._ * ated. The ANN uses random starting values, and GP s use . , . .. . . , , . ' . random selection of initial population and crossover point. __ , . . , • , , , • • j- , , The range of solutions obtained has implications for both b , . . , , ,. . , the accuracy ofc the solutions and the ease of configuration. „, . „. . • , „ , , , The software effort estimator must consider all available . . , „ _ , tools rfor improving the accuracy of estimates. The aim of ,. . .., • . . . . . . . this paper is to provide the estimator with the information to , . . , , . . . . __ . , . , make an informed decision on the use of GP for this task. „. . . . . u » u j i, J i_ •,,,-, ,i Since the data has been shown, to be modelled by J a Weibull . . . . . . ._ . _. . . ... , distribution (Section 3), it is possible to produce random sets of data by computer. This means that statistical tests can be produced to compare the predictors with random data to isolate statistically significant accuracy. Table 5 tests the

„ „ , „ , " ReSUltS °f t h e

8

Pa»son ^ ^ ML ^ This has eyaluated ^ . ^ ^ i r • cc . • i J/~.r>. i t niques including a GP tool to make software project effort , .., , , ,. . ,,, , , . predictions. We believe tthat the best way to assess the prac. , .... c., , . ... ., , tical utility of these techniques would be to consider them i_u r i_ • • • •. , r within the context of their interaction with an example of an , , . . ; r intended user, viz. a software project manager. THowever, . , , ,. ,. since the data set we have used is one used by many other , ,, , ., , . .„„„ , . . ... workers, and based on the late 1980s, this is not possible. „, .,, „ ., . , , . Thus we will perform the comparisons based on the , , 5 . . , . c ,, accuracy of the results, the ease of configuration and the , , , . transparency of the solutions. 81

com

Accuracy of the predictions Table 5 lists the main results used for the comparison.

102

C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873 Table 6 Population behaviour for the best and worst (chosen on the basis of MMRE) of the population of solutions from ANN and GP solutions v v

Estimated effort

Correlation AMSE Pred(25) Pred(25)% MMRE BMMRE

Artificial neural network Worst Average Best

Genetic programming Worst Average Best

0.588 6.278 10 56 65.45 74

0.612 14.58 2 112 52.12 92.47

0.635 5.477 10 56 60.63 69

0.650 5.209 10 56 59.23 66

0.752 11.13 42 23 5 44.55 74.57

0.824 7.77 5 28 37.95 59.23

The various different types of regression equations all gave insignificantly differing results from each other, and so only, the linear LSR results are quoted. Results for ANNs are less accurate than those reported by Wittig and Finnie [24], although this is for a different data set and this may be, in part, due to the impact of their removing outlier projects in some of the validation sets. For the GP and ANN solutions, we have generated 10 solutions to assess reliability of accuracy. The population behaviour is summarised in Table 6. The neural network seems to converge fairly consistently to a reasonable solution, with a difference of 6% in MMRE. In contrast, the GP system is consistently more accurate for MMRE but does not converge as consistently as the ANN to a good solution, with a 14% variation in MMRE. This suggests that more work needs to be done to try and prevent premature convergence in the GP system. For AMSE and Pred(25), the ANN is more accurate and consistent. Early results from using GP, suggest that optimising one particular measure, in this case MMRE, has the effect of degrading a lot of the other measures and that a fitness function that is not specifically tied to one particular measure may give more acceptable overall results. This is illustrated in Tables 5 and 6, which give superior results for MMRE and correlation, but rather poor results for Pred(25), AMSE and BMMRE. This suggests that more research is required not only on GP, but also on the problem itself as to which of the many measures or combination of measures is the most appropriate in practice. find 8.2. Transparency of the solution One of the benefits of LSR is that it makes explicit the weight and contribution of each input used by the prediction

system. This can lead to insights about the problem for

understanding of the data by the user, it gives little indicat i o n o f t h e contribution of specific variables. Changing para,. , , ,, , .• - iT_. ,

meters could allow some exploration of this space, but complex interactions between variables and small data sets to make exploration limited. ^ T h e neural nets u s e d w k h i n this s m d y d o nQt a U o w us er to see the rules generated by the prediction system. If a particular prediction is surprising, it is hard to establish any rationale for the value generated. It is difficult to understand ..... . , "j . , , , . ,. . , , an ANN merely by studying the net topology and individual node weights. To extract rules from the best ANN a method of pruning was used as suggested by Lefley [29]. This reduced the net to the following nodes and links that were making a significant contribution to the error. Node 10 -0.35*X7 + - 0 . 8 1 * X 8 Node 11 1.60 + 0.39 *X1 — 0.69 *X4 — 0.67 *X6 — 1.075981 *X1 - 1.40 *X8 Node 12 0.42 *X5 + 0.61 *X1 Node 13 - 0 . 2 8 *X6 — 0.74 *X1 — 3.24 *X8 Node 14 0.32 *X6 + 0.33 *X1 + 0.52 *X8 O u t p u t n o d e 15 1.59 - 2.27*iV10 - 4.22*2tfll + 1.12 *N12 — 2.41 *W13 + 1.09 *W14.

Though this does not paint a very clear picture of how a decision is made it indicates the importance of variables X6, X7 and X8 (see Table 1) and careful examination might yield a little more information. However, even with such an analysis, ANNs do remain hard to interpret, certainly harder than say LSR, for small gains in terms of accuracy, Potentially, GP can produce very transparent solutions, in the sense that the solution is an algebraic expression. At present, the good solutions for this problem have the maximum number of nodes, i.e. 64 operators or operands and the expressions do not simplify a great deal. Further work may solutions with less nodes, which may make the results more understandable analytically. In contrast, increasing the number of nodes may provide better estimates, but the results are likely to be less transparent and the computation time will increase significantly, 83

example to direct efficiencies. CBR, or estimation by analogy, also has potential explanatory value since projects are ordered by degree of similarity to the target project. Indeed, it is instructive that this technique demonstrates the effectiveness of user-involvement in performing better, when the user is able to manipulate the data and modify predicted outputs. However, although this suggests an

Ease

of configumtion

ofthe

system

The third factor in comparing prediction systems is what we term ease of configuration. In other words how much effort is required to build the prediction system in order to generate useful results. Regression analysis is a well-established technique with good tool support. Even allowing for analysis of residuals and so forth, little effort needs to be

103

871

872

C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873

expended in building a satisfactory regression model. CBR needs relatively little work though more might be gained by relative weighting of the inputs. The number of neighbours did not seem to have a significant effect. By contrast, we found it took some effort to configure the neural net and it required a fair degree of expertise. Generally, different architectures, algorithms and parameters made little difference to results but some expertise is required to choose ,,

,

....

, ,

. . .

,

making his data set available. We also thank the journal reviewers for their helpful comments which have significantly improved the quality of this report, References W A- Aarmodt, E. Plaza, Case-based reasoning: foundational issues,

iT i j

methodical variations and system approaches, AI Commun. 7

reasonable values. Although heuristics have been published on this topic [17,44], we found the process largely to be one of trial and error guided by experience. ANN techniques are not easy to use for project estimation by end-users as some expertise is reauired

' ^ case-Based Learning Algorithms, 1991 DARPA CaseBased Reasoning Workshop, Morgan Kaufmann, Los Atlos, CA, 1991. ^l W. Banzhaf, P. Nordin, R.E. Keller, F.D. Francome, Genetic

[2] w D A h a

„„ , ., , • j- c G P has many parameters too, the choice of functions, crossover and reproduction rates, percentage of elitism, dealing with out of bounds conditions, creation Strategy and setting maximum tree sizes and depths, i.e. code lengths, to name some of the more significant. We found

Programming: An Introduction, Morgan Kaufmann, Los Atlos, CA, 1998 [ 4 ] R B i s i o F M a i a b o c c h i a i C ost estimation of software projects through case based reasoning, International Conference on Case Based Reasoning, Sesimbra, Portugal, 1995. W B - W - B o e h m - Software Engineering Economics, Prentice Hall, New

,„ T 0 *' 198 , 1 "

by following the decisions suggested by Koza [24], we J

°

obtained good results but again, at present, some expertise

_

,

,

.

_

„

~,

„ .

• »„ •

Menlo Park, CA, 1986.

[7] J.C.W. Debuse, V.J. Rayward-Smith, Feature subset selection within a simulated annealing data mining algorithm , J. Intell. Inf. Syst. 9 (1997)57-81. [8] J.M. Desharnais, Analyse statistique de la productivitie des projets

,

9. Conclusions and future work

informative a partie de la technique des point des fonction, Unpublished Masters Thesis, University of Montreal, 1989. [9] J.J. Dolado, A study of the relationships among Albrecht and Mark II Function Points, lines of code 4GL and effort, J. Syst. Soft. 37(1997)

In this paper, we have compared techniques for predicting software project effort. These techniques have been

compared in terms of accuracy, transparency and ease of r.. „ -..CJ*i » »i sec • configuration. Despite finding that there are differences in °

, 7 V C.

and M o d e l S | Benjamin/Cummings,

is Still needed. „

T.

[6] S. Conte, H.E. Dunsmore, V.Y. Shen, Software Engineering Metrics

161-172.

[101 J.J. Dolado, A validation of the component-based method for software . . . ____ „ . £ . . ^ . . . ^ ,.„, size estimation, IEEE Trans. Soft. Engng 26 (2000) 1006-1021.

1

prediction accuracy levels, we argue that it may be other characteristics of these techniques that will have an equal, if not greater, impact upon their adoption. We note that the explanatory value of both estimation by analogy (CBR) and r, • »i J i_ J • ii • • » RI,T gives them an advantage when considering their interb

j j D o l a d o , O n t h e p r o b l e m of the software . cost function_ Inf . Soft . Tech. 43 (2001) 61-72. [12] S. Drummond, Measuring applications development performance, Datamation 31 (1985) 102-108. [131 L. Fernandez, J.J. Dolado, Measurement and prediction of the verifi. J^ * • • t i- _, , * ,

[n]

° " action with end-users. Problems of configuring neural nets tend to counteract their superior performance in terms of accuracy. W e are encouraged that this early investigation of GP shows it

cation cost of the design in a formalized methodology, Inf. Soft. Tech. (1999) 421-434. [14] D.E. Goldberg, Genetic algorithms in search, optimisation and machine learning, Addison Wesley, Reading, MA, 1989. I151 A.R. Gray, S.G. MacDonell, A comparison of techniques for devel41

can provide accurate estimates. The results highlight that measuring accuracy is not simple and researchers should consider a range of measures. Further, we highlight that where there is a Stochastic element, learners must also show acceptable, consistent accuracy for a population of

oS) E 3 7

m ddS

°

^ S°ftWare ^ ^

^

^

^

^

[16] T H e g a z V | o Moselhi> A n a l o g y b a s e d solution t o m a r k u p e s t i m a t i o n

problem, J. Comput Civ. Engng 8 (1994) 72-87. [17] S.Huang, Y.Huang, Bounds on the number of hidden neurons, IEEE Trans N e u r a l N e t w o r k 2 d " 1 ) 47~55.

[18]

TeIhH3U88U996E)X67e-75UdSeraent "' ^ eStimati " 8 m e t h ° d ' ^ ^ eC [19] M Jorgensen, Experience with the accuracy of software maintenance task effort prediction models, IEEE Trans. Soft. Engng 21 (1995) 674-681. t2°] N- Karunanithi, N. Whitley, Y.K. Malaiya, Using neural networks in reliability prediction, IEEE Soft. 9 (1992) 53-59. , , „ „ „ ,. y v . .. . . . . . ' . .

solutions, which complicates simple comparisons. We believe these results show that this approach warrants further investigation, particularly to explore the effects of various parameters on the models in terms of improving robustness and accuracy. It also offers the potential to ., , • , ,. , also rprovide more transparent solutions but this aspect r

[21] C.F. Kemerer, An empirical validation of cost estimation models, CACM 30 (1987) 416-429 [22] H.C. Kennedy, C. Chinniah, P. Bradbeer, L. Morss, Construction and evaluation of decision trees: a comparison of evolutionary and conce t P ^ m i n g methods, in: D. Come, J.L. Shapiro (Eds.), Evolu-

requires further research. Perhaps a useful conclusion is that the more elaborate estimation techniques such as ANNs and GPs can provide better accuracy but require more effort in setting up and training. A trade off between

accuracy from complexity and ease of interpretation also

^S^^XSf^^

seems inevitable. The authors are grateful to Jean-Marc Desharnais for

[ 2 3 ] p K o k | B A Kitchenham,

"**"*

^

J. Kirakowski, The MERMAID approach to software cost estimation, Proceedings of Esprit Technical Week, 1990.

104

873

C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873 [24] J.R. Koza, Genetic Programming: On the Programming of Computers by Natural Selection, MIT Press, Cambridge, 1993. [25] J.R. Koza, Genetic Programming II: Automatic Discovery of Reusable Programs, MIT Press, Cambridge, 1994. [26] J.R. Koza, F.H, Bennett, D. Andre, M.A. Keane, Genetic Programming III: Darwinian Invention and Problem Solving, Morgan Kaufmann, Los Atlos, CA, 1999. [27] P.S.H.A. Langley, G.L. Bradshaw, J.M. Zytkow, Scientific Discovery: Computational Explorations of the Creative Process, MIT Press, Cambridge, 1987. [28] D. Leake, Case-Based Reasoning: Experiences Lessons and Future Directions, AAAI Press, Menlo Park, 1996. [29] M. Lefley, T. Kinsella, Investigating neural network efficiency and structure by weight investigation, Proceedings of the European Symposium on Intelligent Technologies, Germany 2000. [30] C. Mair, G. Kadoda, M. Lefley, K. Phalp, C. Schofield, S. Shepperd, S. Webster, An investigation of machine learning based prediction systems, J. Soft. Syst. 53 (1) (2000). [31] R. San Martin, J.P. Knight, Genetic algorithms for optimization of integrated circuit synthesis, Proceedings of Fifth International Conference on Genetic Algorithms and their Applications, Morgan Kaufman, San Mateo, CA, 1993, pp. 432-438. [32] M. Mitchell, Introduction to Genetic Algorithms, MIT Press, Cambridge, 1996. [33] Y. Miyazaki, A. Takanou, H. Nozaki, N. Nakagawa, K. Okada, Method to esumate parameter values in software prediction models, Inf. Soft. Tech. 33 (1991) 239-243. [34] K.S. Narendra, K. Parthasarathy, Identification and control of dynao S i T 2 ™ USi " 8 nCUral n e t W ° r k S > I E E E T r a " S ' N e U r a l N e t W O T k '

[35] E. Rich, K. Knight, Artificial Intelligence, McGraw-Hill, New York, 1995. [36] R. Schank, Dynamic Memory: A Theory of Reminding and Learning in Computers and People, C.U.P., 1982. [37] T.J. Sejnowski, C.R. Rosenberg, Parallel networks that learn to pronounce English text, Complex Syst. 1 (1987) 145-168. [38] M.J. Shepperd, C. Schofield, Estimating software project effort using analogies, IEEE Trans. Soft. Engng 23 (1997) 736-743. [39] K.K. Shukla, Neuro-genetic prediction of software development effort, Inf. Soft. Tech. 42 (2000) 701-713. [40] K. Srinivasan, D. Fisher, Machine learning approaches to estimating software development effort, IEEE Trans. Soft. Engng 21 (1995) 126-136. [ 4 1 ] E. Stensrud, I. Myrtveit, Human performance estimating with analogy a n d regression models: an empirical validation, Proceedings of the F i f t h International Metrics Symposium, IEEE Computers and Society, Bethesda, MD, 1998. [ 4 2 ] A R Venkatachalam, Software cost estimation using artificial neural networks, International Joint Conference on Neural Networks, IEEE, Nagoya 1993 [ 4 3 ] s yincinanza, M.J. Prietula, Case based reasoning, software effort estimation, Proceedings of 11 th International Conference on Informat j o n Systems 1990 [ 4 4 ] s W a l c z a k ] N CeTp^ Heuristic principles for the design of artificial neura, networks> ,nf Soft T e c h 4J ( 1 9 9 9 ) 1 0 7 n 7 [45] L w R M a r f C a s e . b a s e d reasoning. a r e v i e W i T h e Knowledge Engng Rev. 9 (1994) 327-354. [4g] G

wmi

&

Finnie

Estimatj

software d e v e l o p m e n t effort

connectionist models, Inf. Soft. Tech. 39 (1997) 469-476.

105

with

Annals of Software Engineering 8 (1999) 167-185

167

Optimal software release scheduling based on artificial neural networks Tadashi Dohi, Yasuhiko Nishio and Shunji Osaki Department of Industrial and Systems Engineering, Faculty of Engineering, Hiroshima University, 4-1 Kagamiyama 1 Chome, Higashi-Hiroshima 739-8527, Japan E-mail: [email protected]

The determination of the optimal software release schedule plays an important role in supplying sufficiently reliable software products to actual market or users. In the existing methods, the optimal software release schedule was determined by assuming the stochastic and/or statistical model called software reliability growth model. In this paper, we propose a new method to estimate the optimal software release timing which minimizes the relevant cost criterion via artificial neural networks. Recently, artificial neural networks are actively studied with many practical applications and are applied to assess the software product reliability. First, we interpret the underlying cost minimization problem as a graphical one and show that it can be reduced to a simple time series forecasting problem. Secondly, artificial neural networks are used to estimate the fault-detection time in future. In numerical examples with actual field data, we compare the new method based on the neural networks with existing parametric methods using some software reliability growth models and illustrate its benefit in terms of predictive performance. A comprehensive bibliography on the software release problem is presented.

1.

Introduction

In a software development project and its quality control process in the testing phase, it is of great importance to assess reliability of a software as an intellectual product. During the testing phase, many software faults can be detected and removed. Then, we can often find that the fault-detection time interval becomes longer and the corresponding probability decreases as the testing goes on. For this reason, the socalled software reliability growth models (SRGMs) have been developed to describe the fault-occurrence in the testing phase [Friedman and Voas 1995; Lyu 1996; Musa et al. 1987; Xie 1991; Yamada 1989]. The SRGMs can be roughly classified into two categories; the hazard rate model and the non-homogeneous Poisson process (NHPP) model. The former models a random variable representing the time-interval between software faults and is the most classical one that was proposed by Jelinski and Moranda [1972] and Schick and Wolverton [1978] as pioneer works. The latter formulates the cumulative number of faults detected during the testing time as a counting process. Goel and Okumoto [1979] and Yamada and Osaki [1985] developed stochastic models based on the NHPP and gave quantitative measures for software reliability assessment. © J.C. Baltzer AG, Science Publishers

106

168

T. Dohi et al. / Optimal software release scheduling

These SRGMs have several advantages to support a decision making in the development management of software products. For instance, it is useful for the software project manager to determine the optimal time to stop software testing and to deliver the system to users. This problem, called optimal software release problem, has been also discussed in many papers. Since the seminal contribution by Okumoto and Goel [1980], many authors formulated the optimal software release problems based on several SRGMs. A comprehensive bibliography on the software release problem is presented in the following section. On the other hand, it is pointed out that the most SRGMs explored have not been applied to practical software development projects. The reason for this is that none of these SRGMs give sufficiently accurate estimates of reliability. Thus, it should be naturally recognized that the optimal software release schedules based on unrealistic SRGMs will lose their own validity. Since, in general, the almost SRGMs in the literature have treated software as a black-box and have sometimes assumed specific behaviors of fault-detection processes in the sense of expectation, they are expected not to function well for assessment of software in the different situation where such a scenario can never be assumed. However, the black-box approach does not always function, since we have to capture the very complex reliability behavior of real software systems based upon a little and incomplete information observed in the testing phase. When the software fault-occurrence events are observed, the fault-detection time data may be treated as time series one. Then it will be useful to predict the future fault-detection time by applying an appropriate time series forecasting technique instead of existing SRGMs. Karunanithi et al. [1991, 1992] and Karunanithi and Malaiya [1992, 1996] applied some kinds of neural networks to evaluate the software reliability. Independently, Khoshgoftaar and Szabo [1994] and Shinohara et al. [1996] compared the neural network approaches with existing parametric SRGMs in terms of predictive abilities. Considering the fact that artificial neural networks are actively applied in many practical applications, they should be utilized for other software engineering applications as well as assessment of the software product reliability. For example, Srinivasan and Fisher [1995] and Venkatachalam [1993] used the neural networks to estimate software development effort or cost instead of theoretical models proposed by Boehm [1981, 1984]. In this direction, artificial neural networks with high information processing abilities have been often applied to evaluate the complex structure in software. The purpose of the present paper is to develop a new method to estimate the optimal software release time which minimizes the relevant cost criterion, by applying artificial neural networks. More precisely, it is shown that the underlying cost minimization problem can be reduced to a graphical one and is essentially equal to a time series forecasting problem. This fact claims that the earlier analytical approach for the software release problem can be interpreted from a different perspective, and enables us to determine the optimal schedule via different statistical inference devices. This paper is organized as follows. In the following section, we review the optimal software release problems referring earlier works. In section 3, we formulate a

107

T. Dohi et al. / Optimal software release scheduling

169

cost-based optimal software release problem and provide a graphical method. In section 4, statistical methods to estimate the optimal software release time from empirical fault-detection time data are presented. After introducing two parametric methods, the statistical estimator of the optimal software release time is defined. Section 5 introduces two typical neural network models with three-layers; multi-layer perceptron neural (MLPN) network and recurrent neural (RN) network. The MLPN network is a supervised network, by which is meant that the data used for training and testing the network has a required response of the network, known as the target. While the RN network is found in the majority of successful applications, from both research and industrial viewpoints, and is highly appropriate for forecasting time series. As the learning algorithm for both networks, the back-propagation (BP) algorithm is used [Rumelhart and McClelland 1986]. Numerical examples in section 6 are devoted to compare the neural network approaches with those by some SRGMs in terms of predictive performance and to evaluate the method proposed quantitatively. Finally, the paper is concluded with some remarks in section 7. 2.

Background and literature survey

The optimal software release problem was formulated first by Okumoto and Goel [1980]. They assumed that the temporal variation of the number of cumulative software faults detected in the testing phase followed an exponential SRGM [Goel and Okumoto 1979] based on the NHPP, and attempted to derive analytically the optimal software release time which minimizes the expected cost incurred in both testing and operational phases. Though it might not be easy to estimate cost parameters in an actual software development process, their approach can provide the cost-effective testing schedule as well as a unified criterion to represent several requirements for software delivery schedule. Koch and Kubat [1983] analyzed the similar but somewhat different problem under the assumption that the the number of cumulative software faults can be described as a Markov process, the so-called Jelinski and Moranda model [Jelinski and Moranda 1972]. Shanthikumar and Tafekci [1983] also considered the similar problem under the binomial type of SRGM. Yamada et al. [1984], Yamada and Osaki [1985, 1986], Bai and Yun [1988], Yun and Bai [1990], Ohtera and Yamada [1990], Kapur and Garg [1989, 1990, 1991], Yamada et al. [1993], Shinohara et al. [1997] and Dohi et al. [1997] took account of reliability requirement and/or formulated modified optimal software release problems for different SRGMs. Recently, Hou et al. [1996, 1997] considered the optimal software release policies for the hypergeometric distribution SRGM proposed by Tohma et al. [1989, 1991]. The literature mentioned above assumed some SRGMs to describe the fault-detection process with respect to time. Such a simple approach will be tractable if an adequate SRGM can be selected in the testing phase. We can classify these modeling approaches into the parametric problem. However, the choice of the SRGMs usually fluctuates as data observations progress. In other words, especially, in the initial testing phase, it may be difficult

108

170

T. Dohi et al. / Optimal software release scheduling

to choose the best SRGM from a small sample of fault-detection time data. Also, the statistical hypothesis that the SRGM is correct or not may be rejected after obtaining additional data. In such an unreliable situation, the Bayesian adoptive method will be useful to estimate the optimal software release time. Forman and Singpurwalla [1977, 1979], Singpurwalla [1991] and Ross [1985] applied the Bayesian statistical inference techniques to estimate the optimal software release time with some reliability criteria. Musa and Ackerman [1989] evaluated the empirical problem to stop testing. Masuda et al. [1989] considered the similar problem for determining release time of software system with modular structure. The most (mathematically) sophisticated techniques to estimate the optimal software release time under the expected cost criterion were proposed by Dalai and Mallows [1988, 1990, 1992]. They treated the underlying problem as the optimal stopping problems and derived the closed-form stopping rules. Recently, Yang and Chao [1995] compared two stopping rules by simulation. It is noticed that the earlier works above also depend on the parametric model structure, that is, they assume that the software inter-failure time interval obeys an exponential with a random rate. In the framework of Bayesian adoptive inference, such a treatment seems to be valid, but the problem on the model identification has remained. 3.

Cost-based software release problem

It will be convenient to unify the optimization criterion for designing the software release plan from the viewpoint of economic justification. In this paper, we concentrate the problem on minimizing the expected cost anticipated to occur during both testing and operational phases. Following Okumoto and Goel [1980] and Yamada et al. [1984], suppose that the software test is started at time 0 and terminated at time T (0 ^ T < l i e ) , where TLc (> 0) denotes the software life cycle or the warranty period. If XLc is the software life cycle, it may be a non-negative random variable with sufficiently large finite mean. Otherwise, since the user will be guaranteed the free service to repair the software failure during the warranty period [T, TLCL the upper limit of to, 2~Lc, m a v ^ e a constant which should be pre-determined (ordinarily, it may be half or one year for the commercial software). As seen latter, although we need not specify TLC, at the moment, assume it to be constant since it is irrelevant to derive the optimal software release time T*. Consider the following cost components: • c\ (> 0): the cost to remove a software fault in testing phase, • C2 (> 0): the cost to remove a software fault in operational phase, • c-j (> 0): the testing cost per unit time incurred in testing phase, where C2 > c\. As assumed in many works [Dalai and Mallows 1992; Koch and Kubat 1983; Okumoto and Goel 1980], it is a natural assumption that fixed costs are incurred for debugging software faults. On the other hand, since the running cost for the testing phase is extremely larger than the holding cost for the software service team during the warranty period, only the testing cost is considered here.

109

T. Dohi et al. / Optimal software release scheduling

171

Suppose that the cumulative number of faults detected up to time t £ [0, TLC] is a non-negative stochastic counting process {N(t), t ^ 0}. Then it is appropriate to adopt the expected total software cost incurred until T L c, V(T). Define

V(T) = E[ClN(T) + c2{N(TLC) - N(T)} + c3T] = (Cl - c2)M(T) + c2M(TLC) + c3T,

(1)

where M(T) = E[N(T)] is the expected cumulative number of faults detected up to time T. If the function M(T) is known completely, the problem is formulated as min

(2)

V(T)

which is a simply algebraic one. For instance, if the function M(T) is continuous and well-defined, and further is a non-decreasing, strictly concave and bounded function of T (see, e.g., [Goel and Okumoto 1979; Jelinski and Moranda 1972]), the minimization problem in equation (2) has a unique solution T* from d2M(T)/dT2 > 0 under the assumption c2> c\. For instance, if {N(t), t^O} satisfies the following properties: (i) JV(O) = 0, (ii) [N(t),

t^O}

has an independent increment,

(iii) ?r{M(t + h) - M(t) = 1} = X(t; 6)h + o(h), (iv) Pr{M(t + h)- M(t) ^ 2} = o(/i), then the stochastic process is the NHPP with ft{^)

= n | ^ = 0} = i

^

^

.

(3)

where M(t) = M(t; 6) is the mean value function of the NHPP representing the mean number of faults detected and 0 e TZn is an arbitrary parameter. Goel and Okumoto [1979] proposed the following exponential SRGM: M(t; 9) = N{\-

exp(-bt)}

(4)

where the parameters 0 = (N, b) e rTZ2 (0 < JV < oo, b > 0) denote the initial fault contents before testing and the fault-detection rate per remaining fault, respectively. In the typical software release problems, one should identify the model parameters under the assumption that the mean value function can be known from any expert knowledge. That is, suppose that n (> 1) data of the fault-detection time interval, 0 = xo,x\,...,xn, and of the cumulative time, Tn = X^Lo 3 ^' a r e available at the point of time in the testing phase. Then, the maximum likelihood estimation (MLE) method will be useful to estimate the model parameters and to derive the parametric

110

172

T. Dohi et al. I Optimal software release scheduling

estimator MQ(T) = M(T; 9). More precisely, since the logarithmic likelihood function becomes n

logM(t;0) = ^ l o § M { T i ; 6 ) - Mffi;d),

(5)

we obtain the simultaneous likelihood equation to be solved as follows:

«^=o.

(6)

Substituting the estimator MQ{T) into equation (2), an estimator of the optimal software release time can be calculated analytically. Yamada et al. [1984] proved the following proposition for the exponential SRGM. Proposition 3.1. For the exponential SRGM in equation (4), if the MLEs of model parameters satisfy Nb > 03/(02 — ci), then there exists a finite and unique solution

IQ-W^-^)

(7)

to the first-order condition of optimality dV(T)/dT = 0 and the optimal software release time is T* = min{T°,T^c}. Otherwise, the optimal software release time becomes T* = 0 and it is optimal to release the product without software test. Next, we attempt to reconsider the underlying problem graphically. After some algebraic manipulations, we have: Proposition 3.2. The minimization problem in equation (2) is equivalent to

max \M(T) o
^ZL).

(8)

c2-cij

The configuration of the geometrical solution method is depicted in figure 1. The result above implies that the underlying software release problem can be reduced to a graphical one to seek the point T* so as to maximize the vertical distance from the straight line c 3 T/(c2-ci) to the curve M(T) in the (T, M{T)) e [0, ThC] x [0,00) plane, if the function M(T) is known. Hence the software release problem is independent of Tic and means essentially the estimation problem of the curve M(T). In existing optimal software release formulations as mentioned above, it is noticed that the function M(T) was assumed in advance. In other words, if the future behavior on M(T) can be estimated at any time point s € [0, T), the optimal software release time will be determined based on the estimator M(T). The following result based on proposition 3.2 gives a dual relationship for proposition 3.1.

Ill

T. Dohi et al. / Optimal software release scheduling

173

mean number of faults detected

Figure 1. Configuration of geometrical solution method.

Proposition 3.3. Suppose that M(T; 6) is bounded, nondecreasing with respect to T and strictly concave. (i) If the tangent slope of the curve y = M(T; 6) at the origin is strictly larger than ci/{c2 — c\) and if the tangent slope of y = M(T\ 9) at x = TLC is strictly less than Ci/(c2 - c\) in the plane (x,y) = (T,M(T)), then there exists a finite and unique optimal software release time T*(0 < T* < TLc) which minimizes the expected total software cost. (ii) If the tangent slope of y = M{T;6) at x = XLc is larger than or equal to ^3/(02 - c\), then the optimal software release time becomes T* = T LC and it is optimal to continue the software test. (iii) If the tangent slope of y = M(T; 6) at the origin is less than or equal to 03/(02 — ci), then the optimal software release time becomes T* = 0 and it is optimal not to test the software product. Consequently, if the complete information on the function M(T; 6) can be obtained in advance, we can derive the optimal software release time on the graph. This method is applicable to the case where the cumulative number of faults detected is unknown. In the following section, we describe the statistical estimation procedure of the optimal software release time on the graph. 4.

Statistical estimation procedure

Consider statistical methods to estimate the optimal software release time for the problem in equation (8). Suppose that the time to detect the ith fault is T{ (i = 1,2,..., n) and that n fault-detection time data are available, that is, 0 = To < T\ < •••
112

174

T. Dohi et al. / Optimal software release scheduling

involved in M(T) have to be estimated from the n data and the optimal software release time (constant) T* must be calculated algebraically using suitable parametric estimator M(T). Ordinarily, the method of MLE will be applied to estimate unknown parameters for SRGMs and the optimal software release time. In the remainder of this section, somewhat different methods are introduced (see [Hishitani et al. 1991; Yamada 1989])^ The first method is to estimate the j'th (j = n + 1,... ,N) fault detection time Tj according to

fj = M(710) = M- 1 (j;?)

(9)

l

if the inverse function M~ (-) exists. The second method focuses on the inter-failure time distribution. It is well known that the NHPP with bounded mean value function, i.e., M(t) < oo for all t has a proper inter-failure time distribution Fk(t) = Pr{Tfc+1 - Tk ^ t | To = 0}, fc = 0,1.2,..., where Ffe(oo) < 1. For instance, the exponential SRGM and the other models based on the NHPP possess the above property (see [Musa et al. 1987]). In order to approximate the fault-detection time interval, Hishitani et al. [1991] proposed the following normalized distribution:

Gk(t)= jy**

(10)

Jo aFk(t) and regarded the following MTBF (mean time between faults) as an estimate: Xj=E[rj]-E[Tj_i],

(11)

j = n+l,...,N,

where the expectation in equation (11) is taken with respect to Gk(t). The two methods can be applied to estimate M(T) without using the method of MLE. If n fault-detection time interval datajri = T\ - To, xg = Ti - T\, ..., xn = Tn — Tn-\ are available and if the estimates Tj and Xj = Tj—Tj-i (j = n+l,... ,mn) are given, where

{

n

m: integer; ^ n+ ^ i=\

"I

m

j=n+l

Xj ,

then an estimator of the mean value function is given by the step function (0, OsCT<Ti, _ i, Ti^T< Ti+u Mn(T)=K, Tn^T
j, Un,

r j ^r
(12)

)

(13)

for i = 1,2,..., n — 1 and j — n + 1, n + 2,..., mn — 1. The configuration of the estimator is shown in figure 2. Note that the estimator in equation (13) is not consistent, but is very reasonable. In addition, it is pointed out that the MLE of the

113

175

T. Dohi et al. / Optimal software release scheduling

Figure 2. Empirical total number of software faults.

NHPP is not always the best estimator. This motivates that the two estimators based on equations (9) and (11) might fit to our graphical method. From equation (13), define an estimator of the expected total software cost as follows: V(T) = ciMn(T) + c2{Mn(TLC) - Mn(T)} + c3T.

(14)

From proposition 3.2, the optimal software release problem can be formulated as

max f.Mn(T)

0C7X7Ic\

— T\.

C 2 -Ci

J

(15)

Then we have the following useful result to restrict the search space of the optimal solution. Proposition 4.1. For an estimator Mn(T), the candidate of the optimal software release time for the problem in equation (15) necessarily exists among Uni Tn+\,. . . , im n )-

The proof is obvious from the fact that the function Mn(T) is right-continuous. From proposition 4.1, we can search the optimal solution over a finite region consisting of mn - n + 1 elements. Hence, if the estimator in equation (9) or (11) is used, the underlying optimal software release problem can be reduced to a time series forecasting one. This claims that a more beneficial forecasting method should be applied to estimate the future fault detection time. In the following section, we introduce the statistical methods to estimate Xj {j = n + 1 , . . . , m n ), using artificial neural networks.

114

176

5.

T. Dohi et al. / Optimal software release scheduling

Estimation via artificial neural networks

In this section, we explain two neural network models; the MLPN and RN networks. First, let us consider the MLPN network (see figure 3). The MLPN used in this paper is constructed of three layers of cells, with interconnections between all combinations of cell layers. The input layer receives input data. The middle layer, called the hidden layer, contains processing nodes terms as artificial neurons, and the output layer gives an output as an estimated value of the next software fault-detection time interval. Each processing element in the hidden layer or the output layer has an activation level, which must be computed as a function of the activation levels on the cell connected to it and the associated interconnection weights. This function is called the activation function, and, commonly, the following sigmoidal function is used:

(16) where /? is a nodal threshold (constant). As a training method, we use the wellknown BP algorithm [Rumelhart and McClelland 1986]. This algorithm modifies nodal thresholds and interconnection weights to reduce the error between the output and the target, where two kinds of learning rates are needed. The parameter a is used to determine the training cycle and the parameter TJ is used to adjust the convergence speed for the algorithm. The target is given by the jth fault-detection time interval when the input data are previous to jth fault. Thus, if the accumulated error over all input data decreases less than a tolerance level, then the neural network outputs the next fault-detection time. For the MLPN used in this paper, the numbers of cells in the input, the hidden and the output layers are fixed as 10, 10 and 1, respectively, where these numbers are determined through numerical experiments in advance. Next, we explain the RN network (see figure 4). This network in figure 4 is called Elman network and has a multi-layered structure similar to that of MLPNs. In the RN network, in addition to an ordinary hidden layer, there is another special hidden layer called context or state layer. This layer receives feedback signals from the ordinary hidden layer, and the outputs of neurons in the context layer are fed forward to the hidden layer. If only the forward connections are to be adapted and the feedback connections are presented to constant values, the network can be considered an ordinary feedforward network and the BP algorithm used to train it. It is well known that the RN network is better than the MLPN for forecasting the time series

Figure 3. Illustration of MLPN network.

115

T. Dohi et al. / Optimal software release scheduling

177

Figure 4. Illustration of RN network.

data, and will be useful to estimate the fault-detection time interval of the software product in the testing phase. In using the MLPN, after one training of the network with a fixed number of training data, all estimates of the future fault-detection time interval are given under an identical network architecture. On the other hand, the RN network can estimate the time interval sequentially, changing its internal parameters. See Karunanithi and Malaiya [1992, 1996] for more details. The estimation algorithm for the optimal software release time is the following; Step 1. Given n fault-detection time interval data x\,X2,...,xn, train the neural networks and estimate the future fault-detection time interval XJ (j = n + 1 , . . . , mn), where TLC and m n are determined in advance. = Step 2. Seek the estimate Mn(T) by plotting the points {A\,A2,...,Amn} {(l,xi),...,(n,xn), (n + l , x n + i ) , . . . ,(mn,Xmn)} and constructing a step function in the two-dimensional plane. Step 3. Calculate M « ( T L C ) for a given TLC, and search the point Xj» maximizing the vertical distance from the straight line to the point At (i = 1 , . . . , m n ) . Step 4. Calculate the minimum total software cost. In Step 1, by replacing the estimate with the neural networks into that with the NHPP in equations (9) and (11), it can be seen that the determination based on the SRGMs is possible. Therefore, the graphical method proposed in section 5 is a unified approach and can be applicable to every statistical inference technique. In the following section, we estimate the optimal software release time and the corresponding total software cost from actual field data, and examine the predictive performance of the proposed method by artificial neural networks. 6.

Numerical examples

We analyze 6 data sets cited in Lyu [1996] (see table 1). In data set #1 which consists of 136 fault-detection time data observed in the testing phase, suppose that 95 fault-detection time interval data Xj (i = 1,2,... ,95) are available and we wish to

116

178

T. Dohi et al. / Optimal software release scheduling Table 1 Test data. No. set #1 #2 #3 #4 #5 #6

No. data 136 129 104 86 207 397

TLc

Mean inter-failure time

Variance

652.07350 690.23260 147.78370 1192.95300 80.46377 276.12560

1071164.000 2259816.000 60071.770 1323566.000 9591.269 407257.900

4

9xlO 8 x 104 5 x 104 5xlO4 17X10 5 14x10"

estimate the optimal software release time sequentially from the point of time T95 as the software testing time goes on. The models used are G&O, Y&O, G&O-l, G&O-2, Y&O, MLPN and RN. The G&O and Y&O assume the associated SRGMs by Goel and Okumoto [1979] and Yamada and Osaki [1985], respectively, and calculate the optimal software release times and the minimum expected costs using the MLE of the respective mean value functions, M(t;9). When the next data (e.g., 96th data) is observed, the most recent 95 data are used to estimate the mean value function and this cycle is repeated until the termination of test, Tic- In G&O-l and G&O-2, the fault-detection time is estimated by the inverse function in equation (9) and the MTBF based on the normalized inter-failure time distribution in equation (11), respectively. MLPN and RN denote the neural network approaches using the multi-layer perceptron neural network and the recurrent neural network, respectively. The most recent 10 data is input to the neural networks and the other data is used for training. That is, in the RN, 9 fault-detection time data and one feedback signal are used as the input. Since the software life cycle TLC is independent of the minimization and can be sufficiently large, we can regard it as the upper limit of the computation time. The respective values of TLC used in the experiments are given in table 1. It is important to design the neural network architecture for assessment of the software reliability. As mentioned in section 5, the network size is fixed as (10,10, l)-units. Also, in the preliminary experiments, the maximum number of training is 10,000 and the parameters are (a,rj) = (0.2,0.3) for the MLPN and (a,rj) = (0.9,0.8) for the RN. The initial values of nodal thresholds and interconnection weights are determined by the uniform random number. The model parameters included in the total software cost are assumed as follows: c\ = 1 [dollar/unit fault], C2 = 2 [dollar/unit fault] and c3 = 0.0004 [dollar/testing time (CPU time)]. Figure 5 shows the behavior of estimates of the optimal software release time in data set #1, where the central straight line in the figure denotes the real optimum and is calculated so as to maximize the vertical distance between the step function representing the number of faults and the straight line cjT/(c2 —c\) after obtaining all 136 data. It will be no problem to regard it as the real optimum. From this figure, it is found that both MLPN and RN fluctuate in the neighborhood of the real optimum and that the other prediction models tend to increase mildly as the number of data

117

T. Dohi et al. / Optimal software release scheduling

179

Figure 5. Behavior of estimates of the optimal software release time (data set #1).

increases. This result tells us that the SRGMs: G&O, Y&O, G&O-l and G&O-2, underestimate the optimal software release time in the initial testing phase and give rather pessimistic predicts. On the other hand, note that the MLPN and the RN are not changed as time elapses after they output the real optimum as the release time. This powerful properties for the neural network approaches depend on the geometrical structure proposed in section 3. To this end, if the neural networks output the close value to the real optimum at the point of time, the estimates after the time never move extremely. The behavior of the estimates of the corresponding minimum software cost (data set #1) is presented in figure 6. It can be seen that the estimates show the similar tendencies to figure 5. In tables 2 and 3, we calculate the mean squared error of the estimates of the optimal software release time and the minimum software cost for all 6 data sets,

118

180

T. Dohi et al. / Optimal software release scheduling Table 2 Comparison of the optimal software release time (mean squared error). Data set #1 #2 #3 #4 #5 #6

MLPN

RN

G&O

Y&O

G&O-l

G&O-2

0.029777 0.026986 0.070552 0.038995 0.034563 0.049658

0.010505 0.007367 0.050556 0.027791 0.023447 0.062843

0.075483 0.103238 0.120274 0.135933 0.038778 0.154353

0.174044 0.198649 0.183824 0.155922 0.159269 0.184874

0.089646 0.122262 0.129070 0.131145 0.037743 0.154366

0.081659 0.090612 0.130139 0.135577 0.039590 0.156657

Table 3 Comparison of the minimum total software cost (mean squared error). Data set #1 #2 #3 #4 #5 #6

MLPN

RN

G&O

Y&O

G&O-l

G&O-2

0.008101 0.013832 0.000662 0.002332 0.006986 0.003491

0.005484 0.003757 0.000323 0.001367 0.006367 0.005379

0.051479 0.047760 0.057068 0.103238 0.031245 0.043656

0.091628 0.076356 0.094428 0.198649 0.092091 0.095495

0.052169 0.047502 O.O567OO 0.122262 0.034227 0.045332

0.027835 0.055454 0.057424 0.090612 0.035648 0.045625

where "mean" is taken on all observation points (£96, £97,. • •) and the squared error is normalized as

[

(estimate of the optimal release time) — (real value)! (real value) J

(17)

From these results, we recognize that the neural network approaches give better predictive performance than the existing SRGMs. Especially, it may be clear that the RN can provide the lowest squared errors less than 1% for almost all cases except for data set #6. This is because the RN with feedback structure is superior to the MLPN in terms of their time series forecasting abilities. On the other hand, we find that estimation results by the parametric method based on the SRGMs are very poor and that no remarkable differences between G&O, G&O-l and G&O-2 can be recognized. In other words, if we assume the parametric models specifying the mean value function, we cannot conclude that the MLE method is always better than the graphical estimation method proposed in this paper. Finally, we summarize the experimental results in this paper as follows. (i) The neural network, especially the recurrent neural network, can provide the better estimate for the optimal software release time minimizing the total software cost as well as the fault-detection time in future than existing SRGMs. This tendency is remarkable as the number of fault-detection time data observed increases.

119

T. Dohi et al. / Optimal software release scheduling

181

Figure 6. Behavior of estimates of the total software cost (data set #1).

(ii) In the graphical approach proposed in this paper (LNN, RNN, G&O-l, G&O-2), one can select the optimal software release time from several estimates of the fault-detection time. This will be more realistic than the existing SRGMs which determine the optimal solution as a constant point of time. The constant time derived by the MLE method may not be always feasible (i.e., we might be testing at that time and not digest some test cases). (iii) The demerit of the neural network approach is to need adjusting itself in advance. However, once the network architecture is determined, the scenario on the debugging process is not needed as the existing SRGMs. Moreover, if the optimization of network architecture is possible, the estimation with higher accuracy can be expected.

120

182

7^ Dohi et al. / Optimal software release scheduling

Thus, it is concluded that the method based on the artificial neural networks is very useful in actual software development process planning. The graphical idea proposed in this paper is also convenient to estimate the optimal software release schedule by applying the existing time series forecasting techniques. 7.

Concluding remarks

In this paper, we have developed a method to estimate the optimal software release time applying artificial neural networks. It has been shown that the underlying cost-minimization problem can be reduced to a graphical one to minimize the vertical distance from a straight line to a curve representing the number of faults. Since the essential factor behind the software release problem is to estimate the fault-detection time interval in future, we have used two typical artificial neural networks for the sake of time series forecasting. In numerical examples with real software fault-detection time data, it has been found that the predictive performance of the optimal software release time by neural networks is better than by the existing parametric SRGMs. Though this paper has assumed two classical neural networks, an effort to improve the forecasting ability will be needed in future. For instance, if the other environmental data for software testing, e.g., structural factors such as the numbers of codes, functions and modules, testing effort or cost such as the number of programmers and execution time, can be observed, the neural networks may carry out more realistic information processing taking account of those factors. On the other hand, the graphical method proposed in this paper will motivate to utilize other statistical time series forecasting techniques instead of neural networks. Dohi et al. [1998] applied five statistical autoregressive models to estimate the number of software faults and the optimal software release time. However, in order to go beyond the predictive results by the neural networks in this paper, further improvement on the statistical autoregressive processes has to be made. Acknowledgement This work was partially supported by a Grant-in-Aid for Scientific Research from the Ministry of Education, Sports, Science and Culture of Japan under Grant No. 09780411 and No. 09680426. References Bai, D.S. and W.Y. Yun (1988), "Optimum Number of Errors Corrected Before Releasing a Software System," IEEE Transactions on Reliability R-37, 41-44. Boehm, B.W. (1981), Software Engineering Economics, Prentice-Hall, Englewood Cliffs, NJ. Boehm, B.W. (1984), "Software Engineering Economics," IEEE Transactions on Software Engineering SE-10, 4-21.

121

T. Dohi et al. / Optimal software release scheduling

183

Dalai, S.R. and C.L. Mallows (1988), "When Should One Stop Testing Software?" Journal of American Statistical Association 83, 872-879. Dalai, S.R. and C.L. Mallows (1990), "Some Graphical Aids for Deciding When to Stop Testing Software," IEEE Journal of Selected Areas in Communications 8, 169-175. Dalai, S.R. and C.L. Mallows (1992), "When to Stop Testing Software - Some Exact and Asymptotic Results," In Bayesian Analysis in Statistics and Econometrics, Lecture Notes in Statistics, Vol. 75, Springer, New York, pp. 267-276. Dalai, S.R. and C.L. Mallows (1992), "Buying With Exact Confidence," Annals of Applied Probability 2, 752-765. Dohi, T., N. Kaio and S. Osaki (1997), "Optimal Software Release Policies With Debugging Time Lag," International Journal of Reliability, Quality and Safety Engineering 4, 241—255. Dohi, T., H. Morishita and S. Osaki (1998), "A Statistical Estimation Method of Optimal Software Release Timing Applying Autoregressive Models," In Proceedings of First Euro-Japanese Workshop on Stochastic Risk Modelling for Finance, Insurance, Production and Reliability, Vol. 2. Forman, E.H. and N.D. Singpurwalla (1977), "An Empirical Rule for Debugging and Testing Computer Software," Journal of American Statistical Association 72, 750-757. Forman, E.H. and N.D. Singpurwalla (1979), "Optimal Time Intervals for Testing Hypotheses on Computer Software Errors," IEEE Transactions on Reliability R-28, 250-253. Friedman, M.A. and J.M. Voas (1995), Software Assessment, Reliability, Safety, Testability, Wiley, New York. Goel, A.L. and K. Okumoto (1979), "Time-Dependent Error-Detection Rate Model for Software Reliability and Other Performance Measures," IEEE Transactions on Reliability R-28, 206-211. Hishitani, J., S. Yamada and S. Osaki (1991), "Reliability Assessment Measures Based on Software Reliability Growth Model With Normalized Method," J. Information Processing Society of Japan 14, 178-183. Hou, R.H., S.Y. Kuo and Y.P. Chang (1996), "Optimal Release Policy for Hyper-Geometric Distribution Software-Reliability Growth Model," IEEE Transactions on Reliability R-45, 646-651. Hou, R.H., S.Y. Kuo and Y.P. Chang (1997), "Optimal Release Times for Software Systems With Scheduled Delivery Time Based on the HGDM," IEEE Transactions on Computers C-46, 216-221. Jelinski, Z. and P.B. Moranda (1972), "Software Reliability Research," In Statistical Computer Performance Evaluation, W. Freiberger, Ed., Academic Press, New York, pp. 465-484. Kapur, P.K. and R.B. Garg (1989), "Cost-Reliability Optimum Release Policies for a Software System Under Penalty Cost," International Journal of Systems Science 20, 2547-2562. Kapur, P.K. and R.B. Garg (1990), "Optimal Software Release Policies for Software Reliability Growth Models Under Imperfect Debugging," Revue Francaise d'Automatique, Informatique et Recherche Operationnelle (Recherche Operationnelle/Operations Research) 24, 295-305. Kapur, P.K. and R.B. Garg (1990), "Cost-Reliability Optimum Release Policies for a Software System With Testing Effort," Opsearch 27, 109-116. Kapur, P.K. and R.B. Garg (1991), "Optimal Release Policies for Software Systems With Testing Effort," International Journal of Systems Science 22, 1563-1571. Kapur, P.K. and R.B. Garg (1991), "Optimum Release Policy for Inflection S-Shaped Software Reliability Growth Model," Microelectronics and Reliability 31, 39-42. Karunanithi, N., Y.K. Malaiya and D. Whitley (1991), "Prediction of Software Reliability Using Neural Networks," In Proceedings of 2nd International Symposium on Software Reliability Engineering, IEEE Computer Society Press, Los Alamitos, CA, pp. 124-130. Karunanithi, N. and Y.K. Malaiya (1992), "The Scaling Problem in Neural Networks for Software Reliability Prediction," In Proceedings of Third International Symposium of Software Reliability Engineering, IEEE Computer Society Press, Los Alamitos, CA, pp. 76-82. Karunanithi, N. and Y.K. Malaiya (1996), "Neural Networks for Software Reliability Engineering," In Handbook of Software Reliability Engineering, M.R. Lyu, Ed., McGraw-Hill, New York, pp. 699-728.

122

184

T. Dohi et al. / Optimal software release scheduling

Karunanithi, N., D. Whitley and Y.K. Malaiya (1992), "Using Neural Networks in Reliability Prediction", IEEE Software 9, 53-59. Karunanithi, N., D. Whitley and Y.K. Malaiya (1992), "Prediction of Software Reliability Using Connectionist Models," IEEE Transactions on Software Engineering SE-18, 563-574. Khoshgoftaar, T.M. and R.M. Szabo (1994), "Predicting Software Quality, During Testing, Using Neural Network Models: A Comparative Study," International Journal of Reliability, Quality and Safety Engineering 1, 303-319. Koch, H.S. and P. Kubat (1983), "Optimal Release Time for Computer Software," IEEE Transactions on Software Engineering SE-9, 323-327. Lyu, M.R., Ed. (1996), Handbook of Software Reliability Engineering, McGraw-Hill, New York. Masuda, Y., N. Miyawaki, U. Sumita and S. Yokoyama (1989), "A Statistical Approach for Determining Release Time of Software System With Modular Structure," IEEE Transactions on Reliability R-38, 365-372. Musa, J.D., A. Iannino and K. Okumoto (1987), Software Reliability, Measurement, Prediction, Application, McGraw-Hill, New York. Musa, J.D. and A.F. Ackerman (1989), "Quantifying Software Validation: When to Stop Testing," IEEE Software 6, 19-27. Ohtera, H. and S. Yamada (1990), "Optimum Software-Release Time Considering an Error-Detection Phenomenon During Operation," IEEE Transactions on Reliability R-39, 596-599. Okumoto, K. and L. Goel (1980), "Optimum Release Time for Software Systems Based on Reliability and Cost Criteria," Journal of Systems and Software 1, 315-318. Rumelhart, D.E. and J.L. McClelland (1986), Parallel Distributed Processing, Vol. 1, MIT Press, Cambridge, MA. Ross, S.M. (1985), "Software Reliability: The Stopping Rule Problem," IEEE Transactions on Software Engineering SE-11, 1472-1476. Schick, G.J. and R. Wolverton (1978), "An Analysis of Competing Software Reliability Models," IEEE Transactions on Software Engineering SE-4, 104-120. Shanthikumar, J.G. and S. Tiifekci (1983), "Application of a Software Reliability Model to Decide Software Release Time," Microelectronics and Reliability 23, 41-59. Shinohara, Y., M. Imanishi, T. Dohi and S. Osaki (1996), "Software Reliability Prediction Using Neural Network Technique," In Proceedings of Second Australia-Japan Workshop on Stochastic Models in Engineering, Technology and Management, R.J. Wilson, D.N.P. Murthy and S. Osaki, Eds., Technology Centre, The University of Queensland, Brisbane, pp. 564-571. Shinohara, Y., T. Dohi and S. Osaki (1997), "Comparisons of Optimal Release Policies for Software Systems," Computers and Industrial Engineering 33, 813-816. Singpurwalla, N.D. (1991), "Determining an Optimal Time Interval for Testing and Debugging Software," IEEE Transactions on Software Engineering SE-17, 313-319. Srinivasan, K. and D. Fisher (1995), "Machine Learning Approaches to Estimating Software Development Effort," IEEE Transactions on Software Engineering SE-21, 126-137. Tohma, Y, K. Tokunaga, S. Nagase and Y. Murata (1989), "Structural Approach to the Estimation of the Number of Residual Software Faults Based on the Hyper-Geometric Distribution," IEEE Transactions on Software Engineering SE-I5, 345—355. Tohma, Y., H. Yamano, M. Ohba and R. Jacoby (1991), 'The Estimation of Parameters of the Hypergeometric Distribution and Its Application to the Software Reliability Growth Model," IEEE Transactions on Software Engineering SE-17, 483-489. Venkatachalam, A.R. (1993), "Software Cost Estimation Using Artificial Neural Networks," In Proceedings of 1993 International Joint Conference on Neural Systems, pp. 987—989. Xie, M. (1991), Software Reliability Modelling, World Scientific, Singapore. Yamada, S., H. Narihisa and S. Osaki (1984), "Optimal Software Release Policies With a Scheduled Software Delivery Time," International Journal of Systems Science 15, 904-915.

123

T. Dohi et al. / Optimal software release scheduling

185

Yamada, S. and S. Osaki (1985), "Cost-Reliability Optimal Release Policies for Software Systems," IEEE Transactions on Reliability R-34, 422-424. Yamada, S. and S. Osaki (1985), "Optimal Software Release Policies With Simultaneous Cost and Reliability Requirement," European Journal of Operational Research 31, 46-51. Yamada, S. and S. Osaki (1985), "Software Reliability Growth Modeling: Models and Applications," IEEE Transactions on Software Engineering SE-11, 1431-1437. Yamada, S. and S. Osaki (1986), "Optimal Software Release Policies for a Nonhomogeneous Software Error Detection Rate Model," Microelectronics and Reliability 26, 691-702 (1986). Yamada, S. (1989), Software Reliability Assessment Technology, HBJ Japan, Tokyo (in Japanese). Yamada, S., J. Hishitani and S. Osaki (1993), "Software-Reliability Growth With a Weibull Test-Effort: A Model & Application," IEEE Transactions on Reliability R-42, 100-106. Yang, M.C.K. and A. Chao (1995), "Reliability-Estimation & Stopping-Rules for Software Testing, Based on Repeated Appearances of Bugs," IEEE Transactions on Reliability R-44, 315-321. Yun, W.Y. and D.S. Bai (1990), "Optimum Software Release Policy With Random Life Cycle," IEEE Transactions on Reliability R-39, 167-170.

124

Chapter 3 ML Applications in Property and Model Discovery Given a software system, ML methods can be used to identify or discover certain properties of the system. Such discoveries of properties can be indispensable in many SE tasks: to facilitate various development and maintenance activities, to understand the relationships among software components for program understanding, to identify reusable components for reuse repository construction, and to re-engineer the existing system into one that has desirable properties, to name a few. Table 23 summarizes the status of the ML methods being utilized in this application category. Table 23. ML methods used in discovery. NN Program Invariants Object Identifying

IBL CBR

DT

GA

GP ILP

EBL

CL BL

IAL

RL

EL

SVM

A/ A/

Operation Boundary

yj

Mutants Process Models

AL

yj \j

\j

In this chapter, we include two papers, one dealing with using NN to identify objects in procedural programs, and the other tackling the issue of detecting equivalent mutants in mutation testing using BL. The paper by Abd-El-Hafiz [1] describes a general approach to identifying objects in procedural programs using clustering NN. Currently, there are three main approaches to the identification of objects in software systems: concept analysis (based on lattice theory), knowledge-based approach, and cluster analysis. Two limitations exist behind these approaches: the inability to identify a coherent set of objects as a result of the presence of undesired connections among functions of a system under analysis, and the need for human intervention so as to obtain satisfactory identification results. In the cluster analysis, there are three types of algorithms, namely: hierarchical, graph-theoretic, and optimization. The proposed approach, which is based on cluster analysis and is classified as belonging to optimization algorithms, intends to address these limitations through two clustering NN algorithms, ART (Adaptive Resonance Theory) and SOM (Self-Organizing Maps). ART and SOM algorithms are used to carry out the clustering process where routines of a software system are partitioned in order to minimize intra-cluster distances and maximize inter-cluster distances. The weights associated with each output node of the neural network correspond to a cluster centroid (cluster prototype). The distance measure is either the Manhattan distance or the Euclidean distance. The process of weights modification and routine partition is iterated until a stability is reached. Through a prototype tool implementing ART and SOM algorithms, a number of experiments has been performed by the author that involves programs written in C and Pascal and having a size up to 19000 lines of code. The tool takes as its input a routine-attribute matrix obtained through

125

simple static analysis of the code, chooses one of the NN algorithms and its corresponding parameter based on the user's decision, and produces the clustering results. The analysis indicates that the clustering results by SOM algorithm are either comparable to or better than those obtained through ART algorithm, though at a cost of longer execution time. The paper by Vincenzi et al [141] deals with the issue of identifying equivalent mutants in mutation testing using BL. Mutation testing holds great promise for improving software quality. It not only leads to the detection of faults in the tested program, but also offers a criterion for the test adequacy issue. However, a major hindrance to the application of this testing technique is its computational cost. For instance, there are a large number of mutants that need to be analyzed and identified for possible equivalence with regard to the original program under test, which is very expensive computationally. The proposed approach in [141] is based on the brute-force maximum a posteriori (MAP) learning algorithm to identify the most promising group of mutants that should be analyzed. The most probable hypothesis is learned from the following settings: 1. Five C programs (each about 100 lines of code) are used to generate the training data. 2. A tool is used that supports mutation testing at the unit level and implements 71 mutant operators to C programs. These mutant operators can be divided into four categories based on the syntactic units over which the mutation is applicable: constants, operators, statements and variables. 3. A test set of 500 cases is constructed for each program. To observe the variation in the number of equivalent and non-equivalent mutants as the number of test cases increases, 19 scenarios of test cases are defined, ranging from 0, 10, 20, 30, , to 300, 350, 400, 450 and 500 test cases. The probabilistic information is then collected, for each scenario, of the equivalent and non-equivalent mutants generated by a given mutant operator. 4. From the 19 scenarios, the probability can be calculated of each mutant operator being able to generate equivalent and non-equivalent mutants. A case study is performed of a sort program having 624 lines of code. The results indicate that for certain mutant operators, the error rate is within a reasonable range of 10%. The following papers will be included here: 5. Abd-El-Hafiz, "Identifying objects in procedural programs using clustering neural networks", Automated Software Engineering, Vol.7, No.3, 2000, pp.239-261. A. M. R. Vincenzi, et al, Bayesian-learning based guidelines to determine equivalent mutants, International Journal of Software Engineering and Knowledge Engineering, Vol.12, No.6, 2002, pp.675-689.

126

•Mi Automated Software Engineering, 7, 239-261, 2000 r T © 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Identifying Objects in Procedural Programs Using Clustering Neural Networks SALWA K. ABD-EL-HAFIZ [email protected] Engineering Mathematics Department, Faculty of Engineering, Cairo University, Giza, Egypt

Abstract. This paper presents a general approach for the identification of objects in procedural programs. The approach is based on neural architectures that perform an unsupervised learning of clusters. We describe two such neural architectures, explain how to use them in identifying objects in software systems and briefly describe a prototype tool, which implements the clustering algorithms. With the aid of several examples, we explain how our approach can identify abstract data types as well as groups of routines which reference a common set of data. The clustering results are compared to the results of many other object identification techniques. Finally, several case studies were performed on existing programs to evaluate the object identification approach. Results concerning two representative programs and their generated clusters are discussed. Keywords:

1.

clustering, objects, abstract data types, neural networks

Introduction

The identification of objects within existing procedural programs can facilitate many maintenance activities. By extracting reusable components from existing software systems, the population of a reuse repository can be cost effective (Abd-El-Hafiz et al., 1991; Canfora et al., 1993b). That is why the extraction of reusable abstract data types has been the focus of many research activities (Canfora et al., 1996, 1993a, b; Dunn and Knight, 1993). In addition, the automatic identification of objects assists in understanding the relationships among the components of a software system. It, thus, facilitates the recognition and comprehension of the abstractions existing in a given system without getting distracted by implementation details (Abd-El-Hafiz, 1997; Abd-El-Hafiz and Basili, 1996; Canfora et al., 1996). Furthermore, object identification enables the re-engineering of procedural programs into functionally equivalent object-oriented ones that are easier to maintain (Newcomb and Kotik, 1995; McFall and Sleith, 1993; Siff and Reps, 1997). Some object identification approaches apply mathematical concept analysis to the object identification problem (Lindig and Snelting, 1997; Sahraoui et al., 1997; Siff and Reps, 1997; Snelting, 1996). Concept analysis is a branch of lattice theory that can be used to identify similarities among a set of objects based on their attributes (Siff and Reps, 1997). Others adopt a knowledge-based approach which uses a library of known abstract data types and rules to recognize their implementations (Dekker and Ververs, 1994). However, considerable research on object identification and software modularization, in general, is based on cluster analysis where clustering is understood to be the grouping of similar objects and the separation of dissimilar ones. Refer for example to (Achee and Carver, 1994; Canfora et al., 1996, Hutchens and Basili, 1985; Ibba et al., 1993; Kunz, 1996; Mancoridis

127

240

ABD-EL-HAFIZ

et al., 1998; Schwanke, 1991; Yeh et al., 1995). Similarity is measured, either explicitly or implicitly, by using relationships among the objects or by using scores of the objects on a number of attributes. An overview of these similarity measures and the different clustering algorithms can be found in (Wiggerts, 1997). This paper reviews some of the approaches proposed in the literature for the identification of objects in existing software systems. The limitations of these approaches are highlighted with the emphasis being on two limitations. Thefirstone is the inability to identify a coherent set of objects due to the existence of undesired connections among routines of the analyzed system. The second one is relying on human intervention in order to produce good results. Techniques which rely on intervention from an expert who understands the system are of little help to someone who is not familiar with the software. To overcome these limitations, the paper presents an object identification approach that is based on cluster analysis. More specifically, we use clustering neural networks to identify objects in procedural programs. The identification technique proposes a flexible set of attributes upon which the clustering can be based. Using two neural clustering algorithms, a hierarchy of clustering options is presented. These clustering results, in general, enable the precise identification of objects with no human intervention. Section 2, of this paper, reviews related work. Section 3 describes two neural architectures which perform an unsupervised learning of clusters. Section 4 demonstrates how these clustering neural networks can be used to identify objects in software systems. This section also gives the clustering results of many examples and compares the presented approach with related approaches. Section 5 briefly describes an implemented prototype tool and evaluates our approach by applying it on several existing procedural programs. Results concerning two representative programs and their generated clusters are discussed. Finally, Section 6 summarizes the strengths and limitations of the presented approach and gives future research directions. 2.

Related work

Some of the existing object identification approaches use mathematical concept analysis which provides a method to identify sensible grouping of objects that have common attributes. Lindig and Snelting (1997) and Sahraoui et al. (1997) use concept analysis by relating each routine of a program to the global variables it accesses. Their results are not encouraging because they could not identify any useful way to decompose a program into objects. Siff and Reps (1997) obtain better results by relating the routines of the program to different attributes which include the fields of user defined structure types that are accessed by these routines. In order to identify objects in the presence of undesired connections between the routines, attributes that reflect 'negative' information must be invented by the user of the approach (e.g., attributes of the form 'routine R does not use fields of structure 7"). Thus, human expertise and effort are required to formulate the additional complementary attributes which correctly recognize objects in the code. The knowledge-based identification approach of Dekker and Ververs (1994) stores a library of common abstract data types in order to search a program for matches against them. This approach has the disadvantage of being only able to recognize abstract data types whose description is stored in the library.

128

CLUSTERING NEURAL NETWORKS

241

Many other object identification approaches are based on cluster analysis. The clustering algorithms reviewed in this section are classified into three different categories: hierarchical, graph-theoretic, and optimization algorithms (Wiggerts, 1997). Hierarchical clustering algorithms build a hierarchy of clusters such that each level contains the same clusters as the first lower level except for two clusters which are joined to form one cluster. Graph-theoretic clustering algorithms are based on defining a model of the subject system as a graph on which notable sub-graphs and/or patterns are identified. Optimization clustering takes an initial partitioning of the system and tries to improve it by iterative adaptations according to some heuristic. Hutchens and Basili (1985) use data bindings and a hierarchical clustering algorithm to determine how two procedures are related. When the systems under consideration exhibit strong encapsulation (i.e., hide their data), the number of publicly accessible variables becomes limited. Hence, the approach fails to identify objects in such cases. Examples of the graph-theoretic clustering algorithms can be found in (Canfora et al., 1993b, 1996; Cimitile and Visaggio, 1995; Dunn and Knight, 1993; Liu and Wilde, 1990; Livadas and Johnson, 1994; Muller et al., 1993; Yeh et al., 1995). The method used by Yeh et al. (1995) is based on constructing two kinds of graphs for C programs. The nodes of the first kind are the procedures and structure types and the edges are the references by the procedures to the internal fields of the structures. The nodes of the second kind of graphs are the procedures and the external variables and the edges are the references by the procedures to the variables. The candidate abstract data types and object instances are the set of connected components in these graphs. Although his basic identification method is automatic, it can introduce some recognition pitfalls. To handle these problems, they introduce some semi-automatic techniques. Dunn and Knight (1993) have also presented an algorithm that exploits the representation of the subject program as a bipartite graph where nodes are either routines or global variables. Edges are directed from routines to global variables to specify the 'uses' relation. The algorithm performs a depth-first traversing of the graph looking for strongly connected components. Each of the resulting components is regarded as a candidate object. Equivalently, Liu and Wilde (1990) define for each global variable x, the set P (x) of routines which directly reference it. A graph is then constructed by considering each P(x) as node. An edge between two nodes P(*i) and P f e ) denotes that the intersection of the two sets P(JCI) and P{xi) is not empty. A candidate object is identified by finding the strongly connected subgraphs. The use of first-order logic to express both the graph representing the system and the types of sub-graphs and patterns to be identified within it is also proposed in the literature (Canfora et al., 1993a). This method has the advantage of being easy to prototype using a logic programming language. However, it produces results which are essentially equivalent to those offered by other graph theoretic algorithms. In general, the objects identified by the above graph theoretic approaches tend to contain spurious routines that are slightly related to the objects and require human effort and understanding to unravel. Thus, Canfora et al. (1996) proposed an improved identification method which uses a variable-reference graph that is essentially the same as the graph adopted by Dunn and Knight (1993). By exploiting simple statistical techniques, they enable a more precise identification of objects with less human intervention. Although the approach performs automatic identification of

129

242

ABD-EL-HAFIZ

the routines which introduce undesired connections, it does not identify the types of these connections. Furthermore, the clustering of such routines into objects is not supported. The prototype tool offers assistance only when it is desired to slice some of these routines. Graph theoretic cluster analysis is also used by Cimitile and Visaggio (1995) in a different context. In order to identify functional abstractions in procedural code, they transform the system's call graph into a dominance graph and, then, interpret the dominance relationships of this graph as functional dependency relationships. They also propose a set of candidature criteria for the aggregation of the procedures in candidate modules. It should also be mentioned that cluster analysis is used to solve the general problem of decomposing a software system into subsystems, in order to re-modularize it. A unified framework for expressing these re-modularization approaches is presented in Lakothia (1997). Muller et al. (1993) present a semi-automatic graph theoretic approach which uses partite graphs to construct subsystems based on their structural aspects. Schwanke (1991) uses hierarchical clustering and measures similarity using an information sharing heuristic to identify groups of related procedures. To identify individual procedures which appear to be in the wrong module, 'maverick analysis' is used. Although this analysis helps in improving the modularization results, it requires tedious and demanding tuning of the similarity measure by an architect. That is why neural networks are used to semi-automate this tuning process. In order to produce good results, the aforementioned two approaches rely on the intervention from an architect who understands the system. Hence, Mancoridis et al. (1998) present another modularization approach which overcomes this drawback. The approach makes use of traditional optimization algorithms such as hill climbing and genetic algorithms. It also shows how graph visualization tools (North and Koutsofios, 1994) can be used to visualize the clustering results. Although the clustering algorithms make use of the different relationships that exist among the modules of a software systems, they do not take into account the different strengths of these relations. Other informal modularization work is also based on using common sub-strings in file names (Anquetil and Lethbridge, 1998) and in using the concepts referred to in the comments and function names (Merlo et al., 1993). While Anquetil and Lethbridge (1998) use a combination of iterative and statistical algorithms, Merlo et al. (1993) use neural networks. The object identification approach presented in this paper uses neural networks to perform clustering. The weights associated with each output node of the neural network correspond to a 'cluster centroid' or 'cluster prototype'. The objective of the clustering algorithms is to partition the routines of the software system so as to minimize intra-cluster distances while maximizing inter-cluster distances. This is performed by repeatedly modifying the weights of the network as well as the partitioning of the routines until stability is reached. Thus, the clustering algorithms can be classified as optimization algorithms. The distance measure being used in the presented algorithms are either the Manhattan distance or the Euclidean distance (Mehrotraet al., 1997; Wiggerts, 1997). 3.

Clustering neural networks

Clustering neural networks learn clusters in the input data without the need to be taught. That is, they perform unsupervised learning of clusters based on data correlations (i.e., similarity

130

CLUSTERING NEURAL NETWORKS Table!.

243

Animals data for unsupervised learning of clusters. has hair

has scales

has feathers

flies

lays eggs

l.Dog

1

0

0

0

0

2. Cat 3. Bat 4. Canary 5. Robin 6. Pigeon 7. Snake 8. Lizard 9. Alligator

1 1 0 0 0 0 0 0

0 0 0 0 0 1 1 1

0 0

0 1 1 1 1

0 0 0

1 1 1 0 0 0

0 0 1 1 1 1 1 1

measures). The same input pattern is presented to the network several times, and a pattern may move from one cluster to another until the network stabilizes. In this paper, we consider Adaptive Resonance Theory (ART) networks and Kohonen's Self-Organizing Maps (SOM) (Jain et al., 1996; Mehrotra et al., 1997; Zurada, 1992). In order to demonstrate how the two neural architectures perform an unsupervised learning of clusters, we use the data shown in Table 1. This data is for a group of nine animals, each described by its own set of attributes (Knight, 1990). The group breaks down naturally into three clusters: mammals, reptiles, and birds. 3.1.

Adaptive resonance theory

Adaptive Resonance Theory (ART) models are neural networks that perform clustering. They allow the number of clusters to vary with problem size. The main feature of ART models is that they permit the user to control the degree of similarity between members of the same cluster by means of a user-defined constant called the vigilance parameter, P. The ART1 network only accepts binary (0/1) input vectors. As shown in figure 1, it uses two layers of neurons (or nodes) with feedforward connections (from input to output nodes) as well as feedback connections (from output to input nodes). The input layer contains as many nodes, n, as the size of the input vector. The output layer has a variable number of nodes, m, representing the number of clusters. The connections from the input layer to the output layer carry 'bottom-up' weights, Bmxn, and the connections from the output layer to the input layer carry 'top-down' weights, Tnxm. A high-level description of the ART1 clustering algorithm, which is adapted from (Mehrotra et al., 1997), is as follows: ART1 clustering algorithm Initialize the number of output nodes: m = 1; Initialize the weights: bj^ = -rjj and t^j•- = 1, for k = 1 while the network has not stabilized, do 1. Select an input pattern x = (x\,X2,..., xn); 2. Let the active set A contain all output nodes;

131

n; j =

l,...,m;

244

ABD-EL-HAFIZ

Figure 1.

An ART1 network with 4 input and 3 output neurons.

3. Calculate yj = J2l=i bj.k ** for each node j € A; 4. repeat • Let j * be a node in A such that yy = max(yi,..., ym)\ • Compute s* = (s* s*) where s% = t£jX^, • if YflT > P ^ e n associate x with node j * and update weights else remove j * from set A; until A is empty or x has been associated with some node j ; 5. If A is empty, create a new output node whose weight vector coincides with the current input pattern x; end-while. In the above algorithm, the ART1 network is first initialized. Then, the input vectors are repeatedly presented to the network until it stabilizes. In a stable network, weights no longer change and each input vector belongs to the same cluster in successive presentations. When a new input vector x is presented to the network, it is communicated to the output layer via the upward connections. At the output layer, y ; is calculated as in step 3. The output yj represents the similarity between the weight vector bj = (bj,\, • • •, bj,n) and the input vector*, which is measured using the Manhattan (or Hamming) distance. In step 4, a competitive activation process occurs among nodes that are in the current "active" list A. The node with highest yj, j * , wins and the corresponding cluster is declared to match x best. Since the best match may not even be close enough to satisfy an externally chosen threshold, the final decision depends on a vigilance test. Using the top-down weights, the y'*th node in the second layer produces the n-dimensional vector s*. The similarity between x and s* is compared with the given vigilance parameter P. If the proportion of 'on' bits in x that are also in s* exceeds the threshold P, with 0 < P < 1, then the match with

132

CLUSTERING NEURAL NETWORKS

Figure 2.

245

Clustering results for the animals example.

the y'*th node is judged acceptable, and the weights (t\ j . tnj"), and (bj>,\,..., bj.n) are modified to make s* resembles x to a greater extent, and computation proceeds with the next input pattern. It should be noted that large P values indicate a more strict similarity requirement than small P values. If the vigilance test fails, the _/*th cluster is removed from the active set of output nodes, A, and does not participate in the remainder of the process of assigning x to an existing cluster. The process of determining the best match from the active set A is repeated until A is empty or until a best match has been found that satisfies the vigilance criterion. Step 5 states that if A is empty and no satisfactory match has been found, a new output node (cluster prototype) is created with weight vectors that match the current input pattern x (Mehrotra et al., 1997). In ART models, 'resonance' refers to the process used to match a new input vector to one of the cluster prototypes stored in the network. The system is 'adaptive' in allowing for the addition of new cluster prototypes to a network. For more details on the clustering algorithm, refer to (Mehrotra et al., 1997; Zurada, 1992). Figure 2 shows the results of applying the ART1 clustering algorithm to the animals data of Table 1. The input vectors correspond to the rows of Table 1. Starting with small P values (0.1 < P < 0.4) resulted in the successful identification of the mammals cluster. However, the network could not differentiate between reptiles and birds. Increasing the P values (0.4 < P < 0.9) resulted in identifying three clusters; mammals, reptiles, and birds. Figure 2 highlights these three clusters by using bold rectangles. High P values (0.6 < P < 0.9) further decomposed the mammals cluster based on whether they fly or not. Theoretically, there is an infinite number of possibilities for the P values and, correspondingly, for the clustering results. In practice, however, the possible clustering alternatives are limited because a range of P values can give the same clustering results. 3.2.

Self-organizing maps

Kohonen's Self-Organizing Maps (SOM) have the property of topology preservation. In a topology-preserving mapping, nearby input patterns should activate nearby output units

133

246

ABD-EL-HAFIZ

Figure S.

An SOM network with a rectangular array of neurons.

on the map. Figure 3 shows the basic network architecture of Kohonen's SOM. It consists of a two-dimensional array of connected neurons. Each neuron is also connected to all n input nodes. The n-dimensional vector associated with the neuron at location ; of the two dimensional array is denoted wj = (WJ,\ Wj,n)- SOM also defines a spatial neighborhood for each output node. The shape of this neighborhood can be square, rectangular or circular. During competitive learning, all weights associated with the winner and its neighboring nodes are updated (Jain et al., 1996; Mehrotra et al, 1997; Zurada, 1992). The SOM clustering algorithm, which is adapted from (Mehrotra et al., 1997), is as follows: SOM clustering algorithm Select the network topology to determine which nodes are adjacent to which others; Initialize weights to small random values; Initialize current neighborhood distance £>(0) to a positive integer; while the network has not stabilized, do 1. 2. 3. 4. 5.

Select an input pattern x — (x\, xi,..., xn)\ Calculate y ; = YL"k=\ (** ~~ wj,k)2< f° r e a c n output node; Select the output node, j * , with the minimum y, value; Update weights to all nodes within a topological distance of D(t) from j * ; Increment t;

end-while. Initially, the network weights are assigned random values. When an input vector x is presented to the network, the square of the Euclidean distance of* from the weight vector Wj associated with each output node is computed in step 2. In step 3, the output node, j * , with the minimum Euclidean distance is chosen as the winner of the competition. The weights to all nodes within a topological distance of D(t) from j * , where D(t) decreases

134

CLUSTERING NEURAL NETWORKS

247

with time, are updated in step 4. D(t) refers to the length of the path connecting two nodes for the prespecified topology chosen for the network. During the learning process, values of the weights change such that each weight vector moves towards the centroid of some subset of input patterns (Jain et al., 1996; Mehrotra et al., 1997). The design parameters include the dimensionality of the neuron array, the number of neurons in each dimension, the shape of the neighborhood, the learning rate, and the criterion used to determine whether the network has stabilized or not. With respect to the first parameter, we experimented with the two commonly used values: one- and two-dimensional arrays. In all the programs we analyzed, the two values gave comparable results. Thus, we focus on using one-dimensional arrays because they are more intuitive than two-dimensional ones. The second parameter, K, which denotes the number of nodes in the linear array, is varied to control the granularity of the resulting clusters. Small K values generate a small number of coarse-grained clusters and vice-versa. The values of K should be larger than the maximum number of possible clusters for the problem but smaller than the number of input vectors. In order to simplify the analysis, we choose commonly reported values for the third and fourth parameters (Zurada, 1992). The shape of the neighborhood we use is circular and the learning rate is exponentially decaying. Finally, the total adjustments made to all neuron weights, during a complete presentation of the input vectors, are used to determine whether stability is reached or not. If these total adjustments are below a certain limit, the network is deemed stable. Although the reduction of this limit can improve the accuracy of the results, it slows the convergence. The clustering results reported in this paper are all obtained with unity adjustments limit. The results of applying the SOM clustering algorithm to the animals data of Table 1 is also shown in figure 2. In this small example, varying K yields the same clustering results as those of the ART1 network. It should be noted that K represents an upper limit on the resulting number of clusters. Making K greater than or equal to four in our example yields the same results because at most four clusters could be identified in the data of Table 1. 4.

Object identification via clustering neural networks

In this section, it is demonstrated how clustering neural networks can be used to identify objects in procedural programs. In general, a routine-attribute matrix similar to the one given in the animals example of Table 1 is formed. In this matrix, the rows correspond to the routines included in the system under consideration, while the columns correspond to the attributes of these routines. The entries of the matrix are either 1 or 0 depending on whether the routines has a given attribute or not. Each row of the matrix represents a single input to the neural network. By varying the parameters K and P, the neural networks output gives multiple clustering possibilities. Related literature adopted several strategies for picking up the attributes. Attributes used before include: usage of common global variables (Canfora et al., 1993b, 1996; Dunn and Knight, 1993; Lindig and Snelting, 1997; Liu and Wilde, 1990; Sahraoui et al., 1997; Snelting, 1996; Yeh et al., 1995), dataflow information (Hutchens and Basili, 1985), usage of user-defined data types (Canfora et al., 1993a), in general, and usage of structure (record)

135

248

ABD-EL-HAFIZ

datatypes, in specific (Siff and Reps, 1997; Yeh et al., 1995). Similar to Siff and Reps (1997) approach, our approach is very flexible and general when it comes to the choice of attributes. Any set of attributes, that may be useful in some instances, can be used in our approach. In our examples and case studies, using the following attributes, either separately or jointly, yielded good clustering results: • Usage of global variables. An attribute might be of the form 'uses global variable x'. • Usage of structure (record) and enumeration data types. An attribute might be of the form 'uses fields of struct stack', 'has argument of type struct stack*', or 'return type is struct stack*'. • Disjunction of attributes related to similar user-defined types or similar global variables. For instance, if T\ and T2 are two similar data types, the disjunction 'uses fields of T\ or uses fields of T2' can improve the object identification results (Siff and Reps, 1997). • Usage of data files and/or usage of read/write statements. In some cases, such attributes identify the objects which interact with the user. Since the neural networks can generate different clustering results at different parameter values, we form a clustering tree, similar to those shown in figures 2 and 5, to facilitate the visualization and analysis of clustering results. In this tree, the root node represents all routines in the program. Whenever the neural network generates partitions of an existing tree node, we create the corresponding sub-nodes which represent the resulting partitions. To further explain our clustering techniques and to facilitate the comparison with related object identification techniques, we use several examples adapted from Canfora et al. (1996) and Siff and Reps (1997). Despite the fact that our techniques apply to any procedural programming language, the examples in this paper are in C. Figure 4 shows a specific C implementation of stacks and queues (Siff and Reps, 1997). Queues are represented by two stacks; one for the front and one for the back. Information is shifted from the front stack to the back stack when the back stack is empty. The queue functions make indirect use of the stack fields by calling the stack functions. We would struct stack {int *base, *sp, size;}; struct queue {struct stack *front, *back;}; /* I */ 1*2*1 /* 3 *l /* 4 •/ /* 5 */ 1*6*1 /* 7 */ /* 8 */ Figure 4.

struct stack *initStack(int sz) struct queue *initQ( ) int isEmptyStack(struct stack* s) int isEmptyQ(struct queue *q) void push(struct stack* s, int i) void enq(struct queue *q, int i) int pop(struct stack* s) int deq(struct queue *q)

{/* {/* {/• {/* {/• {/* {/• {/*

uses fields of struct stack */} uses fields of struct queue */} uses fields of struct stack */} uses fields of struct queue */} uses fields of struct stack */} uses fields of struct queue */} uses fields of struct stack */} uses fields of struct queue */}

A sample C-like code for a stack and a queue (adapted from Siff and Reps, 1997).

136

CLUSTERING NEURAL NETWORKS Table 2.

Attributes for the stack-queue example. #

A, 1 0

1 2

249

4 5

0

A2

1 1 0

A^

Attribute

Ai

argument or return type is struct stack*

A2

argument or return type is struct queue*

A3

uses fields of struct stack

A4

uses fields of struct queue

0

A4 1

1

0 1 0 1 1 0 1 0

6

0

1

0

1

7

1

0

1

0

8 I 0

Figure 5.

1

0

yS '

1 3 5 7 '

11

'

'

•

O \^K>=2 ^ K > 2

2 4 6 8 '

'

'

The routine-attribute matrix and its clustering results for the stack-queue example.

like to identify the two objects representing the two given abstract data types. Using the functions of figure 4 and the attributes of Table 2, we formed the routine-attribute matrix of figure 5. This matrix represents the input to the two clustering neural networks under consideration. We varied P between 0.1 and 0.9 with a step of 0.1 and gave K values that are greater than or equal to 2. ART1 and SOM gave the same clustering tree, which is depicted in figure 5. As shown by the two bold rectangles, the two abstract data types are correctly identified. 4.1.

Object identification in the presence of undesired links

The results generated by clustering neural networks, in the examples considered thus far, are similar to the results produced by many other techniques in the literature (see for example Siff and Reps, 1997; Yeh et al., 1995). To demonstrate the full power of clustering neural networks, we now consider their application to real-life systems. In such systems, there can be some routines which cause undesirable clustering. Canfora et al. (1996) describe two different types of undesired links: coincidental links and spurious links. A coincidental link results from a routine that actually includes implementations of several routines, each logically belonging to a different object. Spurious links are created by routines that access the supporting data structures of more than one object in order to implement system specific operations. Many object identification approaches do not yield good results when applied to examples that exhibit undesired links (see for example Dunn and Knight, 1993; Liu and Wilde, 1990; Livadas and Johnson, 1994; Yeh et al., 1995).

137

250

ABD-EL-HAFIZ

I A, A, A,

1 1 0

2 0 1 0 3

5 6 7

1

1

1

0

i

0

1 0 1 0 1 1 1 0 1

8 10

Figure 6.

0

0

I 1~~^ 8

A^ I

1

y^

/ \ O 1

<= P < 0.7

x^2,i

I 13 5 7 I124 6 8 I / ^V0-7 <= p <= °-9 X X K >= 4 I 20 o 8 I I 4"T7 6

0 1 0

1 I

|

> I|

'

The routine-attribute matrix and its clustering results after modifying the stack-queue example.

As an example of spurious links, Siff and Reps (1997) consider the following modification of the stack-queue example given in figure 4. /* 4 */ /* 6 */

int isEmptyQ(struct queue *q) void enq(struct queue *q, int i)

{/* uses fields of struct stack and struct queue */} {/* uses fields of struct stack and struct queue */}

Although such a modification may be more efficient, it causes some queue routines to access the supporting data structure of the stack routines. Figure 6 shows the routine-attribute matrix and the clustering tree after performing this modification. It is clear that the two abstract data types are still correctly identified. Because functions 4 and 6 are different from functions 2 and 8, with respect to the set of data they access, the queue abstract data type (functions 2, 4, 6, and 8) is divided into two corresponding partitions. That is, the clustering technique provides additional information about similarities among the functions of a selected cluster. Compared to the concept analysis technique presented by Siff and Reps (1997), we do not have to add a complementary attribute of the form 'does not use fields of struct queue' to correctly identify the two abstract data types. In order to discuss the effect of both spurious and coincidental links, Canfora et al. (1996) use the example of figure 7. This example gives a sample C-like code which uses a stack, a queue, and a list. The function global-init (function # 20) is an example of a coincidental connection, while functions 14-19 exemplify spurious connections. In this example, we use six attributes corresponding to the six global variables defined in the code. Each attribute has the form: 'uses global variable x'. For more details on this example and on its routineattribute matrix, refer to (Canfora et al., 1996). The results of applying ART1 and SOM are shown in figures 8 and 9, respectively. We varied P between 0.1 and 0.9 with a step of 0.1 and gave K values that are greater than or equal to 2. Only P and K values which trigger a partitioning of an existing cluster are shown on these figures. ART1 succeeds in identifying the list (functions 10-13) and stack (functions 1-5) abstract data types. However, it is unsuccessful in separating the queue abstract data type (functions 6-9). On the other hand, SOM successfully identify all the three abstract data types. Additionally, SOM provide the information that functions 14-20 can be grouped into three clusters: (15, 16), (14,18), and (17, 19, 20). The approach of Canfora et al. (1996) fails to automatically identify such a

138

CLUSTERING NEURAL NETWORKS

251

ELEM_T stack_struct[MAXDIM]; int stack_point; ELEM_T queue_struct[MAXDIM]; int queuejiead, queue_tail, queue_num_elem; struct list_struct {ELEMT node_content; struct liststruct * nextjiode; } list; main ( ) {/* this program exploits a stack, a queue, and a list of items of type ELEM_T •/} /• 1 */ void stack_push ( el) {/* uses stack_point and stack_struct •/} /* 2 */ ELEM_T stackjop ( ) {/* uses stack_point and stack_struct •/} /• 3 •/ ELEMJT stackjop ( ) {/* uses stack_point and stack_struct •/} /* 4 */ BOOL stack_empty () {/• uses stackj o i n t •/} /• 5 */ BOOL stack_full ( ) {/• uses stack_point •/} /* 6 •/ /* 7 */ /• 8 */ /• 9 */

void queue_insert ( el) ELEMT queueextract ( ) BOOL queue_empty ( ) BOOL queue_full ()

{/* {/• {/* {/*

uses queue_struct, queue_head and queue_num_elem */} uses queuestruct, queue_tail and queuenumelem •/} uses queue_num_elem •/} uses queue_num_elem */>

/• 10 •/ I* 11 */ /* 12 •/ 1*13*/

void list_add (el) void list_elim ( el) BOOL lFstjsjn (el) BOOL list_empty ( )

{/• {/• {/* {/•

uses list •/} uses list •/} uses list •/} uses list */}

/• 14 •/ /• 15 •/

void stack_to_list () void stack_to_queue ( )

/* 16*/

void queue_to_stack ( )

/* 17*/ /• 18 */ /* 19 */ /• 20 */

void queue_to_list () void Iist_to_stack ( ) void list_to_queue () void global init ( )

{/• uses stack_point, stackstruct and list */} {/* uses stack_poiht, stack_struct, queue_struct, queue_head and queue_num_elem •/} {/• uses stack_point, stack_struct, queue_struct, queue_tail and queue_num_elem •/} {/• uses queue_struct, queue_tail, queue_num_elem and list •/} {/• uses stack_point, stack_struct and list */} {/* uses queuestruct, queuejiead, queue_num_elem and list */} {/* uses stack_point, queuejiead, queuejail, queue_num_elem and list •/}

Figure 7. A sample C-like code for a stack, queue, and a list (adapted from Canfora et al., 1996).

Figure 8. ART1 clustering results for the stack-queue-list example.

139

252

Figure 9.

ABD-EL-HAFIZ

SOM clustering results for the stack-queue-list example.

decomposition. Their tool only uses program slicing (Weiser, 1984) to assists in overcoming the coincidental connection introduced by routine number 20. In summary, Table 3 presents a comparison with the closely related object identification approaches. The criteria we use in the comparison are the choice of attributes on which the identification is based, the identification algorithm, the ability to identify objects in the presence of undesired connections, the need for human intervention to perform such an identification, and the ability to produce a hierarchy of objects.

Table 3.

A comparison with related object identification approaches. Comparison criteria If undesired connections exist

Approach

Attributes

The presented neural approach

Flexible

Canforaetal. (1996)

Fixed

Algorithm

identification

human effort

Hierarchical

Optimization clustering

Yes

No

Yes (tree)

Graph theoretic clustering

Yes

Yes

No

with statistical adaptation Dunn and Knight (1993) Fixed

Graph theoretic clustering

No

-

No

Liu and Wilde (1990)

Fixed

Graph theoretic clustering

No

-

No

Siffand Reps (1997) Yehetal. (1995)

Flexible Fixed

Concept analysis Graph theoretic clustering with informal adaptations

Yes Yes

Yes Yes

140

Yes (lattice) No

CLUSTERING NEURAL NETWORKS Table 4.

253

Clustering tool performance. ART1 Program

Name ccount

Type

Language KLOC Routines Attributes

Exec, time (sec.)

K

Exec, time (sec.)

0.8

17

10

0.9

0.00

6

0.00

schedule Schedule univ. courses

Pascal

1.5

39

23

0.9

0.00

21

1.98

vh

Hypertext browser

C

4.5

103

29

0.9

0.00

45

1.05

gdbm

Data base manager

C

6.8

69

22

0.1

0.00

9

0.93

0.9

0.00

45

4.67

elvis

Editor (a clone of vi/ex)

C

18.7

220

37

0.9

0.11

60

16.04

5.

Misc. counts for C files C

P

SOM

Implementation and evaluation

A prototype tool, which implements the clustering algorithms presented in this paper, has been implemented. The tool accepts as its input a routine-attribute matrix, which is constructed using simple static analysis of the code. The user decides which neural network model to use and its corresponding parameter value. After applying the required clustering algorithm, the clustering results are provided using a simple text-based interface. A graphical display of the clustering results, as shown in this paper, would certainly be very advantageous. In order to evaluate our approach, several case studies, which involve C and Pascal programs, have been carried out. The size of these programs ranges up to 19,000 lines of code. Table 4 presents the performance measurements of the clustering tool on five of these programs. The counting and scheduling programs are obtained from (Frakes et al., 1991) and (Jalote, 1991), respectively. The remaining three programs are public domain C programs obtained from the directory: ftp://ftp.ms.uky.edu/pub3/gnu. Execution times shown in Table 4 were collected by running the tool on a Pentium 11-400 computer with 128 MB of RAM and running the Windows98 operating system. Execution times that are less than 0.01 second are given zero values. It should be noted that the tool execution time does not only depend on the number of routines and attributes in the software system but also depends on the specific relations existing between them. Thus, the binary content of the routine-attribute matrix affects the speed with which the algorithm reaches stability. That is why the execution time of the scheduling program, for instance, is larger than that of the hypertext browser despite the fact that the routine-attribute matrix of the browser is larger. In all of the considered programs and case studies, the clustering results of the SOM architecture are either comparable to or slightly better than the corresponding ones of the ART1 architecture. However, the execution times of the SOM architecture are larger. Because the clustering algorithms are computationally simple, the execution times are, in general, very short. The largest execution time in Table 4 is only 16.04 seconds for a 18.7 KLOC program with a 220 x 37 routine-attribute matrix. From the computational point of

141

254

ABD-EL-HAFE

view, this implies that the clustering tool can handle large real-life systems. Nevertheless, further experimentation is needed in such cases to evaluate the quality of the clustering results. Evaluation of the clustering results in the considered case studies is performed by manual inspection of the code. Because it is difficult to manually inspect the code of large software systems, input from a software architect who is familiar with the system would be needed to evaluate the clustering results of such systems. In the following two subsections we focus on the description of two representative case studies. The first case study applies the approach to a small program, the counting program. This serves to asses the accuracy of the approach and to compare its proposed clustering results with the intended clustering of the program designer. To evaluate the effectiveness of the approach when dealing with medium-size programs, the second case study applies the approach to the database management program. Despite the fact that the examples presented so far focus on the identification of abstract data types, these case studies demonstrate that our approach is also appropriate for the identification of other groups of routines which reference a common set of data, e.g., object instances (Yeh et al., 1995).

5.1.

First case study: A counting program

This program performs different counts for C source files (Frakes et al., 1991). It provides the number of commentary source lines, the number of non-commentary source lines and comment-to-code ratio for C source files. These counts are reported for each function, for lines external to functions, and for the source file as a whole. The program consists of 800 lines of C code. It has 17 functions, which are shown in Table 5. The designer of the program divided the program into 7 clusters. In Table 5, each of these clusters is enclosed between two dashed lines. The attributes we use for this program are given in Table 6. Because there is a small number of global and data type definitions in this program, we use a combination of all possible attribute categories. The attributes include the only structure defined in the program (count_struct). The three defined enumeration types are also included. We consider a disjunction of the two similar enumeration types, char-class and token-type. In addition, usage of two global definitions, data files, and read/write statements is taken into account. Because ART1 and SOM gave similar clustering results for this program, we only show the results of ART1 in figure 10. The designers view of the program clusters is correctly identified for 5 clusters. These 5 clusters are drawn in bold rectangles. The first two designerdefined clusters were joined together in one cluster (functions 1-4). The reason for joining these four functions together is that they possess none of the considered set of attributes. That is, they represent a collection of a driver and miscellaneous utility routines. Functions 6-9 are the ones that parse lines of code and classify them. Functions 1 l-16implement an abstract data type for lists of line counts. Functions 13, 14 are more similar to each other than the rest of the abstract data type functions because they do not use fields of the count_struct structure. Functions 5 and 10 are considered similar to this abstract data type because they have argument/return types of struct count_struct*. Function 17 generates the error messages.

142

CLUSTERING NEURAL NETWORKS Table 5.

Functions for the counting program. #

Function

Function

1

main

10

check-options

11

3

clean-commandJine

12

destroy .node

4

get-parameters

13

is.empty.list

report

.metrics

create_node

5

countJines

14

createjist

6

start-tokenizer

15

append-element

16

delete.element

17

error

8 9

getjoken find

Junction Jiame

classifyJine

Attributes for the counting program. #

5.2.

#

2

7

Table 6.

255

Attribute

Ai

argument /return type is struct count-struct *

A2

uses fields of struct count struct

A3

argument/return type is tokenjype or char-class

A4

uses elements of token-type or charxlass

A5

argument /return type is error-type

As

uses elements of error-type

A7

uses max Jine

A8

uses max Jdent

A9

uses a file data type

A10

uses read/write statements

Second case study: A data base management system

This case study uses the GNU data base manager, GDBM, which is a free software written by Nelson (1993). The software system consists of 6,760 lines of C code and it has 69 functions. The system is divided by the designer into 48 files; 9 ".h" files and 39 ".c" files. Most of the 39 " x " files include single C functions. Only 13 files, out of 39, contain groups of C functions. In Table 7, the contents of each of these 13 files are enclosed between two dashed lines. To analyze the GDBM software, all global variables and user-defined structure and enumeration data types were used to form a list of 22 attributes. We only excluded one structure from the attribute list because it was defined to conveniently group pointers to dynamic variables as well as frequently used variables. Since read/write statements are used throughout the whole program, they were not considered in the attribute list.

143

256

Figure 10.

ABD-EL-HAFIZ

ART1 clustering results for the counting program.

Because the SOM architecture gave slightly better overall results than the ART1 architecture, we only show the results of SOM in figure 11. Due to the large number of routines, we only form the clustering tree at two K values (9 and 45). The results, which are depicted in this figure, demonstrate that a graphical visualization and manipulation of the clustering results is required when dealing with large software systems. In the remainder of this section, we discuss the SOM results and, when necessary, point out how they differ from the ART1 results. Table 7.

Functions included in 13 files of the GDBM software.

#

Function 1 2

find-stack-direction alloca

#

Function

#

Function

27

push-avail-block

51

my.bcopy

28

get_elem

52

exchange

3

iOOafunca

29

-gdbm.put.av_elem

53

-getoptJnternal

4

iOOafuncb

30

get-block

54

getopt

5

-gdbmjiew .bucket

31

adjust.bucket_avail

55

main

6

-gdbm-get.bucket

33

-gdbm-read-entry

57

first-key

7

.gdbm-split-bucket

34

-gdbm-findkey

58

nextjcey

8

-gdbm-write-bucket

40

gdbm.open

61

print-bucket

10

main

41

gdbm_init_cache

62

-gdbm-print-availjist

11

usage

42

rename

63

-gdbm-print-bucket-cache

20

dbm-firstkey

43

gdbm_reorganize

64

usage

21

dbmjiextkey

44

getjiextjcey

65

main

24

-gdbm-alloc

45

gdbm.firstkey

67

writeJieader

25

.gdbmjree

46

gdbm_next_key

68

-gdbm.end.update

26

pop-avail-block

50

myjndex

69

-gdbm-fatal

144

CLUSTERING NEURAL NETWORKS

Figure 11.

257

SOM clustering results for the GDBM software.

To analyze the clustering results, we consider the following two questions: 1. Assuming that the 13 files in Table 7 represent the designer-defined clusters, how many of these clusters are correctly identified by the neural algorithms? 2. Do the designer-defined clusters represent the best way to decompose the system? If not, what kind of improvements are offered by the neural algorithms. 5.2.1. Identification of designer-defined clusters. The SOM architecture only identifies the two clusters: (33, 34) and (67, 68). Function 69 is not included in the second cluster because it has no attributes. The ART1 architecture, on the other hand, identifies four designer-defined clusters. 5.2.2. Improvements offered by the clustering algorithms. Figure 11 shows, in double rectangles, one view of how to decompose the program into clusters. This view divides the 69 functions of the GDBM software into 18 clusters instead of the 39 ".c" files defined by the designer. Based on the required level of granularity, there can be several other decompositions. The following three points highlight the kind of improvement offered by the clustering algorithms. 1. Grouping utility and driver routines in few clusters: Cluster Ci groups all the utility and driver routines of the software system in one cluster. The functions included in this cluster possess no attributes. Similarly, cluster Cio includes all the driver and utility functions which only use fields of the "datum" structure. 2. Identifying new clusters and improving on already existing ones: For example, consider the new clusters C2, C9, and C\\. The names of the functions included in these clusters are given in Table 8. By inspecting these names, it becomes clear why they were clustered together in the shown manner. Consider also functions 61-65 which are grouped by the software designer in a program that tests the database routines and helps in debugging them. In our results, each of these functions belongs to a cluster that better matches it from the data manipulation point of view.

145

258 Table 8.

ABD-EL-HAFIZ Content of some SOM clusters for the GDBM software. Cluster

Function numbers

Function names

C2

23,59

delete, store

C6

16,17

dbmjnit, dbm-open

45,46

gdbm-first-key, gdbm-nextkey

47,49

gdbm-setopt, gdbm_sync

C9

36, 39,48

gdbm-delete, gdbm jetch, gdbm-store

Cn

15,32

dbmjetch, fetch

C12

9

dbmclose

20, 21

dbm.firstkey, dbmjiextJcey

57, 58

firstjcey,

next-key

3. Improving the identification of abstract data types in two different ways: • Separating the routines which define an abstract data type from the routines that uses it. Consider, for example, the file space management routines, 24—31, which are grouped by the software designer in one file. Our clustering results divide them into three separate clusters: C3, C14, C15. These three suggested clusters separate the file space management routines (Q4) from the two abstract data types (C3 and C15) they utilize. Cluster C\n includes the three function (24, 25, and 31) which allocate space, free space and make sure that the current space is close to half full, respectively. Cluster C15 includes four functions (25, 27, 30, and 62) which define an abstract data type for an available block of data. The four functions pop, push, search for and print an available block of data, respectively. Cluster C3 defines an abstract data type for an available element within the available block of data. The two functions of the cluster (28 and 29) search for and insert an element in the block, respectively. • Including all the routines that define an abstract data type. For instance, the software designer defines an abstract data type for a bucket, which is a small hash table, by grouping functions 5-8 in one file. In the SOM clustering results, the four functions are included in two clusters: Ci7 and C 18 . The two clusters also include three other functions: 35, 61, and 36. Function 35 frees all memory associated with a bucket cache, function 61 prints the bucket, and function 63 prints the bucket cache. To correctly define the abstract data type for buckets, we think that clusters Ci 7 and C18 should be combined together. That it, this improvement is partially offered by the SOM results. In the ART1 clustering results, the four functions are scattered in different clusters. Finally, it should be mentioned that the ART1 architecture could not identify clusters C5 and C9. As depicted in Table 8, we also think that each of the clusters C(, and Cn would better be divided into three sub-clusters. This division is clear from the function names and it is suggested by the ART1 results and the software designer.

146

CLUSTERING NEURAL NETWORKS

6.

259

Conclusions

An approach for identifying objects in procedural programs has been presented. This approach is based on clustering neural networks. It is very flexible and general when it comes to the choice of the attributes on which the identification is based. It is also capable of identifying objects in the presence of undesired links with no human intervention. By controlling the design parameters of the two considered neural architectures, we automatically obtain a hierarchy of clustering possibilities. The design parameter of ART1, P, controls the degree of similarity between elements of the same cluster. The number of clusters necessary to achieve this similarity requirement is automatically determined by the network. On the other hand, the design parameter of SOM, K, represents an upper limit on the required number of clusters. That is, the user identifies the appropriate number of clusters for a specific problem by gradually increasing K. With respect to the clustering results, the two neural architectures were successful in identifying abstract data types as well as groups of routines which reference a common set of data. However, the examples and case studies showed that although the execution times of the SOM architecture are larger than those of the ART1 architecture, the SOM clustering results are slightly better. While SOM succeeded in identifying objects in the presence of undesired links, ART1 was only partially successful. In the second case study, ART1 and SOM gave complementary results. However, the overall identification results of the SOM architecture were better. Because the presented clustering approach assists in the identification of abstract data types and groups of routines which reference a common set of data, it is convenient for re-engineering procedural programs into object-oriented ones. Future work includes experimenting with the object identification approach on software systems that are larger than the ones considered so far. In such cases, a user interface that allows graphical visualization of the analysis results becomes essential. Since some of the visualized graphs may become excessively big on medium/large size systems, it might also be necessary to navigate inside large graphs (see for example Antoniol et al., 1997; North and Koutsofios, 1994). References Abd-El-Hafiz, S.K. 1997. Effects of decomposition techniques on knowledge-based program understanding. In Proceedings of the International Conference on Software Maintenance, Bari, Italy, pp. 21-30. Abd-El-Hafiz, S.K. and Basili, V.R. 1996. A knowledge-based approach to the analysis of loops. IEEE Trans, on Software Engineering, 22(5):339-36O. Abd-El-Hafiz, S.K., Basili, V.R., and Caldiera, G. 1991. Towards automated support for extraction of reusable components. In Proceedings of the Conference on Software Maintenance, Sorrento, Italy, pp. 212-219. Achee, B.L. and Carver, D.L. 1994. A greedy approach to object identification in imperative code. In Proceedings of the Third Workshop on Program Comprehension, pp. 4—11. Anquetil, N. and Lethbridge, T. 1998. Extracting concepts from file names; a new file clustering criterion. In Proceedings of the International Conference on Software Engineering, Kyoto, Japan. Antoniol, G., Fiutem, R., Lutteri, G., and Merlo, E. 1997. Program understanding and maintenance with the CANTO environment. In Proceedings of the International Conference on Software Maintenance, Bari, Italy,

pp. 72-81.

147

260

ABD-EL-HAFIZ

Canfora, G., Cimitile, A., and Munro, M. 1993a. A reverse engineering method for identifying reusable abstract data types. In Proceeding of the First Working Conference on Reverse Engineering, Baltimore, Maryland, pp. 73-82. Canfora, G., Cimitile, A., and Munro, M. 1996. An improved algorithm for identifying objects in code. Software Practice and Experience, 26(l):25-48. Canfora, G., Cimitile, A., Munro, M., and Taylor, C.J. 1993b. Extracting abstract data types from C programs: A case study. In Proceedings of the International Conference on Software Maintenance, Montreal, Quebec, Canada, pp. 200-209. Cimitile, A. and Visaggio, G. 1995. Software salvaging and the call dominance tree. The Journal of Systems and

Software, 2S(.2):in-m. Dekker, R. and Ververs, F. 1994. Abstract data structure recognition. In Proceedings of the Ninth Knowledge-Based Software Engineering Conference, pp. 133-140. Dunn, M.F. and Knight, J.C. 1993. Automating the detection of reusable parts in existing software. In Proceedings of the 15th International Conference on Software Engineering, Baltimore, Maryland, pp. 381-390. Frakes, W.B., Fox, C.J., and Nejmeh, B.A. 1991. Software Engineering in the UNWC Environment. Prentice Hall. Hutchens, D.H. and Basili, V.R. 1985. System structure analysis: Clustering with data binding. IEEE Transaction on Software Engineering, SE-11(8):749-757. Ibba, R., Natale, D., Benedusi, P., and Naddei, R. 1993. Structure-based clustering of components for software reuse. In Proceedings of the International Conference on Software Maintenance, Montreal, Quebec, Canada, pp. 210-215. Jain, A.K., Mao, J., and Mohiuddin, K.M. 1996. Artificial neural networks: A tutorial. IEEE Computer, 29(3):3144. Jalote, P. 1991. An Integrated Approach to Software Engineering. Springer-Verlag. Knight, K. 1990. Connectionist ideas and algorithms. Communications of the ACM, 33(11):59-74. Kunz, T. 1996. Evaluating process clusters to support automatic program understanding. In Proceedings of the Fourth Workshop on Program Comprehension, pp. 198-207. Lakothia, A. 1997. A unified framework for expressing software subsystem classification techniques. Journal of Systems and Software, 36:211-231. Lindig, C. and Snelting, G. 1997. Assessing modular structure of legacy code based on mathematical concept analysis. In Proceedings of the I9th International Conference on Software Engineering, pp. 349-359. Liu, S. and Wilde, N. 1990. Identifying objects in a conventional procedural language: An example of data design recovery. In Proceedings of the Conference on Software Maintenance, San Diego, California, pp. 266-271. Livadas, P.E. and Johnson, T. 1994. A new approach to finding objects in programs. Software Maintenance: Research and Practice, 6:249-260. Mancoridis, S., Mitchell, B.S., Rorres, C , Chen, Y., and Gansner, E.R. 1998. Using automatic clustering to produce high-level system organizations of source code. In Proceedings of the Sixth International Workshop on Program Comprehension, Ischia, Italy. McFall, D. and Sleith, G. 1993. Reverse engineering structured code to an object oriented representation. In Proceedings of the Fifth International Conference on Software Engineering and Knowledge Engineering, pp. 86-93. Mehrotra, K., Mohan, C.K., and Ranka, S. 1997. Elements of Artificial Neural Networks. The MIT Press. Merlo, E., McAdam, I., De Mori, R. 1993. Source code informal information analysis using connectionist models. International Joint Conference of Artificial Intelligence, vol. 2. Los Altos, CA, pp. 1339-1344. Muller, H.A., Orgun, M.A., Tilley, S.R., and Uhl, J.S. 1993. A reverse engineering approach to subsystem structure identification. Software Maintenance: Research and Practice, 5(4): 181-204. Nelson, P.A. 1993. GDBM, the GNU Data Base Manager. Cambridge, MA: Free Software Foundation. Newcomb, P. and Kotik, G. 1995. Reengineering procedural into object-oriented systems. In Proceeding of the Second Working Conference on Reverse Engineering, Toronto, Ontario, Canada, pp. 237-249. North, S. and Koutsofios, E. 1994. Applications of graph visualization. In Proceedings of Graphics Interface, Banff, Alberta, pp. 235-245. Sahraoui, H.A., Melo, W., Lounis, H., Dumont, F. 1997. Applying concept formation methods to object identification in procedural code. Technical Report CRIM-97/05-77, CRIM. Schwanke, R.W. 1991. An intelligent tool for re-engineering software modularity. In Proceedings of the Thirteenth IEEE International Conference on Software Engineering, Austin, Texas, pp. 83-92.

148

CLUSTERING NEURAL NETWORKS

261

Siff, M. and Reps, T. 1997. Identifying modules via concept analysis. In Proceedings of the International Conference on Software Maintenance, Ban, Italy, pp. 170—179. Snelting, G. 1996. Reengineering of configurations based on mathematical concept analysis. ACM Transactions on Software Engineering and Methodology, 5(2):146-189. Weiser, M. 1984. Program slicing. IEEE Trans, on Software Engineering, SE-10(4):352-357. Wiggerts, T.A. 1997. Using clustering algorithms in legacy systems remodularization. In Proceedings of the Working Conference on Reverse Engineering, Amsterdam, Holland, pp. 33—43. Yeh, A., Harris, D.R., and Reubenstein, H.B. 1995. Recovering abstract data types and object instances from a conventional procedural language. In Proceeding of the Second Working Conference on Reverse Engineering, Toronto, Ontario, Canada, pp. 227-236. Zurada, J. 1992. Introduction to Artificial Neural Systems. West Publishing Company.

149

International Journal of Software Engineering and Knowledge Engineering Vol. 12, No. 6 (2002) 675-689 © World Scientific Publishing Company

BAYESIAN-LEARNING BASED GUIDELINES TO DETERMINE EQUIVALENT MUTANTS AURI MARCELO RIZZO VINCENZI*'5, ELISA YUMI NAKAGAWA*^, JOSE CARLOS MALDONADO*-*, MARCIO EDUARDO DELAMARO+-** and ROSELI APARECIDA FRANCELIN ROMERO*-" * Institute de Ciencias Matemdticas e de Computacdo, Universida.de de Sao Paulo, Av. do Trabalhador Sancarlense, 400, Cx. Postal 668, CEP 13560-970, Sao Carlos, SP, Brazil t Faculdade de Jnformdtica, Fundagao Euripedes Soares da Rocha, Av. Higyno Muzzy Filho, 529 ND, CEP 17525-901, Marilia, SP, Brazil § [email protected] ' [email protected] [email protected] H [email protected] ** [email protected] Mutation testing (Mutation Analysis), although powerful in revealing faults, is considered a computationally expensive criterion, due to the high number of mutants created and the effort to determine the equivalent mutants. Using mutation-based alternative testing criteria it is possible to reduce the number of mutants but it is still necessary to determine the equivalent ones. In this paper the Bayesian Learning(one of the Artificial Intelligence techniques used in machine learning) is investigated to define the Bayesian Learning-Based Equivalent Detection Technique (BaLBEDeT), which provides guidelines to help the tester to analyze the live mutants in order to determine the equivalent ones. Keywords: Mutation testing; program equivalence analysis; Bayesian learning.

1. Introduction Software testing is one of the most relevant activities used to guarantee the quality and the reliability of the software under development. The success of the testing activity depends on the quality of the test set. Testing criteria have been defined and investigated to help the tester to generate and evaluate test sets. In the last two decades, data-flow and mutation based testing criteria have been intensively investigated [11, 5, 20, 7, 2]. The focus of this paper is mutation testing. Mutation testing requires the development of a test set T that reveals the presence of a well-specified set of faults [5]. The faults are modeled by a set of mutant operators which, when applied to a program P under test, generate syntactically correct programs called mutants. The * Corresponding author. 675

150

676

A. M. R- Vincenzi et al.

quality of T is measured by its ability to distinguish the behavior of the mutants from the behavior of the original program. One problem related to mutation testing is the large number of mutants that need to be compiled and executed. In addition, a tester needs to examine many mutants and analyze them for possible equivalence with respect to (w.r.t.) the original program. For these reasons, mutation testing is considered too expensive. Despite the high cost of mutation testing some empirical studies provided evidence that it is effective at detecting faults [24, 16]. To reduce the number of mutants generated, some approaches have been investigated [19, 16, 13, 2], but even using these alternative approaches the equivalent mutants should be determined to obtain an adequate test set. The automation of equivalent mutant determination has been pursued by many researchers [14, 15, 17, 18, 10, 9]. This work aims at providing guidelines to ease the analysis of the live mutants. Each mutant operator has specific characteristics, i.e., one mutant operator may generate more equivalent mutants than another one. Based on these characteristics and historical information previously collected [2, 22], artificial intelligence techniques can be used to guide the analysis of the live mutants, aiming at reducing the effort to determine the equivalent ones. This paper presents a case study using Bayesian Learning algorithms which provide probabilistic information used to estimate the number of equivalent mutants. In Sec. 2 is presented the background on mutation testing and Bayesian Learning. In Sec. 3 the related work is presented. In Sec. 4 the experimental methodology, data collection and the results obtained are described. In Sec. 5 a case study is described illustrating the application of the technique. In Sec. 6 the conclusions and future work are presented. 2. Background In this section, the main concepts related to mutation testing [5, 6] and Bayesian Learning [21, 12], necessary for the understanding of this paper, are explained. 2.1. An Overview of Mutation

Testing

Mutation testing provides the tester with a systematic way to generate test cases as well as to evaluate how "good" a test set is. The idea is to produce from the program under test a set of other possible implementations — the mutants — containing simple syntactic changes, that are modelled by the mutant operators. a In fact, mutant operators can be seen as the implementation of a fault model that represents the common errors committed during software development. a It may happen that on applying a given operator op to a program P no mutant is generated. This occurs when P does not contain any structure in the domain of the syntactic changes imposed by op. For example, consider the SWDD operator for C language that replaces the while with the do-while command. If P does not have any while commands, the application of SWDD would not generate any mutant.

151

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants

677

The goal of mutation testing is to encourage the tester to find test cases that make the mutants behave differently from the original program, thereby distinguishing the mutants. Such distinguished mutants are said "dead". The "live" mutants are those that behave as the original program for all the test cases in the test set. This may occur either because the mutant is equivalent to the original program or because the test set is not good enough to distinguish the mutant. In the former case, the mutants can be dismissed. In the latter, the test set should be improved. The mutation score — the ratio of the number of dead mutants to the number of non-equivalent mutants — provides to the tester a mechanism to assess the quality of the testing activity. When a mutation score reaches 1.00, it is said that the test set T is adequate w.r.t. mutation testing (MT-adequate) to test the program P, increasing the confidence in the program under test. Some alternative approaches have been investigated to deal with the cost aspects: Randomly Selected Mutation, Constrained Mutation and Selective Mutation [19, 16, 13, 2]. The goal is to determine a subset of mutations in such a way that if a test set T is obtained, which is able to distinguish those mutations, then T would also distinguish the complete set of mutations. 2.2. Bay esian learning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that quantities of interest are governed by probability distributions and that optimal decisions are taken by reasoning about these probabilities together with observed data. Bayesian reasoning also provides the basis for learning algorithms that directly manipulate probabilities, as well as a framework for analyzing the operations of other algorithms that do not explicitly manipulate probabilities. Bayesian learning algorithms calculate explicit probabilities for hypotheses, and are among the most practical approaches to certain types of learning problems [12]. 2.2.1. Bayes theorem The Bayes theorem is the base for all Bayesian learning algorithms. It is given by Eq. (1):

mm - ^ ™

(i)

where: • P(h) denotes the initial probability that a hypothesis h holds before we observe the training data. P{h) is often called the prior probability of h and may reflect any background knowledge we have about the chance that h is a correct hypothesis; • P(D) denotes the prior probability that training data D will be observed;

152

678 A. M. R. Vincenzi et al.

• P(D\h) denotes the probability of observing data D given some world in which hypothesis h holds. More generally, we write P(x\y) to denote the probability of x given y; and • P{h\D) is called the posterior probability of h, because it reflects our confidence that h holds after we have seen the training data D. Note that the term P(D) is an independent constant of h and can be dropped, as shown in Eq. (2). P(h\D) = P(D\h)P(h)

(2)

In many learning scenarios, the learner considers a set of candidate hypotheses H and is interested in finding the most probable hypothesis h £ H, given the observed data D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. The Brute-Force MAP Learning Algorithm, which is used in this paper, is based on the MAP. The two steps of this algorithm are [12]: 1. For each hypothesis h in H, calculate the posterior probability using Eq. (2); and 2. Output the hypothesis KMAP with the highest posterior probability: huAP = max(P{h\D)),h £ H

(3)

In Sec. 5 the Brute-Force MAP Learning Algorithm is used for predicting the percentage of equivalent mutants, for each mutant operator. More information about Bayes theorem and the Brute-Force MAP Learning Algorithm can be found in [12]. 3. Related Work Empirical studies have provided evidence that mutation testing is among the most promising criteria in terms of fault detection [24, 16]. However, mutation testing often imposes unacceptable demands on computing and human resources because of the large number of mutants that need to be executed and that need to be analyzed for possible equivalence with respect to the original program. Randomly Selected Mutation, Constrained Mutation, Selective Mutation [19, 16, 13, 2] are alternatives to reduce the number of mutants generated, but if we want to obtain an adequate test set equivalent mutants must still be determined. Offutt et al. [14, 6, 15, 17, 18] have considered the problems of test data generation and equivalent mutant detection, using both constraint-based techniques and compiler optimizations. The idea explored by Offutt and Craft [14] was to implement a set of compiler-optimization heuristics and evaluate them. This approach consists of looking at the mutants which, compared to the original program, implement traditional "peep-hole" compiler optimizations [1]. Compiler optimizations

153

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants

679

are designed to create faster, but equivalent, programs, so that a mutant which implements a compiler optimization is, by definition, an equivalent mutant. The set of implemented heuristics they used was able to detect about 10% of the equivalent mutants. DeMillo and Offutt [6], using the concept of constraint, developed an automatic way to generate test cases. Their idea is that by solving a set of constraints it is possible to generate a test case that kills any given mutant. Even if they are not completely satisfied, the set of constraints is also useful to determine equivalent mutants. Empirical studies showed that the approach could achieve a detection rate of equivalent mutants of about 50% [17, 18]. Hierons et al. [10] in their work show how amorphous slicing [8] can be used to assist the human analysis of live mutants, rather than as a way of automatically determining the equivalent ones. Another study developed by Harman et al. [9] shows the relationship between program dependence and mutation testing. The idea of the authors is to combine dependence analysis tools with existent mutation testing tools, supporting the test data generation and the determination of equivalent mutants. The authors also proposed a new mutation testing process which starts and ends with dependence analysis phases. The pre-analysis phase removes a class of equivalent mutants from further analysis, while the post-analysis phase is used to reduce the human effort to study the few mutants that evade the automated phases of the process [9]. This paper aims at reducing the effort needed to analyze the live mutants instead of providing a way to automatic detect the equivalents. The idea presented here is to provide guidelines to ease the determination of equivalent mutants and also the identification of non-equivalents, which is useful to improve the test set. Based on historical data collected in previous experiments [2, 22], our approach, named Bayesian Learning-Based Equivalent Detection Technique (BaLBEDeT), uses the Brute-Force algorithm to estimate which is the most promising group of mutants that should be analyzed. In the next sections we present the experiment description and the case study that was carried out. 4. Experiment Description The methodology used in the experiment comprises four steps: • • • •

Program Selection; Tool Selection; Test Set Generation; and Results and Data Analysis. Details of these steps can be found in Sees. 4.1 to 4.4.

154

680

A. M. R. Vincenzi et al.

4.1. Program selection Five UNIX C programs, called the 5-UNIX program suite, are used: Cal, Checkeq, Comm, Look and Uniq. Although they are simple programs (about 100 LOC each), our intention is to evaluate the applicability of Bayesian Learning in this context. We will then investigate the applicability of Bayesian Learning to larger programs and in other domains. 4.2. Tool selection To support the application of Mutation Analysis, Proteum [3] was used. This tool was developed at the Institute de Ciencias Matemdticas e de Computagao da Universidade de Sao Paulo and at the Departamento de Informdtica da Universidade Estadual de Maringd - Brazil. Some facilities that ease the carrying out of empirical studies are provided, such as: • Test case handling: execution, inclusion/exclusion and enabling/disabling of test cases; • Mutant handling: creation, selection, execution, and analysis of mutants; and • Adequacy analysis: mutation score and statistical reports. With these characteristics, different combinations of test sets can be evaluated against different groups of mutants in the same test session. In this way, alternative approaches to apply mutation testing such as Randomly Selected Mutation, Constrained Mutation and Selective Mutation are easily applied. Proteum supports the application of mutation testing at the unit level and implements 71 mutant operators to test C programs [3]. These operators are divided into 4 classes according to where the mutation is applied: Constants, Operators, Statements and Variables. A description of the unit operators is presented elsewhere [3]. To illustrate the cost aspect related to the number of mutants, the total of mutants generated for the Mutation Analysis criterion is provided in Table 1, considering each mutation class. Note that even for small programs, Mutation Analysis can be very expensive, so that the investigation of mechanisms to reduce its application cost and to ease the analysis of live mutants is worthwhile to be pursued. For example, Cal and Comm programs have both 119 LOC but generate significantly different number of mutants: 4,332 and 1,728, respectively. Considering the Constants' class, for Cal program 1,780 mutants were generated, over 5 times the number of mutants generated for Comm program. For the Operators and Variables's classes, considering Cal program, 1,409 and 791 mutants were generated, respectively, over 2 times the number of mutants generated for Comm program: 642 and 367, respectively. The equivalent mutants for the 5-UNIX programs have been already determined in previous experiments [2, 22]. Table 2 illustrates the total of equivalent mutants manually determined for each program considering the four mutation classes. The percentage w.r.t. the total of equivalent ones per program is also provided.

155

681

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants Table 1. Total and percentage of mutants generated by each mutation class. Program

Constants # Mut. % Total

Operators # Mut. % Total

Statements # Mut. % Total

Variables # Mut. % Total

Total

Cal Checkeq Comm Look Uniq

1,780 1,111 314 371 244

41.1 35.9 18.2 18.1 15.1

1,409 937 642 720 621

32.5 30.2 37.2 35.0 38.3

352 268 405 319 348

8.1 8.6 23.4 15.5 21.5

791 783 367 646 406

18.3 25.3 21.2 31.4 25.1

4,332 3,099 1,728 2,056 1,619

Total

3,820

29.8

4,329

33.7

1,692

13.2

2,993

23.3

12,834

Table 2.

Total and percentage of equivalent mutants generated by each mutation class. Constants

Program ~C~al Checkeq Comm Look Uniq Total

# Mut. I % Total

Operators

Statements

# Mut. I % Total ~# Mut. I % Total

Variables # Mut.

% Total

Total

72 2 28 46 14

2L8 0.9 14.4 17.9 8.4

113 105 123 111 119

34.2 47.1 63.1 43.2 71.2

12 5 9 26 4

3.7 2.2 4.6 10.1 2.4

133 111 35 74 30

40.3 49.8 17.9 28.8 18.0

330 223 195 257 167

162

13.8

571

48.7

56

4.8

383

32.7

1,172

4.3. Test set generation For each one of the 5-UNIX programs a pool of 500 test cases was constructed. Each pool is composed of: 1. ad hoc test cases based on the program specification; and 2. random test cases taken from Wong's experiments (used to compare Mutation Analysis and Data-Flow based criteria [23]). 4.4. Results and data analysis For each program, a test session was created using all the mutant operators, and the 500 test cases were executed with all non-equivalent mutants. Note that, although the 71 mutant operators were applied, considering the 5-UNIX programs, 15 mutant operators generate no mutants and, in this way, there is no statistical information about them.b Next, using the features available in Proteum to enable/disable test cases, 19 subsets of test cases were evaluated. The idea is to observe the variation in the number of equivalent and non-equivalent mutants as more and more test cases b

Operators that produces no mutants, considering the 5-UNIX programs: OBAA, OBBA, OBEA, OBSA, OSAA, OSAN, OSBA, OSBN, OSEA, OSLN, OSRN, OSSN, OSSA, SGLR and Vtrr.

156

682 A. M. R. Vincenzi et al.

were added. The 19 subsets were composed of 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450 and 500 test cases, respectively. For each one of these subsets we collected information about the probability of the live mutants generated by a given operator to be equivalent or non-equivalent. We used the symbols © and 0 to represent the equivalent and non-equivalent mutants, respectively. Considering LM the total of live mutants remaining in a given test session, LMop the total of live mutants of a given mutant operator op, EM the total of equivalent mutants, and EMop the total of equivalent of op, according to the Bayes theorem, the posterior probability of a mutant being equivalent if it was generated by op is given by Eq. (4): P(®\op) = P(op|©)P(©)

(4)

where: • P(©) - probability of a live mutant being equivalent (Eq. (5)) P(©) = EM/LM

.

(5)

• P(op\(B) - probability of a live mutant of op being equivalent (Eq. (6)) P(op|©) = EMop/LMop

(6)

P(0), P(op|©), and P(©|op) have the same meaning but considering nonequivalent mutants. Since the live mutants are either equivalent or non-equivalent, the probabilities regarding to non-equivalent mutants are the complement of the probabilities regarding to equivalent ones. Tables 3(a), 3(b), and 3(c) illustrate how the Bayes theorem is used to estimate the probabilities of op to produce equivalent mutants, considering test cases sets with different sizes: 0, 20 and 500 elements, respectively. For example, Table 3(a) illustrates the data collected for an empty test set. Applying the 56 mutant operators to the 5-UNIX programs, 12,834 mutants are generated. Since no one was executed with any test case, 12,834 mutants are still live: 1,172 are equivalent and 11,662 are non-equivalent. At the bottom of Table 3(a) it is shown how P(©) and P(©) are calculated. We can observe that for an empty test set, the probability of a live mutant being non-equivalent is 0.91 against 0.09 of it being equivalent. In same way, for each operator op the probabilities of the mutants generated by op to be equivalent were calculated. For example, consider the OLBN operator (Logical Operator by Bitwise Operator). It generates 81 mutants: 30 are equivalent. The probability of a live mutant of OLBN being equivalent (P(OLBN\®)) is 30/81 = 0.37. The probability of a live mutant being non-equivalent is P(OLBN\Q) = 0.63 (1 - P(OLBN\®)). Now, applying the Bayes theorem, it is possible to estimate the posterior probability of a mutant to be non-equivalent or equivalent if it was generated by op. Again, considering OLBN (Table 3(a)):

157

683

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants Table 3. Average data for the 5-UNIX programs: (a) Empty test set, (b) test set with 20 elements, and (c) test set with 500 elements. ____^ Operator Cccr Ccsr

(a) Total Live Non-Equiv Equiv P(op|©) 1,676 I 1,676 I 1,524 I 152 I O09 1,393 1,393 1,389 4 0.00

P(ffi|op) O01 0.00

Pnor(®\op) O01 0.00

OLBN ORRN

81 515

81 515

51 449

30 66

0.37 0.13

0.03 0.01

0.06 0.01

SCRB SSDL

6 446

6 446

3 428

3 18

0.50 0.04

0.05 0.00

0.09 0.00

0.04 O01 -

0,09 0.01 -

P(ffi|op) O20 0.01

Pnor(®\op) O40 0.02

VDTR 633 633 322 311 0.49 VTWD 422 422 393 29_ 0.07 Total 12,834 12,834" 11,662 1,172 P(9) = 11,662 / 12,834 = 0.91 P(ffi) = 1,172 / 12,834 = 0.09 _ _ ^ (b) Operator Total Live Non-Equiv Equiv P(op|ffi) Cccr 1,676 I 343 I 191 I 152 I O44 Ccsr 1,393 200 196 4 0.02 OLBN ORRN

81 515

40 142

10 76

30 66

0.75 0.46

0.34 0.21

0.72 0.42

SCRB SSDL

6 446

3 53

0 35

3 18

1.00 0.34

0.46 0.16

1.00 0.30

0.40 0.14 -

0.86 0.27 ~

VDTR 633 353 42 311 0.88 VTWD 422 94 65 29 0.31 Total 12,83T~ 2,569 1,397 1,172 P(G) = 0.54 P(©) = 0.46 _ _ _ ^ (c) Operator Total Live Non-Equiv Equiv P(op|©) Cccr 1,676 I 162 I 10 I 152 I O94 Ccsr 1,393 19 15 4 0.21

'

P(ffi|op) O79 0.18

Pnor(ffi|°p) O99 0.58

OLBN ORRN

81 515

34 97

4 31

30 66

0.88 0.68

0.74 0.57

0.97 0.92

SCRB SSDL

6 446

3 19

0 1

3 18

1.00 0.95

0.84 0.79

1.00 0.99

0.98 0.73

0.82 0.61 ' -

1.00 0.93 -

VDTR 633 VTWD 422 Total 12,834 P ( e ) = 0.16 P(ffi) = 0.84

318 40 1,401 '

7 11 229

311 29_ 1,172

158

-

"

"

684 A. M. R. Vincenzi et al.

P(®, OLBN) = P(OLBN\®) * P(©) = 0.37*0.09 = 0.03 p ( e , OL5iV) = P(OLJ5iV|e) * p ( e ) = 0.63 * 0.91 = 0.57 Normalizing these quantities so that they sum to 1 we use Eq. (7) to obtain the exact posterior probability of equivalent mutants, and Eq. (8) to obtain the exact posterior probability of non-equivalent mutants.

(7) (8) Applying Eqs. (7) and (8) for all mutant operators it is possible to estimate the probability of each one being equivalent and non-equivalent for different subsets of test cases. For instance, considering Table 3 and the operator OLBN we have: • Table 3 (a) - Empty Test Set - Pnor(®\OLBN) = 0.06 - PnOr(e\OLBN)

= 0.94

• Table 3(b) - Test Set with 20 Elements - PnoA®\OLBN) = 0.72 - Pnor(e\OLBN)

= 0.28

• Table 3(c) - Test Set with 500 Elements - Pnor{®\OLBN) = 0.97 - Pnor(e\OLBN) = 0.03 Note that after 20 test cases have been executed (Table 3(b)) the probability of a live mutant being equivalent increases. So, the more the test cases that have been executed the higher the confidence that a live mutant is equivalent. For the 19 subsets of test cases we have calculated the probability of each mutant operator being able to produce equivalent and non-equivalent mutants. The tester, while testing another program, may select the probabilities corresponding with the number of test cases that he/she has already used and then estimates the current number of equivalent and non-equivalent mutants. The next section presents an example of this technique.

159

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants Table 4.

685

Total and percentage of generated and equivalent mutants for the Sort program.

Constants Operators Statements Variables Total # Mut. % Total # Mut. % Total # Mut. % Total # Mut. % Total Generated 3,769 16.81 5,104 22.77 2,745 12.24 10,801 48.18 22,419 Equivalent 857 31.16 878 31.93 124 4.51 891 32.40 2,750

5. A Case Study: Sort Program In this section, we consider that we have a program to be tested and would like to estimate the number of equivalent/non-equivalent mutants generated. The probabilistic information previously obtained from the 5-UNIX programs is used to estimate these quantities and the estimated values are compared with the real ones. The Sort program is used in our example. This program has approximately 624 LOC's and it is used to classify records in one or more files. Applying the 71 mutant operators to the Sort program, 22,419 mutants were generated (2,750 of them are equivalent). Nine mutant operators (OSAA, OS AN, OSBA, OSBN, OSEA, OSLN, OSRN, OSSN and OSSA) generate no mutants and were not considered. Note that these operators are a subset of the 15 ones that generate no mutants in the 5-UNIX programs. Table 4 shows a more detailed information about the number and percentage of mutants generated (first line) and equivalent ones (second line) for each mutation class. Next, supposing that all of the 22,419 mutants have been executed with 20 and 100 test cases, Table 5 and Table 6 show the real number of live and equivalent mutants that remain and the probability of each mutant operator producing nonequivalent and equivalent mutants. According to Table 5, 905 out of 1,208 mutants of Cccr are alive after executing 20 test cases: 374 out of 905 are non-equivalent and 531 out of 905 are equivalent mutants. With respect to the 5-UNIX programs, considering a test set with 20 elements, the probability of Cccr's live mutants being equivalent is 0.60 against 0.40 for it being non-equivalent, i.e., from 905 Cccr's live mutants 545.45 are estimated to be non-equivalent and 359.55 to be equivalent, which means an error rate around 18,9%. The error rate means the discrepancy obtained from comparing the original data about equivalent and non-equivalent mutants against BaLBEDeT technique. Observe that the estimated number of non-equivalent and equivalent mutants are sometimes very different from those found in the real data. On average, considering 20 test cases, the error in the estimation is around 19.3% and the standard deviation 17.4%. Considering 100 test cases, the error is around 23.2%, and the standard deviation 16.9%. Table 7 shows the error rate obtained in the estimation of non-equivalent and equivalent mutants, considering 20 and 100 test cases. For 20 test cases (Table 7(a)), 17 mutant operators with an error rate between 0%-10% were identified, and 9 of these 17 operators are classified with an error rate lower that 5%. Most of the mutant operators were classified with an error rate higher than 10%.

160

686

A. M. R. Vincenzi et al. Table 5. Sort program: Real data x estimated data for 20 test cases. Real Data Estimated Data Live Will Die Equiv Pnor{Q\op) Pnor(®\op) Will Die Equiv Error(%) 905 374 531 0.60 0.40 545.45 359.55 18.9 1,025 817 208 0.98 0.02 1,007.75 17.25 18.6 655 537 118 0.94 0.06 616.10 38.90 12.1

Oper Cccr Ccsr CRCR

Total 1,208 1,542 1,019

OARN OASA OCOR

282 8 16

141 3 16

117 3 0

24 0 16

0.91 1.00 0.00

0.09 0.00 1.00

128.98 3.00 0.00

12.02 0.00 16.00

8.5 0.0 0.0

SCRB SSDL STRP

21 552 556

17 256 128

8 201 120

9 55 8

0.00 0.70 0.91

1.00 0.30 0.09

0.00 178.75 115.97

17.00 77.25 12.03

47.1 8.7 3.1

199 347 2,584 420 310 74 9,214 2,750

0.14 0.85 0.73

0.86 0.15 0.27

76.34 469.66 2563.51 440.49 278.88 105.12 Average Standard Deviation

22.5 0.7 8.1 19.3 17.4

VDTR 843 546 5,575 3,004 Vsrr VTWD 562 384 Total 22,419 11,964

Table 6. Sort program: Real data X estimated data for 100 test cases. Real Data Estimated Data Pnar(®\op) Will Die Equiv Error(%) Live Will Die Equiv Pnor(e\op) 666 135 531 0.25 0.75 168.59 497.41 5.0 459 251 208 0.93 0.07 426.28 32.72 38.2 313 195 118 0.76 0.24 237.73 75.27 13.7

Oper Cccr Ccsr CRCR

Total 1,208 1,542 1,019

OARN OASA OCOR

282 8 16

70 2 16

46 2 0

24 0 16

0.70 1.00 0.00

0.30 0.00 1.00

49.22 2.00 0.00

20.78 0.00 16.00

4.6 0.0 0.0

SCRB SSDL STRP

21 552 556

14 135 54

5 80 46

9 55 8

0.00 0.34 0.61

1.00 0.66 0.39

0.00 46.55 33.06

14.00 88.45 20.94

35.7 24.8 24.0

91 347 745 420 130 74 2,657 2,750

0.04 0.56 0.41

0.96 0.44 0.59

17.61 420.39 647.24 517.76 83.59 120.41 Average Standard Deviation

16.8 8.4 22.7 23.2 16.9

VDTR 843 438 5,575 1,165 Vsrr VTWD 562 204 Total 22,419 5,407

161

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants

687

For 100 test cases (Table 7(b)), 12 operators were classified with an error rate lower that 10% and more operators were classified with an error rate above 20% than Table 7(a). In Artificial Intelligence, an error rate around 10% in estimation is considered reasonable [12]. Table 7. Error rate: (a) 20 Test cases; and (b) 100 test cases. (a) Operator (Error %) OASA (0.0) OCOR (0.0) ORLN (0.5) Vsrr (0.7) OALN (0.7) OABN (2.6) STRP (3.1) STRI (3.8) Vprr (4.4) SMTC (5.2) Oido (5.3) OLNG (6.5) OASN (6.7) VTWD (8.1) OBAN (8.2) OARN (8.5) SSDL (8.7) [10%-20%) OLLN (10.0) OLRN (10.5) OCNG (10.9) OESA (11.2) OBSN (11.5) CRCR (12.1) ORRN (12.3) ORAN (12.7) ORSN (14.2) SWDD (14.4) Varr (16.4) OBRN (16.9) OIPM (18.2) Ccsr (18.6) Cccr (18.9) ORBN (19.3) [20%-30%) O A A N (20.5) V D T R (22.5) S M V B (24.1) OLAN (24.9) OLSN (25.2) OLBN (26.4) S S W M (29.0) Error Rate [0%-10%)

[30%-40%) [40%-50%)

OABA (30.8) OEBA (33.4)

SRSR (40.5) OAAA (44.4) OBNG (45.6) OBBN (46.6) [50%-100%j SBRC (54.4) VSCR (86.9)

OASA VSCR OARN ORLN

(b) Operator (Error %) (0.0) OCOR (0.0) SBRC (0.0) (0.0) OASN (3.4) STRI (4.2) (4.6) Cccr (5.0) OBSN (5.7) (6.2) Vsrr (8.4) SWDD (8.6)

OIPM CRCR ORAN VDTR ORSN

(10.0) (13.7) (14.9) (16.8) (19.2)

OESA OABA OLNG OBAN

(12.4) (14.2) (15.8) (17.2)

OLBN (20.7) O R B N (20.7) V T W D (22.7) O B R N (23.4) SSDL (24.8) Oido (24.8) O A A N (27.8) S M V B (29.0) OEAA (37.4) OLSN (31.4) OABN (32.0) OCNG (34.8) SCRB (35.7) OBLN (44.4) OBLN (44.4) SMTC (47.4) SCRB (47.1) OAAA (50.0) SSWM (50.0) Varr (55.9) SRSR (60.4)

Vprr (13.1) OALN (14.3) ORRN (16.6) OEBA (19.1)

O L A N (21.2) STRP (24.0) O E A A (25.0) OLLN (33.3) Ccsr (38.2) OLRN (49.2) OBBN (50.2) OBNG (62.8)

It should be pointed out that more historical information should be collected in order to obtain a more representative estimative considering programs with the characteristic of Sort. The idea is to define probabilistic information according to the characteristics of a set of programs. So, for each set, there would be one instance of the application of the technique that is a better estimator of the the number of equivalent mutants. The results obtained here should be considered as a first attempt on providing guidelines to the tester to analyze live mutants. For example, considering Table 5, the OASA operator generates 8 mutants. After the run with 20 test cases, 3 are kept alive. The probabilistic information about this operator tells us that all OASA's live mutants should die, since the probability that an OASA's live mutants being equivalent is zero. Therefore, if the tester wishes to evaluate the live mutants by trying to improve the test set, he/she should first analyze the kind of operators that are supposed to produce non-equivalent mutants. On the other hand, considering the OCOR operator that generates 16 mutants, according to the probabilistic information, all OCOR's live mutants should be equiv-

162

688 A. M. R. Vincenzi et at.

alent. So, if the tester wishes to determine the equivalent ones, he/she should first analyze this kind of operator. Thus, there are operators for which live mutants are more or less probable to be equivalent. For example, considering the live mutants of the STRP operator, the probability of the STRPs live mutants being equivalent is 0,09. 6. Conclusion and Future Work In this paper the use of Bayesian Learning, an Artificial Intelligence technique, was investigated to provide guidelines to help the tasks of determining equivalent and non-equivalent mutants. The main contribution of this paper is the proposition of the technique BaLBEDeT (Bayesian Learning-Based Equivalent Detection Technique). Further studies are been planned to investigate the scalability of these results to larger programs. We are also interested in expanding the selection of programs to different application domains to replicate this study, in order to increase the validity of these results. A database with information about equivalent mutants in a different set of programs in different domains would lead to more precise predictions (closer to real predictions) and smaller effort would be needed to analyze the live mutants. A further refinement of this work would consider the frequency of execution. Proteum/IM 2.0 [4] provides this kind of information. The frequency of execution of a given mutant indicates how many test cases of the test set reach the mutated code. Acknowledgments The authors would like to thank the Brazilian Funding Agencies — CNPq, FAPESP and CAPES — and Telcordia Technologies (USA) for their partial support of this research and the anonymous referees for their valuable comments. We would also like to thank Eric Wong who provided part of the test cases used in this experiment and Rodrigo Funabashi Jorge who provided information on equivalent mutants. References 1. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley, 1996. 2. E. F. Barbosa, J. C. Maldonado, and A. M. R. Vincenzi, "Towards the determination of sufficient mutant operators for C", Software Testing, Verification and Reliability 11(2) (2001) 113-136. 3. M. E. Delamaro and J. C. Maldonado, "Proteum — a tool for the assessment of test adequacy for C programs", in Conference on Performability in Computing Systems (PCS'96), Brunswick, NJ, July 1996, pp. 79-95. 4. M. E. Delamaro, J. C. Maldonado, and A. M. R. Vincenzi, "Proteum/IM 2.0: An integrated mutation testing environment", in Mutation 2000 Symposium, San Jose, CA, Oct. 2000, pp. 124-134.

163

Bayesian-Learning Based Guidelines to Determine Equivalent Mutants

689

5. R. A. DeMillo, R. J. Lipton, and F. G. Sayward, "Hints on test data selection: Help for the practicing programmer", IEEE Computer 11(4) (1978) 34-43. 6. R. A. DeMillo and A. J. Offutt, "Constraint based automatic test data generation", IEEE Trans, on Software Engineering 17(9) (1991) 900-910. 7. R. G. Hamlet, "Testing programs with the aid of a compiler", IEEE Trans, on Software Engineering 3(4) (1977) 279-290. 8. M. Harman and S. Danicic, "Amorphous program slicing", in 5th IEEE Int. Workshop on Program Comprehesion (IWPC'97), IEEE Computer Society Press, Dearborn, Michigan, May 1997, pp. 70-79. 9. M. Harman, R. Hierons, and S. Danicic, "The relationship between program dependence and mutation testing", in Mutation 2000 Symposium, San Jose, CA, Oct. 2000, pp. 15-23. 10. R. M. Hierons, M. Harman, and S. Danicic, "Using program slicing to assist in the detection of equivalent mutants", Software Testing, Verification and Reliability 9(4) (1999) 233-262. 11. W. E. Howden, "Reliability of the path analysis testing strategy", IEEE Trans, on Software Engineering 2(3) (1976) 208-214. 12. T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997. 13. E. Mresa and L. Bottaci, "Efficiency of mutation operators and selective mutation strategies: An empirical study", The Journal of Software Testing, Verification and Reliability 9(4) (1999) 205-232. 14. A. J. Offutt and W. M. Craft, "Using compiler optimization techniques to detect equivalent mutants", Software Testing, Verification and Reliability 4 (1994) 131-154. 15. A. J. Offutt, Z. Jin, and J. Pan, "The dynamic domain reduction approach to test data generation", Software Practice and Experience 29(2) (1999) 167-193. 16. A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf, "An experimental determination of sufficient mutant operators", ACM Transactions on Software Engineering Methodology 5(2) (1996) 99-118. 17. A. J. Offutt and J. Pan, "Detecting equivalent mutants and the feasible path problem", in COMPASS'96 — in Annual Conference on Computer Assurance, IEEE Computer Society Press, Gaithersburg, MD, June 1996, pp. 224-236. 18. A. J. Offutt and J. Pan, "Automatically detecting equivalent mutants and infeasible paths", Software Testing, Verification and Reliability 7(3) (1997) 165-192. 19. A. J. Offutt, G. Rothermel, and C. Zapf, "An experimental evaluation of selective mutation", in 15th Int. Conf. on Software Engineering, Baltimore, MD, May 1993, pp. 100-107. 20. S. Rapps and E. J. Weyuker, "Selecting software test data using data flow information", IEEE Transactions on Software Engineering 11(4) (1985) 367-375. 21. S. J. Russell and P. Norving, Artificial Intelligence: A Modern Approach, PrenticeHall, New Jersey, 1995. 22. A. M. R. Vincenzi, J. C. Maldonado, E. F. Barbosa, and M. E. Delamaro, "Unit and integration testing strategies for C programs using mutation-based criteria", Software Testing, Verification and Reliability 11(4) (2001) 249-268. 23. W. E. Wong, On Mutation and Data Flow, Ph.D. thesis, Department of Computer Science, Purdue University, W. Lafayette, IN, Dec. 1993. 24. W. E. Wong, A. P. Mathur, and J. C. Maldonado, "Mutation versus all-uses: An empirical evaluation of cost, strength, and effectiveness", in Int. Conf. on Software Quality and Productivity, Hong Kong, Dec. 1994, pp. 258-265.

164

Chapter 4 ML Applications in Transformation One of the essential challenges in SE, as eloquently explicated by Brooks, is the changeability: "The software product is embedded in a cultural matrix of applications, users, laws, and machine vehicles. These all change continually, and their changes inexorably force change upon the software product." Changes can be made to a software system through transformations. A transformation to a software product is a mapping from one model to another that aims at improving certain aspect of the transformed software product (e.g., improved modularity, desirable parallelism, improved run-time performance) while preserving all of its other properties (e.g., its functionality) [23]. A transformation is usually localized, affects a small number of classes, attributes, and operations, and is carried out in a series of small steps. In this chapter, we focus on ML applications in software product transformation. Table 24 offers a state-of-thepractice in this area. Table 24. ML methods used in transformation. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Parallel Programs

Modularity Object-oriented Applications

y]

V

V

V V

In this chapter, we include one paper by Schwanke and Hanson [128]. The paper deals with the issue of transforming software systems for better modularity using nearest-neighbor clustering and a special-purpose NN. The proposed approach treats modularization as a categorization (clustering and classification) activity that relies on some similarity measurement. Similarity between software units is computed in terms of a function defined based on their common and distinctive features. To learn the similarity function, there are several steps: 1. A subset of software units, that the software architect believes are correctly classified, is identified and used as a set of training data. 2. A set of more-similar-than judgments (e.g., for three software units S, G and B, <S, G, B> indicates that S is more similar to G than S is to B) is constructed such that if correctly modeled by the similarity function would result in perfect classification on the training set. 3. A special-purpose NN is defined and used to learn, through back-propagation, the similarity function that optimally fits the training data (more-similar-than judgments). Once the similarity function is learned, categorization is accomplished through a nearestneighbor classifier. This approach compares favorably to more traditional methods. It has been integrated into a module architecture advisor environment, and has been used successfully on several real software reorganization tasks.

165

The following paper will be included here: R. Schwanke and S. Hanson, "Using neural networks to modularize software", Machine Learning, Vol.15, No.2, 1994, pp.137-168.

166

Machine Learning, 15, 137-168 (1994) © 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Using Neural Networks to Modularize Software ROBERT W. SCHWANKE AND STEPHEN JOSE HANSON Siemens Corporate Research, Princeton, NJ 08540 Editor: Alex Waibel Abstract. This article describes our experience with designing and using a module architecture assistant, an intelligent tool to help human software architects improve the modularity of large programs. The tool models modularization as nearest-neighbor clustering and classification, and uses the model to make recommendations for improving modularity by rearranging module membership. The tool learns similarity judgments that match those of the human architect by performing back propagation on a specialized neural network. The tool's classifier outperformed other classifiers, both in learning and generalization, on a modest but realistic data set. The architecture assistant significantly improved its performance during a field trial on a larger data set, through a combination of learning and knowledge acquisition. Keywords, neural networks, software modularization, similarity classification.

1. Introduction U. The cognitive task of programming Software engineers today face a formidable cognitive challenge: understanding the interactions among thousands of procedures, variables, data types, macros, and files. Most software engineers work on large, long-lived programs. Consequently, they spend more of their time modifying existing code than they do creating new code. The engineer frequently must read and understand parts of the program that he did not write, or that he wrote months or years ago and no longer recognizes. Any documentation he might have available is almost certainly obsolete. The original designers of the system have probably moved on to new projects, or even new employers. Thus, he is left with only the code itself to give him the information he needs. Most significant commercial software systems comprise more than 100,000 lines of code—1600 pages, thicker than a James Michener novel. Fortunately, the code is typically organized into modules,1 so that the programmer can deal with it in larger chunks. Even so, a large system is likely to comprise more than 10,000 procedures, variables, types, macros, etc. (hereafter called software units, or units), in more than 100 modules, and is likely to involve five or more programmers. Furthermore, the system is likely to be changing rapidly, with new major releases coming out every year, each with one quarter or more of the code different from the previous release. With rapid change comes architectural drift, as each change moves the structure of the system away from its original design. To compound his woes, the programmer may be responsible for working on several different system versions simultaneously, so he must remember how the interactions among components differ from one version to another.

167

138

R.W. SCHWANKE AND S.J. HANSON

The goal of the current research is to help rescue engineers from the nightmare of incomprehensible code by providing them with intelligent tools for analyzing the system, reorganizing it, documenting the new structure, and monitoring compliance with it, so that significant structural changes can be detected and evaluated early, before they become irreversible. 1.2. Why software is in modules Recent developments in programming environments have raised questions about whether modules are as important as they once were. Cross-reference aids, "smart recompilation," and hypertext facilities, for example, treat procedures, macros, and other software units individually, practically ignoring traditional file and module boundaries. However, when programming is considered as a human cognitive activity, the importance of modules becomes clear. Reviewing this activity will also motivate the heuristic analysis and reorganization methods we are proposing. • Modules are the building blocks of a software system's technical design. One of the goals of design is to select a set of conceptual entities that have relatively few interactions between them, so that the designers can reason about the system as a whole without much reference to the details inside individual modules. • Modules are often used to assign technical responsibility. Each programmer on a large project becomes a specialist in certain parts of the system. Limiting the interactions between modules reduces the amount of communication needed between programmers. • Good modularity can also limit the impact of program changes. A single conceptual change generally requires changes to several software units. For example, if every module in a system contains code that directly accesses a sorted list, changing it to a hash table will be extremely difficult. However, if other modules can access the list only by calling the "insert," "retrieve," and "remove" routines of the "SortedList" module, then only these three routines will need to be rewritten. • Modules are the basic units of system integration and testing. Good planning depends heavily on having well-defined modules with limited dependencies on other modules, and on making sure that the dependencies do not change much between writing the plan and starting integration. A recent study reveals that at least half of the cost of a software system occurs after the software is first delivered to the customer, (cf. Chapin, 1988). The largest component of this cost is modifications to the software, including fixing bugs, adding new functions and services, and porting the softare to new computers, new operating systems, and new user interlace systems. Furthermore, the programmers who make these modifications spend most of their time, not in making the changes, but in understanding the code that is related to the changes. In summary, the choice of modules for organizing a large software system affects understandability, division of labor, modifiability, integratability, and testability. Of these, understandability has the largest impact on the success of a software project. Therefore,

168

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

139

modules are important for their role as conceptually coherent chunks of software, and improving coherence through machine-assisted reorganization is an appropriate goal. 1.3. Overview of the article The current research is intended to form the basis for a heuristic module architecture advisor, which recommends organizational changes that would improve the information-hiding quality of the modules in a software system. This article models modularization as a categorization activity requiring similarity judgment. Similarity between software units is computed by a function of their common and distinctive features, which is fitted to training data by a neural network. Categorization is accomplished by a nearest-neighbor classifier. The model is examined by embedding it in a software classification tool and several interactive clustering tools, which make reclassification and clustering recommendations, respectively. The tools incorporate a learning component, which responds to rejected recommendations by using a neural network to adjust feature weights as necessary to make the classifier agree with the category assignments given explicitly by the user. The learner transforms the user's category assignments into more-similar-than judgments, "S is more similar to G than S is to B," selecting triples < S, G, B > such that a similarity function whose values minimize errors on those judgments also maximizes the classifier's accuracy on the given category assignments. The tool then learns an ordinal similarity function that optimally fits the more-similar-than judgments. Learning is carried out through a special-purpose back-propagation neural network. The network directly compares the value of the similarity function computed on two pairs of inputs ( < S , G > and <S, B > ) , and back-propagates error to increase similarity on the first pair while decreasing it on the second. The features of < S , G > and <S, B> are preprocessed and presented to the network as common and distinctivefeatures.The similarity function computed by the network is constrained to compute a ratio of common to distinctive features, in keeping with accepted models of human similarity judgment. The classifier-with-learner thus constructed compares favorably to more traditional category learning methods. It has also been installed in a module architecture advisor, and used successfully on several real software reorganization tasks. We conclude from these experiences that modeling software modularization as nearestneighbor classification, with a similarity function based on accepted models of human similarity judgment, is a viable basis for the design of a module architecture advisor. The learning method used would be useful for a wide range of applications involving nearestneighbor classifiers. The module architecture advisor illustrates a promising approach for designing "intelligent assistants" for expert tasks.

2. The information-hiding principle One of the earliest and most influential writers on the subject of modularity is David L. Parnas. In 1971,2 he wrote of the information distribution aspects of software design (emphasis his),

169

140

R.W. SCHWANKE AND S.J. HANSON

The connections between modules are the assumptions which the modules make about each other. In most systems we find that these connections are much more extensive than the calling sequences and control block formats usually shown in system structure descriptions (Parnas, 1972). The same year he formulated the information-hiding criterion, advocating that a module should be .. .characterized by a design decision which it hides from all others. Its interface or definition [is] chosen to reveal as little as possible about its inner workings (Parnas, 1971). According to Parnas, the design decisions to hide are those that are most likely to change later on. Good examples are • • • •

data formats, user interface details, hardware (processor, peripheral devices) operating system.

In practice, the information-hiding principle works in the following way. First, the designers identify the role or service that the module will provide to the rest of the system. At the same time, they identify the design decisions that will be hidden inside the module. For example, the module might provide an associative memory for use by higher-level modules and conceal whether the memory is unsorted or sorted, whether it is all in fast memory or partly on disk, and whether it uses assembly code to achieve extra-fast key hashing. The module description is then refined into a set of procedure and data types that other modules may use when interacting with the memory. For example, the memory module might provide operations to insert, retrieve, modify, and remove records. These four operations would need parameters specifying records and keys, and some way to determine when the memory is full. The module would declare and make public the data types "Key" and "Record," and the procedures "Insert," "Retrieve," "Modify," and "Remove." Next, the associative memory module is implemented as a set of procedures, types, variables, and macros that together make, for example, a large in-core hash table. The implementation can involve additional procedures and types beyond the ones specified in the interface; only the units belonging to that module are permitted to use these "private" units. Thus, the information that the memory is implemented as a hash table is concealed from other modules. They cannot, for example, determine which order the records are stored in, because they cannot use the name of the table of records in their procedures. Later, if the implementor should decide to replace the hashing algorithm, or even to use a sorted tree, all the code that he would need to change would be in the associative memory module. This example shows that many design decisions are represented by software unit declarations, such as HashRecord array HashTabIe[TabIeSize]

170

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

141

which embodies the decision to store hash records in a fixed-size table rather than, say, a linked list or tree. In most cases, procedures that depend on the design decision will use the name of the corresponding software unit, such as procedure Retrieve(KeyWanted: Key) Index = Hash(KeyWanted) if HashTable[Index].Key equals KeyWanted return HashTabIe.Record else return FAILURE

This correspondence implies that If two units use several of the same unit-names, they are likely to be sharing significant design information, and are good candidates for placing in the same module. A unique aspect of our research is that we measure design coupling, rather than data or control coupling. A simple example will illustrate the difference. The diagram below illustrates four procedures (A, B, C, and D) and table T. Procedure A calls procedure B to write information into table T. Procedure D reads information from the table. Procedure C also writes information into table T. Procedures A and B have a control link between them, because A calls B. Procedures B and D have a data link between them, because data pass from BtoD through the table. Likewise, A and B are data-linked through parameters, and C and D are data-linked through T. However, B and C are not data-linked, because both of them put data into T, but neither one takes data out. Finally, B, C, and D have a design link among them, because all three share assumptions about the format and interpretation of table T. If one of the procedures ever needs to be rewritten in a way that affects the table T, the other two should be examined to see if they require analogous changes.

A

cal1

•

B -^write ^ ^ - ^

^ ^ T -« C ~""~"^^

read

D

write

Before Parnas's work, it was commonplace to group units into modules based on control links, leaving large numbers of design dependencies between modules. Nowadays, programmers generally agree that it is more important to group together procedures that share data and type information than to group procedures that call one another. It would be nice if the clear, simple concepts contained in a system's original design could be directly mapped into an appropriate set of implementation modules, and the mapping preserved throughout the system's lifetime. However, the implementation process always uncovers technical problems that were not apparent during the early design process, leading to changes in the design. Furthermore, design decisions are almost never so clearly separable that they can be neatly divided into subsystems and sub-subsystems. Each decision interlocks

171

142

R.W. SCHWANKE AND S.J. HANSON

with other decisions, so that inevitably there are some design decisions that cannot be concealed within modules, even though they are likely to change. Conversely, a module may span several loosely related decisions. In addition, there are often managerial and other non-technical influences on how a system is modularized. In the final analysis, good modularity is highly subjective. 3. A model for human software classification We observe that programmers modularize software in much the same way that humans generally classify objects. Specifically, modules are used analogously to categories. The software units contained in a module are instances, or exemplars, of the category. The unit names appearing in an instance are its boolean-valued features. Two units can be compared by looking at their shared and distinctive features. Programmers often decide whether two units belong in the same module by such comparisons. For example, when writing a new procedure, a programmer will normally place it in the same module as other procedures that use some of the same data types and data structures in the same way. (Unlike some domains, such as thyroid assay interpretation (cf. Horn et al., 1985), we make a sharp distinction between shared "true" features and shared "false" features.) Modules must be described by exemplars because there are many cases in which a welldesigned, useful module contains two units (instances) that do not share any features. Therefore, there can be no necessary-and-sufficient feature list to describe the category. Feature diversity is intrinsic in the problem, because the information-hiding principle implies that there should be very few widely used unit names, and therefore very few features that are common to large numbers of instances. Nonetiieless, a module is often designed to surround several related data structures and other private software units. Some procedures access only one or two of these data structures, while others may access all of them. Consequently, many modules contain no single "typical" member, although in some cases two or three procedures together represent the principal types of procedures contained in the category. These observations are consistent with the literature of human classification. Humans classify things according to a few simple heuristics. First, people tend not to behave as if categories are defined by necessary and sufficient conditions; rather, they treat them as probabilistic (cf. Smith & Medin, 1981), or, more generally, as if diey possess a feature "polymorphy" (cf. Hanson & Bauer, 1989). Much of the natural world promotes this view: cups, chairs, birds and so forth are labeled as such because they possess smaller feature variance within each category than between categories. Cups are cups because they possess more "cupness" than, say, "bowlness." Consequently, categorization of like objects arises partly as a contrast between clusters of objects. Another important heuristic about human classification is that not all exemplars within a category have equal status—categories are not equivalence classes. Some members of a category are better representatives of the whole category; some are more typical or more central to the "definition" of the category (Posner & Keele, 1968; Homa, 1978). And finally, humans tend to use multiple strategies when classifying, depending on the frequency of the candidate and its closeness to the most typical case. Categories can be extended either by comparisons to an aggregate pattern, prototype, or "average" or by nearest match to an exemplar (Medin & Schaffer, 1978; Homa, 1978).

172

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

143

To turn this qualitative model of modularization into an operational one, we require • a way to describe software units as sets of features, • a way to measure similarity in terms of those features, and • a classification rule based on that similarity measure. 3.1. Software implementation features The information-hiding principle led us to the observation that the names used in a program unit are good clues about the design assumptions on which its implementation depends, and therefore are good indicators of which module it belongs in. For similar reasons not elaborated here, the names of other units in which a unit is used are also good clues about where it belongs, although the correlation is not as strong. Therefore, the names a unit uses, and the names of places where a unit is used, are appropriate features representing the design characteristics of that unit.3 This information can be extracted from the code itself. First, a conventional crossreference extractor analyzes the code to produce a relation, NamesUsed C UnitxUnit, where (x, y) 6 NamesUsed if and only if the name of unit y is used in unit x. Then, UserNames = {(y, i ) } such that (x, y) € NamesUsed, and HasFeatures = NamesUsed U UserNames. Notice that UserNames introduces the notation x, denoting a synthetic name derived from x. This represents the difference between a name used in a unit and the name of a place where the unit is used. The distinction is made so that, when HasFeatures is computed, {y, x) and (y, x) are distinct tuples. Experience has shown the importance of obtaining cross-reference information that is as fine-grained as possible, e.g., the individual field names of structures and the individual literals of enumeration types. Such details are what distinguish code that implements an abstract data type from code that merely uses it. Although cross-reference analysis generates a rich set of implementation features, some important design decisions do not correspond to any particular identifier. Therefore, the relation HasFeatures may be expanded with tuples supplied by the human architect. 3.2. Measuring similarity In order to design an appropriate similarity function, we first describe several important properties that the function should satisfy, and then introduce a function that satisfies them. 3.2.1. Matching and monotonicity When a programmer judges the similarity of two procedures, she looks both at the features they have in common and the features mat are distinctive to one or the other procedure.

173

144

R.W. SCHWANKE AND S.J. HANSON

Adding a common feature increases similarity; adding a distinctive feature decreases it. She also judges the relative importance of different features. A feature representing a localized, volatile design decision deserves a greater weight than a feature representing a widely used, stable design decision. One type of feature she does not look at is one that is absent from both procedures. Identifiers that do not occur in either procedure have no impact on their similarity; they are simply irrelevant. These properties of software similarity judgment correspond to general models of human similarity judgment, such as proposed by Tversky (1977). Tversky's model treats object descriptions as sets of features, and similarity functions as functions of common and distinctive features, defined as follows. Let/4, B, C, . . . be objects described by sets of features a,b,c, . . . , respectively. When comparing two objects, the following computed feature sets are significant: a H b

The set of features that are common to A and B.

a - b, b — a The sets of features that are distinctive to A or B, respectively. The matching property restricts similarity functions to those that are functions of the common and distinctive features, and that are independent of the features that neither object has. (The property is apparently so named because it matches up features in the two sets.) A similarity function, SIM, has the matching property if there exists a function F such that SIM(X, Y) = Fix D v, x - y, y - x) The monotonicity property embodies the idea that similarity should increase in proportion to common features and decrease in proportion to distinctive features. A similarity function, SIM, has the monotonicity property if SIM(A, B) > SIM{A, C) whenever a 0 b 2 a f~l c a —c 2 a —b c —a 2 c —b and, furthermore, the inequality is strict whenever at least one of the set inclusions is proper. 3.2.2. Other desirable properties The software domain suggests some additional characteristics that a similarity function should have: • No maximum value. Two identical procedures with many features should be more similar than two identical procedures with few features.

174

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

145

• A minimum value, obtained when there are no common features. Two unrelated procedures should be just as dissimilar as two other unrelated procedures. • The function should be defined when there are no common and no distinctive features. This surprising requirement arises because real-world software sometimes contains "stub" procedures that have no bodies and no callers, and hence no features. 3.2.3. A ratio model of similarity We have designed a matching, monotonic similarity function with the above properties, called Ratio, similar to one proposed by Tversky, of the same name. We define

RatiofS N) = Weigh** C\ n) v ' ' 1 + c • Weights D n) + d • (Weights - n) + Weight{n - s)) This function satisfies the matching and monotonicity properties as long as c and d are positive, and Weight increases monotonically with the set membership of its argument. This requirement is satisfied by defining

Weight{X) = 2 wx, where wx > 0. flight computes the combined significance of a set of features. Although Tversky defined the function to be linear, we admit the possibility that it might be non-linear, representing correlations among features. Ratio satisfies the requirements of software similarity mesurement described above. It is also symmetric, which is not necessary to model human similarity judgment, but makes it possible to use the function in standard clustering algorithms. There still remains the problem of how to assign weights to the features, and values to c and d. Giving all features the same weight causes high-frequency features to dominate clustering performance at the expense of rare features. Intuitively, feature weight should vary inversely with its frequency, since rarely occurring features have a better chance of being encapsulated within a module. Therefore, we have been estimating the significance of a feature by its Shannon information content, vty = -log P(f), where P(f) is the probability that a unit in the system being studied has feature/ In a later section we will describe how to learn better estimates of these weights. 3.3. A nearest-neighbor classification rule Because programmers assign procedures to modules by finding a small group of highly similar procedures, modularization can be modeled by a nearest-neighbor classification rule. To describe the rule, we use the following definitions: Subject. A unit being considered for inclusion in a category. Good Neighbor. A unit belonging to the category for which the subject is being considered.

175

146

R.W. SCHWANKE AND S.J. HANSON

Bad Neighbor. A unit belonging to any category other than the one for which the subject is being considered. k. A parameter of the classification rule, denoting the minimum group size to which a subject might be added. Since more than one module may be a good candidate to receive a given subject, the classification rule below incorporates a confidence measure for each of the possible categories: Confidence. Subject S fits in category X with confidence C if and only if assigning S to X would result in S having exactly C bad neighbors more similar to it than its jfc'th good neighbor. Note that C is zero when S'sk nearest neighbors are all members of category X. Greater values of C imply that the immediate neighborhood of S is "polluted" with units from other categories.4 With the confidence measure so defined, the classification rule is straightforward: Classification Rule. A software unit belongs in the category for which its confidence rating is best (closest to zero). Confidence ratings also provide a sensitive way to measure how well a classifier implementing this rule conforms to a given set of classification data. If performance were measured only in terms of classification errors, the classifier might compute the correct categories, but with marginal confidence. Therefore, it is useful to measure a classifier's performance in relation to its confidence that the subjects are correctly classified as labeled in the data. However, such a measure would still be sensitive to the cluster size parameter, k. Experience has shown that setting k equal to 1 causes problems when some of die data are mislabeled. If just one unit is mislabeled, any unit for which it is the nearest neighbor will appear to be misclassified. Similarly, any particular value for k may be inappropriate for a specific unit, because highly cohesive software clusters occur in many sizes. Therefore, performance is actually measured over a range of values of k, as follows: Classifier Performance Measure. A nearest-neighbor classifier conforms to a data set D with rating R, K

R= 2

(S,C)tD

2 / Confidence^, C, k)

t=l

where (5, C) denotes unit S assigned to cluster C. Typical values of K range from 2 to 5. 4. Learning architectural judgment The classification rule described in the previous section has been installed in a module architecture advisor and used profitably to help reorganize real software systems (including

176

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

147

the tool itself). These experiences will be discussed in section 5. However, the classifier's accuracy has been limited by the arbitrary way that values are assigned to the constants and feature weights in the similarity function. "Hand tuning" those values has been tedious and unenlightening, although doing so does improve performance. The goal of the present research is to replace hand tuning with an automatic tuning process, in which the tool learns from its mistakes. To achieve this goal, the learning task is divided into the following steps: 1. Advise the architect until the architect disagrees with the advice (indicating that learning is needed). 2. Identify a subset of the units that the architect believes are correctly classified, for use as a training set. 3. Construct a set of more-similar-than judgments that if correctly modeled by the similarity function would produce perfect classifier performance and confidence on the training data. 4. Train the similarity function, by back propagation, to maximally fit the more-similarthan judgments. 5. Ask the architect for additional features to explain any category assignments that the classifier has not learned. The rest of this section assumes that the advisor can extract a training set from the current tool state. It describes how more-similar-than judgments are constructed, and describes the back propagation network used to train the similarity function. Training data selection and feature acquisition from the architect are deferred to section 5. 4.1. Constructing more-similar-than judgments From the definition of the classifier performance measure, one can see that only similarity judgments between a subject and its near neighbors are relevant to classifier performance. In particular, Confidence(S, C, k) is proportional to the number of cases in which S is more similar to one of its bad neighbors than to itsfc-thnearest good neighbor. Aggregating over all values of S and k, classifier performance is equal to the number of cases, (S,G,B), for which a subject, S, is more similar to a bad neighbor, B, than to one of its K nearest good neighbors, G. Therefore, only a subject's K nearest good neighbors are relevant to it. The relevant bad neighbors are those more similar to the subject than its K\h nearest good neighbor. Therefore, the more-similar-than judgments that should be learned are all possible combinations of a subject and its relevant good and bad neighbors. Optimizing these judgments optimizes classifier performance according to the specified measure. The optimization process brings in one additional problem: initialization. Since the goal is to learn a similarity function, and training data are selected using that same function, one must assume that, initially, the estimated similarity function is a poor predictor of actual similarity. Therefore, limiting consideration to the "ATth good neighbor" -hood might arbitrarily screen out units that are actually highly similar to the subject.

177

148

R.W. SCHWANKE AND S.J. HANSON

This problem is solved in the present research by setting K very large at the beginning of training, while the similarity function is poorly estimated, and gradually decreasing it as learning progresses. Initially, all weights and constants are drawn from random distributions, and K is set greater than the size of the largest module. After each training epoch, Kis reduced, by exponential decay, toward a predetermined asymptotic value and the triples (S, G, B) are reconstructed. 4.2. Backpropagating similarity judgment errors Similarity judgment is learned by computing more-similar-than judgments in a feedforward neural network, and backpropagating errors. However, rather than being a general hidden-layer network, the network is designed to mirror the model of similarity judgment discussed above. It is described here mathematically first, and then as a network. 4.2.1. Inputs For each triple (5, G, B), the inputs to the network are the corresponding feature sets s, g, and b. 4.2.2. Error junction The network computes the sigmoid function of the difference of two similarities: Errors, g, b) = (a(Sim(s, g) - Sim(s, b)) - threshold)2 where the threshold is typically 0.95, and

•» " TV?-The value of Error will be near zero whenever 5 is much more similar to G than to B, and near threshold2 when the opposite is true. 4.2.3. Similarity function The similarity function is defined as described in section 2.3: im(s, n)

l + c

.

Weignt(s

Weightjs PI n) n n) + d - (Weights(s - n) + Weight(n - s))

where Weight(X) = ]>] w^, and wx > 0.

178

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

149

4.2.4. Implementation as a network Figure 1 shows the topology of the network. The Focuser uses the current estimate of the similarity function to select a set of input triples for one training epoch. Each triple is presented to the network as a triple of feature vectors. A preprocessing stage computes the needed sets of common and distinctive features. The next stage computes Weight for all six sets, using the same feature weights each time. These six Weights are fed into two identical sub-networks, which compute Ratio on (s, g) and (s, b), using the same values for c and d each time. Finally, the comparison stage of the network computes the sigmoid function of the difference of the two similarities. Back propagation is carried out with a simple delta rule. 4.2.5. Novel aspects of the network The most significant aspect of the network design is that it learns a similarity function by comparing the value computed by the function on two related pairs of inputs. This works

figure 1. Comparison network for learning relative similarity judgments.

179

150

R.W. SCHWANKE AND S.J. HANSON

because the absolute value computed by the function is unimportant; only the relative order of the values it computes matters. Therefore, instead of training the network to compute a specified value, it must compute two values with a specified relative order. The implementation method was suggested by Tesauro, who had invented it to teach his Neurogammon system (Tesauro & Sejnowski, 1989) an ordinal move evaluation function by training it on pairs of alternative moves from the same board position and dice roll, one of which was known to be an expert's choice. He calls the approach the comparison paradigm. To implement the comparison paradigm, the network uses the same set of link weights for both computations of Sim. When error is propagated back, each weight receives error assignments for its roles in both computations. The order of the inputs does not need to be randomized, and no training is signal needed, because the symmetric network design prevents the weights from learning a bias toward the "left" or "right" neighbor. The "training signal" is always the same: S should be more similar to G than to B. By implementing the ratio formula directly, the network restricts the class of similarity functions to a subset of the monotonic, matching functions. This restriction was motivated by the problem domain and has proven to be acceptable in practice. However, further research could examine other monotonic, matching functions. The network also specifies a linear Weight function. Nonlinear, but monotonic, functions could be substituted, if necessary to model feature correlations, but experience has shown the linear function to be satisfactory. Two other implementation details are worth mentioning. The weights are bounded greater than zero, so that the similarity function remains monotonic. When the weights are updated during backpropagation, any weights that drop below a small positive threshold are reset to that threshold. The other detail is that the common and distinctive feature sets are represented as feature vectors, with l's for set members and O's for non-members. Zero has the special property that back propagation will assign no portion of the error to a link from a zero input, meaning that error can only be assigned to weights for features that were present in at least one of the input units. This reinforces our working hypothesis that similarity is unrelated to absent features.

4.3. Learning and generalization performance To obtain a preliminary estimate of the tool's capabilities, we applied it to a small but realistic data set. The sample data come from a real software system, which is actually an early version of our batch-clustering tool. It comprises 64 procedures, grouped into seven modules. Membership in the modules is distributed as described in table 1. The software is written in the C programming language. To create the sample data set, we applied a cross-reference analysis tool to it, collecting every occurrence of a non-local identifier, including procedures, variables, macros, typedefe, and the individual field names of structured data types. Each such name was given a unique identification number, so that there would be no confusion when, for example, two different record types have a field with the same name.

180

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

151

Table 1. Module sizes Members

Module

10 12 5 7 4 12 14

attr hac massage node objects outputmgt simwgts

Each distinct non-local name occurring within a procedure was then recorded as a feature of that procedure. In those cases where one procedure called another, two features were recorded: the callee's name became a feature of the caller, and the caller's name became a feature of the callee, but marked to distinguish "called by X" from "calls X." This process produced 152 distinct feature names. However, many of these features occurred in only one procedure each, and were therefore greatly increasing the size of the problem without contributing to the similarity of two procedures. Therefore, we eliminated all such singly occurring features, leaving 95. We expected the given data to contain classification errors, because the software had not been carefully modularized. However, we wanted to measure generalization performance on "clean" data, so that generalization errors would not be confused with training data errors. Therefore, we established two criteria for eliminating units from the data set, both of which had to be met before the unit was eliminated: 1. The classifier must fail to classify the unit correctly, even after learning. 2. The feature data must show evidence that the learning failure was due to a modularization error. Twelve procedures met both criteria and were removed, leaving 52. The network was able to learn to classify all 52 correctly. To test the network's generalization ability, we contracted a jackknife test, in which the 52 >. rocedures were divided into a training set and a test set, to determine how often the similarity measure, used in a nearest-neighbor classifier, could correctly classify procedures that were not in the training data. The test consisted of 13 experiments, each using 48 procedures for training and 4 for testing, such that each procedure was used for testing exactly once. Each experiment consisted of training on triples constructed from the training set, and then using the learned similarity function to identify the nearest neighbor of each procedure in the test set. K had a final value of 3. The test procedures were classified with k equal to 1. The results of the jackknife test are shown in table 2. Each row gives the number of procedures that were in that module and how many of them were classified into each module during the jackknife test. Only one unit out of 52 was misclassified.

181

152

RW. SCHWANKE AND S.J. HANSON

Table 2. Classifier performance on unseen units. Classifed as Module Actual Module outputmgt simwgts attr hac node massage objects

No.

outputmgt

11 11 9 S 7 4 2

11

simwgts

attr

10

1 9

hac

node

massage

objects

8 7 4 2

From these results we conclude that the given data, the classification and similarity models, and the learning method were very well matched to one another. Naturally, it could be that the data were highly redundant, so that a jackknife test that removed only four procedures was not really withholding any knowledge. Or it could be that the process of removing errors from the data biased the results. The experiment was simply too small to tell. 4.4. Comparisons to simpler classifiers To study the properties of software categories, we first tried clustering them in feature space. Similarity in feature space is measured by distance metrics with small distance corresponding to great similarity. Consequently, units possessing common features will be more similar than those that do not. Furthermore, those features that appear in one category but not in the other will augment similarity of units within a group. Typical kinds of similarity measures include Euclidean, Hamming, and inner product (cosine). Such measures used with agglomerative hierarchical clustering algorithms can only produce cluster groups that were originally at least linearly separable, i.e., clusters that could be separated by a line in 2-space or a hyperplane in four or higher dimensions. We studied the category properties in feature space by attempting to cluster the example data set in feature space, using a hierarchical, agglomerative clustering algorithm under several different similarity measures. Shown in figure 2 is a Euclidean, centroid clustering of the software units by their crossreferences (see above). At the end of each leaf is a label indicating which module of a possible seven the unit is defined in. Note that separation of the category modules does not result; less than half of the group members are assigned to their proper modules. Other measures fare no better. Figure 3 and figure 4 show cluster diagrams for other similarity measures. 4.5. Comparison to neural network classifier Neural networks represent a class of function approximating methods that can create similarity as a function of the data to which they are exposed. Within any network is a

182

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

Figure 2. Euclidean, eentroid clustering.

Figure 3. Hamming, eentroid clustering.

183

153

154

R.W. SCHWANKE AND S.J. HANSON

Figure 4. Hierarchical clustering with cosine correlation.

basis set of functions (see Hanson & Burr, 1990) that allow arbitrary similarity functions to result as the network learns to correctly label data points. Such supervision is also critical for category problems in which the initial feature space is required by the category labels to transform nonlinearly. Consequently, although the similarity measure may cause units with shared and distinctive features to be closer in similarity space, category labels as determined by membership in modules may require transformation of similarity by moving units that are initially far apart closer together in similarity space. Nonetheless, having an effective similarity measure and supervision from labels still involves a complex induction problem. Methods like neural nets share with other learning approaches the problem of generalizing correctly from a limited sample of example cases. In theoretical learning work, this is currently an intense area of research with many unanswered questions (cf. Somplinsky & Tishby, 1990; Baum & Haussler, 1988; Rivest, Haussler, & Warmuth, 1989). Consequently, it is possible to learn perfectly all examples from the domain and still incorrectly classify new examples as they appear. We found that this phenomenon was indeed present in our sample data. We used a simple feed-forward, back-propagation network with one hidden layer of sigmoidal activation units, and trained it to classify the given objects. We found that, with just four hidden units, this network was able to classify the training data perfectly. However, this perfect learning prevented it from classifying any novel units correctly. These results are shown in figure 5. We believe that one of the reasons neural network generalization performance is so poor is that the categories are sparsely populated. There are generally more relevant features than units to be classified. Each unit has typically only seven features, which it shares

184

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

155

Figure 5. Training and transfer performance by number of hidden units.

with only a handfiil of other units. The majority of instances have no exact duplicates in the training set. Therefore, the network is not able to make generalizations about salient features, but can only memorize the category memberships. 5. Advising architects Since preliminary experiments on small data sets showed the classification and learning methods to be promising, we proceeded to incorporate them in a heuristic architecture advisor (Schwanke & Platoff, 1989) and to try them out on real reorganization tasks. This section describes • • • •

the working styles supported by the tool, how the tool uses the classification model to give advice, a case study showing that the advice is useful, and a case study showing how the quality of the advice improves with learning.

5.1. Working styles The tool supports three different (although overlapping) styles of work:

185

156

R.W. SCHWANKE AND S.J. HANSON

• Incremental change: the software is already organized into high-quality modules. The engineer wishes to identify individual weak points in the architecture, and repair them by making small changes. • Moderate reorganization: although the software is already organized into modules, their quality is suspect. The engineer wishes to reorganize the code into new modules, but with an eye to preserving whatever is still good from the old modularity. • Radical (reorganization: Either the software has never been modularized, or the existing modules are useless. The engineer wishes to organize the software without reference to any previous organization. All these styles can be applied to a whole system or to an individual subsystem or module. The tool supports these activities with two kinds of service: clustering and maverick analysis. 5.2. Clustering This service organizes software units into a tree of categories. It is actually a group of clustering services, each of which interacts with the user in a different way: Batch clustering supports radical reorganization. It uses a hierarchical, agglomerative clustering (HAC) algorithm to form a category tree. The given similarity measure is used to derive a group similarity measure. The algorithm starts by placing each unit in a group by itself. It then repeatedly combines the two most similar groups. Some variations of it heuristically eliminate useless interior nodes in the category tree, so that the tree has varying arity appropriate to the data. Incremental clustering supports incremental to moderate reorganization. It is based on the same HAC algorithm, but allows the user to apply it to any node of his cateogry tree at any time, and to review each clustering action before it is carried out. The user selects a node in the category tree, and asks for either the nearest sibling or the two most similar children of that node. Based on the answer, he may decide to combine the indicated groups, or to make some other change in the organization. Thus clustering is carried out manually, but with advice whenever requested. Interactive reclustering supports moderate reorganization. It starts with a given set of original categories, but tries to build a fresh classification tree out of the units in them. It uses the original category labels to decide which clustering steps should be reviewed by the user, and which can be carried out automatically, without review. It uses the hierarchical, agglomerative clustering algorithm, but before combining two groups, it checks to see whether all members of both groups were in the same original category. If so, it combines the two groups automatically. If not, it pauses and asks the user whether to combine them. To help the user decide, it presents several other relevant groups. For each of the two groups that it is recommending combining, it also presents the second-nearest neighbor as well as the nearest neighbor whose members all belonged to the same group originally. The user can then choose to combine the recommended pair, to combine some other subset of the presented groups, or to make any other organizational change he wishes. The interactive clustering algorithm then resumes its work using whatever clusters the user has formed.

186

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

157

Neighborhood clustering can be used as a prelude to incremental clustering. Given a set of units, this service forms the smallest clusters for which each unit is in the same cluster as its k nearest neighbors. The batch-clustering algorithm appears not to be useful, because a mistake early in the clustering process often makes all subsequent clustering decisions wrong. Also, an architect is not likely to accept a new set of categories that is radically different from the set of modules with which he started. The interactive clustering methods are much more useful, because in each of them the architect has opportunities to override the tool's recommendations, and the tool then continues its work based on the architect's decisions. Also, the interactive tools present the architect with several good alternatives, rather than just one best choice, thus giving the architect powerful guidance without preempting his options. The neighborhood clustering algorithm is quite powerful: even specifying a neighborhood size of 1 often creates clusters with an average of four members. This means that three quarters of the clustering decisions (1 decisions combines two clusters) have already been made, leaving only one quarter to be made by other methods. However, all the reclustering methods suffer the same weakness: they cannot yet learn from their mistakes.

5.3. Maverick analysis This service heuristically identifies software units that appear to violate the informationhiding principle. A unit is deemed to be in violation if it appears to share more implementation characteristics with units in other modules than with units in its own module. This violation is detected by using the classifier to compute the "correct" module assignment. If the computed assignment is different from the module in which it resides, the unit is listed as a maverick. (Like a stray calf on the western plains, the unit must be returned to the proper herd.) Such a misplaced unit usually indicates a conceptual weakness in the architecture. The correct repair is usually not to simply move the unit. More often, it indicates that one or more units are mixing design decisions from different modules, and that the units should be rewritten. The maverick analyzer evaluates the category assignment of each unit. It uses the classifier both to identify the best category for the unit and to give confidence ratings for both the present and recommended category assignments. A unit is considered a maverick if it fits into another category with a better confidence rating than its rating for the category to which it is currently assigned. Because the heuristic nature of the analysis leads to a substantial number of false positives, the mavericks are presented to the architect "worst first." They are sorted in order of confidence in the recommended category (strongest first), and, among mavericks with equal reclassification confidence, in order of confidence in the current classification (weakest first). The maverick analyzer has been used to review the organization of a modest-sized industrial-strength software system (Lange and Schwanke, 1991). The system studied contained 300 procedures, organized into 27 modules, and 900 distinct cross-reference feature names. At the time of the analysis, the programmer maintaining the code was already planning to clean up its structure, but had not yet done so.

187

158

R.W. SCHWANKE AND S.J. HANSON

Maverick analysis yielded 51 procedures that were apparently misclassified. An analyst unfamiliar with the code used the maverick list to uncover 24 specific ways in which the code modularity could be improved. Recommended improvements included the following: • • • • • • •

Move a procedure Repartition a set of modules Add methods to an abstract data type, and use them instead of accessing the representation Introduce an interface layer to separate low-level from high-level functionality Split a procedure that is straddling two modules Replace an erroneous variable reference with the correct one Remove dead code

The original maintainer reviewed each of these 24 recommendations and responded in one of four ways: • Correct (5 cases) • Helpful (6 cases): the problem was correctly identified, but a more appropriate repair was found • Redundant (7 cases): the identified problem was due to one of the previous 11 cases • Incorrect (6 cases) To assess the potential benefit of learning from mistakes, the analyst then hand tuned the maverick analyzer to improve its performance, by adding five user-defined features, adding four syntactic features that were not derived from cross-references, changing eight feature weights, and marking two procedures as unquestionably belonging to the modules in which they resided. This reduced the maverick list to 23 procedures, including the 11 correct and helpful cases. These results are promising, but not completely satisfactory. Although maverick analysis provided real benefit to a real project, 80% of the mavericks were useless or redundant. Although hand tuning reduced this number to 50%, the tuning process was difficult and unenlightening. Therefore, it seemed worthwhile to try an automatic tuning process, which, in combination with focused knowledge acquisition from the user, might produce a more useful tool.

5.4. Maverick analysis with learning Although presenting the mavericks worst-first mitigates the problem of false positives, the architect must eventually review all the mavericks on the list or worry that he has overlooked a problem. What is worse, if he makes changes to the code because of the real mavericks he finds, he may have to re-review the whole maverick list to see if any of the old false positives have become real mavericks. Adding a learning capability to the tool could overcome these problems in two ways:

188

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

159

1. Translating the human's classifiction decisions into similarity judgments will make them applicable to other potential mavericks after the code has been changed or reorganized. Instead of re-reviewing all mavericks, the human would only have to review those that were not screened out by his previous judgments. 2. By using the architect's judgments as "relevance feedback" during an analysis session, the tool could reorder the maverick list to bring the real mavericks closer to the top of the list. These two hypotheses have been investigated in a case study using the learning method to improve the quality of maverick analysis for a real architect reorganizing a real system. This section first describes the case study itself, then reports subsequent analsyis of data taken from the study. 5.4.1. A case study The Arch tool was applied to the code of a real software system called TSL (Balcer, Hasling, & Ostrand, 1989). This system is in production use in Siemens operating divisions, and is undergoing active maintenance. It comprises 470 procedures in 33 modules, ranging in size from 1 to 29 procedures. Three modules with fewer than four procedures were excluded from the experiment. The architect was asked to perform four kinds of actions: 1. Move those mavericks which were actually in the wrong module. 2. Identify false-positive mavericks. 3. Identify those units for which nearest-neighbor classification was not the right model to explain his modularization decisions. 4. Supply user-defined features, as necessary, to identify shared design properties that were not represented by cross-reference features. Time did not permit us to have the programmer actually rewrite any code. The experimental procedure went as follows: 1. Initialize the tool's classifier with feature weights based on the information content of features, as described in section 3.3. 2. Have the architect review each of the 10 worst mavericks, either specifying the best module assignment explicitly or removing it from further analysis. 3. Move the mavericks to their new modules as specified. 4. Review the common and distinctive feature data for explicitly classified mavericks, checking for learnability. 5. Have architect supply features to explain "unlearnable" cases. 6. Extract training data from the user session log and transmit it to the learning component. 7. Train the neural network, as described in sections 4.1 through 4.2, until classification performance on the training data stops improving. 8. Transmit the learned weights back to the maverick analyzer. 9. Generate a revised maverick list. 10. If the list is non-empty, go back to step 2.

189

160

R.W. SCHWANKE AND S.J. HANSON

Notes: • Step 4. Experience has shown that, for a module assignment to be learnable, there usually must be at least one feature that the unit shares with some good neighbors that it does not share with its nearest bad neighbors. • Step 5. In principle, we could have waited until thetoolfoiledtolearn the correct classification for the unit before asking for a user-defined feature, but time constraints prevented us from waiting for this verification. From the user's point of view, supplying the feature provides useful documentation anyway, so it is not an unreasonable procedure. This step is actually a carefully focused knowledge acquisition procedure. The architect is asked to focus her attention on just the difficult situations, and to supply just enough information to explain them. This strategy is far more desirable for the architect than supplying large amounts of knowledge a priori, because she can see the immediate benefit of supplying the knowledge: it corrects the tool's mistake. • Step 6. To extract training data without burdening the architect unduly, most of the training set should be identified automatically. Therefore, the training set was constructed by collecting all the units that appeared to be correctly classified already, with strong confidence, before the first user session began, and adding to them all the mavericks that the architect had reviewed and explicitly classified. By this procedure we hoped to avoid putting "false negatives" in the training data, but this assumption was never directly verified. The architect required 10 cycles through the experimental procedure before all the mavericks had been examined. His actions on each cycle are summarized in table 3. Table 3. Architect's actions during maverick analysis. Session

Mavericks

False

Move

1 2 3 4 5 6 7 8 9 10 Totals

125 92 82 71 65 56 45 27 12 3 125

5 6 5+6 6+3 6+5 7+9 6+5 5+1 6 1 53+29

5 4 3 1+2 3

16+2

Remove

1 1+1 1 1 5+3 4 15+4

Repeat

1

User Features

2

1

1 2 5

6 4 6 3 2 1 24

Checked 10 10 16 15 16 18 12 14 10 1 114

Session: During each session, the architect set out to check the ten worst mavericks. The actual count was sometimes more and sometimes less. Mavericks: The number of mavericks on the maverick list for each session. False: The number of mavericks that the architect indicated were false positives. + 6 : Sometimes the architect explicitly labeled procedures other than the top ten mavericks, such as when he was providing user-defined features for the maverick and the other members of its cluster. The number of nontop-ten mavericks so labeled is shown as "+digit." Move: The number of mavericks that the architect reassigned to a new module. Remove: The number of mavericks that the architect indicated were not appropriate for maverick analysis. Repeat: The number of mavericks that showed up in the top ten after having been checked previously. User features: The number of user-defined features added to explain "unleamable" cases. Checked: Total number of False, Moved, Removed, and Repeated mavericks.

190

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

161

Several observations about the study are worth mentioning: • All of the Moved mavericks were found in the first five sessions. The last five sessions served only to teach the tool not to report the false positives. A real user, after finding no more real mavericks in the top ten, might well choose to stop analysis, satisfied that he has found nearly all of the problem procedures. • In the first five sessions, all of the top-ten mavericks had a reclassification confidence of 2 or less. All of the Moved mavericks had a reclassification confidence of 0 or 1. • User-defined features were not needed until session 3. • Nineteen of the mavericks were judged to be procedures for which the nearest-neighbor classification heuristic seemed inappropriate. For example, some of the cases involved one-of-a-kind procedures. There were another 19 procedures that were excluded by very simple heuristics, such as those belonging to modules with three or fewer members, belonging to an imported module, or with fewer than three good or bad neighbors having at least one common feature. • Eight of the 24 mavericks requiring user-defined features were type-destructor procedures that shared more implementation information with one another than they did with other procedures defined on their own type. Possibly these eight should have been excluded rather than kept and given an extra feature. • One hundred and fourteen of the 125 mavericks had to be checked explicitly before the tool learned to completely agree with the architect. This seems to indicate that very little generalization was taking place. • Despite these concerns, the tool did eventually learn to incorporate all the architect's classification judgments.

5.4.2. Analyzing generalization performance The feet that 114 mavericks had to be checked explicitly before perfect learning was achieved seemed to be due to a very liberal definition of maverick. In order to minimize the risk of false negatives, a large number of false-positive mavericks were occurring. Therefore, to measure learning performance, we decided to treat the maverick list like the result of an information retrieval (IR) query, measuring its performance on a precision/recall graph. When an IR system produces a set of documents that approximately match a query, some of the retrieved documents are likely to be irrelevant, as judged by the end user. Since retrieval is based on the similarity between the query and each of the documents in the collection, the retrieved list is typically sorted in order of decreasing similarity, since the most-similar documents are presumed to be most likely to be relevant. Depending on the stamina of the end user, she may look at only the first few entries, or half the list, or the whole list. Information scientists have defined two standard concepts, precision and recall, to measure the quality of a retrieved set of documents. Precision is the fraction of documents in the retrieved set that are relevant. Recall is the fraction of relevant documents that are in the retrieved set. When the retrieved set of documents is ordered by their estimated likelihood

191

162

R.W. SCHWANKE AND S.J. HANSON

of relevance, that ordering can be evaluated by measuring precision and recall for successively longer prefixes of the list. Specifically, data are collected for each prefix of the list that ends with a relevant document. Precision and recall are plotted for each such prefix in a precision/recall graph. Perfect performance, where all the relevant documents were recalled and bunched at the top of the list, would produce a horizontal line at Y = 1.0. Random performance, where all relevant documents were recalled but uniformly distributed throughout the list, would produce a horizontal line at Y = relevant/(relevant + nonrelevant). Two methods of estimating relevance can be compared by comparing their precision/ recall graphs. However, precision and recall depend strongly on both the document collection and the specific query, so comparisons must use the same collections and queries to evaluate both methods. Normally, precision and recall are averaged over a large number of queries. However, it is also valid to compare the performance of two methods on a single query, as long as one does not generalize too much from a single example. Precision/recall measurement can be used to measure the quality of maverick analysis by treating the maverick list as a sorted list of retrieved documents, sorted by estimated relevance. Relevance is estimated by reclassification confidence and lack of confidence in the current classification. We will compare two sets of parameters for the similarity function by comparing the maverick lists they generate for the same set of data. Various small procedural changes during the course of the case study prevented us from performing meaningful analysis directly on the experimental protocol. However, the protocol did produce a complete list of relevant documents, allowing us to analyze precision and recall with and without learning. First, figure 6 shows why learning was needed. Out of 125 mavericks, only 16 were actually relevant. Although precision was relatively high near the top of the list, beyond the first five real mavericks, precision dropped off rapidly. In the first iteration of the experiment, the architect's analyses of the 10 worst mavericks were used to learn new feature weights. To compare performance with and without learning, we performed the specified module reassignments, removed the original top 10 from further consideration as mavericks, and compared the maverick lists constructed with and without the learned feature weights. The results are plotted in figure 7. Here we see that, without learning, precision is low at all recall levels, not much better than "random." Precision with learning is better in all cases, and 3 to 5 times better for recall levels up to 0.63. To compare the contribution of feature weights versus gross coefficient values in the similarity function, we tried varying the gross coefficients, both with and without learned feature weights. We found the variation in performance due to the gross coefficients to be negligible compared to the variation due to learning feature weights. 6. Discussion 6.1. Modeling modularization The case studies reported here support the following hypotheses: • Nearest-neighbor classification, with similarity measured based on common and distinctive features, is an effective model for a large fraction of the modularization decisions in software systems.

192

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

Figure 6. Retrieval performance of the initial, untrained maverick list.

Figure 7. Retrieval performance of revised maverick list.

193

163

164

R.W. SCHWANKE AND S.J. HANSON

• The accuracy of the model is highly sensitive to individual feature weights. • The accuracy of the model is relatively insensitive to the gross coefficients of the similarity function. Model performance can be substantially improved by learning from its biggest mistakes, without need for additional features. • With a modest number of user-defined features, the learning component can adapt perfectly to the architect's judgment. Future research combining the nearest-neighbor classification heuristic with other kinds of heuristic modularization advice may produce a practical module architecture advisor. 6.2. Related work on software similarity and modularity Other software engineering research related to the present work fells into the following categories: • Software similarity: work by Maarek and Kaiser (1987), for example, uses similarity and clustering to group software units in a reuse library. They rejected implementation features as irrelevant to reuse. • Module and subsystem synthesis: Belady and Evangelisti (1982), Hutchens and Basili (1985), and Chen et al. (1990) have investigated clustering units into modules based on data bindings and data flow connection strength. None of these papers reports validation of the clusters by real maintainers. Maarek and Kaiser (1988) looked at clustering for the purpose of identifying software units that are likely to be modified at the same time. They proposed measuring affinity between software units by a combination of connection strength and how often in the past the units have been changed as part of the same task. Choi and Scacchi (1990) propose synthesizing subsystems based on articulation points in the cross-reference graph. • Module quality analysis: Selby and Basili (1988) have measured the maintainability of a module by measuring its internal cohesion and external coupling to other modules. Porter and Selby (1990) successfully use this information to help predict the module^ error proneness and cost to repair. They automatically construct decision trees from large volumes of real project data. This work is closest in spirit to the present work, in that it applied machine learning techniques to real-world data and measures success by realworld standards. However, the methods of these authors do not identify specific flaws in module quality, nor do they make reorganization recommendations.

6.3. Learning similarity vs. learning classification The classifier and similarity learning method worked significantly better in the given problem domain than simpler methods that learn categories directly. We attribute the failure of simpler methods to the feature sparsity and category diversity inherent in the problem domain, and to the small number of examples in the training data. The information-hiding

194

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

165

principle predicts that very few features will occur widely; most will be rare, indicating that exemplar-style category models would be more appropriate than probabilistic models. The small number of examples of each category also suggested that the actual category members be used as exemplars, rather than a small set of synthesized prototypes representing a larger set of category members. Explicitly representing and learning the similarity function also permitted us to force it to be monotonic and matching, which probably also contributed to its success. 6.4. Heuristic design advisors Several characteristics of the architecture advisor may be useful in other heuristic design advisors: • The advisor embodies a model of the judgment-based human reasoning process it is advising, rather than merely checking a mechanistic design rule. While a practical tool would incorporate as many such design rules as are useful, adding judgment-based advice extends the usefulness of the tool considerably. • By providing advice in the form of an ordered list of good alternatives, the tool was actually more useful than if it had given a single recommendation. By examining the alternatives, the architect could better understand why one was recommended over another, even if she chose one of the "lesser" alternatives. • By providing interactive analysis on the architect's work in progress, the tool played the role of a subservient assistant rather than a demanding master. Whether or not the architect acted in accordance with or in opposition to the tool's advice, the tool could analyze the new situation, sometimes revise its judgment based on the architect's actions, and identify good alternatives for the next step. • By acquiring most of its feature data from the design artifacts themselves, the tool was able to provide a useful level of service even before it acquired additional user-defined features and learned from its mistakes. • Further knowledge acquisition was mistake-driven. The user could know much better what information to supply when she knew what mistake needed correcting. The alternative would have been to expect the user to supply much more information, not knowing which parts of it were important. • By using its mistakes as "relevance feedback" to reorder the priorities of its other recommendations, the tool significantly improved the quality of its advice. 7. Future work Clearly, more case studies are needed to confirm or contradict what was found in the TSL study. Such studies are somewhat expensive because they require an engineer knowledgeable about the code to comment on all mavericks, both real and false-positive. We find that an effective combination is to team up the knowledgeable engineer with an experimenter well versed in the information-hiding principle. The experimenter pre-screens the mavericks to point out the ones that are obviously true or false, thereby reducing the engineer's effort considerably.

195

166

R.W. SCHWANKE AND S.J. HANSON

Next, a method is needed for incorporating learning into the interactive clustering tools. The main difficulty is to automatically extract training data from the user session. We believe that learning similarity rather than learning classification makes the knowledge learned more transferable to new problems. The similarity judgment learned during maverick analysis could be used to • cluster the members of a module into smaller sub-modules. • cluster modules into subsystems, and • reanalyze the structure of the system after adding major enhancements. Use in reanalysis would require a few extensions to the present work. Suppose that an architect did maverick analysis on a system, including several rounds of learning from relevance feedback, and saved the learned weights. After a major revision to the system, the architect's specific relevance judgments could not be reused without review, because the restructuring might have invalidated some of them. However, the learned weights represent his judgment of the relative significance of different implementation features, and that judgment would not be likely to change radically. Therefore, the learned weights could be used to compute the initial maverick list after reorganization, and then could be readjusted according to further relevance feedback. Some procedure would be needed for calibrating the a priori weight estimates for new features introduced by the enhancements, so that as a group they would be neither more nor less significant than the learned weights carried over from before the enhancements. Substantial additional research is needed to create a practical architecture re-engineering tool. Advisory heuristics are needed that apply to variables, types, and macros. Representations are needed for the role of each module in a system, and for the conceptual relationships among modules. Heuistics are needed that relate individual units to the roles of modules, and explanation techniques are needed to present the analyses to the architect.

8. Conclusions We conclude from the work described here that it is effective to model software modularization as a nearest-neighbor clustering and classification activity. The model supported effective heuristic assistance with that activity, and effective performance improvement by learning and knowledge acquisition. Naturally, similarity-based clustering is not the only principle used in software modularization. However, research on other heuristics can now be based on identifying those cases where the nearest-neighbor heuristic does not apply. We also conclude that nearest-neighbor classification can be learned effectively by converting classification data to more-similar-than judgments and training the similarity function by back-propagation using the comparison paradigm. We did not do any efficiency studies, whether in time, space, or features needed. However, we believe that feature sparsity is an important characteristic of the problem domain and that nearest-neighbor classification captures more information about the sparse features than is collected in statistical classifiers. Finally, we hope that the ideas about modeling judgment, giving advice, and acquiring knowledge will prove useful in the creation of other intelligent design assistants.

196

USING NEURAL NETWORKS TO MODULARIZE SOFTWARE

167

Notes 1. When the programming language does not have an explicit module construct, the programmer typically uses files to represent his modules. However, our use of the term module specifically does not cover systems where every module or file contains only one procedure. 2. The conference proceedings were not actually published until the following year. 3. The astute programmer will be worrying about the problem of duplicate variable names in different scopes, such as i, j , and k, which are declared many times in large systems, but with unrelated meanings. We restrict our features to non-local names, so that private variables are not considered, and give each distinct software unit a unique name system wide, so that there is no problem with duplicate names. 4. Since greater values of C imply poorer confidence, one might more intuitively call this a measure of doubt.

References Balcer, M.J., Hasling, W.M., & Ostrand, T.J. (1989). Automatic generation of test scripts from formal test specifications. Proceedings oftheACMSIGSOFT1989 Third Symposium on Software Testing, Analysis, and Verification. Key West, FL: ACM Press. Baum, E., & Haussler, D. (1988). What size net gives valid generalization? In D. Touretzky (Ed.), Advances in neural information processing systems (Vol. 1). Morgan-Kaufmann. Belady, L.A., & Evangelisti, C.J. (1982). System partitioning and its measure. Journal of Systems and Software, 2(2). Chapin, N. (1988). Software maintenance life cycle. Proceedings of the Conference on Software Maintenance—1988. IEEE Computer Society Press. Chen, Y.-F., Nishimoto, M., & Ramamoorthy, C.V. (1990). The C information abstraction system. IEEE Transactions on Software Engineering, 16(3). Choi, S.C., & Scacchi, W. (1990). Extracting and restructuring the design of large software systems. IEEE Software, 7(1), 66-73. Hanson, S.J., & Bauer, M. (1989). Conceptual clustering, categorization, and polymorphy. Machine Learning, 3, 343-372. Hanson, S.J., & Burr, D.J. (1990). What connectionist models learn: Learning and representation in connectionist networks. Behavioral and Brain Sciences, 13, 471-518. Homa, D. (1978). Abstraction of ill-defined form. Journal of Experimental Psychology: Human Learning and Memory, 4, 407-416. Horn, K.A., Compton, P., Lazarus, L., & Quinlan, J.R. (1985). An expert system for the interpretation of thyroid assays in a clinical laboratory. Australian Computer Journal, 17(1), 7-11. Hutchens, D.H., & Basili, V.R. (1985). System structure analysis: Clustering with data bindings. IEEE Transactions on Software Engineering, 11(8). Lange, R., & Schwanke, R.W. (1991). Software architecture analysis: A case study. Third International Workshop on Software Configuration Management. ACM Press. Maarek, Y.S., & Kaiser, G.E. (1987). Using conceptual clustering for classifying reusable Ada code. Using Ada: ACM SIGAda International Conference. Special issue of Ada Letters. ACM Press. Maarek, Y.S., & Kaiser, G.E. (1988). Change management for very large software systems. Seventh Annual International Phoenix Conference on Computers and Communications. Scottsdale, AZ. Medin, D., & Schaffer, M.M. (1978). Context theory of classification learning. Psychological Review, 85, 207-238. Parnas, D.L. (1972). Information distribution aspects of design methodology. Information Processing 71. Amsterdam: North-Holland. Parnas, D.L. (1971). On the criteria to be used in decomposing systems into modules (Technical Report No. CMU-CS-71-101). Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA. Porter, A.A., & Selby, R.W. (1990). Empirically guided software development using metric-based classification trees, IEEE Software, 7(3).

197

168

R.W. SCHWANKE AND S.J. HANSON

Pbsner, M.I., & Keele, S.W. (1968). On the genesis of abstract ideas. Journal of Experimental Psychology, 77, 353-363. Rivest, R., Haussler, D., & Warmuth, M.K. (1989). Proceedings of Second Annual Workshop on Computational Learning Theory. Morgan-Kaufmann. Schwanke, R.W., & Platoff, M.A. (1989). Cross references are features. Second International Hbrkshop on Software Configuration Management. Software Engineering Notes, 17(7), ACM Press. Selby, R.W., & Basili, V.R. (1988). Error localization during software maintenance: Generating hierarchical system descriptions from the source code alone. Conference on Software Maintenance—1988. IEEE Computer Society Press. Smith, E.E., & Medin, D.L. (1981). Categories and concepts. Cambridge, MA: Harvard University Press. Sompolinsky, H., & Tishby, N. (1990). Learning from examples in large neural networks. Siemens Computational Learning Theory and Natural Learning Systems Workshop, September 5 & 6, Princeton, NJ. Tesauro, G., <& Sejnowski, T. J. (1989). A parallel network that learns to play Backgammon. Artificial Intelligence, 39, 357-390. Received September 24, 1990 Accepted July 24, 1992 Final Manuscript December 15, 1992

198

Chapter 5 ML Applications in Generation and Synthesis Depending on the domain from which the training data are collected and the nature of the target function to be learned, ML methods can be used to generate or synthesize various types of software products or artifacts. This chapter pertains to the applications of ML methods for software artifacts generation or synthesis. Examples include: test data generation, synthesis of test-resource allocation, generation of project management rules or schedules, and synthesis of agent programs, data structures, scripts, or design schemas. Table 25 provides a glimpse of the ML methods used in this application area. Table 25. ML methods used in generation and synthesis. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Test cases/data

V

Test Resource

V

Project Management Rules

V

V

Design Repair Knowledge

V

Design Schemas

V

V

Data Structures

Project Management Schedule

V

V

Software Agents

Programs/Scripts

V

V V

V

V

V

In this chapter, we include one paper by Michael, McGraw and Schatz [102]. The paper describes a GA based approach to test data generation. The proposed approach is based on dynamic test data generation and is geared toward generating condition-decision adequate test sets. Two GA algorithms: the standard algorithm and the differential algorithm are utilized in their study. Experimental results are obtained for three approaches: GA-based generator, the gradient descent generator, and the random test generator, from simple programs to a real-world autopilot control program. The comparison among those approaches indicates some salient features of GA based approach (such as the global optimization, evolutionary pressure induced serendipitous coverage) and its limitations (expensive search process, more executions of the target program). The authors also point out some other possible test data generation applications GA approach is considered appropriate, and open research issues in automatic test data generation systems.

199

The following paper will be included here: C. Michael, G. McGraw and M. Schatz, "Generating software test data by evolution", IEEE Trans. SE, Vol. 27, No.12, December 2001, pp.1085-1110.

200

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001

1085

Generating Software Test Data by Evolution Christoph C. Michael, Member, IEEE, Gary McGraw, Member, IEEE, and Michael A. Schatz Abstract—This paper discusses the use of genetic algorithms (GAs) for automatic software test data generation. This research extends previous work on dynamic test data generation where the problem of test data generation is reduced to one of minimizing a function [1], [2]. In our work, the function is minimized by using one of two genetic algorithms in place of the local minimization techniques used in earlier research. We describe the implementation of our GA-based system and examine the effectiveness of this approach on a number of programs, one of which is significantly larger than those for which results have previously been reported in the literature. We also examine the effect of program complexity on the test data generation problem by executing our system on a number of synthetic programs that have varying complexities. Index Terms—Software testing, automatic test case generation, code coverage, genetic algorithms, combinatorial optimization.

+ 1

INTRODUCTION

A

N important aspect of software testing involves judging

a program is instrumented to collect information about the

how well a series of test inputs tests a piece of code, program as it executes. The resulting information, collected Usually, the goal is to uncover as many faults as possible during each test execution of the program, is used to with a potent set of tests since a test series that has the heuristically determine how close the test came to satisfying potential to uncover many faults is obviously better than a specified test requirement. This allows the test generator one that can only uncover a few. Unfortunately, it is almost to modify the program's input parameters gradually, impossible to predict how many faults will be uncovered by nudging them ever closer to values that actually do satisfy a given test set. This is not only because of the diversity of the requirement. In essence, the problem of generating test the faults themselves, but because the very concept of a fault data reduces to the well-understood problem of function is only vaguely defined (c.f., [3]). Still, it is useful to have minimization. some standard of test adequacy, to help in deciding when a T n e approach usually proposed for performing this program has been tested thoroughly enough. This leads to minimization is gradient descent, but gradient descent the establishment of test adequacy criteria. suffers from some well-known weaknesses. Thus, it is Once a test adequacy criterion has been selected, the a p p e aling to use more sophisticated techniques for function question that arises next is how to go about creating a test minimization, such as genetic search [6], simulated annealset that is good with respect to that criterion. Since this can i n g [7]/ o r t a b u s e a r c h [ 8 ] ^ t h i s p a p e r / w e i n v e s t i g a t e the be difficult to do by hand, there is a need for automatic test u s e o f function hc search to ate test cases b data generation. minimization. Unfortunately, test data generation leads to an undecid^ autO matic test data generation schemes have In the able problem for many types of adequacy criteria. Insofar as u s u a l l b e e n , H e d fo gj ams ( mathema. the adequacy criteria require the program to perform a ^ f u n c t i o n s ) u s i s i k t e s t a d a c y c r i t e r i a ( specific action, such as reaching a certain statement, the feranch c o v e } R a n d o m t e s t g e n e r a n o n performs adehalting problem can be reduced to a problem of test data Nevertheless, it seems unlikely on ^ Mems generation. To circumvent this dilemma, test data genera- i . J ui r n \- J u °. , ., . . . • i i j that a random approach could also perform well on realistic • ii i•i f hon algorithms use heuristics, meaning that they do not , ° , . ,. ,. , ° . -u test-generation problems, which often require an intensive i « ^i j j ,L 7,.i_ ,. j always succeed in finding an adequate test input. Comparl t r A-CC i i i J f !_• i_ u manual effort. Indeed, our results suggest that random test . . ,? . ™ isons of different test data generation schemes are usually , . , . . . , . " ., , -JO. \ generation performs poorly on realistic programs. The r aimed at determining which method can provide the most ° , • , , , , b r o a d e r i m l l c a h o n ls benefit with limited resources. P «*»*- d u e t o *eir simplicity, toy In this paper, we introduce GADGET (the Genetic P ^ a m s fal1 t o e x P ° s e ** limitations of some test-data Algorithm Data GEneration Tool), which uses a test data generation techniques. Therefore, such programs provide generation paradigm commonly known as dynamic test data l i m i t e d u t i l i t y w h e n s p a r i n g different test generation generation. Dynamic test data generation was originally methods. Because GADGET was designed to work on large proposed by [1] and then investigated further by [2], [4], programs written in C and C++,, it is possible for us to and [5]. During dynamic test generation, the source code of examine the effects of program complexity on the difficulty of test data generation. We examine a feature of the dynamic test generation • The authors are with Cigital Corporation, Suite 400, 21351 Ridgetop problem that does not have an analog in most other Circle, Dulles, VA 20166. E-mail: (ccmich, gem, [email protected]. {wction m i n i m i z a n o n problems. If we are trying to satisfy Manuscript received 17 Dec. 1997; revised 5 Feb. 1999; accepted 24 Oct. 2000. m t e s t requirements for the same software, we have to Recommended for acceptance by D. Rosenblum. / . . . . . . . u r . For information on obtaining reprints of this article, please send e-mail to: perform many function minimizations, but the functions [email protected], and reference IEEECS Log Number 106067. being minimized are sometimes quite similar. That makes it 0098-5589/01/S10.00 C 2001 IEEE

201

1086

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL 27, NO. 12, DECEMBER 2001

possible to solve one problem by coincidence while trying to solve another. In other words, the test generator can find inputs that satisfy one requirement even though it is searching for inputs to satisfy a different one. On the larger programs we tested, coincidental discovery of test inputs satisfying new requirements was much more common than their deliberate detection (the GAs often satisfied one test requirement while they were trying to satisfy a different one). In fact, the ability of test data generators to satisfy coverage requirements coincidentally seems to play an important role in determining their effectiveness. Random test generation did not perform well in our experiments. Moreover, our empirical results show an increasing performance gap between random test generation and the more sophisticated test generation methods when tests are generated for increasingly complex programs. This suggests that the additional effort of implementing more sophisticated test generation techniques is ultimately justified. This paper begins with an overview of automatic test data generation methods (Section 2), followed by an introduction to genetic algorithms in the context of testdata generation (Section 3). In Section 4, we describe our own test-data generation system, GADGET. Finally, in Section 5, we empirically examine the performance of our system on a number of test programs. 2

specifically stating the conditions that the tests must fulfill, For example, each statement in a program might be associated with a requirement asking that the statement in question be executed during testing. The example given above describes statement coverage. A slightly more refined approach is branch coverage. This criterion requires every conditional branch in the program to be taken at least once. For example, supposing we want to obtain branch coverage of the following code fragment: lf

< a > = b > { do one t h i n g } { do something e l s e } we must satisfy two test requirements: There must be one program input that causes the value of the variable a to be greater than or equal to the value of b, and there must be one that causes the value of a to be less than that of b. One effect of these requirements is to ensure that both the "do one thing" and "do something else" sections of the program are executed, There is a hierarchy of increasingly complex coverage criteria having to do with the conditional statements in a program. We shall refer to this hierarchy as defining levels of coverage. At the top of the hierarchy is multiple condition coverage, which requires the tester to ensure that every permutation of values for the Boolean variables in every condition occurs at least once. At the bottom of the hierarchy is function coverage, which requires only that every function be called once during testing (saying nothing about the code inside each faction). Somewhere between else

TEST ADEQUACY CRITERIA AND TEST DATA

GENERATION

these extremes is condition-decision coverage, which is the

criterion we use in our test-data generation experiments. Some test paradigms call for inputs to be selected on the A condition is an expression that evaluates to TRUE or basis of test adequacy criteria, which are used to ensure that FALSE/ b u t d o e s n o t c o n t a i n a n y o t h e r TRUE/FALSE-valued certain features of the source code are exercised (in testing expressions, while a decision is an expression that influences terminology, these features are to be covered by the test f h e p r o g r a m ' s flow o f c o n t r o l T o o b t a i n condition-decision inputs). Some studies, such as [9], [10], [11], have concluded c o v e a test set m u s t m a k e e a c h condition evaluate to that test adequacy does, m fact,mprove the ability of a test J R U E for a t ^ ^ ^ ^ ^ . ^ o n e Q£ ^ suite to reveal faults, though [12], [13], [14], [15], among ., , ., .. ' ? .. , J .. evaluate to FALSE for at least one of the tests. Furthermore, lU . . , , , , - , , others, describe situations where this is not true. Whether or , not test adequacy criteria really measure the quality of a test t h e T R U E a n d F A L S E b r a n c h e s o f e a c h d e c l s l o n m u s t b e suite, they are an objective way to measure the thorough- exercised. Put another way, condition-decision coverage ness of testing requires that each branch in the code be taken and that every These benefits cannot be realized unless adequate test condition in the code be TRUE at least once, and FALSE at data (i.e., test data that satisfy the adequacy criteria) can be least once. found. Manual generation of such tests can be quite timeWith any of these coverage criteria, we must ask what to consuming, so it would be appealing to have algorithms d » when an existing test set fails to meet the chosen that can examine a program's structure and generate criterion. In many cases, the next step is to try to find a test set tnat adequate tests automatically. d°es satisfy the criterion. Since it can be quite It is desirable to have test data generation algorithms that difficult to manually search for test inputs satisfying certain are more powerful in the sense of being more capable of requirements, test data generation algorithms are used to finding adequate tests. Our research addresses this need. automate this process. 2.1

Code Coverage and Test Adequacy Criteria

2.2

Previous Work in Test Data Generation

Many test adequacy criteria require certain features of a The term "test generation" is commonly applied to a program's source code to be exercised. A simple example is number of diverse techniques. For example, tests may be a criterion that says, "Each statement in the program should generated from a specification (in order to exercise features be executed at least once when the program is tested." Test of the specification), or they may be generated from state methodologies that use such requirements are usually model of software operation (in order to exercise various called coverage analyses because certain features of the states or combinations of states). source code are to be covered by the tests. A test adequacy For a program with a complex graphical user interface, criterion generally leads to a set of test requirements test generation may simply consist of finding tests that

202

MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1087

exercise all aspects of the interface. On the other hand, what happens if the loop is never entered, if it iterates once, many software systems package a diverse collection of if it iterates twice, and so on ad infinitum. In other words, the services and, for such packages, it is often considered symbolic execution of the program may require an infinite sufficient to ensure that most or all services are used during amount of time. Test data generation algorithms solve this problem in a testing. In such cases, it is often straightforward to find inputs that exercise a given feature and, so, a test generator straightforward way: The program is only executed only has to list the features or combinations of features that symbolically for one control path at a time. Paths may be are to be tested. In other words, the term "test data selected by the user, by an algorithm, or they may be generation" sometimes only refers to the process of coming generated by a search procedure. If one path fails to result in m expression that yields an adequate test input, another up with concrete test criteria. Unfortunately, some test criteria are harder to satisfy. If P a t n *s tried. L o o s a r e n o t (he o n l we want to satisfy code coverage criteria or exercise some P y programming constructs that c a n n o t easil Y b e evaluated symbolically; there are other other precisely defined aspect of program semantics, it may be far from obvious what program inputs satisfy a given obstacles to a practical test data generation algorithm based criterion. This paper is concerned with this case: We are o n symbolic execution. Problems can arise when data is given a set of test adequacy criteria and the goal is to find referenced indirectly, as in the statement: test inputs that cause the criteria to be satisfied. a _ B [ c+( ji / i o There are many existing paradigms for this type of automatic test data generation. Perhaps the most commonly H e r e ' ; t i s ""known which element of the array B is being encountered are random test data generation, symbolic (or referred to by B [c+d] because the variables c and d are not path-oriented) test data generation, and dynamic test data b o u n d to specific values. Pointe generation. In the next three sections, we will describe each r references also present a problem because of the of these techniques in turn. The GADGET system we potential for aliasing. Consider that the C code fragmentdescribe in this paper is a dynamic test generator. In our * a _ i 2 . experiments, we use random data generation as a baseline * b _ 1 3 . for comparison. c _*a.' 2.2.1 Random Test Data Generation Random test data generation simply consists of generating inputs at random until a useful input is found. The problem with this approach is clear: With complex programs or complex adequacy criteria, an adequate test input may have to satisfy very specific requirements. In such a case, the number of adequate inputs may be very small compared to the total number of inputs, so the probability of selecting an adequate input by chance can be low. This intuition is confirmed by empirical results (including those reported in Section 5). For example, [16] found that random test generation was outperformed by other methods, even on small programs where the goal was to obtain statement coverage. More complex programs or more complex coverages are likely to present even greater problems for random test data generators. Nonetheless, random test data generation makes a good baseline for comparison because it is easy to implement and commonly reported in the literature. 2.2.2 Symbolic Test Data Generation Many test data generation methods use symbolic execution to find inputs that satisfy a test requirement (e.g., [17], [18], [19]). Symbolic execution of a program consists of assigning symbolic values to variables in order to come up with an abstract, mathematical characterization of what the program does. Thus, ideally, test data generation can be reduced to a problem of solving an algebraic expression. A number of problems are encountered in practice when symbolic execution is used. One such problem arises in indefinite loops, where the number of iterations depends on a nonconstant expression. To obtain a complete picture of what the program does, it may be necessary to characterize

results in c taking the value 12 unless the pointers a and b Nation, in which case, c is assigned the v a l u e 13 Since a a n d b a r e n o t b o u n d t o numeric values during symbolic execution, the final value in c cannot be determined. Technically, any computable function can be computed without the use of pointers or arrays, but it is not normal practice to avoid these constructs when writing a program, Thus ' although array and pointer references are not a theoretical impediment to the use of symbolic execution, they complicate the problem of symbolically executing real programs,

refer to t h e s a m e

2 2 3

Dynamic Test Data Generation third class of test data generation paradigms is dynamic test data generation, introduced in [1] and exemplified by the TESTGEN system of [2], [16], as well as the ADTEST system of [5]. This paradigm is based on the idea that if some desired test requirement is not satisfied, data collected during execution can be still used to determine which tests come closest to satisfying the requirement. With the help of this feedback, test inputs are incrementally modified until one of them satisfies the requirement. For example, suppose that a hypothetical program contains the condition

A

1

p o s > -

' •• • on line 324 and that the goal is to ensure that the TRUE branch of this condition is taken. We must find an input that will cause the variable pos to have a value greater than or equal to 21 when line 324 is reached. A simple way to determine the value of pos on line 324 is to execute the program up to line 324 and then record the value of pos.

203

1088

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001

Let pos32i(x) denote the value of pos recorded on line 324 but suppose that b has a default value of 3. It may be that when the program is executed in the input x. Then, the one execution path gives b a new value, while a different function path simply leaves the value of b alone. As long as we only take the path that leaves b with its default value, we will §rx\ _ / 2 1 - P°s32i(x), if Pos32i(x) < 21; neV er be able to make the condition TRUE; no choice of I °. otherwise inputs can make the default value of b be anything other is minimal when the TRUE branch is taken on line 324. Thus, m a n 3" Therefore, the test generation algorithm must know the problem of test data generation is reduced to one of h o w t o s e l e c t ^ e x e c u t i o n P a t h * * assiS^s a n e w v a l u e t 0 function minimization: To find the desired input, we must b" ^ * e T E S T G E N sYf^' heuristics are used to select the path that seems most likely to have an impact on the target ,. , , , ., . . . . cw > r J r b , find a value of x that minimizes Sr(z). In a sense, the function 9 (which we will also call an '.___„_ , rn , .. . . ... , T . ,. ,. , .. . . .. ., . . . , , .. . ^ In the ADTEST system of [5], an entire path is specified objective function) tells the test generator how close it is to . , , ., , , , . . . . . ,. , ., . , , in advance, and the goal of test data generation is to find an reaching its goal. If a; is a program input, then the test . . ,, , . ., , . , °, „. ., . , ° ° , • input that executes the desired path. Since it is known generator can evaluate 9(x) to determine how close * is to w h j c h b r £ j n c h m u g t b e t a k e n for e a c h c o n d i t i o n o n ^ th> satisfying the test requirement currently being targeted. The a l l o f t h f i s e c o n d i t i o n s c a n b e c o m b m e d i n a s i n g i e f^^^ idea is that the test generator can now modify x and w h o s e m m i m i z a t i o n l e a d s t o m a d e q uate test input. For evaluate the objective function again in order to determine e x a m p l e / if m e d e s i red path requires taking the TRUE what modifications bring the input closer to meeting the b r a n c h o f ^ condition requirement. The test generator makes a series of successive modifications to the test input using the objective function if (b >= 10) . . . for guidance and, hopefully, this leads to test that satisfies on line 11 and taking the FALSE branch of the condition the requirement (in fact, 9 can only be said to provide heuristic information, as will become apparent when we ~ discuss the construction of 9 in Section 4.3). on line 13, then one can find an adequate test input by In the TESTGEN system of [2], the minimization of §{x) minimizing the function Si (a:) + 9 2 (i), where begins by establishing an overall goal, which is simply the ,in , ... , , l n r in 6 6 r 3 6 , y f 10 — feu, if bn < 10 on line 11; /N satisfaction of a certain test requirement. The program is 3ti(x) = < executed on a seed input, and its behavior on this input is ^ ' ° erwlse Cl3 8 lf c > 8 o n l m e 13; used as the basis of a search for a satisfactory input (that is, Q^)= [ ~ ' " if the seed input is not satisfactory itself). 10, otherwise. The subsequent action depends on whether the the (Here, c13 and 6 n are actually functions of the input valuer.) execution reaches the section(s) of code where the test Unfortunately, this function cannot be evaluated until requirement is supposed to hold (for example, whether it i i n e 1 0 a n d line 13 are both reached. Therefore, the ADTEST reaches line 324 in the example above). If it does, then system begins by trying to satisfy the first condition on the function minimization methods can be used to find a useful path, adding the second condition only after the first input value. condition has been satisfied. As more conditions are If the code is not reached, a subgoal is created to bring reached, they are incorporated in the function that the about the conditions necessary for function minimization to algorithm seeks to minimize. work. The subgoal consists of redirecting the flow of control Another test generation system relevant to our work is so that the desired section of code will be reached. The the QUEST/Ada system of [20], [21]. This is a hybrid algorithm finds a branch that is responsible (wholly or in system combining random testing and dynamic testing for part) for directing the flow of control away from the desired Ada code. Once the code is instrumented and ranges and location and attempts to modify the seed input in a way that types of input variables have been provided, the system will force the control of execution in the desired direction. creates test data using rule-based heuristics. For example, The new subgoal can be treated in the same way as other values of parameters are adjusted according to one such test requirements. Thus, the search for an input satisfying a rule to increase or decrease by a fixed constant percentage, subgoal proceeds in the same way as the search for an input The test adequacy criterion chosen by Chang et al. is branch satisfying the overall goal. Likewise, more subgoals may be coverage. The system creates a coverage table for all created to satisfy the first subgoal. This recursive creation of branches and marks those that have been successfully subgoals is called chaining in [2] and [4]. covered. Table 1 provides an example of such a branch Korel's approach is advantageous when there is more t a b l e - The t a b l e i s consulted during analysis to determine than one path that reaches the desired location in the code. w h i c h branches to target for testing. Partially covered The test data generation algorithm is free to choose branches are always chosen over completely noncovered whichever path it wants (as long as it can force that path branches. to be executed), and some paths may be better than others. Although QUEST/Ada does not use the dynamic test For example, suppose we want to take the TRUE branch of generation paradigm we have been describing, the coverage ., j.f. table of L[21] is relevant to our research because itr provides a . .. _ ,, . .. , , . , the condition t r ' strategy for dealing with the situation where a desired if (b > 10) . . . condition is not reached. Instead of picking a particular 204

MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1089

Both TESTGEN and the QUEST/Ada system gradient descent to minimize the objective function. technique is limited by its inability to recognize minima in the objective function (see Section 3.1). Our

TABLE 1

A Sample Coverage Table after Chang Branch Decision | TRUE | FALSE"]

1

X XX X

2

3 T ..

...

'. . .

'

goa] i s t o

X

overcome this limitation.

In this paper, we report on GADGET, a test generation system designed for p r o g r a m s written in C or C++. GADGET automatically generates test data for arbitrary C/C++ programs, with n o limitations on the permissible language constructs a n d n o requirement for hand-instrumentation. (However, GADGET does not generate mean-

J=

use This local main

_—I

K

77MS table is generated from a program flow chart. The table provides

,

6

information regarding the covered branches and directs future test case mgiul character strings unless those strings represent generation. We adapt this approach for our generator as well. numbers, and it does not generate meaningful values for

condition, as TESTGEN does, or picking a particular path, like ADTEST, this strategy is opportunistic and seeks to cover whatever conditions it can reach. Although this is inefficient when one only wants to exercise a certain feature of the code under test, it can save quite a bit of unnecessary work if one wants to obtain complete coverage according to some criterion. We developed the coverage-table strategy independently and use it in our test-generation system. In [22], [23], [24], simulated annealing is used in conjunction with dynamic test generation in much the same way that we use genetic algorithms. These papers only report results for small programs, but they show how dynamic test generation can be applied numerous test generation problems other than the satisfaction of structural coverage criteria. The GADGET test generation system, which we discuss in this paper, is a dynamic test generation system like TESTGEN and ADTEST, but it uses genetic search to perform optimization, instead of the gradient descent techniques used by TESTGEN and ADTEST; the advantages of this will be discussed in Section 3.1. In [25], we presented some preliminary results on the performance of an early prototype and, in [26], we examined the performance of the GADGET system using a number of different optimization techniques, including genetic search and simulated annealing. 2.3 Contributions of This Paper The research described in this paper addresses two limitations commonly found in dynamic test-data generation systems. First, many systems make it difficult to generate tests for large programs because they only work on simplified programming languages. Second, many systems use gradient descent techniques to perform function minimization and, therefore, they can stall when they encounter local minima (this problem is described below in greater detail). Limited program complexity is a drawback of the TESTGEN. It can only be used with programs written in a subset of the Pascal language. Aside from problems of practicality, the problem with such limitations is that they prevent one from studying how the complexity of a program affects the difficulty of generating test data. The unchallenging demands of simple programs can make simple schemes like random test generation appear to work better in comparison to other methods than they actually do.

compound data-types, even though such values can be regarded as inputs if they are read from an external file.) W e re ort test results for a P program containing over 2 000 l i n e s of s o u r c e c o d e ' ' excluding comments. To our ^owledge, this is the largest program for which results have been reported. (Although [5] reported that their system had been run on programs as large as 60,000 lines or source code, no results were presented.) The ability to generate tests for programs using all C/C++ constructs has m e added benefit of allowing us to study the effects ° f program complexity on the difficulty of test data generation. Some experimental results were also presented in [26], but the current paper also presents a selfcontained explanation of GADGET'S underlying testgeneration paradigm. The GADGET system uses genetic algorithms to perform the function minimization needed during dynamic test data generation. In this respect, it differs from the TESTGEN and ADTEST systems, which use gradient descent. The advantage of using genetic algorithms is that they are less susceptible to local minima, which can cause a test-generation algorithm to halt without finding an adequate input. Genetic algorithms were used in a different way by [27]. That s y s t e m judges test inputs according to how "interesti n g " m e y a r 6 / a c c o r ding to user-defined criteria for what is interes ting. Thus, although that system is used for generati n g s o f t w a r e t e s t S ; i t i s a t e s t generator in a different sense than our system. It does not strive to satisfy specific test requirements and, thus, it does not need to use semantic formation about the target program (in contrast, semantic information is crucial for the other test generation techni9 u e s w e have discussed). Th e most frequently cited advantage of genetic algorithms, when they are compared to gradient descent methods, is that genetic algorithms are less likely to stall in a local minimum—a portion of the input space where $s(x) appears to be minimal but is not. There is also a second advantage when several paths to the desired location are available. Unlike gradient descent methods, which must concentrate on a single path, the implicit parallelism of genetic algorithms allows them to examine many paths at once. This presents a partial solution to the path-selection problem described in Section 2.2.3. Certain limitations are common to all dynamic test generation systems, including our own. Existing systems are limited to programs whose inputs are scalar types

205

1090

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001

3.1 Function Minimization Recall that, in dynamic test generation , the target progra m is repeatedly executed on a series of test inputs. This is done to evaluate an objective function SJ, which heuristicall y tells the test generator how close each input has come to satisfying the test requirement that is currently targeted , This lets the test generator make successive modifications to the test input, bringing it ever closer to satisfying the requirement. This processis equivalent to conducting a minimization of the function 3 . Thus, the ability to numerically minimize a function value is the key to dynamic test data generation . The standar d minimization technique, used by [1], [2], [5],is gradien t descent. It essentially works by making successive changes to a test input in such a way that each change decreases the value of the objective function. Insofar as the objective function tells us how close the input is to satisfying the selected test requirement, each modification b r i n g s m e i n p u t c l o s e r t o m eeting that requirement, T h e minimization method suggested in [16] is a form of g r a d i e n t d e s c e n t . Small changes in one input value ar e made ^ . ^ tQ d e t e r m i n e a d d i r e c t i o n f or m a k i , J• m o v es w h e n ^ d j r e c t i o n {g fouJ . . , , . ., ..,.,. .. .., increasingly larg e steps ar e taken in that direction until no <• , . . . . . . . ... , , ^ r t h e r l m P r o v e m e n t is obtained (in which case, the search be lns S ***" w l t h s m a 1 1 l n P u t modifications), or until the input no longer reaches the desired location (in which case, t h e m o ve is t r i e d a a i n w i t h a s m a l l e r ste s i z e S P )' W h en n o riher ^ progress can be made, a different input value is modified, and the process terminates when no more progress can be made for any input value, Reference [5] also uses gradien t descent; specifically,the system described there uses a quasi-Newton technique (see [28]). Gradien t descent can fail if a local minimum is encountered. A local minimum occurs when none of the changes of input values that ar e being considered lead to a decrease in the function value and, yet, the value is not globally minimized. The problem arises because it is only possible to consider a limited number of input values (i.e., a small section of the search space) due to resourc e limitations. The input values that ar e considered may suggest that any change of values will cause the function's value to increase, even when the curren t value is not truly minimal. This situation is illustrate d in Fig. 1. In other area s where optimization is used, the problem of GENERATION 3 GENETIC ALGORITHMS FOR TEST DATA i oca i minima has led to the development of function GENERATION minimization methods that do not blindly pursue the s mulated In dynamic test data generation, the problem of finding s t e e P e , s t S r a d i e n t " N o t a b l f am ,°? 8 ^ e s e a r e } a, M. •. • J A i. n • iui \A \ annealin g [7], tabu search [81, [291,and genetic algorithms 6 6 software tests is reduced to an optimization problem. Most r _ . „ & ' J ' . T , . l " l " . . . . . . . , ., .. . . i, .., L[6] , l [30], [31]). In this r paper , we apply two a genetic rY } existing techniques solve the optimization rproblem with ." .., ' \ " ,, /. ' , . .. . . . ~ i , . . . . algorithms to the problem of test data generation . gradien t descent, but this is not a general-purpos e o r o approac h because of certain assumptions it makes about 3.2 Genetic Algorithms the shape of the objective function, which we will discuss A gen etic algorithm (GA) is a randomized paralle l search momentarily. This is the motivation for building the system m e t hod based on evolution. GAs have been applied to a we describe in this paper , which replaces gradien t descent v a r i e t y o f pro blems and ar e animportant tool in machine with a more general optimization method, namely, genetic learnin g and function optimization. References [30] and [31] search. give thorough introductions to GAs and provide lists of possible application areas . The motivation behind genetic (though the natur e of dynamic test generation systems allows the use of arbitrar y data-types within the progra m itself, which is a further advantag e of dynamic test-data generatio n over the symbolic methods described in Section 2.2.2). The main challenge posed by nonscalar progra m inputs is that, often, the data members must satisfy certai n constraint s for the input to make sense. For example, the character s in an arra y may be required to spell out the name of a file, an integer may represen t the number of elements in a list stored elsewhere in the same object instantiation , etc. What these constraints actually are canvary a great deal from one application to the next and, so, it is difficult to automate universal support for them. (Of course, violating such constraint s can make for interesting test cases, but, if many input values are nonsensical most or all of the time, it can lead to poor coverage.) In such cases, users must build auxiliary functions that accept scalar inputs from the test generato r and that construct meaningful values for variables having complex data types. Existing dynamic test generation techniques ar e also somewhat unintelligent in their handling of TRUE/FALSEvalued variable s or enumerated types. Program s using such variables within conditional statements do not seem to have , ,. , .„ ,, . ,, , been used in past research . (The problem presented by such variables will be examined in Section 5.4.) Dynamic test generation has the advantag e of treatin g a progra m somewhat as a black box, which makes most programming constructs transparen t to the algorithm. It is only necessary that we be able to extract the information needed for calculating the objective function and for determining whether the path of execution leads to the location where the objective function is evaluated. This gives dynamic test generation the flavor of a generalpurpose test generation technique that can be applied automatically to a wide range of programs . The GADGET system advances this conception of test generation in two ways. First, it uses genetic search, which is a generalpurpose optimization strategy and, unlike gradien t descent, does not assume an absence of local minima in the objective function. Second, GADGET uses automatic code instrumentation, which eliminates a practica l obstacle when generatin g tests for large programs . This allows us to easily examine the effects of progra m complexity on the performance of test data generators , as we doin this paper . 3 GENETIC ALGORITHMS FOR TEST DATA

1. The exceptions are GADGET itself, for which preliminary results were published in [25], and the system described in [22], which was published after the original preparatio nof this paper .

206

. ... , , ,. , . , „ ., .... , algorithm s is to m o d el t h e robustnes s a n d flexibility of natura l selection.

MICHAEL ET A L : GENERATING SOFTWARE TEST DATA BY EVOLUTION

1091

Fig. 1. Illustration of a local minimum. An algorithm tries to find a value of x that will decrease f(x), but there are no such values in the immediate neighborhood of the current x. The algorithm may falsely conclude that it has found a global minimum of / .

In a classical GA, each of a problem's parameters is represented as a binary string. Borrowing from biology, an encoded parameter can be thought of as a gene, where the parameter's values are the gene's alleles. The string produced by the concatenation of all the encoded parameters forms a genotype. Each genotype specifies an individual which is in turn a member of a population. The GA starts by creating an initial population of individuals, each represented by a randomly generated genotype. The fitness of individuals is evaluated in some problem-dependent way, and the GA tries to evolve highly fit individuals from the initial population. In our case, individuals are more fit if they seem closer to satisfying a test requirement; for example, if the goal is to make the value of the variable pos greater than or equal to 21 on line 324, then an input that results in pos having the value 20 on line 324 is considered more fit than an input that gives it the value -67. The genetic search process is iterative: evaluating, selecting, and recombining strings in the population during each iteration (generation) until reaching some termination condition occurs. (In our case, a success leads to termination of the search, as does a protracted failure to make any forward progress. This is a relatively common arrangement.) The basic algorithm, where P(t) is the population of strings at generation t, is: initialize P(t) evaluate P(t) while = (termination condition not satisfied) do select P{t + 1) from Pit) recombine Pit + 1) evaluate P{t + 1) t =t+l

In the first step, evaluation, the fitness of each individual is determined. Evaluation of each string (individual) is based on a fitness function that is problem-dependent. Determining fitness corresponds to the environmental determination of survivability in natural selection, and, in our case, it is determined by the fitness function described in Section 2.2.3. The next step, selection, is used to find two individuals that will be mated to contribute to the next generation. Selection of a string depends on its fitness relative to that of other strings in the population. Most often, the two individuals are selected at random, but each individual's probability of being chosen is proportional to its fitness. This is known as roulette-wheel selection. Thus, selection is done on the basis of relative fitness. It probabilistically culls from the population individuals having relatively low fitness. The third step is crossover (or recombination), which fills the role played by sexual reproduction in nature. One type of simple crossover is implemented by choosing a random point in a selected pair of strings (encoding a pair of solutions) and exchanging the substrings defined by that point, as shown in Fig. 2.

Fig. 2. Single-point crossover of the two parents A and B produces the two children C and D. Each child consists of parts from both parents leading to information exchange.

207

1092

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001

selection has come out in such a way that the first, third, and fourth elements of A, namely, Ai, As, and A4, are copied directly to the result; perhaps p was about 2/3. But, the second element of the result has the value A2 + 0A(B2 - C2) and the fifth element has the value A5 + 0.4(B 5 -C 5 ). 3.3

Fig. 3. An illustration of how individuals are combined by a differential GA. Three individuals are chosen as parents and each element of the resulting individual is either copied from the first parent or combined from elements of all three parents.

In addition to evaluation, selection, and recombination, genetic algorithms use mutation to guard against the permanent loss of gene combinations. Mutation simply results in the flipping of bits within a genome, but this flipping of bits only occurs infrequently. The Individuals in the population act as a primitive memory for the GA. Genetic operators manipulate the population, usually leading the GA away from unpromising areas of the search space and toward promising ones, without the GA having to explicitly remember its trail through the search space [32]. It is easiest to understand GAs in terms of function optimization. In such cases, the mapping from genotype (string) to phenotype (point in search space) is usually trivial. For example, in order to optimize the function f(x) = x, individuals can be represented as binary numbers encoded in normal fashion. In this case, fitness values would be assigned by decoding the binary numbers. As crossover and mutation manipulate the strings in the population, thereby exploring the space, selection probabilistically filters out strings with low fitness, exploiting the areas defined by strings with high fitness. Since the search for individuals with higher fitness is not restricted to a localized region of the objective function, this search technique is not subject to the problems associated with local minima, which were described above.

An Example of Genetic Search in Dynamic Test Data Generation

Our example of test generation by genetic search is based on the following simple program: main(int argc, char **argv) { int a = atoi(argv[l]); if (a> 100)

puts ('hello world\n") ; }

The program input is read in by the statement i n t a = a t o i (argv[l] ), which takes the first argument from the command line and assigns it to the variable a. (Although the test generator only generates scalar input values, those values are represented as text when the program needs them in that format). Suppose the test requirement is simply to exercise the p u t s statement on line 6, causing " h e l l o world" to be printed. The objective function will be based on the value of variable a at line 3 since this variable determines whether or not line 4 is reached. The source code is instrumented in a way that causes the value a - 100 to be sent to the test generator when line 3 is executed. This is the value of the objective function 3 . When the genetic algorithm is invoked, it begins by generating an initial population of test inputs; each input will be treated as one individual by the genetic algorithm. For this example, we assume four individuals are generated with the values 94, 91, 49, and -112, respectively. The value returned during the four test executions is used to determine the fitness of each of the four inputs. A smaller number indicates that the test input is closer to satisfying the criterion, but, in genetic search, larger 3.2.1 Differential Genetic Search numbers have traditionally been associated with greater A second genetic algorithm that we have also used for fitness and we will adopt that convention here. Therefore, generating software tests is the differential GA described in the fitness of each input is obtained by taking the inverse of [33]. Here, an initial population is constructed as above. Recombination is accomplished by iterating over the the heuristic value returned when the target program is inputs in the population. For each such input /, three executed. For the inputs —112, 49, 91, and 94, the mates, A, B, and C, are selected at random. A new input instrumented program returns 222, 51, 9, and 6, respec/' is created according to the following method, where we tively, to the execution manager. Therefore, the respective let Ai denote the value of the ith parameter in the input A fitnesses of the four inputs are 1/222, 1/51, 1/9, and 1/6. Once the fitness of each test input has been evaluated, and, likewise, for the other inputs: For each parameter the reproduction phase of the genetic algorithm begins. value h in the input I, we let /; = /, with probability p, where p is a parameter to the genetic algorithm. With Two inputs are selected for reproduction according to their probability 1 — p, we let 1\ = At + a(Bi — Ci), where a is afitness. This is accomplished by normalizing the fitness second parameter of the GA. If /' results in a better values and using them as the respective probabilities of objective function value than /, then /' replaces /; choosing each input as a parent during reproduction. The otherwise, / is kept. This procedure can be thought of probabilities of selecting -112, 49, 91, and 94 are thus 0.015, as an operation on /-dimensional vectors. First, we 0.065, 0.368, and 0.552, respectively. With only four generate a new vector by adding a weighted difference individuals, there is a significant probability of selecting of B and C to A. Then, we perform £-point crossover the same individual twice, but many genetic algorithms between A and the newly generated vector, obtaining our have explicit mechanisms that prevent this. We will assume result. This is illustrated in Fig. 3. Here, the random the algorithm in our example does so as well.

208

MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1093

Suppose the two parents selected are 91 and 94—the two inputs with the largest probabilities. Reproduction is accomplished by representing each number in binary form, 01011011 and 01100000, respectively. A crossover point is selected at random (suppose it is 3), and crossover is performed by exchanging the first three bits of the two parents. The resulting offspring are 01000000 and 01111011, or 64 and 123. Since reproduction continues until the number of offspring is the same as the original size of the population, two more inputs are selected as parents. Suppose they are —112 and 91 and the crossover point is 5. If the inputs are converted to bit-strings using 8-bit, two's complement representation, then they are 10010000 and 01011011, t nm rvm 1 u i r J LU rr • respectively. Crossover produces the offspring 10010011 and 01011000 or -109 and 88. In summary, the second generation consists of the individuals 64, 123, -109, and 88. When the program is executed on these inputs, it is found that one of them satisfies the test requirement, so the test generation process is complete (for that requirement). If none of the tests had met the requirement, the reproduction phase would have begun again using the four newly generated inputs as parents. The cycle would have been repeated until the test requirement was satisfied or until the GA halted due to insufficient progress. (Note that we would have been in trouble if all four original test inputs had had zeros in the first two binary digits. Then, no combination of crossovers could have created a satisfactory test input and we would have had to wait for an unlikely mutation. This is part of the importance of diversity, mentioned in Section 5.1 when we discuss the adjustment of the GA's parameters.) 3.4 Other Optimization Algorithms Once the underlying framework for dynamic test data generation has been implemented, it is straightforward to add other optimization techniques. One such technique is simple gradient descent. We implemented two gradient descent algorithms: Polak-Ribiere conjugate gradient descent [28] and a reference algorithm which is slow but makes no assumptions about the shape of the objective function. Conjugate gradient descent is best suited for oprimization problems with continuous parameters, and integer parameters lead to a myriad of small plateaus where the algorithm can stall. That is, the algorithm may make such small adjustments to an integer-valued parameter that there is no affect on the program's behavior. (Continuous-valued parameters from the optimization algorithm were converted to integers where necessary by rounding). To solve this problem, we interpolate the objective function, but our technique seeks to execute the program as infrequently as possible and is somewhat crude in other respects. For the input values xltx2,- • • ,*„ found by the gradient descent algorithm, we calculate the two values Q — <j(y1 y2 ... y\ and

Q _ cw '

'

-. '

where „ '

=

) x" I r n d ( x 0>

if x; is a continuous parameter; i i s a n i n t e e e r parameter

if x

and

( xi} if xt is a continuous parameter; rnd(xi) + 1, if Xi is an integer parameter and Zi = < Xi > rnd(xj); rnd(xj) — 1, if X, is an integer parameter and ( x, < rnd(xi), , , , ,_^ whe re r n d d e n o t e s r o u n d m , S t o ^ nearest mteger. (Our ™pkmentanon rounds 0.5 down.) The interpolated value is then r

209

»i

1094

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001

have a higher objective-function value than the seed. In this way, the algorithm implements a technique, also used by conjugate gradient descent, of bracketing a minimum in the objective function and then conducting an increasingly refined search for the minimum within the bracket.

test generator are evaluated by a commercial coverage analysis tool (DeepCover), together with the original seed input. This lets us plot the percentage of requirements satisfied as a function of the number of times the target program has been executed. The two details that must be addressed are the same as thos * described in Section 2.2.3: we must find a way to reach the code location where we want our test requirement to be satisfied, and we must convert that requirement into a function that can be minimized,

4

THE GENETIC ALGORITHM DATA GENERATION _ . . . _ p p-r\ * In this section, we describe our approach to test data generation by genetic search. Recall from Section 2.2.3 that our approach is based on the conversion of the test-data generation problem to a function minimization problem, which allows a genetic algorithm to be applied. In the previous section, we described the genetic algorithm itself, and we now describe its application to the problem of automatic test data generation. We begin with an overview of the algorithm, and then we discuss a number of issues in greater detail, including the implementation of the fitness function and the technique for ensuring that the GA can reach the location where the fitness function is evaluated. _ . ... _ _ . _ . . „ ...

4.1

Overview of the Test Data Generation Algonthm

The goal for the GAs is to find a set of tests that satisfy condition-decision coverage (see Section 2.1). This leads to two test requirements for every condition in the code, namely, that the condition be true for at least one input and false for at least one input. Condition-decision coverage also requires that each branch of each decision be taken at least once, but this requirement is satisfied by any test set that meets the requirements above. Before starting the GA, we execute the program on a seed input. Such a seed input typically satisfies many of the test requirements (see Section 5.2 for a discussion of this). The initial program execution is used to initialize the coverage table. After this, the coverage table is used to select a series of test requirements in turn. For each test requirement, the GA is initialized and attempts to satisfy the given requirement. Due to the limitation on the number of iterations (the GA must make some progress every n iterations for some n), the GA is guaranteed to halt, either because a solution has been found or because the GA has given up. Whenever the GA generates an input that satisifes a new test requirement, whether or not that test requirement is the one the GA is currently working on, the new test input is recorded for future use (see below), and the coverage table is updated. We also record how many times the target program was executed before the new input was found. The test generation system continues to iterate over the test requirements until no further progress can be made. This happens when one attempt has been made to satisfy every reachable requirement. (A requirement is considered unreachable if no test input executes the code location where the requirement is evaluated. In other words, the test requirement refers to a condition that the GA cannot reach. The coverage table can be used to determine when there are no more reachable requirements, see Section 4.2.) The performance of each test generator is measured as the percentage of test requirements it has satisfied. To determine this percentage, the inputs that are found by the

4

- 2 Reaching the Target Condition Recall that, in dynamic test generation, function minimization cannot be performed unless the flow of control reaches a certain point in the code. For example, if we are seeking an input that exercises the TRUE branch of a condition in line 954 of a program, we need inputs that reach line 954 before we can begin to do function minimization, Our approach is slightly different from that of [2] and [5], which concentrate on finding a specific path to the desired location. Our goal is (among other things) to cover all branches in a program. This means we can simply delay our attempts to

J^*

a c e r t a i n c o n d i t i o n mm

w

^ a v e f*und

condition. T h i s l e a d s t o a t e s t generation approach similar to the o n e employed by [21]. A table is generated to keep track of (he conditional branches already covered by existing test c a s e s If n e i t h e r b r a n c h o f a c o n d i t i o n h a s b e e n t a k e r i , then t h a t d e d s i o n h a s n o t b e e n rea ched, so we are not ready to a p p l y f u n c t i o n minimization to that condition. If both branches have been taken, then coverage is satisfied for that c o n d i t i o n and w e n e e d n o t e x a m i n e it further. However, if o n l y o n e b r a n c h o f a c o n d i t i o n h a s b e e n exercised, then the c o n d i t i o n h a s b e e n r e a c h e d / and it is appropriate to apply ^ ^ m m i m i z a t i o n i n s e a r c h of a n j t that will exercise t h e o t h e r branch. tests that reach that

4.3

Calculation of Fitness Functions. from Section 2.2.3 that dynamic test generation i n v o i v e s reducing the test generation problem to one of minimizing a fitness function 9. The first step is to define 9. LJke other dynamic test generation techniques, ours begins b v instrumenting the code under test. The purpose of this instrumentation is to allow us to calculate the fitness faction by executing the instrumented code, A t e a c h c o n d i t i o n i n t h e code, our system adds instrumentation to report $(x) when execution reaches that c o n d i t i o n . T a b l e 2 s h o w s h o w Q{X) i s c a i c u i a t e d for s o m e t k a l r e l a t i o n a l o p e r a t o r s w h e n we are seeking to take the J R U E b r a n c h o f a condition ( t h e functions for the FALSE b r a n c h are anaiogous) If ^ m ' s execution fails to reach the desired l o c a t i o n ( i t t e r m i n a t e s o r times o u t w i t h o u t h a v i executed ^ statement)/ m e n m e fimess ^ ^ t a k e s its w o r s t i bj ^ Qur g s e e k g tQ a t e cond iti on -decision ad te test ^ c o n j u n c tions and disjunctions can be handled . loitin c / c + + short circuit evaiuation. For ^ ^ ^ ^ e of ^ condition Recan

i f ( ( c > d ) && ( c < f ) ) . . .

210

MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1095

TABLE 2 Computation of the Fitness Function decision type

example

fitness function

,. inequality 4

... ,, if (c >= d ) . . .

7:, \ I d-c, if d > c; Avm = < „ , ~ . ' I 0, otherwise

if (c — d) . . .

$(x) =

equality ...

,

true/false value '

. , .

~, .

if (c) . . .

\d-c\ f 1000,

5(^) = { n ' \ 0,

is not reached unless the first clause evaluates to TRUE, so the requirement that states the first and second clause must both be TRUE is replaced by a requirement stating that the second clause must be reached and must evaluate to TRUE, If both clauses are reached and both clauses take on both the value TRUE and the value FALSE (as required by conditiondecision coverage), then both branches of the conditional branch will necessarily have been taken. 4.4 Execution Control In the GADGET system, an execution controller is in charge of running the instrumented code, coordinating GA searches, and collecting coverage results and new test cases. It begins by executing all preliminary test cases, (These preliminary cases can be supplied by the user or generated randomly by the execution controller.) After running all initial test cases, the execution controller uses the coverage table to find a condition that can be reached, but has not been covered yet (that is, no input has made the condition TRUE or else no input has made it FALSE). The genetic algorithm is invoked to make this condition take on the value that was not already observed. The GA is seeded with test cases that can successfully reach the condition (though they did not give the condition the desired value, or else the condition would already have been covered). If ,,... ,. . , ,. /, . , ' additional inputs are needed to get the required population size, they are generated randomly. When the GA terminates, either because it found a successful test case or because it stopped making progress, the execution controller uses the coverage table to find a new condition that has not been covered completely. The GA is called again with the task of finding an input that covers this condition. This process continues until all conditions that have had only one value (either TRUE or FALSE) have been subjected to GA search. The execution controller keeps track of all GA-generated test inputs that cover new program code, regardless of whether or not they satisfy the test requirement that the GA is currently working on. (In other words, GADGET takes advantage of all serendipitous coverages.) These test cases are stored for later use. 4.5 Computational Costs Since dynamic test generation is a hueristic process, we cannot give a universal characterization of its computational cost. To be specific, the optimization algorithm makes iterative improvements to a test input while trying to make that input satisfy the chosen requirements, and we cannot predict exactly how many iterations there will be.

if c = FALSE;

,, otherwise

However, there are a number of factors that influence the computational cost of each iteration. The most significant cost is that of executing the target program; recall that this program has to be executed in order to evaluate the objective function. Each new generation created by the genetic algorithm contains a number of new test inputs, and the objective function has to be evaluated for each one. Thus, the cost of dynamic test generation depends intimately on the cost of executing the target program. In addition, larger programs typically have more conditional statements, so if we try to obtain complete coverage of a program, as we did in the experiments described in Section 5, the number of different optimizations the GA has to perform depends on the program size as well. The second factor that determines the efficiency of test generation is the optimization algorithm itself. For example, the genetic algorithm's ability to escape local minima comes at the cost of additional resource expenditures, compared to gradient descent. In other words, the genetic algorithm makes more iterations than gradient descent would make, and it executes the target program more often. The experiments we report on in Sections 5.3 and 5.4 required anywhere from 30 minutes to several hours on a fun Sparc-10 workstation, though the simple programs in f ^ ™ 5"2 w e , r e less expensive. Based on results reported in [4] and elsewhere, this genetic search may be considerably * a d i e n t d e s c e n t J o u l d b e ( b u t J, more sive t h a n b d o w ) u l t i r n a t e l V / s u c h expense would be justified by the difficulty of generating test data by hand, which makes any autO mated technique desirable, especially for programs w jth a complex structure. it is interesting to note that computational overhead seems to play a considerable role in the time needed to perform dynamic test generation. When running the target program on a given test input, the test generator has to invoke the program (using the Unix vf ork and exec system calls in the case of GADGET) and wait for the results (GADGET uses wait.) For all of our programs, even the largest, computation time was significantly affected by the expense of the Unix vf o r k / e x e c / w a i t calls needed to execute the target program. We empirically estimated that our operating system is capable of about 3,200 vfork/ exec/wait operations per minute. In contrast, [4] reports b e i n g a b l e t o p e r f o r m 200,000 to 700,000 tests in five minutes (using random test data generation). Further computational overhead comes from the fact that GADGET is written in object-oriented C++ and was not optimized for these experiments. This (together with the timing results of [4]) suggests that it is possible to obtain much more efficient operation.

211

1096

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001

Apparently, platform-dependent computational overhead greatly affects the performance of test-generation algorithms. Therefore, it seems that the number of executions of the target program is a better measure of a testgenerator's resource requirements than wall-clock or system time. Reference [4] does not report on the number of executions used by their nonrandom techniques and, ... ... ./,.,. ,. , . i • although it appears that their gradient descent technique was much cheaper than our genetic search, their reduced overhead costs call this conclusion into question, at least if computational expense is measured in terms of the number of executions of the target program. In the next section, we do, indeed, measure the expense of test generation by the number of target-program executions. For reasons discussed above, this seems to be the most logical measure of efficiency for a dynamic test generator: It abstracts away the cost of actually running the target program itself, as well as the overhead of calling the target program and communicating with it. 4.6 Tuning the GA When a test is performed using one of the two genetic algorithms described in Section 3.2, the GA is first tuned by adjusting the number of individuals in the population and the number of iterations that must elapse with no progress before the GA gives up. In the standard GA, mutation is controlled by adjusting the probability that any single bit will be flipped during reproduction. The goal of this finetuning was to maximize the percentage of conditions covered while keeping the execution time low. A second goal was to ensure that the differential and standard GAs executed the target program about the same number of times in order to get a reasonable comparison between the 6

r

two techniques. , , , , , , , Such tuning adjustments control the breadth and the thoroughness of the GA's search. If there are more individuals, then there are more disinct inputs that can be created by the crossover operation; in a sense, more genetic diversity is available. On the other hand, if we make the GA continue for more iterations before giving up, then it is less likely to give up simply because progress is slow. In some cases, the GA is visibly on the path to a successful input, but it gives up because an unlucky series of crossovers fails to improve the fitness of the population. The likelihood of this is reduced if more iterations are permitted. Unfortunately, both of these improvements have a cost. Allowing more iterations means the GA will waste more time trying to cover conditions that it cannot cover. Adding more individuals means that the target program is executed more often during each iteration because a target-program execution is required every time we evaluate the fitness of an individual.

5

programs. Finally, we present results obtained by analyzing a real-world autopilot control program called b737. _ 5 1 " D e s i 9 n o f t h e Experiments *" s e t t i n S UP o u r experiments, we began by selecting a P r °g r a m on which to try out the test generation system; we refer to s ch a " P r o S r a m a s a ^get program. By means of source-code instrumentation, the target program is conne^ ^ ^ ^ J ^ m £ u r e d tQ r u n ^ t m d e m ^ w h k h m o n i t o r s t h e p r o g r a m ' s behavior. The target p r o g r a r n i s instrumented by augmenting each condition w i t h additional code. This code reports the objectivefunction value for each condition to the execution manager. (The objective functions are calculated as described in Section 4.3). The program that performs this instrumentation also collects the information needed to construct the coverage table described in Section 4.2. Several parameters of the GA were the same throughout all of our experiments. For the differential GA, the parameter p (see the end of Section 3.2) was always 0.6. The parameter a was always 0.5. The probability of mutation, number of individuals, and number of iterations permitted without progress were different for different test programs. We report the results for each of the programs we tested and give the actual values we used below. (See Section 4.6 for a discussion of how these parameters affect the performance of the GA.) An additional parameter of the test generator is the random number seed used to create pseudorandom numbers by means of a linear congruential generator (c.f., [34]). The following describes the utilization of pseudorandom numbers and, hence, the impact of the choice of a random number seed in each of the three test generators. „ , . . .. AH • i I. J • Random test generation. All inputs are generated pseudorandomly. In all of our experiments, the parameters were numeric, and inputs were generated

.

•

b

whe er

* ,

EXPERIMENTAL RESULTS

In this section, we report on four sets of test data generation experiments. The first set of experiments involves programs that calculate simple numeric functions. The second experiment investigates how the GAs and random test generation perform on increasingly complex synthetic 212

•

selecti

a

value

for

each

parameter

uniformly at random from the range specified in a program-specific configuration file, G r a d i e n t de scent. The initial inputs are randomly generated using the same technique as the random t e s t d a t a generator. Since our implementation supp O r t s i t readily, we randomly select two starting p o i n t s ^ d perform the two descents in tandem with One another. The step size has a pseudorandom element as well, as outlined earlier at the end of Section 3.4. Standard genetic search. The seed inputs are created pseudorandomly, as are new population members when these are needed. Each input is represented as a string of bits, and inputs are generated by setting each bit to 1 with probability 0.5. Crossover points are selected uniformly at random from the potential crossover points in the bit-string, and the decision of or not

*° mut f te

a bit is based o n a ,r

randomly generated number as well. Differential genetic search. Pseudorandom numbers are used to create the seed input and to create new population members as above and, in addition, the random number generator drives the random choices used in reproduction (see Section 3.2).

MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1097

We executed the test generation system described in values. Therefore, the differential GA sees each input as a Section 4 using each of the genetic algorithms described in s e r i e s ° f variables corresponding to the input parameters of Section 3.2. A single run of the test generation system t h e t a r 8 e t program. The description of the input parameters involves the following steps: (ranges, types, etc.) is provided to the test generation system by the user and that description is used both by the 1. Selection of a target program for which tests will be standard GA and the differential GA when test inputs are generated. formatted for use by the target program. 2. Specification of a test adequacy criterion. In these experiments, the test adequacy criterion is always 5 2 S l m P l e Programs condition-decision coverage. We began our experiments on a set of simple functions 3. Enumeration of the test requirements that the m u c h l i k e m o s e reported in the literature [35], [21], [4]. The adequacy criterion induces on the target program. programs analyzed are as follows: 4. An attempt by the test generator to satisfy all of the , . test requirements for the program. , ,,, ' , . , , • bubble sort, . n u m b e r of days between two dates, We performed several complete test-generation runs . euclidean greatest common denominator, with each test generation system and for each of the . .. . several test programs. (That is, each test generation system made several attempts to satisfy the test requirements .. . , ' ,. . . . . . . , r , rJ u r • computing the median, HHf 1 associated with each program.) The exact number of runs varies between test programs, and it is given below when * ^ u a ,ra ^ o r m u a ' we describe our results for each of the programs. During warshall s algorithm, and * Wangle classification. each run of the test generation system, the system saves the original seed input, and it also saves any test input These programs are roughly of the same complexity, that satisfied a requirement not already satisfied by an averaging 30 lines of code and all having relatively simple earlier input. decisions. After each run of the test generator, information collected When we tested GADGET on these programs, we used 30 during the run is used to assess the test generator's individuals for the differential GA and allowed performance. Performance is measured in terms of the io generations to elapse with no progress before allowing percentage of test requirements that were satisfied. To do m e GA to give up. For the standard GA, we used this, the seed input is fed to a commercial coverage analysis 1 0 0 individuals, allowed 15 generations to elapse before tool (DeepCover), along with the inputs that were saved by t h e G A a n d m a d e m e p r o b a b i l i t y o f m u t a t i o n 0.001. the test generator because they satisfied new coverage F o r ( h e s e p r o g r a m S / r a n d o m t e s t d a t a generation never requirements^ This tells us how many requirements were o u h , e r f o r m s netic search/ t h h s o m e r i m e s both apr satisfied by these inputs. , , ., -n. £ Cc ? „ , . . . . rproaches have the same effectiveness. These results it , . ., .. j • n i l J MI • ,.u Each time the test generator saves a test input, r . , °, . ' resemble those reported in [21] and [41; in those rpapers, r , . . .. , , , . „ records how many times the target program was executed , , ...• / c A TU- i ? i i. i.u i. .. random test generation also performed nearly as well as r J before that input was found. This lets us plot the test , . .& , , . . . . , r_ i.r l i <• more sophisticated techniques on simple programs, c generator s performance as a function of the number of f . .. u i . i . 1 J target-program executions, which is what we do in most

, . . of our experiments. T JJ\;i • IUV. ,-AI- j i i *. In addihon to running the two GA-based test generators, we also apply the random test generator to each target

_ / . . , , _ , . , Random test case generation has the upper hand in these . , . . . ... r r . .

experiments because it involves significantly less computaF 6 F . / hon. However, in every case, one of the GAs performs the , , , , , „ , , , J T best overall. Table 3 shows the results. The numbers

program. We run the random test generator as many times r e P o r t e d i n * e t a b l e ""f^f111 * * m g h e S t P e r c f t a S e o f as the two other algorithms. For each run of the random test t e s t requirements satisfied by any single run of the test generator, we execute the program on randomly generated g e n e r a t ° r a ™ n g a s f i e s of f i v e fuch r u n s lt i s u s e f u l t o « « l y ^ a sample program to understand input values and record the inputs that satisfy new w h a t the G A s are doin requirements as above. The number of such executions is S - T h e c o d e f o r T n a n g l e classification equal to the largest number of times the program was i s s h o w n b e l o w - ^The c o m m e n ^ ° n the right margin will be executed by any of the GAs. We plot the performance of the u s e d l a t e r t o s h o w w h i c h i n P u t s ™tefed which conditions.) random test generator in the same way as the performance N o t e ^ m a n y d e c i s i o n s onl Y contain a single condition, #include <stdio.h> of the GAs. 5.1.1 Input Representation in the GAs

. . , ,. , int triang (int I , int 3, mt k) { Input representation in the GAs. In the standard GA, test f/ r e t u r n s o n e o f t h e f o l l o w i n g . inputs are represented as a contiguous string of bits, and / / x i f t r i a n g l e i s scalene crossover is accomplished by selecting a random position " 2 l f t " a n 3 l e 1 S isosceles within the bit-string. (In our experiments, the binary // 3if triangle i s equilateral representation was obtained by using the machine encoding / / 4 i f not a t r i a n g l e for floating-point parameters and Gray-coding others.) Representing the input as a series of bits does not work for the differential GA (see Section 3.2) because its method i f ( ( i < = 0 ) II ( j < = 0 ) II ( k < = 0 ) ) r e t u r n 4; of reproduction is not well-suited to variables with only two / / acd

213

1098

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001

TABLE 3 A Comparison of Four Test Data Generation Approaches on Simple Programs | random | | Program Binary search 80 Bubble sort 1 100 Bubble sort 2 100 Number of days between two dates 87.0 Euclidean greatest common denominator 100 Insertion sort 100 Computing the median 100 Quadratic formula 75 Warshall's algorithm 91.7 Triangle classification | 48.6 |

GD | 100 100 100 90 100 100 90 75 100 91.7 |

GA | differential-GA 70 100 100 100 100 100 100 100 100 100 92.9 100 100 100 75 75 100 100 94.29 | 84.3

The standard GA exhibits the best performance overall, but not by a significant amount. These data represent the highest percentage obtained by each method during a series of five attempts to obtain complete coverage of the program.

i n t t r i = 0; i f (i == j) t r i + = 1; II g if(i==k)tri+=2; //h i f (j == k) t r i +=3; // i if ( t r i == 0) { / / bef if ( (i+j <= k) II (j+k <= i) II (i+k <= j ) ) t r i = 4; / / be else tri = 1; lit return t r i ; i if (tri > 3) tri = 3; else if ( (tri == 1) && (i+j > k) ) tri = 2; else if ( (tri == 2) && (i+k > j) ) tri = 2; // h else if ( (tri == 3) (j+k > i) ) tri = 2; else tri = 4; return tri; } void main () { printf ("enter 3 integers for sides of triangles\n"); int a,b,c; scanf("%d %d %d",&a, &b, &c) ; int t = triang(a,b,c) ; if (t == 1) printf ("triangle is scalene\n"); // f else if (t == 2) printf ("triangle is isosceles\n"); // h else if (t == 3) printf ("triangle is equilateral\n"); else if ( t== 4) printf ("this is not a triangle\n"); //abcdegi Fig. 4 shows how the four different systems—standard GA, differential GA, gradient descent, and random test generation—perform on the triangle program shown above. 5.2.1 Coverage Plots for Test Generation Problems In Fig. 4, the number of test requirements satisfied by each test generation system is plotted against the number of executions of the target program. Here, as well as in our later results, the standard GA and differential GA did not

214

coverage

execute the program the same number of times. (In general, the standard GA tended to satisfy individual test requirements more quickly.) The random test generator was run last, and the number of program executions alloted to it was the maximum number of program executions needed by either of the other two systems. This creates favorable conditions for the random test generator and helps us determine when the use of a more complex test generation system is actually justified. The plot in Fig. 4 shows features that appear throughout o u r test generation experiments. First, there is a large, immediate jump in the percentage of test requirements

Fig. 4. A coverage plot comparing the performance of the two GAs, gradient descent, and random test generation on the triangle program. The graph shows how performance (in terms of the percentage of the 20 test requirements that have been satisfied) improves as a function of the number times the program is executed. Random test generation hits its peak early, but fails to improve after that. The differential GA has a better performance and executes for a longer amount of time, but the standard GA has the best performance overall, covering about 93 percent of the code on average in about 8,000 target-program executions. Gradient descent performs nearly as well as the standard GA. The curve for each system represents the pointwise mean performance of that system taken over five attempts by that system to achieve complete condition-decision coverage of the program.

MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1099

TABLE 4 A Table of Sample Input Cases Generated by the Standard GA for triangle | Execution of target program | Key | Integerl | ~2 I a I 1680498885 I 3 b 1293470477 c -120192928 4 6 d 841354299 20 e 1056804119 117 f 719320455 5311 g 743820356 10751 h 999699718 I i | 799340978 | 16800

Integer2 | Integer3 j 1961702355 I -1490056820 1898197634 465181194 1041962067 280365949 -1802686561 -209782592 660913846 1617709752 507534636 574028437 743820356 1826109949 584551117 999699718 1321708382 | 1321708382

These data can be mapped to conditional expressions in the code shown above using the Key field.

satisfied, so that the interesting part of the vertical axis does not start at zero but somewhere closer to 50 percent. Second, the percentage of requirements satisfied by the GAs seems to make discontinuous jumps. The initial jump in coverage is there because the first test input satisfies many conditions. When the program executes the first time, it has to take either the TRUE branch or the FALSE branch on any condition it encounters and, each time a new condition is found to be true or false, the percentage of test requirements satisfied goes up. For example, if a program has no nested decisions and if each decision has exactly one condition, then the first program execution is forced to take either the TRUE branch or the FALSE branch of every condition in the program. For such a program, we would always achieve 50 percent coverage using only the first test. Many of the discontinuities that appear in Fig. 4 are there for similar reasons. When an input causes the program to take a branch that had not been taken before, it may lead to the execution of other conditional statements that were not executed previously. Once again, each condition must either be TRUE or FALSE, and this leads to an instantaneous increase in the number of conditions satisfied. (Apparent discontinuities can also occur when there are few conditions in the program because then the granularity of the vertical axis is low. However, this is not the cause of the salient jumps in the coverage plots we present here. Readers may visually judge the granularity of the plots by looking at small features, such as the shallow increments that occur in the plot for the standard GA at 43 percent and 78 percent on the vertical axis.) A further cause of discontinuities is serendipitous coverage. The GAs often satisfied one test requirement by coincidence when they were trying to satisfy a different requirement. In fact, the shorter execution time of the standard GA results largely from this phenomenon; less exertion was required of the GA because so many requirements were met serendipitously. We find this to be a recurring phenomenon in our experiments and have more to say about it in Section 5.5.4. 5.2.2 The GAs' Choices of Test Inputs A sample of results obtained by the GA test data generation algorithm is shown in Table 4. These data can be mapped to the source code shown above by using the letters shown in the comments.

The tests in Table 4 are probably not like those that a human tester would choose, much less those that would occur in a hypothetical world where this program was used by the general public. Of course, the ability to create bizarre tests can sometimes be an advantage. Coverage criteria are often used to exercise obscure and infrequently utilized features of a program, which testers might otherwise overlook. However, it might be desirable to concentrate on inputs that are more realistic. This leads to an interesting digression on dynamic test generation techniques: Unlike the static methods described in Section 2.2.2, dynamic test generation allows certain inputs to be preferred over others by means of biases in the objective function. For example, bizarre or uncommon input combinations could be penalized by raising the value of the objective function for those inputs. One could even construct an operational profile (c.f., [36]), allowing each input to be weighted according to its probability of occurring on the field. Similar biases are used in [22] to create a preference for test inputs close to boundary values, 5 2 3 - - Performance of Gradient Descent In view of the above discussion, it is perhaps unsurprising that the performance of gradient descent was generally somewhere between that of the genetic algorithms and that of random test generation. In many cases, gradient descent failed because of flat spots in the objective function where there is no information to guide the algorithm's search. This was the case with the binary search program, for example, However, gradient descent appears to encounter a local minimum in the t r i a n g l e program. This can be observed in the behavior of the reference algorithm for gradient descent, which empirically estimates the gradient at each step of the optimization process. In the t r i a n g l e program, the reference algorithm reaches a point where all adjacent points (as defined by the neighborhood function) lead to a worsening of the objective-function value. This leads to a reduction of the mean step size (as described in Section 3.4), but the same phenomenon is encountered again, and this process is repeated until the algorithm gives up. The optimization process is empirical rather than analytic, so this does not prove the existence of a local minimum, but it provides strong evidence. (Of course, this might be prevented by a clever neighborhood function, but finding a neighborhood function that is guaranteed to eliminate local minima is a nontrivial matter.)

215

1100

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001

The test criterion whose objective function contained the local minimum was satisfied by the standard GA, albeit serendipitously. 5.2.4 Performance of the Random Test Generator Our results for random test case generation resemble those reported elsewhere for cases where the code being analyzed was relatively simple. For the simple programs in Table 3, the different test generation techniques have roughly the same performance most of the time. In [21], random test data generation was reported to cover 93.4 percent of the conditions on average. Although its worst performance—on a program containing 11 decision points—was 45.5 percent, it outperformed most of the other test generation schemes that were tried. Only symbolic execution had better performance for these programs. Reference [4] reports on 11 programs averaging just over 100 lines of code. Overall, random test data generation was fairly successful, achieving 100 percent statement coverage for five programs and averaging 76 percent coverage on the other six. It is also interesting to compare our results with those obtained by [16] for three slightly larger programs. Again, simple branch coverage was the goal. Random test generation achieved 67 percent, 68 percent, and 79 percent coverage, respectively, on the three programs analyzed. Symbolic test generation achieved 63 percent, 60 percent, and 90 percent coverage, while dynamic test generation achieved 100 percent, 99 percent, and 100 percent coverage. These results suggest a common trend: Random test generation has at least an adequate performance on such 6 , 'l K programs, but, for larger programs or more demanding r o . . e r e o coverage criteria, its performance deteriorates. The pro1 •
we call the condition complexity). For example, a program with no nested conditional clauses (nesting complexity = 0) would look like the beginning part of the triangle program, in that i f statements are not nested inside other i f statements. The nature of the conditional expressions in each conditional clause is controlled by the second parameter (the condition complexity). The decision ( ( i<= 0) | | (j <= 0) I I (k <= 0) ) from the triangle program ranks a s a 3 o n this scale because it contains three conditions. In this section, we will use these two parameters to characterj z e the complexity of our synthetic programs. We will use t he expression compKx, y) as shorthand for "nesting complexity x and condition complexity y." In our experim e n t S / p rO grams were generated with all complexities compl{nest,cond), nest € {0,3,5}, cond e {1,2,3}. In this sec tion, we present the results for compl{0,1), compel, 3), a n d c o m p ; ( 3 ) 5 ) / w h i c h iii ustr ate the widening gap between t h e p e r f o r m a n c e levels of the three test generators as p r o g r a m complexity grows. The results for the other six programs are given in the appendix. Note ^ i n s o m e c a s e S / c h a n g i n g m e n e s t m g complexi t y h a s ^ s a m e e f f e c t a s changing the condition complexity changing i f (cond2}

( c o n d l && c o n d 2 )

doeg nQt c h a n g e

Q r c + + ^ feut u

^

toi f

(condl)

i f

s e m a n t i c s o fa p r o g r a m i n c

does c h t h e nesta n d condition However, in our synthetic programs, there is al a d d i n O n a l code between two nested conditions, T h u ^ &h h e f n e c o r n p l e xity implies a more complicated relationshi b e t w e e n the input parameters and the . ,, . . . . , , . , .... .-_ ., variables that appear in deeply nested conditions. On the , , , ,.f .... , . . . ,. . . . other hand, a high condition complexity implies that many .... , , . ., . . . conditions are evaluated in the same program location, . , , , .;r , ° . . c meaning that the test generator has to find a single set of . ,.° . .,.?•?• • 1 _.-,_• ^ ^i. variable values that satisfies many simple conditions at the J r . same time. I J « - 1 ^ A i_ J 1U • c T ) In these tests, the differential GA had a population size of „ „ , , , , ,., 1 • .,„ 30 and abandoned a search if no progress was made in 10 . , \ b, generations. The standard GA also gave up after & K , , , , 10 g y r a t i o n s made no progress, but the population s,ze w a s ad usted J ^ ^ s t a n d a r d f A executed the target P™&am a b o u t t h e s a m e n u m b e r o f h m e s a s * e differential GA. This resulted in a population size of 270 for comp^O, 1), 3 2 0 for ^m P /(3,2), and 340 for comPl(5,3). The large P°P"lation sizes make up for the standard GA's tendency t o ive U e a r l S P y ; t h i s i s discussed in Section 5.3.1. The mutation probability for the standard GA was 0.001. Fi s 5 6 a n d 7 s h o w m e m e a n S- ' ' performance over six m n s o f each test generation technique. (Note that the percentage of test requirements satisfied, the shown vertical axis h a s ' different ranges in different plots. We have focused on the interesting portion of each plot.) Forthe simp'est program, random test generation and the GAs quickly achieve high coverage. All three methods 5.3 The Role Of Complexity make most or all progress during the very early stages of In our second set of experiments, we created synthetic test generation. In these early stages, the GAs have programs with conditions and decisions whose character- essentially random behavior because evolution has not istics we controlled. The two characteristics we were begun yet. The standard GA sometimes satisfied additional interested in controlling were: 1) how deeply conditional requirements closer to the end of the process. The results clauses were nested (we call this the nesting complexity) and are shown in Fig. 5, whose horizontal scale is logarithmic to 2) the number of Boolean conditions in each decision (which show both of the features just mentioned.

216

lexities

MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1101

Fig. 5. The random test generator and GAs perform relatively equally on the least complex program. compl(0,1); 100 percent coverage represents coverage of 60 conditions.

Fig. 6 shows results for a program of intermediate complexity. Random test generation does not do as well as before. When we examined the details of each execution, we found that the standard GA performed well because it often managed to satisfy new criteria serendipitously—that is, it often found an input satisfying one criterion while it was searching for an input that satisfied another. It is likely that the differential GA performs a more focused search than the standard GA, and our examination of the detailed results showed that it failed to find as many inputs by coincidence. Finally, Fig. 7 shows the results for a program with comp/(5,3). Here, all test generation methods are less effective than before, and there is more to distinguish among the different techniques. In most cases, the standard GA performed better than the differential GA. The results for conjugate gradient descent and the reference algorithm are shown in Table 5. We do not show the results for gradient descent in the coverage plots because the disparity in the horizontal scale makes them

Fig. 7. The complexity program was hard for both generators, but the GA outperforms random by even more. The standard GA ultimately outperforms the other two methods. campl(5,3); 45 conditions were to be covered.

hard to see; conjugate conjugate gradient descent is considerably faster than the other techniques, while the reference algorithm is somewhat slow (in any case, we expect faster performance by conjugate gradient descent a priori, so a comparison of performance over time is not as informative as it is with the two GAs). However, the table shows the total condition-decision coverage obtained by gradient descent for each of the three programs, compl(0,1), compl(3,2), and compl(5,3). For compl(0,1), the gradient descent algorithms perform comparably to the other techniques. For compl(3,2), the performance of gradient descent is somewhere between that of the standard GA and that of the differential GA. For compl(5,3), both GAs outperformed gradient descent. Like the other optimization algorithms, gradient descent encountered problems because of flat regions in the objective function. This was the most frequent cause of failures to satisfy a test criterion. However, the reference algorithm apparently encountered one local minimum in the compl(0,1) program, one in the compl(S, 2) program, and two in the compl(5,3) program. (We concluded this on the basis of the same type of empirical evidence as in Section 5.2.) For the the programs with compl(5,3) or compl(3,2), the GAs also failed to satisfy the test criteria for which there TABLE 5 Condition-Decision Coverage Achieved by Conjugate Gradient Descent and the Reference Gradient Descent Algorithm for

compl(0,1), compl(3,2), and comp/(5,3) | compl (0,1) (3.2) (5.3)

Fig. 6. As complexity increases, the GAs begins to do better and the standard GA outperforms the differential GA. campl(3,2); there are a total of 45 conditions to cover.

| CGD | ref. | I 95.27 I 94.03 70.65 75.0 | 29.58 I 39.3

The curves are not shown since their scale is different from that of the other curves, but conjugate gradient descent was faster than the other algorithms (as expected). The two algorithms have comparable performance. Gradient descent generally performed somewhat below genetic search and comparably to differential genetic search.

217

1102

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001

were local minima in the objective function. The test criterion with the local minimum in the compl(0,1) program was satisfied by the GAs some of the time, but only serendipitously. This suggests that the global optimization capability of the GAs is not what makes them perform better than gradient descent. The notion that other factors are more important than global optimization ability in determining the success of genetic search for these problems is also reinforced by our observations on serendipitous coverage. It is interesting that conjugate gradient descent did not perform as well as the reference algorithm for most of these programs. In most cases, this is apparently because the reference algorithm obtained better serendipitous coverage, In the experiments we will describe next (in Section 5.4), the opposite thing happened: Conjugate gradient descent outperformed the reference algorithm by—according to the logged status information—obtaining better serendipitous coverage. Additional results are shown in the Appendix. The performance gap between the three techniques seems to grow when the target program becomes more complex, but it remains small when either the condition complexity or the nesting complexity is small. This suggests that dynamic test data generation is worthwhile for complex programs, though it may not be worthwhile for programs whose conditions are simple or are not nested. However, it seems that random test generation levels off after several hundred executions, during which time the GAs show the same level of performance; since evolution has not begun yet, their behavior is essentially random. This is illustrated especially well in Figs. 5 and 14.

On the other hand, there are many programs where each parameter controls specific aspects of a program's behavior, This may make it more appropriate to think of the parameters as features, each one describing a different dimension of the program's behavior. If we regard the parameters as features, we see that the differential GA has a tendency to modify each feature, while the standard GA (with single-point crossover) tends create different combinations of existing features, This gives the differential GA a disadvantage when it comes to the serendipitous discovery of new inputs. This phenomenon, which we will discuss in the next section, requires that existing control paths (that is, existing features) be preserved, and the differential GA may not be g°°d a t doing this. In our experiments, the disadvantage of the differential G A was sometimes amplified since we insisted on having different optimization methods use roughly the same number of program executions. To make the standard GA execute the program as many times as the differential GA did during its fine-grained and fruitless searches, we increased the standard GA's population size. The standard G A tended to outperform the differential GA even when their populations were the same, and increasing the population size allowed it to widen its search while the differential GA remained tightly focused, T o m a k e matters w o r s e for <*« differential GA, small differences in performance are magnified over time. The faiIure to find a n " ^ **** c a u s e s o n e b r a n c h o f a decision t0 b e r o u t e d means that no conditions within that branch can be covered in the future. Thus, the failure to satisfy a sin le S condition can be a greater handicap than it seems.

r- „ w -TL. r, _r r ^ ,-,•« .- , ^. ^ , 5 . 4 b737: Real-World Control Software 5.3.1 The Performance of the Differential GA Compared , , . „ . . , „ In this study, we used the GADGET system on b737, a . ' to the Standard GA '' , . , .. } ^ C program which is part of an autopilot system. This code The poor performance of the differential GA (compared to h a s 6 9 d e c i s i o n p o i n t S / 75 c o n d i t i o n s , and 2,046 source the standard GA) can probably be attributed to its l i n e s o f c o d e ( e x c i u d i n g comments). It was generated by a numerical focus, which seems poorly suited to test genera- CASE tool tion problems. Recall from Section 3.2 that the differential W e generated tests using the standard GA, the differGA makes numerical changes in many variables, while the e n t i a l G A / c o n j u g a t e gradient descent, the reference standard GA, using single-point crossover, leaves most g r a d i e n t descent algorithm, and random test data generavariable values intact during reproduction. The scheme t i o n . F o r e a c h m e t hod, 10 attempts were made to achieve used by the differential GA is desirable in numerical complete coverage of the program. For the two genetic problems, while the standard GA seems more suited to algorithms, we made some attempt to tune performance by situations where variables represent specific features of a adjusting the number of individuals in the population and potential solution. The question, therefore, is whether the the number of generations that had to elapse without any variables in a test generation problem behave more like the improvement before the GAs would give up. parameters of a numerical function,or if they are more like For the standard genetic algorithm, we used populations "features" describing the behavior of the program. of 100 individuals each and allowed 15 generations to This somewhat cloudy question becomes more concrete elapse when no improvement in fitness was seen. The if we focus on the part of the program's behavior that is of probability of mutation was 0.000001. For the differential interest during test generation. The program's input is a set GA, we used populations of 24 individuals each and of parameter values, but, in a test generation problem, we allowed 20 generations to elapse when there was no generally ignore the output and are only interested in the improvement in fitness (the smaller number of individuals control path taken during execution. In this sense, test allows more generations to be evaluated with a given generation is not what would normally be thought of as a number of program executions, and we found that the numerical problem. Indeed, a small change in the input differential GA typically needed more generations because parameters often leads to no change at all in the control it converges more slowly than the standard GA for these path, so the fine parameter adjustments made by the problems). As before, we attempted to generate test cases differential GA are often fruitless. that satisfy condition-decision coverage.

218

MICHAEL ET A L : GENERATING SOFTWARE TEST DATA BY EVOLUTION

1103

Fig. 8. Coverage plots comparing performance of four systems on the b737 code. The four curves represent the pointwise mean over 10 attempts by each system to achieve complete coverage. The upper and lower edges of the dark gray region represent the best and worst performances of the standard GA, while the edges of the light gray region are the best and worst performances for the differential GA. The GAs have comparable performance and both show much better performance than random test data generation. The performance of gradient descent was slightly lower; conjugate gradient descent achieved just over 90 percent condition-decision coverage. The vertical axis shows the percentage of the 75 conditions that have been covered. (Note that the two plots do not have the same horizontal or vertical scale.)

First, we tried to achieve condition-decision coverage The two gradient descent algorithms were somewhere in with the two GAs. Next, we applied random test data the middle, with conjugate gradient descent achieving generation to the same program. Here, we permitted the a b o u t 9 0 percent condition-decision coverage, while the same number of program executions as was used by the r e f e r e n c e a i g o r i t h m only reached about 85 percent, genetic searches. This amounts to thousands of random tests, one for each time the fitness function was evaluated 5.5 Detailed Analysis of Experiments during genetic search. Note, however, that random test For any given execution of the test generator, we can divide generation stops making progress quickly. t h e 7 5 c o n d i t i o n s o f b 7 3 7 i n t o four classes. Some were ^ Fig. 8 shows the coverage plot comparing generic and n e v e f c o v e r e d s Q m e w e f e c o v e r e d s e r e n d i >. random test data generation. The graphs show the best, . , .. ., „ . . . . j-« , . . • c ,„ ' tously while the GA was trying to cover a different worst, and pomtwise mean performance over 10 separate .. ," , _, ., ,. . u u i. «. ui .. j-icondition, some were covered while the GA was actually attempts by each system to achieve complete condition' j .. working on them, and some were covered by the randomly eC The best performance is that of the differential GA; s e l e c t e d i n P u t s w e u s e d t o s e e d t h e G A i n i t i a l l yT h e l a s t c l a s s of i n u t s ls eventually, all runs converged to just over 93 percent P reflected in Fig. 8 by the fact t h a t man condition-decision coverage. The best runs of the standard Y conditions were already covered almost GA also reached this level of performance, but the immediately. For example, the standard GA covered about differential GA was faster. Random test generation only 7 6 percent of the conditions right away, Here as in o u r other achieved 55 percent coverage. ' experiments, most inputs were Though we did not tune the GAs extensively for discovered serendipitously. (Note that, when a test requireperformance (as we have said), we did try to adjust the ment is satisfied serendipitously, it often happens before the population size of both programs in order to obtain genetic algorithm makes a concerted attempt to satisfy that comparable resource requirements for both programs. The requirement. Therefore, the fact that many requirements reason for doing this was to ensure that better or worse w e r e satisfied by chance does not imply that the GA would performance by one algorithm was not merely the result of h a v e f a i k d t 0 s a t i s f y t h e m o m e r w ise.) a greater or smaller number of program executions. But, ^ w e r e discovered b luck m e a n s T h e fact ^ most • during this process we found that changing the population ^ m Q s t ^ i r e m e n t s not s a t i s f i e d b c h a n c e w e r e n o t size only had a small affect on performance. The reason was .. ,. , „ , ,. . ., ,-,.,-, • , . . , . ., V^A •. A u c «.i satisfied at all. In this respect, the b737 experiments shed that the GAs wasted many program executions on fruitless , , , searches and changes in the population size simply changed s o m e U S h t o n t h e t r u e behavior of the two different the resources wasted in this way. Therefore, it is quite G A implementations. A quick look at parts of the source possible that the standard GA could have been faster if the c o d e elucidates this behavior. We will first explain a typical population size had been smaller, without a significant individual run of the standard GA and then discuss the decrease in the condition-decision coverage it achieved. differential GA. 219

1104

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001

The execution of the standard GA we examine as a model was selected because its coverage results are close to the mean value of all coverage results produced by the standard GA on b737. In its 11,409 executions of b737, this run sought to satisfy test requirements on 12 different conditions in the code. Of these 12 attempts, only one was successful. The remaining 11 attempts showed little forward progress during 10 generations of evolution. While making these attempts, however, the GA coincidentally discovered 14 tests that satisfied requirements other than the ones it was working on at the time. The high degree of coverage that was finally attained was mostly due to those 14 inputs. Indeed, the most successful executions have the shortest runtimes precisely because so many inputs were found serendipitously—many conditions had already been covered by chance before the GA was ready to begin working on them. Next, we consider a typical execution of the differential GA. During this execution, the 15,982 executions of b7 3 7 involve 25 different attempts to satisfy specific test requirements. The number of attempts depends on which requirements are satisfied and in what order. Therefore, it is not the same for all executions of the test data generator. Again, only one objective is obtained through evolution, though 10 additional input cases, not necessarily prime objectives of the GA, are found. 5.5.1 Where the GAs Failed .. , , . „ . It is interesting to consider the conditions that the GAs never successfully covered. The standard GA failed to cover , , ,, . ., .... the following eight conditions: for (index = b e g i n ; index <= end && ! t e r m i n a t i o n ; index++) i f (( (T <= 0 0) II (0.0== Gain) )) . £ , cv . , ~ t, , ~ _ .. . i f ( ( (o_5 > 0. 0) && (o < 0.0) ) ) i f (FLARE) i f (FLARE) i f (DECRB) The differential GA failed to cover these conditions: for (index = b e g i n ; index <= end && ! t e r m i n a t i o n ; index++) i f (o) i f (o) i f ( ( ( ( O P - D 2 ) < 0 . 0 ) && ((OP-D1) > 0 . 0 ) ) ) i f (o_3) i f (( (o_5 > 0 . 0 ) & & ( o < 0 . 0 ) ) ) i f (FLARE) i f ( ( ( ! LOCE) I I ONCRS)) i f (RESET) Most of the decisions not covered only contain a single Boolean variable, signifying a condition that can be either TRUE or FALSE. The technique we use to define our fitness function seems inadequate when the condition contains Boolean variables or enumerated types. For example, if we are trying to exercise the TRUE branch of the condition if (windy) ... we simply make 9 (x) equal to the absolute value of windy, This makes Ss(x) zero when the condition is FALSE and

positive otherwise. But, if windy only takes on two values (say 0 and 1), then the fitness function can only have two values as well. Any two-valued fitness function does not allow the genetic algorithm to distinguish between different inputs that fail to satisfy the test requirement. Genetic search relies O n the ability to prefer some inputs over others, so twovalued variables cause problems when they appear within conditions. Our experimental results suggest that this problem is real. With an improved strategy for dealing w ith such conditionals, GA behavior should improve, The GAs also failed to cover several conditions not containing Boolean variables, in spite of the fact that such conditions provide the GAs with useful fitness functions, The conditions not covered by the GAs all occurred within decisions containing more than one condition, and this may account for the the GAs difficulties. However, it is also important to bear in mind that these conditions do not tell the whole story since the variables appearing in the condition may be complicated functions of the input parameters, 5.5.2 The Performance of the Random Test Generator question raised by our experiments is why the r a n d o m t e s t generator performed so poorly. In all of our experiments, the random test generator quickly satisfied a certain percentage of the test requirements and then leveled off, failing to satisfy any further requirements, even though, m SQme caseS/ ^ werg toousands of a d d i t i o n a l p r o g r a m . executions, ^ m a n y p r o g r a m S / u i s n o t counterintuitive that random test generation performs poorly. The b737 program presents an intuitively striking illustration of this. This program has 186 floating-point input parameters, meaning that there are 211904 possible inputs if a floating-point number is represented with 64 bits. Even a program like t r i a n g l e , with three floating-point parameters, has 2192 possible inputs. With a search space of this size, nothing even approaching an exhaustive search is possible. It is clear that, for any probability density governing the random selection of inputs, one can write conditions that only have a minute probability of being satisfied by those inputs. Thus, even seemingly straightforward test requirements c a n ^ e essentially impossible to satisfy using a random test generator. Some parts of the t r i a n g l e program are only executed when all three parameters are the same, but a random test generator only has one chance in 2128 of coming up with such an input if inputs are selected uniformly and independently (e.g., the second and third input parameters have to have the same value as the first, whatever that value happens to be). Since the random test generator creates tests independently at random, it is straightforward to determine ^ t h e Probability of generating an equilateral triangle one o r m o r e times d u r i n S 15> 0 0 ° t e s t s i s l e s s *** 4 x 1 0 • ( ^ I121 f o r a discussion of the probability of exercising a program feature during testing that was not exercised previously.) In some cases, the performance of the random test generator might be improved by nonuniform sampling, as discussed in Section 5.2. If the test generator were only allowed to choose integer inputs between 1 and 100—the tester would already have to know that restricting the A s e c O nd

220

1105

MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION

inputs to small integers does not preclude finding the desired test—then there would be one chance in 10,000 of finding a satisfactory input. This would give the random test generator a manageable chance of finding an input whose three parameters are all the same. Still, the same trick would also improve the performance of the GA-driven test generator. The t r i a n g l e example clearly illustrates the advantages of a directed search for a satisfactory input. 5.5.3 The Performance of Gradient Descent

The coverage plots for conjugate gradient descent and the reference gradient descent algorithm are shown on the right side in Fig. 8, with a different horizontal and vertical scale than the coverage plot for the GAs. On average, conjugate gradient descent achieved 90.53 percent condition-decision coverage, just slightly less than the genetic algorithms. The reference algorithm achieved 85.51 percent. In fact, conjugate gradient descent was more expensive than genetic search in these experiments. The reason for this, according to the status information logged during execution, was that conjugate gradient descent had trouble noticing when it was stuck on a plateau. It continued searching there when it should have stopped. This could easily be fixed, but tuning gradient descent for efficiency is our goal in this paper. Needless to say, the reference algorithm was more expensive than any of the others. The poor performance of the reference algorithm is somewhat surprising. We would expect it to perform at least as well as conjugate gradient descent if the objective function had no local minima or plateaus. The status information logged by the reference algorithm does not show that it encountered local minima, but it did encounter plateaus. According to the logged status information, this difference in performance was, once again, the result of serendipitous coverage. Below, we will discuss the issue of serendipitous coverage in more detail. 5.5.4 Serendipitous Coverage

Recall that, in the b737 code, 14 conditions were covered while the GA was trying to satisfy a different condition, while only one condition was covered while the GA was actually trying to cover it. For the differential GA, the ratio was 10 coincidental coverages to one deliberate coverage. This phenomenon was seen throughout our experiments and it is graphically illustrated in the coverage plots. There are sudden, seemingly discontinuous increases in the percentage of conditions covered. Instantaneous jumps come about because many new conditions are encountered and satisfied immediately when the program takes a new branch in the execution path (this was discussed in Section 5.2). But, sometimes, the sudden increases in coverage are rapid without being instantaneous. In our experiments, this happened when the execution of a new branch provided new opportunities for serendipitous coverage. The most interesting question raised by this experiment is the following: If the two GAs had so much success with inputs they happened on by chance, then why didn't random test generation perform equally well? We believe that the evolutionary pressures driving the GA to satisfy

Fig. 9. The flow of control for a hypothetical program. The nodes represent decisions and the goal is to find an input that takes the TRUE branch of decision c.

even one test requirement are strong enough to force the system as a whole to delve deeper into the structure of the code. This means though the GA is not necessarily following the optimal algorithm of grinding through each conditional one after the other to meet its objectives in lockstep manner, it is, in the end, finding good input cases. This argument is illustrated in the diagram in Fig. 9, which represents the flow of control in a hypothetical program. The nodes represent decisions. Suppose that we do not have an input that takes the TRUE branch of the condition labeled c. Because of the coverage-table strategy, GADGET does not attempt to find such an input until decision c can be reached (such an input must take the TRUE branches of conditions a and b). When the GA starts trying to find an input that takes the TRUE branch of c, inputs that reach c are used as seeds. During reproduction, some newly generated inputs will reach c and some will not, but those that do not will have poor fitness values and they will not usually reproduce. Thus, during reproduction, the GA tends to generate inputs that reach c. Until the GA's goal is satisfied, all newly generated inputs that reach c will, by definition, take the FALSE branch and, therefore, they will all reach condition d. Each time a new input is generated that reaches c, there is a possibility it will exercise a new branch of d. According to this explanation, gradient descent should also benefit from serendipitous coverage and, in fact, we found this to be the case (our experiments with gradient descent were performed after the preceding argument was formulated). In the case of the b737 experiments, this may also explain why conjugate gradient descent performed better than the reference gradient descent algorithm. Conjugate gradient descent tends to take small steps initially, which means that many of those inputs are likely to reach the same branches as the original seed. The reference algorithm, which takes larger steps, may make more exploratory executions that do not reach the same branches as the seed. (Of course, those inputs will lead to poor objective-function values and that direction of search will not be pursued further.) In contrast to the inputs generated by genetic search and gradient descent, those generated completely at random may be unlikely to reach d because many will take the FALSE branches of conditions a and b. Therefore, random inputs are less likely to exercise new branches of d.

221

1106

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27,

NO. 12, DECEMBER 2001

6.1 Global Optimization vs. Gradient Descent In general, gradient descent is faster than global optimization algorithms such as genetic search, so it is preferable when it provides comparable performance. However, the g OPEN RESEARCH ISSUES global optimization algorithms almost always provided better performance in our experiments. Our experimental results open a considerable number of Gradient descent can fail in two ways: First, it can research issues. These issues all have at their heart the e n c o u n t e r l o c a l m i n i m a andi s e c o n d / i t c a n e n c o u n t e r question: How can our test data generation system be lateaus in the objective f ^ ^ . w h e n w e analvzed the further improved? More specifically, how can we make our b e h a v i o r o f t h e re f ere nce algorithm, it appeared that local system find tests that satisfy an even larger proportion of m i n i m a d i d e x i s t J* Q u r fest a d for criteria_at the test requirements? The handful of issues addressed m ^ ^ ^ w g defined ^ { of ^ blem this section each have the rpotential to improve system i_ ,. i ,. j ^ u i f r j space—but plateaus appeared to be much more common. . . . r r behavior. „, . , , , . . . . . , ., Plateaus can also make a global optimization algorithm • Improved handling of binary-valued variables. The fail, and they did so in our experiments. Nonetheless, the fitness function should deal intelligently with con- global optimization algorithms made up for this failing with ditions that contain two-valued variables (see Sec- serendipitous discoveries of new inputs. In a sense, the tion 5.5.1). extra program executions, which make up the bulk of the • Improved handling of inputs that fail to reach the extra resources needed for global optimization, were put to target condition. When genetic search generates an good use. input that fails to reach the condition that we are On the basis of our experiments, we are therefore currently trying to satisfy, that input is simply given reluctant to conclude that global optimization algorithms a low fitness value. However, we already have at performed better because of their avoidance of local minima, least one input that reaches the condition because of Rather, it seems that their absence of assumptions about the the way the algorithm is defined. If we assign higher shape of the objective function, and the extra exploration fitnesses to inputs that are closer to reaching the they did to make up for this lack of assumptions paid off in condition, it might be possible to breed more inputs an unexpected way. that actually reach it. Of course, the setting determines whether or not the • Special purpose GAs. Much of the GA literature is performance gain is worth the computational effort. For concerned with investigating special purpose GAs many programs, it may be important to generate tests whose parameters and mechanisms are tailored to quickly, and achieving high levels of coverage may be less specific tasks [31]. Results garnered from the important. Indeed, it has been argued that coverage for its differential GA versus standard GA comparison o w n sa ke is not a worthy goal during software testing, that we made suggest that investigation into However, high test coverage levels are mandated in some designer GAs would be profitable. This research settings. We will also argue, in Section 6.2, that achieving would focus on GA failure and investigate ways to structural test coverage is not the only reason for using avoid running out of steam during test data autO matic test generation. In settings where automated test generation. More work should also be done generation replaces manual test generation, the automated determining exactly why the standard GA seems a p p r o a c h is clearly desirable, and we hope that this paper to outperform the differential GA. w i l l c o n s t itut e progress toward making it feasible in • Path selection. Path selection is the use of heuristics s e t t i n g s w h e r e programs are complex and manual test to choose an execution patii that simplifies test-data e n e r a t i o n i s especially laborious. r J generation (as used by TESTGEN; see Section 2.3). 6 Although path-selection is not vital in our test- 6.2 Further Applications of Test-Data Generators generation approach, it may still be the case that fo t h e b temi/ ^ a r e a n u m b e r o f i nter esting potential some execution paths are better than others for a l i c a t i o n s o f t e s t d a t a g e n e r a t i o n t h a t a r e n o t r e l a t e d t o satisfying a particular test requirement. If static or ^ s a t i s f a c t i o n o f t e s t J criteria o f t w e would dynamic analysis can provide clues about which ... . ... ^ J . , , r ,• paths are best/it will not be difficult to bias a genetic l l k e t o knaw w h e t h e r a P r o S r a m l s c a P a b l e o f P e r m i n g a search algorithm toward solutions using those paths. c e r t a i n a c t i o n ' whether or not it was meant to do so. For In fact, the work of [16], [4] suggests that such an example, in a safety-critical system, we want to know approach can lead to a noticeable improvement in whether the system can enter an unsafe state. When security performance. ' s a concern, we would like to know if the program can be • Higher levels of coverage. In this paper, we made to perform one or more undesirable actions that reported on the generation of condition-decision constitute security breaches. Even in standard software adequate test data. However, higher levels of cover- testing, one could conceivably perform a search for inputs age may further discriminate among different test- that cause a program to fail, instead of simply trying to generation techniques. It would be interesting to exercise all features of the program. References [22], [23] apply our technique to multiple-condition coverage and [24] discuss a number of such extensions of dynamic as well as dataflow and mutation-based coverage test generation and give more detail than we do here about measures. their implementation. In the final analysis, both GAs clearly outperform random test data generation for a real program of several thousand lines. This is an encouraging result.

222

MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1107

Fig. 10. comp/(0,2); 100 percent coverage represents coverage of

Fig. 11. compl(0,3); 100 percent coverage represents coverage of

60 conditions.

45 conditions.

The genetic search techniques we are developing can be applied in all of these areas, although we expect that each area will present its own challenges and pitfalls. To our knowledge, the only test-data generation systems that can be used on real programs are our own and that of [5]; since both are recent developments, it has not been possible to explore many of the less obvious applications of test data generators. Thus, the ability to automatically satisfy test criteria will open an enormous number of new avenues for investigation. 6 3 Threats to Validitv „ ., .. . . , For the most part, this paper simply reports what we , , . . _ , . , . observed in our experiments. By performing each expenment several times with different random number seeds, we tried to ensure that the observations we reported were not extraordinary phenomena, and we can say with a certain confidence that similar results would be obtained if the same experiments were run with different initial seeds. However, we do not know of any legitimate basis for generalizing about one program after observing the behavior of another nor for selecting programs at random in a way that permits statistical conclusions to be drawn } , about a broader class of programs. Therefore, we have not . , , , tried to obtain, say, a statistical sample of control programs . „ . { . , , . . , , to use m Section 5.4 nor have we tried to generalize about .,,.,, . , •• ,• • programs with different nesting complexities or condition , ... , . . ., complexities by generating many random programs with the same compl(-, •) parameters in Section 5.3. It follows, however, that our results in Section 5.4 do not predict the performance of genetic search in automatic test generation for all control software nor do the results in Section 5.3 predict its performance for all software with a given nesting complexity and condition complexity. Our measurements were made in two ways. First, we kept a log of certain information while test generation was in progress. Our logs show the value of the objective function when the program under test is executed, they record when new coverage criteria are satisfied, and, in

some cases, they record other information as well, such as the fact that the reference gradient descent algorithm is reducing its step size. Coverage results are obtained using the commercial coverage tool DeepCover and they are recorded each time a new criterion is satisfied; this allows us to construct convergence plots. Although most of what we reported is simply what we observed, some conclusions about the shape of the objective function was obtained indirectly via the log files. Thus, when we report that a test generation hits a plateau in the objective function, we mean that the algorithm was unable to find any inputs that caused the objective function value , , , , , to change and not that there were no such inputs. Likewise, r ° , when We that b ective f u n C t i o n w a s n O t ** ** ° >
7 m

^

p a p e r / w e h a v e r e p o r t e d o n r e s u l t s f r o m f o u r s e t s of

e x p e r i m e n t s u s i n g d y n a m i c t e s t d a t a g e n e r ation.

Test data

wefe

inciuding

generated

for p r o g r a m s

of v a r i o u s

sizeS/

... , , . ., „ ,. . , some that were large compared to those usually subjected . . . j . ,.• I I J T to test data generation. To our knowledge, we present ii.ri.i-i ± ^ I J • A. * * results for the largest program yet reported in the test .. ... L, , ., . .. . generation literature. The following are some salient , . , , ° observations of our study:

223

•

In our experiments, the performance of random test generation deteriorates for larger programs. In fact, it deteriorates faster than can be accounted for simply by the increased number of conditions that must be covered. This suggests that satisfying individual test requirement is harder in large programs than in small ones. Moreover, it implies that, as program complexity increases, nonrandom test generation techniques become increasingly desirable, in spite of the greater simplicity of implementing a random test generator.

1108

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001

Fig. 12. compl(3,1); the differential GA outperforms the other two techniques, which occured rarely in our experiments. 100 percent coverage represents coverage of 60 conditions.

•

•

Although the standard genetic algorithm performed best overall, there were programs for which the differential GA performed better. For most of the programs, a fairly high degree of coverage was achieved by at least one of the techniques. From the standpoint of combinatorial optimization, it is hardly surprising that no single technique excels for all problems, but, from the standpoint of testdata generation, it suggests that comparatively few test requirements are intrinsically hard to satisfy, at least when condition-decision coverage is the goal, Apparently, a requirement that is difficult to cover with one technique may often be easier with another, Serendipitous satisfaction of new test requirements can play an important role. In general, we found that the most successful attempts to generate test data did so by satisfying many requirements coincidentally. This coincidental discovery of solutions is facilitated by the fact that a test generator must solve a number of similar problems, and it may lead to considerable differences between dynamic test data ... .. . ' ,, generation and other optimization problems.

Fig. 13. compl(3,3); 100 percent coverage represents coverage of 45 conditions.

Fig. 14. comp/(5,l);100 percent coverage represents coverage of 60 conditions. All the requirements that are ever satisfied are satisfied in the early stages of the test generation process. The behavior of the GAs is essentially random because no evolution has taken place yet.

APPENDIX RESULTS FOR

SYNTHETIC PROGRAMS

This appendix shows the results of further experiments described in Section 5.3. Several interesting features, such as the remarkably poor performance of the standard GA for i) and the similar performance t h e program w i t h compi(3t of ^ ^ methods for ( Q) ^ d in several .. ., . . , - M ^ » repetitions of the experiment using different GA parameters and different seeds for random number generation. This suggests (not surprisingly) that the compl metric does not " P ^ everything needed to predict the performance of t h e test generators. As in Section 5.3, the differential GA used 30 individuals and gave up on satisfying a given test requirement if 10 generations elapsed with no progress. The standard GA w a s a l s o instructed to give up if no progress was made in 1 0 g e n e r a t i o r i S / a n d t h e population size was adjusted to give ., , , . . ' ., the same number of tareet-proeram executions as the o r e>

Fig. 15. compl(5,2); there are a total of 60 conditions to cover.

224

MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION

1109

TABLE 6 Performance of Conjugate Gradient Descent and the Reference Gradient Descent Algorithm on the Off-Diagonal Synthetic Programs |

| compl(0,2) | compl(0,3) \ compl(3,l)

\ compl(3,S) | compl(5,l) | compl(5,2)

CGD I 85^83 I 8O0 I 78?fl I 4L25 I 721 I 53?79 ref. I 89.25 | 84.03 | 81.67 | 55.28 | 72.5 | 62.96 Overall, the techniques are comparable to one another and their performance is somewhat below that of genetic search A remarkable exception is compl(3,3), where the reference algorithm outperformed all others. The reference algorithm also outperformed conjugate gradient descent on compl(0,3), which may be because of the interpolation technique we used for conjugate gradient descent.

differential GA. This resulted in the following population , , or r sizes for the standard GA (shown in Figs. 10, 11,12,13,14,

['3] E.J. Weyuker and B. Jeng, "Analyzing Partition Testing Strate• „ IEE£ j Soctware £„„ v o i 1 7 n o 7 p p 703-7H, J u l y vv y f 99 i.

a n d 15, respectivly):

[14] E.J. Weyuker, "Axiomatizing Software Test Adequacy," JEEE Trans. Software Eng., vol. 12, no. 12, pp. 1128-1137, Dec. 1986.

v

' ^v ' ' ' r \ t / compl(3, 3 ) : 280, compl(5,1): 340, and compl(5,2)

The authors would like to thank Berent Eskikaya, Curtis Walton, Greg Kapfhammer, and Deborah Duong for many . , ' , . helpful contributions to this paper. Tom O Connor and Brian Sohr contributed the GADGET acronym. Bogdan

Detecting Ability of Testing Methods, iEEE Trans. Software Eng., vol. 19, no. 3, pp. 202-213, Mar. 1993. [16] B. Korel, "Automated Test Data Generation for Programs with Procedures," Proc. Int'l Symp. Software Testing and Analysis, pp. 209-215, 1996. [17] L.A. Clarke, "A System to Generate Test Data Symbolically and Execute Programs," IEEE Trans. Software Eng., vol. 2, no. 3, pp. 215-222, Sept. 1976. [18] C.V. Ramamoorty, S.F. Ho, and W.T. Chen, "On the Automated Generation of Program Test Data," 7EEE Trans. Software Eng., vol. 2, no. 4, pp. 293-300, Dec. 1976. » A n ^ g r a t e d Automatic Test Data Generation [19] , o f f u t t System,"/. Systems Integration, vol. 1, pp. 391-409,1991.

Korel provided the programs analyzed in Section 5.2. This

™

:240.

T h e performance for each size is reported in Table 6. ACKNOWLEDGMENTS

r

r

WH D ea on D Br ? T n ' ^ H ' Chan ,?', a c n r d r J i I - Cro f n ' J c / ' ®'

"ARule;

Based Software Test Data Generator, IEEE Trans. Knowledge and Data Eng., vol. 3, no. 1, pp. 108-117, Mar. 1991. [21] K.H. Chang, J.H. Cross II, W.H. Carlisle, and S.-S. Liao, "A Performance Evaluation of Heuristics-Based Test Case Generation Memods for Software Branch Coverage," I n n J. S o / W En*. «mrf Knowledge Eng., vol. 6, no. 4, pp. 585-608, 1966. [22] N. Tracey, J. Clark, and K. Mander, "The Way Forward for REFERENCES Unifying Dynamic Test-Case Generation: The Optimisation-Based REFERENCES Approach," Proc. Int'l Workshop Dependable Computing and Us [I] W. Miller and D.L. Spooner, "Automatic Generation of Floating „ „ Rations (DCIA), pp 169-180 Jan. 1998. Point Test Data," IEEE Trans. Software Eng., vol. 2, no. 3, pp. 223- (231 £• Tracey J. Clark, and K Mander, Automated Program Flaw JO rr Finding Usmg Simulated Annealing, Proc. Int I Symp. Software 226 Sept 1976 [2] B. Korel,' "Automated Software Test Data Generation," IEEE Testing and Analysis, Software Eng. Notes, pp. 73-S1, Max. 199S. 24 H Trans. Software Eng., vol. 16, no. 8, pp. 870-879, Aug. 1990. I ! - Tracey, J. Clark, K. Mander, and J. McDermid, "An Automated Framework for Structural Test-Data Generation," Proc. Automated [3] P. Frankl, D. Hamlet, B. Lirtlewood, and L. Strigini, "Choosing a Testing Method to Deliver Reliability," Proc. 19th Int'l Conf. Software Eng. '98, pp. 285-288, 1998. Software Eng. (ICSE '97), pp. 68-78, May 1997. [25] C.C. Michael, G.E. McGraw, and M.A. Schatz, "Genetic Algorithms for Dynamic Test Data Generation," Proc. Automated [4] R. Ferguson and B. Korel, "The Charting Approach for Software Test Data Generation," ACM Trans. Software Eng. Methodology, Software Eng. '97, pp. 307-308, 1997. vol. 5, no. 1, pp. 63-86, Jan. 1996. [26] C.C. Michael, G.E. McGraw, and M.A. Schatz, "Opportunism and [5] M.J. Gallagher and V.L. Narasimhan, "Adtest: A Test Data Diversity in Automated Software Test Data Generation," Proc. Generation Suite for Ada Software Systems," 7EEE Trans. Software Automated Software Eng. '98, pp. 136-146, 1998. Eng., vol. 23, no. 8, pp. 473-484, Aug. 1997. [27] A.C. Schultz, J.C. Grefenstette, and K.A. Dejong, "Test and [6] J.H. Holland, Adaption in Natural and Artificial Systems. Ann Arbor, Evaluation by Generic Algorithms," IEEE Expert, pp. 9-14, Oct. ° ' research has been made possible by the US National Science Foundation under award number DMI-9661393 and the US T~, c AJ JTO 1-TJ-I.A /r~»A r>r> A \ Defense A d v a n c e d Research Projects A g e n c y (DARPA) c o n t r a c t N66001-00-C-8056.

Mich.: Univ. of Michigan Press, 1975. S. Kirkpatrick, CD. Gellat Jr., and M.P. Vecchi, "Optimization by Simulated Annealing," Science, vol. 220, no. 4,598, pp. 671-680, roi M ^ i 1 9 8 3 ' « T u o UJ . m . n m i r .• i i •, [8] F. G over "Tabu Search Part I, II," ORSA ]. Computing, vol. 1, no. 3, pp. 190—206, 1989. [9] J. Horgan, S. London, and M. Lyu, "Achieving Software Quality ' ... JP .. ,-. ,, '„ U . i *,*, ,n with Testing Coverage Measures, Computer, vol. 27, no.n 9, pp. 6069 Se t 1994 ' ' rr [10] J. Chilenski and S. Miller, "Applicability of Modified Condition/ Decision Coverage to Software Testing," Software Eng. }., pp. 193[7]

200, Sept. 1994. [II] R. DeMillo and A. Mathur, "On the Uses of Software Artifacts to Evaluate the Effectiveness of Mutation Analysis for Detecting Errors in Production Software," Technical Report SERC-TR-92-P, Purdue Univ., 1992. [12] R. Hamlet and R. Taylor, "Partition Testing Does Not Inspire Confidence," IEEE Trans. Software Eng., vol. 16, no. 12, pp. 14021411, Dec. 1990.

1993. [28] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C. New York: Cambridge Univ. Press, 1991. I291 J- Skorin-Kapov, "Tabu Search Applied to the Quadratic Assignm e n t P r o b ] e m / - ORSA ; . Computing, vol. 2, no. 1, pp. 33-U, Winter 1 oon . _ _ . . . _ , . ., ... . , _ . . . . , rori [30] D. Goldberg, Genetic Algorithms in cSearch, Optimization, and .„ •• • • r, J- * » , .,,. ... , ,„„„ Machine Learning. Reading, Mass.: Addison-Wesley, 1989. I31) ™- Mitchell An Introduction to Genetic Algorithms. Cambridge, Mass " M r r Fress'19%-

[32] Foundations of Genetic Algorithms. G. Rawlins, ed. San Mateo, Calif.: Morgan Kaufmann, 1991. [33] R. Storn, "On the Usage of Differential Evolution for Function Optimization," Proc. North Am. Fuzzy Information Prossessing Soc., (NAFIPS '96), pp. 519-523, June 1996. [34] S.K. Park and K.W. Miller, "Random Number Generators: Good Ones are Hard to Find," Comm. ACM, vol. 31, no. 10, pp. 11921201, Oct. 1988.

225

1110

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, JGINEERING, VOL.27, VOL. 27, NO. 12, DECEMBER 20O1 IEEETRANSAC1

[35] R.A. DeMillo and A.J. Offutt, "Experimental Results from an Automatic Test Case Generator," ACM Trans. Software Eng. Methodology, vol. 2, no. 1, pp. 215-222, Jan. 1993. [36] J.D. Musa, "Operational Profiles in Software Engineering," IEEE Software, vol. 10, no. 2, pp. 14-332, 1993. Christoph C. Michael received the BA degree in physics from Carleton College, Minnesota, in 1984, and the MSc and PhD degrees in of William William computer science from the Collegei of is a and Mary in Virginia, 1993. He is a senior senior served as as research scientist at Cigital. He has» served prinicpal investigator on software assurance assurance ite of Stangrants from the US National Institute of Standards and Technology's Advanced Technology Technology Program and the US Army Research Labs, as :urity grants from the US Defense Advanced well as software security jency. His current research includes information Research Projects Agency. :tion, software test data generation, and dynamic system intrusion detection, deling. He is a member of the IEEE. software behavior modeling.

Gary McGraw is the vice president of corporate technology at Cigital (formerly, Reliable Software Technologies) where he pursues research in software security while leading the Software Security Group. He has served as principal investigator on grants from the US Air Force Research Labs, the Defense Advanced Research Projects Agency, the US National Science Foundation, and the US National Institute of Standards and Technology's Ad•rogram. He also chairs the National Information vanced Technology Program. Council's Malicious Code Information Security Security Research Council's ogy Study Group. He coauthored Java Security Science and Technology re Fault Fault Injection Injection (Wiley, (Wiley, 1997), 1997), and and Securing Securing Java Java (Wiley, 1996), Software (Wiley, 1999), and is currently writing a book entitled Building Secure esley, 2001). He is a member of the IEEE. Software (Addison-Wesley, Michael A. Schatz graduated summa cum laude with a BS degree in mathematics from Case Western Reserve University in 1996 and an MS degree in computer engineering from Case Western Reserve University in 1997. He is a senior research associate at Cigital (formerly known as Reliable Software Technologies). He has worked on numerous projects in both research and development roles. These projects include experimentation with using fault injection abilities, using genetic algorithms to generate test to find security vulnerabilities, data for programs, andd augmenting the capabilities of Reliable Software ige tool. He has coauthored articles for Dr. Dobbs Technologies's coverage arch papers. and a number of research

> For more information on on this or any computing topic, please visit our Digital Library att http://computer.org/publications/dlib.

226

Chapter 6 ML Applications in Reuse As the cost of software development becomes the bulk of any computer based solution, it makes a great economic sense to systematically reuse existing solutions, which can take place at many different levels: specifications, domain knowledge, designs, development processes, systems, subsystems, and components. There are a number of technical and managerial benefits of reuse: reduced development time and risk, increased reliability and productivity, and improved standardization. The ML applications in this chapter pertain to reuse. Issues that have been considered in this area of applications include: how to compute the similarity in a reuse library, tools for browsing software libraries, how to model the cost of rework for reusable components, how to locate and adopt software components to given specifications, how to generalize program abstractions so as to increase their chance for reuse, and how to organize reusable components such that efficient retrievals can be accommodated. The ML methods utilized in this area of applications consist of IBL/CBR, DT, GA and EBL, as shown in Table 26. Table 26. ML methods used in reuse. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Similarity Computing

V

Active Browsing

V

Cost of Rework Knowledge Representation Locate and Adopt Software to Specifications

V V V

Generalize Program Abstractions Clustering of Components

V

V

This chapter contains one paper by Katalagarianos and Vassiliou [70]. The paper describes a CBR based approach to locating and adopting reusable components to particular specifications. In their organizational framework, a single representational model is used to accommodate a variety of different artifacts (ranging from designs, domain knowledge, development experience and programming knowledge, to code) in the repository. Descriptions of those artifacts are made in a component interconnection language called Telos. Components in the repository are decomposed into implementation components, design components, specification components, and dependency components. In the generalization/specialization hierarchy of specification

227

components, there are three different types of elements: the most general elements, the most specific elements and the intermediate elements. The dependency components interrelate components of different types and consist of the design dependency and the implementation dependency. A design dependency component defines a dependency relationship between a design component and the corresponding specification component. An implementation dependency component defines a dependency relationship between an implementation component and the corresponding design component. The proposed approach has two phases: retrieval and adaptation phase, and repository evolution phase. When retrieving components for reuse, the proposed system interacts with the application developer, and searches for the most appropriate case satisfying the user's need with possible adaptation. The repository evolution is based on learning and generalization. To demonstrate the viability of the proposed approach, a prototype system was implemented for the reuse of C++ code. The following paper will be included here: P. Katalagarianos and Y. Vassiliou, "On the reuse of software: a case-based approach employing a repository", Automated Software Engineering, Vol.2, No.l, 1995, pp.55-86.

228

Automated Software Engineering, 2,55-86 (1995) © 1995 Kiuwer Academic Publishers, Boston. Manufactured in The Netherlands.

On the Reuse of Software: A Case-Based Approach Employing a Repository PANAGIOTIS KATALAGAR1ANOS Department of Computer Science, University of Crete and Institute of Computer Science, Foundation for Research and Technology—Hellas. FORTH YANN1S VASSILIOU National Technical University of Athens, Depl. of Electrical and Computer Engineering and Institute of Computer Science, Foundation for Research and Technology—Hellas, FORTH

Abstract. Systematic reuse of software has been proposed as a promising means to address the legendary productivity increase in software development. While object-oriented programming languages are, by nature, well suited for reusability-based development of applications, additional mechanisms to effectively reuse software are necessary. We present a novel language-independent method, which assumes an appropriately organized software repository and employs a simple form of Case-Based Reasoning in conjunction with the specificitygenericity hierarchy to locate and possibly adopt software to particular specifications. The method focuses on code reuse and addresses the evolving nature of the repository. Complexity issues for the main algorithms are presented. Finally, a demonstrator prototype system for reusing object-oriented code (C++) is described.

1.

Introduction

Several advances towards systematic software reuse have recently been reported with library systems, classification techniques, the creation and distribution of reusable components, reuse support environments, and corporate reuse programs. Despite these efforts, it is argued that reuse has not delivered yet on its promise to significantly increase productivity and quality (Prieto-Diaz, 1993). Effective reuse of software requires a rich collection of designed-for-reuse software components and knowledge on how to locate them in a repository, adapt them if needed (for instance, by substituting parameters), and even create new ones based on information provided by other components exploiting similar characteristics. Genericity is the technique that allows a module to be defined with parameterized types. This is a definite aid to reusability because just one generic module is defined, instead of a group of modules that differ only in the types of objects they manipulate (Meyer, 1987). A language supporting type parameterization allows specification of general container types such as list, where the specific type of the elements is left as a parameter. Thus, a parameterized class specifies an unbounded set of related types; for example, list of int. Typical languages that provide genericity are Ada, LPG and CLU. A general form of parameterized types can also be integrated into object-oriented languages that do not provide a built-in form of genericity, Stroustrup (1988) proposed such a general form for the C++ language. Within this framework a generic class can be defined

229

56

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU

having one parameter of type < T > . For example the generic class array may be defined as:

class array { protected: < T > * element; int size; public int search(); Example 1

Using this technique, it is possible to maintain the efficiency of object-oriented code (as we make the substitution before compiling), while retaining the benefits of genericity. However, this genericity mechanism by itself, is not flexible enough because it can not capture the fine grain of commonality between groups of implementations of the same general data abstraction. This is because there are only two levels of modules: a) generic modules, which are parameterized and thus open to variation but not directly usable, and b) fully instantiated modules (specific modules) which are directly usable but not open to refinement. Our conjecture is that in order to effectively reuse object-oriented code using genericity, software developers could depend on experiential knowledge gathered and stored in the repository while developing similar software components. Specialized methods can then be incorporated in order to achieve better support from the system for the provision of this knowledge. The application of techniques and methods from artificial intelligence to software engineering is one mechanism through which reusability of software might be achieved. By abstracting and encoding the expertise of experienced software engineers into knowledge bases together with software components, system developers can gain effective access to the artifacts in the software repository as it evolves over time. Case-Based Reasoning (CBR) (Barletta, 1991) is a method of solving problems based on the transfer of past experience to new problem situations. It has been attractive as a method for building intelligent reasoning systems because it appears relatively simple and natural. CBR as a learning paradigm has several advantages: Firstly, there are several performance enhancements as it provides: shortcuts in reasoning, the capability of avoiding past errors, and the capability of focusing in the most important parts of a problem first. Secondly, learning can be eased since CBR does not require a causal model or a deep understanding of the domain. Thirdly, individual or generalized cases can also serve as explanations. This paper presents a novel method, which employs a simple form of Case-Based Reasoning (CBR) in conjunction with the specificity-genericity hierarchy to semi-automatically locate the appropriate code in a software repository, possibly adapting it to particular requirements, while dealing with the evolution of the repository by adding (if needed) new components and making the proper repository re-organization.

230

ON THE REUSE OF SOFTWARE

5-7

The method presented in this paper has been evaluated through a prototype implementation, which addresses the reuse of C++ code. Since the objective is not to describe all the functionalities of the prototype system but to illustrate the way the method is applied, we present a sample session on the system usage which can be found in Appendix A. The organizational framework is presented in section 2, while the reuse method itself is described in section 3. The main topic of section 4 is the complexity analysis of the algorithms used for selection, adaptation and repository evolution. Finally, conclusions and extensions are described in section 5. 2.

Organizational Framework

Software environments typically use an environment database (repository) to provide support for all activities concerning the software development process. There are several factors in addressing the organization of the repository. These are: a) which artifacts may be reused? b) how these artifacts are transformed into reusable components? c) what is the proper size and form of reusable components? d) how should thej be classified? and e) what are proper techniques and languages for describing the components? Regarding which artifacts are candidates for reuse, it appears that the software community is reaching a consensus: not only code but also higher-level concepts like designs, domain knowledge, development experience and programming knowledge. The issues of representation and presentation of the reusable artifacts do not adhere to a simplistic solution and certainly need to be separated. A main objective in the work reported, has been the use of a single representation model for hosting all these drastically different artifacts in arepository. Effectively, the repository stores and manages descriptions of the artifacts (in the sequel they are called components)—all expressed with a suitable data description language. An important aspect in this work has been to identify the concepts and relations in the programming domain which can be used to capture programming knowledge. These identified concepts and relations have then been explicated in component descriptions and component interconnections by introducing abstractions and generalizations, under the following assumptions: •

The software components are described and are interconnected in an economic, domain independent way that enables the formation of matching cases.

•

The matching cases may be categorized effectively by indices that are easily extracted from the component descriptions.

•

The whole organization provides a strong basis for controlled evolution and expansion of the repository with quality components through the application of an evolution method.

A consequence of describing and interconnecting the software components under these assumptions is that the repository is constrained to be methodology-specific, rather than general-purpose. This is a trade-off that experience and experimentation to date shows unavoidable.

231

58

PANAGIOTIS KATALAGAR1ANOS AND YANNIS VASSILIOU

Several languages from various fields like software engineering, databases and knowledge representation, provide various means to describe individual features of components and component interconnections. Informally, two components are interconnected whenever there exists at least one resource belonging to one, and is being accessible or derived by the other. All of these languages belong to a special class called Component Interconnection Languages (CILs) (Motschnig-Pitrik and Mittermeir, 1989). CELs are not restricted at the implementation level, but they use the term component to refer to a segment of software specification independently of the level in which it belongs. Using a CIL to define components and their interconnections, makes it possible to cover all of the aspects which are considered to be important to be supported when following a specific life cycle approach and employing specific languages and techniques. The main feature of a CIL is that it is not just a language for programming, but fundamentally a language for packaging. When properly applied, this principle can provide much functionality to the application developer, without imposing inconvenient constraints or overhead. In the work reported, the language chosen for the description of software components is Telos (Mylopoulos et at., 1990). Telos is an E-R based language, satisfying all the requirements so as to be considered a CIL, designed specifically for information system development applications. It supports a number of structuring mechanisms which have been used by knowledge representation languages as well as semantic data models allowing the designer of a knowledge base to introduce gradually and in an orderly fashion the detail that needs to be represented. These mechanisms are: classification, aggregation, and generalization. Using Telos, the general metaclasses of components and interconnections are defined as: IndividualClass Component in Ml .Class end Component IndividualClass Interconnection in Ml.Class with necessary, single from.component: Component; to_component: Component end Interconnection The generic attributes associated with Interconnection are listed within the declarations using the syntax : . Thusfivm.component: Component is one such generic attribute which allows instances of Interconnection to have attributes with values which are instances of the metaclass Component. Attribute categories in Telos, which are all userdefined, group generic attributes and impose constraints on their instances. In the previous declaration the necessary attribute category is a constraint to be enforced at all times for Interconnection instances, while single constraints its instances not to have multi-valued attributes. Since the software environment at hand covers all the stages of the software life cycle, a first partition of the set of the components involved in the framework include: a) implementation components, b) design components, and c) specification components.

232

ON THE REUSE OF SOFTWARE

2.1.

Implementation

rg

Components

Object-Oriented languages have data abstraction and encapsulation constructs called packages, modules, or classes that enable one to define and enforce the boundaries separating the components of a software system. In this article these abstraction constructs are referred to as classes, or implementation components. Every class provides some resources to other classes in a system, and in turn may require some resources from other classes. These resources are a) the class parts which denote the data representation of a component, b) the class methods. Using the Telos syntax an implementation component description is defined as: IndividualClass IMPL.Component in Ml .Class isA Component with necessary, single language: ProgrammingJLanguage necessary filename: String attribute class-parts: ClassPart; methods: Method end IMPL_Component The attribute type ClassPart is defined as: IndividualClass ClassPart in Ml .Class isA IMPL-Resource with necessary, single parUype: Type end ClassPart The attribute type Method is defined in a similar way. These resources need to be further specialized in order to specify their interfaces. For instance, a protected class part is defined as: IndividualClass ProtectedPart in Ml .Class isA ClassPart end ProtectedPart Up to this point, the features of implementation components and their resources are modeled using Telos metaclasses. These metaclasses can be instantiated to Telos classes corresponding to the different implementation components that are stored in the repository. For instance, the instantiations corresponding to the implementation component array described in Example 1 are: IndividualClass array is SXlass, IMPL.Component with file : "array.cc" language :"C++" class-parts

233

60

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSIUOU

: array_size; : array_element methods : array_search end array IndividualClass array-size in S-Class, ProtectedPart with part-type : int end array-size 2.2.

Design Components

Object-oriented design is viewed as a software (de)composition technique. It may be defined as a technique which, unlike the classical (functional) design, bases the modular decomposition of a software system on the classes of objects the system manipulates. The resources distributed among the design components are partitioned into abstract data types and operations. Using the Telos syntax a design component is defined as: IndividualClass DES.Component in Ml_Class isA Component with attribute adt: AbstractData Type; operations: Operation end DES-Component Concerning the running example, the abstract data type table (as the array is an implementation of a table) may be identified with a search operation. 2.3.

Specification

Components

The ability to structure a specification is vital in any software engineering environment. A specification model in the proposed framework, includes both functional and non-functional specifications. Functional specifications provide adescription of the functions carried out by the corresponding implementation component, while non-functional specifications impose global constraints on the operation, performance, and efficiency of any proposed solution to the functional specifications model (Chung, 1990). When modeling a specification component, both specification types are considered to be resources, described as: where ATTRIBUTES is defined as the set of attributes of the form: ,

234

ON THE REUSE OF SOFTWARE

gj

and description is a textual description of the specification. In the example, the description of the operation search carried out by the implementation component array is expressed with the following functional specification resource: (Search, {(where, array), (what,)}, "Search for an element in an array"). Using the Telos syntax the specification component and the functional specification resource are defined as follows: IndividualClass SPEC.Component in Ml-Class is A Component with necessary functional: FunctionalSpecification attribute nonfunctional: NonFunctionalSpecification end SPEC-Component IndividualClass Specification in MLClass isA SPEC_Resource with necessary, single description : String; name: String attribute spec-attribute: Type end Specification IndividualClass FunctionalSpecification in Ml .Class isA Specification end FunctionalSpecification Concerning Example 1, the following instantiations need to be made: IndividualClass SPEC.Compl in S.Class, SPEC-Component with functional : Sped endSPEC.Compl IndividualClass Sped in S.Class, FunctionalSpecification, LeafResource isA Spec4 with description : "Search for an element in an array" name : "Search" specJttributes where: array; what: "< T > " end Spec 1 The functional specification resource Spec 1 is defined to be an instance of the metaclass LeafResource. This instantiation is explained by describing the way that the specification resources are organized.

235

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU

62

Sp«3 name: Scnrch description: Search for elements

/£ A

^-^\JSA " v> -s^

Spec4 / description: Search for elements in tables Spec_attrilmtes: where: tahle

Spec! / description: Search Tor elements in arrays Spec_attributes: whoreiarray whnt:

^ \ Specs '" description: Sewi-h for elements in lists Spec.attributes: whore: list (

(

what.-

««y

space /

is>\^

•••

description: Search for elements in linked lilts Spec.attrihutes: where: linkedjist

Figure I. A functional specification resources' hierarchy

The specification resources (functional and non-functional) are organized along a set of generalization/specialization (isA) hierarchies. Each member of the set, consists of a hierarchy having as elements specification resources of the same name (e.g. Search). The specializations of the hierarchy are made with respect of the spec .attributes attributes. Specifically, an attribute of type specMttributes may be either inherited to the more special resources of the hierarchy, or may be redefined to a special type of it. Figure 1 depicts an example of a hierarchy for the functional specification resources with name Search. Since both lists and arrays may be considered as special cases of the abstract data type table, the functional specification resource Spec 4 corresponding to the specification of the search operation in tables may be specialized to resources Spec 1 and Spec 5, corresponding to specifications for search operations in arrays and lists respectively. On the same way the functional specification resource Spec 5 is further specialized to resources corresponding to specifications for search operations in special types of the abstract data type list. Three different types of elements may be distinguished in a generalization/specialization hierarchy. The most general element, elements which are not further specialized, and the intermediate ones. This distinction, is necessary in the model, in order to achieve better performance during the selection process. Therefore, two special types of resources are defined: a) RootResources which are considered to be the most general resources in an isA hierarchy, and b) LeafResources which are resources that are not further specialized. All these different types of components are interconnected. In order to do that, the concept of "dependency" is introduced. Each component depends on a component of the previous stage of the software life cycle. An implementation component for a specific application depends on a design component corresponding to the design of this implementation. Similarly, a

236

ON THE REUSE OF SOFTWARE

A~ OJ

design component component depends on a specification component corresponding to the specifications that are satisfied by this design. These dependencies are not necessarily one-to-one. This happens because we may make different design decisions in order to satisfy the same specification in different cases. For example, in order to satisfy the specification "Data organization for fast retrieval", either a hashing scheme or a B-tree organization may be chosen. On the other hand, different implementation decisions may be made in order to satisfy the same design. For example, consider a design component that describes a stack. It consists of the abstract data type stack, with operations pop, push and top. However, there are several methods for implementing a stack, using arrays or special types of linked lists. It is desirable to model these design and implementation decisions in order to reuse them in similar cases in the future. Therefore, a new component type is introduced, the dependency component.

2.4. Dependency Components Dependency components interrelate software components of different, type, by defining dependency relationships among them. Furthermore, they are used to model the different decisions that are made during the software development process. Two different types of dependency components are distinguished: a) design dependency components and b) implementation dependency ones. A design dependency component defines a dependency relationship between a design component and the corresponding specification one. The resources distributed among the design dependency components provide information about the design decisions they represent. An application developer makes different design decisions in order to satisfy different non-functional requirements. In the model presented, the non-functional requirements are expressed using non-functional specification resources. Therefore, non-functional specification resources constitute the set of resources distributed among the design dependency components. In addition, another resource that is necessary to be distributed among these components is the textual description of the decision they represent. It is noted that, a design dependency component does not necessarily represent always a design decision. It may just relate a design component with the corresponding specification one, acting as a dependency link. Using the Telos syntax a design dependency component is defined as: IndividualClass DES_Dependency in Ml-Class isA Interconnection, Component with necessary.single from_component: DESXomponent; to_component: SPECXomponent attribute description: String; nonJunct: NonFunctionalSpecification end DESJDependency

237

64

PANAG10TIS KATALAGARIANOS AND YANNIS VASSILIOU

Figure 2 depicts the interconnections among the software and dependency components involved in two search operations for an element in a file and an array respectively. Like in all examples in this paper, for simplicity and clarity of presentation, only the necessary minimal attributes are shown. Note that the DES.Dependency metaclass is defined as a special case of both interconnection (as its instances interconnect two components), and component (as it provides resources). In analogy, an implementation dependency component defines a dependency relationship between an implementation component and the corresponding design one. Additionally, it provides information about the different implementation decisions that are made during the creation of an implementation component. The software components described above provide the basis that enables the formation of matching cases. These matching cases may be easily processed by the case-based system. The research reported, focused on finding an economic, domain independent representation for stored cases, which can be easily acquired and can guide the various processes involved in the adaptation of existing cases to new problem situations. A case is defined as a triplet C of the form: C = <S, des, impl> where: •

S is a set of functional specification resources that constitute a path from a leaf resource to a root resource along an isA hierarchy.

•

des is a design component (instanceOfides, DES-Component)), corresponding to the specification component that provides as resources members of the set S.

•

impl is a specific implementation component (an implementation component with no parameterized attributes) corresponding to the design component des.

Each case C is broken into pieces (sub-cases). These sub-cases are stored individually along with a set of pointers that can be used to reconstruct the whole. This makes it easier to access parts of old cases to solve parts of new ones, allowing the complex reuse problems to be solved by combining partial solutions of several other problems. It also makes it easier to assess the applicability of part of a previous problem to a new situation. Four matching sub-cases can be distinguished: i.

Given the user's specifications, match an S that best suits his/her needs (sub-case C\).

ii.

Given an S, match a design component by making (if needed) the appropriate design decision. C 2 = <S, des_dep, des> where instanceOfldesjlep,

DES-Dependency).

238

ON THE REUSE OF SOFTWARE

65

Figure 2. Interconnections among software and dependency components

iii. Given a design component des, match an implementation component impl{ by making (if needed) the appropriate implementation decision. C 3 = <des, impLdep, implt> where instanceOflimpljdep,

IMPLDependency).

iv. If the implementation component implx is generic (i.e., it has parameterized types), then transform it to the corresponding specific one {impl) according to the user needs. C4 =

Figure 3 presents graphically an example of a matching case, corresponding in a search operation for an element in an array.

239

66

PANAGIOTIS KATALAGARIANOS AND YANNIS VASS1LIOU

Figure 3. An example of a matching case

3.

The Method

Two distinct phases characterize the proposed method: a) the retrieval and adaptation phase, and b) the repository evolution phase.

3.1.

Retrieval and Adaptation

In retrieving components for reuse, the system interacts with the application developer and locates, with possible adaptation, the most appropriate case meeting the user needs. Upon accepting a new sub-case of type C\, the system proceeds as follows:

240

ON THE REUSE OF SOFTWARE

fp

Step 1. Recall relevant sub-cases from case memory. The goal of this step is to retrieve "good" sub-cases that can support the reasoning that comes in the next steps. Step 2. From the collection of sub-cases retrieved in Step 1, select the most promising sub-cases to apply the reasoning mechanism. Step 3. Construct a solution for the new sub-case, by matching the most promising of the sub-cases retrieved in step 2, and by proceeding (if needed) the proper adaptation. 3.1.1.

Recalling Relevant, "Good", Sub-cases of Type C\

Good sub-cases are considered to be all the hierarchies of functional specification resources, whose elements have the same resource name, that meet user needs. Consequently, the application developer is prompted for the provision of the resource name n, that should be provided by the target functional specification resource. Since each of these hierarchies is uniquely identified by a leaf resource having as name this n, instead of recalling the complete hierarchies, it is sufficient to recall the set LR of these leaf resources. In order to do that the system evaluates the following query: LR = setOf x such that (attributed, name, 1, n) AND instanceOf(x, LeafResource) AND instanceOf(x, FunctionalSpecification)) where attribute^, a,l,t) and type t. 3.1.2.

defines that the class c has an attribute of category a with label /

Selecting the Most Promising Sub-cases

This process is based on the structure of the target functional specification resource, meeting the user needs. Through an interactive process, the target resource is refined with the provision of more details as prompted by the system. These details are related with the desired specification attributes associated with it. During the first step of the resource refinement process, the system retrieves the set ALT of the triplets (funct, label, type) where fund is a functional specification resource that it is member of the set LR, and label, type are the labels and types of its specification attributes respectively: ALT = setQf (x, y, z) such that ((exists x, y, z such that x belongsTo LR) AND attributed, spec-attributes, y, z)) Next, the system constructs the set AL of all attribute labels associated with the elements of the set LR: AL = setOf y such that (exist x, z such that ((x, y, z) belongsTo ALT)) Then, the system prompts the user for the provision of a specification attribute type f, for each of the elements /,• of the set AL. Four alternatives can be distinguished.

241

68

PANAG10TIS KATALAGARIANOS AND YANNIS VASSIUOU

1) The user may select one of the elements of the set T, as proposed by the system. This set has members the valid (known) types f,- associated with the labels /, of the members of the set/,/?. 2) The user may provide a type 7, that does not belong to the set T, but according to his/her opinion fits better to the new problem situation. 3) The user may exclude a label /, of the set AL from the target specification resource. This is possible with the provision of a no value. 4) The user may provide a nojcare value specifying that he/she does not actually care whether such a label exists in the target functional specification resource. Reaching the end of this process, the system creates a set D of couples (/,, f,) that in combination with the name n constitute a description of the target functional specification resource. In fact this set includes much more information due to the existence of some no or noxare values. Therefore, the set D is partitioned into its subsets D\, Dj, and D3, as follows: D = Di U D2 U Dy, D\

=

£>2 £>3

= =

setOf (/,, tj) such that ((/ ; , tj) belongsTo D AND tj not equals to "no" AND tj not equals to "no_care") setOf (If, tj) such that ((/,-, tj) belongsTo D AND tj equals to "no") setOf (//, tj) such that ((/,, //) belongsTo D AND tj equals to "no-care")

The specification attributes of the set D\ are combined with the resource name n to provide a description of the target functional specification resource. The set Di is used to exclude candidate functional specification resources from the set LR. Using D2, the system constructs the set LR\ of the actual functional specification resources having at least the same specification attributes with that described by D\ (the rest are specification attributes that the application developer does not actually care about them): LRX

=

setOf such that (x belongs to LR AND forall /,, r, such that (JC, /,, u) belongs to ALT ((Ij, "no") does not belong to D 2 ))

Next, follows the matching process. During this step, the target functional specification resource, described by D\, is compared with the candidate functional specification resources that are members of the set LR\, Five matching alternatives can be distinguished: 1. Exact Match. A specification resource s matches exactly the user needs. This is the case where each label-type couple of the set D\ is also a label-type couple of the resource s. Formally: exact_match(s) -O- exists s such that (s belongs to LR\ AND forall (/,, tj) such that (/,, /,) belongs to D\ ((s, /,, r,) belongs to ALT)) After an exact match of a leaf resource s, the set S of the functional specifications resources that constitute a path from a leaf resource to a root resource, need to be constructed. In

242

ON THE REUSE OF SOFTWARE

gg

order to do that, the system follows recursively the isA hierarchy links:

S = setOf 5] such that isa(.?, s\) This set constitutes a sub-case C| = <S> of type C\ that fits exactly the user needs. After a successful match of a sub-case of type C\, the system tries to match a sub-case ci of type Ci. In order to do that, the system initially creates set SC of the specification components that provide as a functional specification resource the leaf resource sn of the hierarchy S. The query used for this purpose is: SC

=

SetOf x such that (attribute^, functional, y, sn) AND instanceOf(x, SPEC.Component))

If the set SC consists of more than one elements, the application developer has to select the most appropriate according to his/her needs. Such a selection should be based on the additional functional specification resources provided by the members of the set SC. In the sequel, the system tries to locate a design component des, given that the specification component that has already been located is spec. Two different alternatives can be distinguished. i) No design decision need to be made. This means that the specification component spec is connected with only one design dependency component. In such a case, the corresponding design component des is located with the evaluation of the following query: des

x such that (attribute(d,to.component,/i, spec) AND attribute(d, from_component, h, x))

=

ii) A design decision is required. This means that the specification component spec is connected with more than one design dependency components. For the design decision to be made, the set DES\ is initally constructed. This set consists of the design decision components dt that are interconnected with spec: DES\

=

setOf x such that (instanceOf(x, DES .Dependency) AND attribute^, to_component, /,, spec))

The set NF\ is consequently constructed. This set consists of the couples (x, dt) where x is a non-functional specification resource associated with the design decision component d-, of the set DESX. Finally, the application developer is responsible for the selection of the best couple (n//,
243

70

PANAG1OTIS KATALAGARIANOS AND YANNIS VASSILIOU

target functional specification resource are of specific type, while some of the specification attributes of the resource s are parameterized: genericjnatch(s) 4> exists s such that (s belongs to LR\ AND foraJ] (/,, /,) such that (/,, /,) belongs to D, ((s, lh /,) belongs to ALT OR ((s, /,, < T > ) belongs to ALT AND s u b s t i t u t e d ^ , s, /,-, r,-)))) The substitute(r«/e/t, s, /,, f,) predicate is interpreted as: there exists a rule rate* associated with the parameterized component s that permits its parameterized specification attribute with label /, to take as type the type /,. After a generic match, the system constructs the sub-case C\ and then matches the subcases Ci and Cj on exactly the same way as in the exact match case. 3. Similarity Match. The only difference between the user specifications and that described by an element s of the set LR\ is that some of their specification attribute types are not exactly the same, but they are similar. For example, an array and a linked list can be characterized as similar as they both are special cases of tables. Such similarities are user-defined and the system learns about them as explained in sub-section 3.2. Formally a similarity match is defined as: similarityjnatch(s) o exists s such that (s belongs to LR\ AND forall (/,, //) such that (/,, /,) belongs to Dt (((s, /,, /,) belongs to ALT OR ( ((s, /;, ) belongs to ALT AND substitute(rufe*, s, /,-, /,))) OR ( (exists tj such that ((s, /,, tj) belongs to A i r AND modification(ru/e/, substitute, /,-, tj, last_attempt) AND last_attempt equals to true)))))) Note that, in the above definition partial generic matches are also permitted, as some types of the specification attributes of s may be parameterized. The predicate modification(r«/e/, substitute, f,-, tj, last-attempt) expresses similarity between the types ;,• and tj. An example of such a predicate is modification(™/e2i, substitute, x, y, true) => isA(x, table) AND isA(y, table) This predicate is interpreted as: there exists a modification rule rule2\, that permits the substitution of the type x, with the type y, if they are both known to the system as special cases of the general type table. The lastAttempt variable is a boolean, expressing if such a substitution was valid the last time it was proposed by the system. If such a similarity is discovered, the system proposes the modification that are necessary to be made in one repository's implementation component in order to create a new implementation component that fits the new case. The implementation component of the repository is located by matching the sub-case C2, C3 and C4 (if a parameterized type is discovered) similarly with the previous matching alternatives. The modifications required in the implementation component which was located, are distinguished into: 1) the substitution of the type tt with the known similar type tj, and 2) other code modifications that were necessary in a similar case in the past, as reported by the system.

244

ON THE REUSE OF SOFTWARE

7 j

4. Optional Match. The system detects a stored resource resembling the specifications, where in some types which are not the same, no similarity has been defined, even though common characteristics exist. In such a case the user may establish a similarity definition (if according to his/her opinion such similarity exists) between the given types and the repository ones. Formally: optional-match(s) 0 exists s such that (s belongs to LR\ AND forall (//, //) such that (/,, f,) belongs to D,(((s, /,, r,) belongs to ALT OR ( ((s, /,-, ) belongs to ALT AND substituteO-ufe*, s, /,, /,))) OR ( exists tj such that ((s, /,, tj) belongs to ALT AND modification(n// substitute, /,•, tj, false) => true It is noted that a negative answer from the user does not exclude a possible similarity between the types /, and tj at a later stage. This is the reason for the existence of the predicate modification(ru/en, substitute, u, tj, false) in the similarity match definition. The position of this predicate guarantees that an attempt for prompting the user for such a similarity in another case in the future, will take place if all attempts for unknown similarities fail. In the case that an optional match takes place, the system locates the corresponding implementation component by matching the sub-cases C 2 , C3 and C4 (if a parameterized type exists) in a similar way to the previous matching alternatives. Consequently, the user is prompted for other source code modifications (additional to substitutions of the type f,- with the type tj), that were necessary in order for the implementation component located to fit the new case. These modifications are also stored in the repository, in order to be used in similar cases in the future. 5. No Match When nothing of the above applies the user has to create a new component by scratch, and then interactively provide the system with appropriate information, in order to store this new component in the repository.

245

72

3.2.

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU

Repository Evolution

In the proposed approach, the object's repository evolution is a major concern. The entire software production cycle is integrated in such a way as to improve the quality of the repository, as well as its maintainability. The evolution method is based on learning and generalization. Learning denotes changes in the cases stored in the repository that are adaptive in the sense that enable the system to do the same task drawn from the same population more efficiently, more effectively, and without repeating previously made errors the next time around. The different ways that the repository may evolve include: 1. Specific Component Creation A new specific component must be created after a similarity or an optional match if the user does not agree with the system's proposal for the creation of a generic component. The creation of a new specific component causes the creation of a new matching case Cj of type C such that: Cj = <Sj,deSj,implj>

The leaf resource Sj of the hierarchy Sj has as specification attributes the specification attributes that belong to the set D\. In addition, a textual description txt, provided by the user, for the new functional specification resource is necessary. The rest of the functional specification resources that constitute the hierarchy Sj are located as follows. For simplicity of the presentation, assume that the only difference between the target component and the one located during the similarity (or optional) match, is that two types tj and tj are not the same but similar (they are both defined as special cases of the type tic). Two alternatives can be distinguished: i)

There exists one functional specification resource st in the hierarchy 5,- having a specification attribute of type /*. In such a case Sj becomes a special case of the resource sk.

ii) No functional specification resource in the hierarchy Sj has a specification attribute of type tk. In such a case the functional specification resource Sj becomes a special case of the root resource of the hierarchy 5,. The old matching case and the new one consist of the same design component since the differences between these cases are related with type definitions which do not affect the design level. Finally, the implementation component intplj provides a description of the new source code. 2. Generic Component Creation A new generic component must be created after a similarity or an optional match if the user agrees with the system proposal for such a creation. The creation of a new generic component causes the substitution of the matching sub-case c,- with a new matching case c} of type C\ such that: Cj = <Sj, desj, implj > The leaf resource Sj of the hierarchy Sj has as specification attributes the common specification attributes between the set D\ and the functional specification component s,-. The rest of

246

ON THE REUSE OF SOFTWARE

-73

the specification attributes are parameterized. In addition to these specification attributes, a new textual description txt of the modified functional specification resource need to be provided by the user. The rest of the functional specification resources that constitute the hierarchy Sj are located in a similar way as in the creation of a new specification component. During the creation of the matching case Cj\ a set of parameterization rules associated with the resource Sj have to be appended in the repository. Each rule corresponds to one of the parameterized attributes of the functional specification component Sj, and has the form: substitute(r«/e/, Sj, Iji, x) =>• isa(x, fa) where /}; is the label of a parameterized attribute of the component sj, and tu is the common general type identified during the similarity or the optional match for the similar types ///, and tji, of the components $,• and Sj respectively. The design dependency des.depj is similar to the design dependency des -depi differing only in the attribute of category to .component where the old specification component specj has been replaced with the new component speCj. This new component provides the same functional specification resources with that of spech and differs only in that they are parameterized. Finally, the matching sub-case CJI consists of the design component des, of the matching sub-case Cjt, a new implementation decision component impl.depj and a new generic implementation component imply The implementation decision component impl.depj is similar to the implementation decision component impl.dep, differing only in the attribute of category fromjcomponent where the old implementation component implj has been replaced with the new generic implementation component implj. 3. New Code Insertion This is the case where the repository contained no matching component. The application developer may provide the system with the new code created by scratch. In such a case, the user is prompted for the appropriate specifications describing this new code. For simplicity reasons we assume that only one specification resource is provided by the user. This phase causes the creation of a new matching case c ; of type C such that: Cj = <Sj, desj, implj > The leaf resource Sj of the hierarchy Sj has as specification attributes, the specification attributes described by the set D\. In addition, a textual description of the new functional specification has to be provided. The rest of the functional specification components of the hierarchy are user defined. In order to do that, the user follows the most appropriate path in the different functional specification hierarchies until he/she reaches to a component Sk that the component Sj can be specialized to it. The design component desj and the implementation component implj of the matching cases Cj2 and c/3, provide design and implementation descriptions of the new code respectively. Finally, a design dependency desAepj and an implementation one impl.depy need to be created. These dependencies interconnect the new design component desj with the

247

74

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU

new specification component specj (the component that provides as resources all the new functional specifications provided by the user), and the new implementation component implj with the new design component desj respectively. 4. Decision Creation A new decision must be created if after a successful match the user is not satisfied with such a solution due to the violation of some non-functional specifications. The creation of a new decision causes the creation of a new matching case Cj of type C. Two different alternatives can be distinguished for this matching case, according to the decision type: i)

Creation of a new implementation decision. The user decides that the reason for the violation of the non-functional specifications is not the design of the component located, but the way it is implemented. The creation of a new implementation decision implies the availability of new source code which satisfies the violated non-functional specifications. If such code is available, a new implementation dependency component is created. This component interconnects the implementation component corresponding to this new code with the design component located during the matching phase. In addition, the new dependency component includes the non-functional specification resources (provided by the user) describing the violated non-functional specifications, in order these resources to be used in a future decision making phase.

ii) Creation of a new design decision. The user decides that the reasons for the violation of the non-functional specifications are related with the design of the component located. The creation of a new design decision implies the availability of new source code that satisfies the violated non-functional specifications. If such code is available, a new design dependency component is created, which connects the functional specification component located during the matching phase with the new design component corresponding to the design description of the new source code. In addition, the new decision component includes the non-functional specification resources describing the violated non-functional specifications, in order these resources to be used in a future decision making phase. Finally, the new design component is interconnected with the implementation component describing the new source code, through a new implementation dependency component. 5. Specifications Refinement The evolution algorithms described above are related with new information that is appended to the repository. However, there are cases that the already stored information need to be refined in order to avoid wrong matching of cases that may happen due to incomplete or erroneous information. One way used to refine the repository is specifications refinement. There is no guarantee that the application developer always provides specifications that fully describe the contents of the repository. These incomplete specifications may cause a future match that is not correct according to the user needs. Consider for example, the implementation component implk corresponding to a linked list of strings, where each time a new string is appended, it is placed at the end of the list. Also consider that when the application developer described the specifications corresponding to this component, he/she provided as specifications attributes

248

ON THE REUSE OF SOFTWARE

7<-

only the couples (where, list) and (what, string) and he/she did not care about the position that a new element is appended. If in a future selection the application developer wishes to locate the implementation component corresponding to a link list of sorted strings, the system will provide as solution the implementation component implk which is not appropriate for the problem at hand. In order to overcome this problem, the specifications corresponding to the implementation component implk need to be refined. In the example this may be done by appending the couple (position, last) as a new specification attribute. The different alternatives that may be used for the refinement of the leaf resource jy of the hierarchy 5,- include: i) Insertion of a new specification attribute (/*, tk). ii) Removal of a specification attribute (/*, tk). It is noted that the system does not permit the removal of a specification attribute (/*, tk) that is inherited. iii) "type adaptation. This is the case where the type /* of a specification attribute is replaced with a new type tm. It is noted that if the attribute with label /* I'S inherited the new type tm must be a special case of the attribute types of the more general resources of the hierarchy. 6. Parameterization Rule Refinement A parameterization rule rule/ (associated with a leaf resource s,-) need to be refined if a wrong generic match has taken place due to an erroneous parameterization. For instance, consider the case of the generic parameterized. Consider also that this generic component has been created after a similarity (or optional) match between the types int and float. In such a case the parameterization rule associated with this generic component is: substitute(r«/e,, J ( , what, x) => isA(x, BasicType) This rule defines that the parameterized type can take as values types that are special cases of a basic type (int, char, float, etc.). If in a future selection the application developer wishes to find a component that sorts characters, a generic match is going to happen (as char is a special case of the type BasicType). However, this is not a correct match because the comparison operators (">","<") used for number comparison are not valid for character comparison. In order to avoid such an invalid parameterization the rule rule, need to be refined. This refinement constraints the parameterized type to take values any basic type but character, and the parameterization rule becomes: substitute(ru/e;, sit what, x) =» isA(x, BasicType) AND NOT (x equals to char) 7. Learning about Similarities The system learns whether two types /,• and t-s are similar during the optional match phase. The application developer is prompted to verify such a similarity and then (in the case that such a similarity is valid) he/she is asked to provide the general type tk that both tt and tj are special types of it. It has to be noted that the system

249

76

PANAGTOTIS KATALAGAR1ANOS AND YANNIS VASSILIOU

proposes some of the already known common general types // for the types f,- and tj. The application developer may select one of these types or provide a new one. If a new type tt is provided, the repository is intially refined with the new specialization information. Next, follows the creation of the new similarity rule rulem. It has the form: modification(n isA(x, ft) AND isA(y, tt) Furthermore, the system prompts the application developer for additional modifications he/she made to the source code lcoated, in order to be stored and presented in a future similarity match between the types f, and tj. The last argument of a similarity rule is set to true during its creation. This true value indicates success for the last time such a similarity validation was attempted. If in a future selection this rule causes a similarity match that is not correct according to the user's opinion, the system will set this argument to false. Also, this argument is set to false, in the case that the user does not verify the similarity when first asked during the optional match phase. This false value does not exclude a possible similarity between these two types in the future. However, the system will first try to locate similarities that their rules have a true value in their last argument (see similarity match condition). If such a similarity can not be identified, the system will prompt the user to verify type similarities using rules with false values in thier last argument. In a positive answer of the application developer, the last argument of such a rule becomes true.

4.

Complexity Issues

The implementation of the repository where the matching cases and some additional software development experience are stored, implies the existence of a DBMS. Here, we refer to an extended relational DBMS where in addition to the information (facts) that fit naturally into traditional record-oriented models, it often requires special interpretation or analysis. The overall cost of the system is composed of the DBMS cost and the cost of the user efforts to work with the system. The interface in the two areas consists of the functional capabilities and usability of the query language (Vassiliou and Jarke, 1984), mainly in the response time of the system. Exact optimization of query evaluation is in general computationally intractable and is hampered further by the lack of precise statistical information about the database. However, in the approach the queries are limited to a set of standard queries performed during the case retrieval or repository evolution phases. Thus, they have been optimized manually by programming the associated procedures and restricting the user's input to the provision of some values when prompted by the system. The database queries involved in the algorithms presented in the previous sections may be solved with many possible strategies. Before we present the most efficient of them, using the relational algebra notation, let us first define the relations that should be included in the schema of the database that the matching cases are stored. These relations are: •

The instanceOf relation: instanceOf(+ name, class),

250

ON THE REUSE OF SOFTWARE

77

•

the generalization/specialization relation: isA(+ class, superclass), and

•

the attribute relation: attributed class, attribute .category, label, type)

Key attributes arc written in italics; a given combination of key attribute values identifies a relation element uniquely. The character '+' before a relation field, denotes that an index is used for this field. The following sub-sections present the optimized definition of the queries used in algorithms for case retrieval and adaptation and the computations of their costs.

4.1.

Recalling Relevant Sub-cases

The first database query that is evaluated during the CBR-based retrieval phase, is used for the construction of the set LR. The equivalent optimized representation of this query in relational algebra is: Proj(

Join(

Rest(attribute, type=n AND attribute.category="name") class=name, Proj(Rest(instanceOf, class="LeafResource"), name) f] (Proj(Rest(instanceOf, class="FunctionalSpecification"),name))

name) Let N\, N2 be the number of elements of the relations attribute, and instanceOf respectively. A sketch of the algorithm that evaluates the above query follows: FOR

i:=lTOiV|DO read the i(* element of the relation attribute IF (i.type=n AND i.attribute_category="name") X:=i.class if (scan(X,"LeafResource") AND scan(X,"FunctionalSpecification")) append X to the target relation

where scan(X,A) scans the instanceOf relation for the existence of the tuple instanceOf[X,A). Since the instaceOf relation is indexed on the field name, instead of scanning sequentially, the matching elements are retrieved directly. Thus, the scan(X,A) function requires in the worst case C\ database accesses, where C\ is the maximum number of meta-classes that a stored class is their instance (6 in the approach). Consequently, the number of disk accesses that are required for the above algorithm are:

Ni+C2*(Q+Cl) = Nl+C where Cj is the number of elements of the relation attribute, that have in their field type the value "n". This number is not considerable compared with the number N\. Thus, we may

251

78

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSIL1OU

easily say that the complexity of the algorithm for the construction of the set LR is O(N\) (measured in disk accesses). 4.2. Selecting the Most Promising Sub-cases Thefirstalgorithm of this phase is responsible for the construction of the set ALT. A sketch of the algorithm follows: ALT={} FOREACH i IN LR ALT:=ALT U Proj( Rest (attribute, class= i AND attribute.category=, "specattribute") class, label, type) Since the attribute relation is indexed on the field class, the number of disk accesses required for each of the database queries of the above loop, can not be more than the maximum number of attributes (C2) that a functional specification resource may have. From our experience, this number is always a single digit. Thus, the total number of disk accesses that are required for the evaluation of the above algorithm can not be more than: |LR|*C2

(O(|LR|))

Since the set ALT is going to be used during the matching phase, the membership operation should be as effective as possible. Thus, we assume that this set is implemented via chained hash tables (Lea, 1991), where the first element of each triplet is considered to be the hash key. The result of each of the above database queries is a set that it is disjoined with the already constructed set ALT (corresponds to a different functional specification resource). Thus, the union operation is equivalent with the insertion of each element of the query result to the set. Since this number is limited, the complexity of the algorithm that constructs the set ALT is not considerably affected. Also, the construction of the set AL of all attributes labels associated with the elements of the set LR may be made simultaneously with that of ALT, since the desired labels are included in the result of each database query. In the sequel the set D is constructed and partitioned in its appropriate sub-sets. This step does not require any additional disk accesses. Finally, the last algorithm of the phase constructs the set LR\ of the functional specification resources. This algorithm works as follows: It scans the set LR and for each of its elements, locates the triplets of the set ALT that have this element as key. In the sequel it checks whether the label of the triplet does not belong to the set D2. As above, no additional disk accesses are necessary. The complexity analysis of the matching algorithms can be found in the sequel. 4.3.

Exact Match

Thefirstof the matching algorithms examines whether an element s of the set LR\ matches exactly the target functional specification resource. In order to do that, the algorithm scans

252

ON THE REUSE OF SOFTWARE

ng

the set LR\ and for all the label-type couples (/,, r,) of the set D\ it checks whether the triplet (s, /,, tj) is a member of the set ALT. Since the first element of the triplets of the set ALT is considered to be the hash key, and the number of elements of the set D\ is very small (D\ is a subset of the set D), the number of the comparisons required by this algorithm in the worst case is:

\LR,\*\Dl\*Ci = \LR{\*C2 where C\ is the maximum number of steps required to locate an element of the set ALT given the hash key (this number is equal to the maximum number of attributes a specification resource may have). Therefore, the complexity of this algorithm measured in disk accesses is O(| LRi |). After a successful match of a sub-case of type C\, the system tries to match a sub-case c 2 of type Ci. In order to do that, the system initially creates the set SC of the specification components that provide as functional specification resource the leaf resource s of the hierarchy S. The corresponding optimized query in relation algebra notation is: Proj( Join( Rest(attribute, type=s AND attribute_category="functional") class=name, Rest(instanceOf, class="SPEC_Component")), class The computation of the complexity of this query is similar with that of sub-section 5.1 (O(/Vj)). In the sequel, the system tries to locate a design component des, given that the specification component that has already been located is spec. The first algorithm involved in the design decision making phase is responsible for the construction of the set DES\. This set consists of the design decision component d, interconnected with spec. The corresponding optimized query in relational algebra notation is: Proj(

Join( Rest(attribute, type=spec AND attribute.category="to.component") class=name, Rest(instanceOf, class="DES_Dependency")), class)

Similarly as before the complexity of this query is (0(N])). After a successful match of a sub-case of type Cz, the implementation decision phase follows, where the system tries to match a sub-case c$ of type C3. The algorithms used in this phase are similar with the algorithms used for the design decision phase.

4.4.

Generic Match

In order to match a generic component, the matching algorithm scans the set LR\ and for all the label-type couples (/,.'/) of the set D\ it checks whether the triplet (s, /,-, f,) or the triplet (s, lj, ) is a member of the set ALT. If the second alternative is true, this is the attribute /,• of the resource s is parameterized, a rule associated with the parameterized resource s

253

80

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU

need to be proven. The relation that should be included in the schema of the database for storing such rules is: substitute( rule, + component, label, type, body) where the field body is used for storing (in some internal form) the part of the rule which follows the "=>" symbol. The database query used for the retrieval of a parameterization rule is: Proj(Rest(substitute, component=s AND label=/,),body) The evaluation of such a query requires C\ disk accesses which is equal to the number of parameterized attributes of the component j (from our experience this number is usually less than three). The result of the above query is then used for the construction of a new database query which is used for the validation of the parameterized rule. This query consists of a search for a specific tuple of the isA relation (e.g. isA(int, BasicType)). Such a query requires a limited number of disk accesses (C{) due to the indexes used on the field class of the isA relation. The time required for the construction of the query that validates the parameterization rule and the time required for the validation of the exceptions is not considerable compared with the time needed for the disk accesses. From the above, we may easily conclude that the maximum number of accesses required for a generic match (all the set LR\ is scanned) is: |LJ?,|*|D,|*(C,+C2)

4.5.

(0(1*,))

Similarity Match

In order to match a similar component, the matching algorithm scans the set LR\ and for all the label-type couples (/,•, r,-) examines which of the following is valid: i)

the triplet (s, /,•, r,) is a member of the set ALT.

ii) the triplet (s, /,, ) is a member of the set ALT, and the parameterization rule can be proven, and iii) the triplet (s, /,-, f;) is a member of the set ALT, and a modification rule can be proven. The database accesses required by this algorithm are the same with that of the generic match algorithm for the validation of the first two of the above cases. For the validation of the third case, some extra disk accesses are necessary. These include the retrieval of the modification rule, and some additional accesses for its evaluation. Similarly with the generic match case the complexity of this step is not considerable.

254

ON THE REUSE OF SOFTWARE

gj

4.6. Optional Match The matching algorithm scans the set LR\ and for all the label-type couples (/,•, /,-) examines which of the following is valid: i) the triplet (s, I;, f,-) is a member of the set ALT, ii) the triplet (s, I,-, ) is a member of the set ALT, and the parameterization rule can be proven, ill) the triplet (s, I,, tj) is a member of the set ALT, and a modification rule can be proven, and iv) the triplet (s, /,•, tj) is a member of the set ALT, but no modification rule exists or can be proven. However, the user verifies that the types f,- and tj are similar. The database accesses required by this algorithm are the same with that of the similarity natch algorithm for the first three of the above cases. For the validation of the fourth case, an interaction is necessary between the user and the system. Concerning disk accesses, no additional accesses are required. Thus the complexity of the optional match algorithm, measured in disk accesses, is O(| LR\ \). Finally the repository evolution algorithms require a limited number of disc accesses (tuple insertions), since all the related information is already known to the system. Therefore the cost of this phase (maintenance cost) is not considered important.

5.

Conclusions

A new method has been presented which integrates ideas and techniques from various fidds like knowledge representation, software engineering and machine learning in order to effectively reuse object-oriented code. This concept has been evaluated through a prototype implementation which addresses the reuse of C++ code (see Appendix A). In addition, a laboratory experiment (Katalagarianos, 1994) in using the prototype with a large number of subjects showed the feasibility of the approach. The method proposed addressed only technical issues for the effective reuse problem. In considering its application in practical situations several operational characteristics need to fee resolved. Future research should include a study on the extensibility of our methodology for reusing applications, not just code. Concerning the retrieval of similar past cases and their adaptation, additional types of similarity should also be explored. Semantic similarity could be a good candidate. Since parameterized components are created using type similarities, additional to type parameterization ways have to be employed for the creation of new generic Applications. Furthermore, the configuration problem is a concern in all development environments. It is the problem of deciding how individual components retrieved from the repository can be oonfigured in order to form a new application.

255

82

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSIL1OU

Finally a quality assurance study is necessary. The entire software production cycle should be integrated and organized in such a way as to provide a strong basis for controlled evolution and expansion of the repository with new applications of high quality. At the practical level, additional system development is needed to turn the prototype system into a functional system. An important issue that should be a concern, is the application of parallel processing techniques to the candidate components during the matching phase. Finally, an important issue that should be investigated, is the applicability of the method to large scale real world problems. This will be possible only by the installation of the prototype system in a software development organization where multiple users will evaluate the method through system usage. Such an experiment will help in providing answers to the following questions: •

What are the accessing privileges that should be granted to the different (according to their experience) users of the system?

•

Who will be responsible for the evolution of the repository?

•

Is a view mechanism necessary? If so, who will be responsible for deciding whether some local changes should become global?

Appendix A.

Illustration

The method presented in this paper has been evaluated through a prototype implementation, which addresses the reuse of C++ code. The prototype system runs on SparcStations and Sun4 series under Unix. The language used for the implementation is a prolog-like language called MegaLog (Horsfield etal., 1990). MegaLog integrates a knowledge base with a logic programming language to provide large scale persistent storage of knowledge in a way that it can be efficiently accessed and processed by logic programs. Next, follows an illustration of the system's usage. Since the objective is not to present all functionalities of the system but to illustrate the way the method is applied, we present only a simplified version of the interaction. The scenario assumes a user looking for an implementation component which searches a file of integers. The system does not have information about searching in files, but it does know about searching for characters in arrays. Also it has been told that arrays and files are similar structures as they are special types of tables. Initially, the system prompts the user for a resource name that describes the component he/she searches for. After giving to the system the name "Search", the user proceeds with clarifying information (declining types for the attributes associated with the name). To the system's prompt: Search where? [array, linkedjist, queue, bag, set]: the user can choose any of the suggested known (to the system) answers or introduce a new

256

ON THE REUSE OF SOFTWARE

03

one. He proceeds by requiring search in a "file". Similarly, he/she may answer to the prompt: Insert what? [char, int, float]? by int as the searching item he/she is looking for is an integer. After all the attributes have been defined the system starts the matching process at the specification resources level. Suppose that no specification resource exists that matches exactly the needs. After notifying the user for this, the system checks whether there is a generic component that meets the needs. An unsuccessful search guides the system into an attempt to identify similarities between the given types and the types of specifications resources with basic name Search. The system reports them in the following way: A SIMILAR Specification (SPEC 102) has been found: EXPLANATION: Search for a character in an array DESCRIPTION NAME: "Search" ATTRIBUTE: what: char ATTRIBUTE: where: array DIFFERENCES: what values are not the same (char, int) and where values are not the same (array, file) Concerning what: In a SIMILAR case in the past, rule rule2 Provided a solution. RULE Description: modification(RULE2, substitute, char, int) RULE Proof: IF isa(char, BasicType) AND isa(int, BasicType) THEN modiflcation(RULE2, substitute, char, int) isa(char, BasicType) is a fact in the Knowledge Base isa(int, BasicType) is a fact in the Knowledge Base ACTION: substitute (char,int) Concerning where: In a SIMILAR case in the past, rule RULES provided a solution.

257

84

PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU

RULE description: modification(RULE5, substitute, array, file) RULE Proof: IF isa(array, table) AND isa(file, table) THEN modification(RULE5, substitute, array, file) isa(array, table) is a fact in the Knowledge Base isa(file, table) is a fact in the Knowledge Base ACTION: substitute (array,file) DIFFERENCES found for the following operations: array: reset="set pointer to 0" file: reset="function fopen" array: next.element="pointer increment" file: next_element="function fscanf' file: last_element="pointer=size" file: Jast_element="function feof" DO YOU WISH TO PROCEED WITH THESE MODIFICATIONS? [Y/N]

In this case the user has to take the final decision if the substitutions proposed by the system meet his/her requirements. If the user does not agree, then he/she will have to select one of the evolution methods and provide the appropriate information. In the sequel, the system will be reorganized in order to avoid the same suggestion in a similar case in the future. However, in our case the answer is positive as these substitutions will provide solution to the problem at hand. Next, the system follows the dependency links and presents to the user the implementation component and the corresponding file where the substitutions have to be applied. Next, the user is asked if he/she wants to create a generic component for searching tables. If the answer is negative the system is automatically reorganized and the only thing that has to be provided is a description of the new component created (in our case "Search for an integer in a file"). If the answer is positive then the system using a special-purpose filter, creates this new generic component. Finally, all the information regarding the searching for a character in an array is removed, the system is reorganized with the information related to the new component created, and the selection process is terminated. Next follows the actual listing of the generic component created from this run:

258

ON THE REUSE OF SOFTWARE

gj

class < T l x T > { protected: *; *current_element; int size; public: (int); void reset(); int search_element(); intfound(); void next.element(); int lasLelement();}; inline int ::lasLelement() { return(current_element=(element+size-l));} inline int ::found( item) { return(*current.element=itcm);} inline void::next-element() { current.element++;} inline void ::reset() { current_element=element;} < T l x T > : : < T l x T > ( i n t sz) { element=new [size=sz]; current_element=element;} int ::search-element( item) { reset(); while (!found(item) & !last.element()) next_element(); return(found(item));) Note that the protected class parts and the operations reset, last-element, and next-element have to do with arrays and not files. A patch file that contains their differences with the corresponding file parts and operations is also stored in the repository, in order to be used when generating a specific component which searches files for an element. References Barletta, R. 1991. An introduction to case-based reasoning. AI Expert August. Chung, L. 1990. Goal-Oriented Processing of Requirements Models into Systems Designs. Ph.D. thesis, Dept. Computer Science, University of Toronto, Ont. Prieto-Diaz, R. 1993. Status report: Software reusability. IEEE Software May. Horsfield, T., Bocca, J., and Dahmen, M. 1990. MegaLog User Guide October. Katalagarianos, P. 1994. Employing Genericity and Case-Based Reasoning to Effectively Reuse Code. Ph.D. thesis, Dept. Computer Science, University of Crete. Lea, D. 1991. User's Guide to GNU C++ Library. Free Software Foundation. Motschnig-Pitrik, R.,andMittermelr, R. 1989. Component Interconnection Languages-Survey. Features. Austria.

259

a

Meyer, B. 1987. Reusability: The case for object-oriented design. IEEE Software March. Mylopoulos, J., Borgida, A., Jarke, M., and Koubarakis. M. 1990. Telos: Representing knowledge about information systems. ACM Transactions on Information Systems, October, pp. 325-362. Stroustrup, B. 1988. Parameterized types for C++. 1988 USEN1X C++ Conference. Vassiliou, Y., and Jarke, M. 1984. Query languages—a taxonomy. In Human Factors and Interactive Computer Systems, Y. Vassiliou, Ed. Norwood, NJ: Ablex.

260

Chapter 7 ML Applications in Requirement Acquisition In this chapter, ML methods are utilized to acquire requirements specifications for new software development or for understanding and transformation of legacy software. Legacy systems are old systems that are critical to the operation of an organization which uses them and that must still be maintained. Most legacy systems were developed before software engineering techniques were widely used. Thus they may be poorly structured and their documentation may be either out-ofdate or non-existent. In order to bring to bear the legacy system maintenance, the first task is to recover the design or specification of a legacy system from its source or executable code, hence, the term of reverse engineering, or program comprehension and understanding. It has been estimated that more than half of the time spent on maintenance of large software systems is on program understanding. ML applications include derivation of specifications from training cases or scenarios, extraction of specifications from software, and specification refinement. Table 27 summarizes the current state-of-the-practice in this application area. Table 27. ML methods used in requirement specification acquisition. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Derivation of Specifications

V

Extract Specifications from Software Specification Refinement

V

V

V

Acquire Specification from Scenarios

V

This chapter includes two papers. The first paper by Cohen [29], describes an ILP based approach to extracting specifications from software. There are static, source-code based analytical approach, and dynamic, behavior-based learning approach to program understanding. The static analysis approach is based on parsing a program source code, converting it into an intermediate form (e.g., an abstract syntax tree), and using pattern matching on the intermediate form to seek out a set of predefined high-level stereotyped programming patterns [123]. On the other hand, the dynamic learning approach, as proposed by Cohen consists of instrumenting the code for behavior monitoring purpose, running the code on representative test cases to generate a set of examples for the program's behavior, and using inductive learning methods (e.g., ILP) to generalize those examples to produce an abstract and general description of the system's behavior. There are several advantages of the aforementioned dynamic learning approach for program understanding. The approach is applicable even if there is no source code to analyze, or if the

261

source code is written in a language for which there is no available parser. When the source program is very complex to the point of hindering the analysis, the learning approach can still be utilized, as long as the behavior of the program can be reasonably generalized using some inductive learning methods. In addition, Cohen also outlined a method for estimating from test data the recall, the precision and the scope of the proposed inductive specification recovery system. The second paper by Hall [60], discusses an EBL based approach to scenario generation which is an integral part of specification modeling. A scenario generator can construct an input event sequence for the specification model that reaches a state satisfying given criteria. It can uncover counterexamples to desired safety properties, explain feature interactions in concrete terms to requirements analysts, and offer online help to end users who learn how to use a system. The proposed approach is applicable to a large class of infinite state reactive systems. The essential idea is to instantiate and piece together scenarios that achieve separate sub-goals into a single scenario satisfying the conjunction of those sub-goals. A prerequisite to the approach is the acquisition of a library of validated concrete scenarios during requirements engineering that describe a system's behavior. EBL is deployed to abstract the scenarios so that they can be coinstantiated and interleaved. A tool called SGEN is implemented. Though sound, SGEN is not complete. Thus, it will not succeed in finding scenarios in all cases of satisfiable goal predicates. Nevertheless, it is intended to be fast, even in failing cases such that it can be a practical interactive tool. Though examples in [60] pertain to conjunctive state predicates, the approach can be extended to disjunctions of conjunctive state predicates by applying it concurrently to each of the disjuncts. Furthermore, the power of the tool can be incrementally increased with more scenarios being added to the library. The following papers will be included here: W. Cohen, "Inductive specification recovery: understanding software by learning from example behaviors", Automated Software Engineering, Vol.2, No.2, 1995, pp.107-129. R. Hall, "Explanation-based scenario generation for reactive system models", Automated Software Engineering, Vol.7, 2000, pp.157-177.

262

Automated Software Engineering, 2,107-129 (1995) © 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Inductive Specification Recovery: Understanding Software by Learning from Example Behaviors WILLIAM W. COHEN [email protected] AT&T Bell Laboratories, 600 Mountain Avenue, Room 2A-427, Murray Hill, NJ 07974 Abstract. We describe a technique for extracting specifications from software using machine learning techniques. In our proposed technique, instrumented code is run on a number of representative test cases, generating examples of its behavior. Inductive learning techniques are then used to generalize these examples, forming a general description of some aspect of the system's behavior. A case study is presented in which this "inductive specification recovery" method is used to find Datalog specifications for C code that implements database views, in the context of a large real-world software system. It is demonstrated that off-the-shelf inductive logic programming methods can be successfully used for specification recovery in this domain, but that these methods can be substantially improved by adapting them more closely to the task at hand. Keywords: induction, machine learning,reverseengineering, Datalog

1. Introduction Program understanding is a major task in maintaining large software systems (Corbi, 1989); it has been estimated that for large software systems, more than half of the time spent on maintenance is spent on program understanding (Parikh and Zvegintzov, 1983, page ix). One way to improve the understandability of a software system is to provide a description of the software that is significantly more compact than the code itself. Such descriptions might be full specifications in a more declarative language, as is the case when the software is being reverse engineered, or more generally, a partial specification that describes some limited aspect of the sofware. The canonical approach toward extracting such specifications automatically from software might be called the parse-and-recognize approach. First, one parses the program and converts it to some intermediate form, such as an abstract syntax tree; then, one uses patternmatching on the intermediate form to identify a set of predefined high-level concepts, such as programming cliches or plans. The prototypical system of this type is the Programmer's apprentice system (Rich and Willis, 1990), although many such systems exist. In this paper we describe an alternative method for specification recovery that does not rely on parse-and-recognize. First, the code is instrumented, so that some aspect of the system's behavior can be monitored. Second, the system is run on a number of representative test cases, generating a series of specific, concrete examples of its behavior. Finally, inductive learning techniques are used to generalize these specific examples, thus forming an abstract, general description of whatever aspect of the system's behavior was monitored. We present a case study in which this inductive specification recovery technique is applied to a large real-world software system. In the case study, learning techniques are used to find Datalog specifications of database views, given the C routines which implement these views. The use of machine learning techniques for software understanding in this domain is made possible by two recent technical advances. The first advance is the development of efficient techniques for learningfirst-orderlanguages (such as Datalog) from examples.

263

108

COHEN

The second advance is the development of machine learning techniques that make effective use of prior knowledge. The paper is organized as follows. We first describe the specification recovery problem considered in the case study. We then discuss some of the technical problems associated with the use of machine learning techniques in this domain, and describe the various machine learning methods that we have adapted or developed for this problem. It will be demonstrated that off-the-shelf but state-of-the-art inductive logic programming methods can be successfully used for specification recovery in this domain; however, these off-the-shelf methods can also be improved by adapting them more closely to the task at hand. In addition to technical problems, the task of inductive specification recovery also raises some interesting methodological problems. In the remainder of the paper, we propose an evaluation criteria for inductive specification recovery, and then present some additional experimental results. We finish with a discussion of related work and some concluding remarks.

2.

A Specification Recovery Problem

The experiments of this paper were carried out with the "Recent Change" (henceforth RC) subsystem. The RC subsystem is part of the software that supports the 5ESS' phone switch, a large and complex phone switch manufactured by AT&T. A large part of the 5ESS software system is a database which encodes all site-specific data. As the database is used directly by the switch, which is a real-time, distributed system, it has a number of quirks; however, for the purposes of this paper, it can be regarded as a conventional relational database. The database is optimized for retrieval, and hence information is often stored internally in a redundant and unintuitive format. Thus, the switch administrator accesses the database through the RC subsystem, which contains several hundred views into the database. RC views are essentially "virtual relations"; they can be accessed and updated in much the same manner as actual relations, but are designed to present a simpler interface to the underlying data. For reasons of efficiency each view is implemented as a C routine which supports a standard set of operations, such as reading or deleting tuples in the view. These routines can be quite complex; inparticular, in updating a view, it is often necessary to check that integrity constraints are maintained. Integrity constraints may involve many database relations and checking them is often non-trivial. The purpose of our discovery system is to automatically construct high-level, declarative specifications of these views. The constructed specifications are in Datalog (i.e., Prolog with no function symbols), and describe how views are derived from the underlying relations. As an example, a view v\ which is a projection of columns 1,3,5 and 4 of relation r might be specified by the one-clause Datalog program

vs{A,B,C,D):-r(A,Z,B,D,C)Similarly, a projection V2 of the join of two relations s and t might be specified v2(A, B, C, D):- s(A, B, £>), t(B, A, Y, Z, C) We emphasize that these specifications are not complete descriptions of the underlying code; they suppress many "details" such as integrity constraints, error handling, and how

264

INDUCTIVE SPECIFICATION RECOVERY

109

relations are stored in the database (which is actually distributed over several processors). However, view specifications like the ones above are a useful description of one important aspect of the code. We conjecture that in many large systems, software understanding can be facilitated by such abstract, high-level descriptions. To summarize, the specification recovery problem we will consider here is to automatically construct view specifications like the ones above. This particular specification recovery problem is attractive for a number of reasons. Many RC views have concise view specifications (at this level of abstraction); furthermore, despite their incompleteness, these view specifications may be useful for forward engineering of RC views.2 Finally, although the domain is constrained enough to be approachable, it also is broad enough to be interesting—the implementation of a single view typically consists of several hundred to a few thousand lines of C code. Another advantage is that although the subsystem being studied is huge (more than 1,100,000 lines of C) it is highly modular; hence insight into the behavior of this large system can be gained by solving many moderate-sized specification recovery problems. The population of problems is also important enough that a successful discovery tool would be practically interesting, and large enough to allow meaningful experimentation.

3.

Inductive Specification Recovery

3.1.

View Specification Recovery as Learning

The approach we propose to recovering view specifications is the following. First, instrument the code in some way. Next, execute the code to find all tuples in a view v, given a specific database DB. (This process is called materializing the view v.) We will assume that this process requires constructing each tuple in the view, one by one. Third, build a dataset for a learning system in which each example corresponds to the construction of a single view tuple. Finally, learn from these examples a general charaterization of the tuples in the view, and use this characterization to construct a Datalog view specification. To make this idea concrete, let us return to the simple view v 1 ? Suppose the materialized view ui (derived from the relation r) is as follows: Materialized view v\:

Relation r:

ssnum

first

last

mi

id

dept

fname

mi

[name

421-43-4532 921-31-3273

dave jane

jones jones

q q

921-31-3273 421-43-4S32

sales sales

jane dave

q q

Jones jones

From the view one can derive two examples, one for each view tuple. From these examples, one would like to learn a specification such as

vi{A,B,C, D):-r(A,E,

B, D,C)

3.2. Inductive Logic Programming In this process, the learning step appears to be the most complex one. One difficulty is that most learning techniques (such as neural networks, decision trees, and statistical methods) formulate hypotheses that are expressed in some variant of propositional logic; in this case,

265

110

COHEN

however, the hypothesized view definition must be expressed in a first-order language. Recently, however, a number of efficient inductive logic programming (ILP) systems have been developed. These systems that learn logic programs from examples (e.g., see (Quinlan, 1990; Muggleton and Feng, 1992)). The input-output behavior of ILP systems is sometimes summarized as follows: Given: • A set of positive examples S+ = [ef e% }; • A set of negative examples S~ = {ej~,..., e~ a }; • A logic program B, called the background theory, Find: • A logic program P such that l.V
2. We~ eS~, P AB\fe~.

More generally, one might impose other restrictions on an ILP system. For example, one might also require that P A B generalize well to new, unseen examples similar to those in S+ and 5~, and that B and/or P satisfy certain syntactic restrictions. The task described above is computationally intractable in many settings (Cohen, 1993a; Kietz, 1993). Because of this many implemented ILP systems use heuristics that may work well in practise, but which are not formally well-understood. 5.3.

Applying ILP to View Specification Recovery

To see how ILP can be applied to the view specification recovery problem, consider again the example view v\, derived from the relation r: Materialized view v \: ssnum 421-43-4532 921-31-3273

first dave jane

Relation r:

last

mi

id

dept

fname

ml

Iname

jones jones

q q

921-31-3273 421-434532

sales sales

jane dave

q q

jones jones

One way to formulate this as an ILP problem is to let the set of positive examples S+ be the tuples in tij, let the set of negative examples S~ be empty, and let the background theory B be a set of unit clauses corresponding to the tuples in the database: S + = {u, (421 -43-4532, dave, jones, q), v^(921-31-3273, jane, jones, q)) B = {r(921-31-3273, sales, jane, jones, q):-, r(421-43-4532, sales, dave, jones, q):-} (Notice that in this case no instrumentation of the code is necessary, as long as both the underlying database and the materialized views are observable.) Now a program P that is a solution to the "ILP problem" will be a definition of v\ in terms of the relation r, which is precisely what one would like as a view specification. For example, the one-clause program

vdA,B,C,D):-r(A,E,B,D,C). is a solution to the ILP problem and could (at least in principle) be discovered with an ILP system.

266

INDUCTIVE SPECIFICATION RECOVERY

11 j

Unfortunately, experimentation with off-the-shelf ILP learning systems on this problem quickly reveals a number of difficulties with this formulation of the problem. One difficulty is caused by the lack of negative examples: since 5~ is empty, trivial programs such as

v}(A,B,C,D):-true. are also ILP-problem solutions. One straightforward solution to the lack of negative data is to use the closed-word assumption (CWA) to generate negative examples. (This is reasonable, since we know that our list of positive examples includes every tuple of the view t/i. The use of the CWA to generate negative data has also been well-investigated in ILP; see for example (Quinlan, 1990; Bell and Weber, 1993).) In this case, applying the CWA would yield a set S~ that contains all facts of the form vt (ci, ci, c 3 , c 4 ) such that each c, appears in either S+ or B, butui(ci,c 2 , c 3 , c 4 ) &S+. One off-the-shelf ILP system that supports use of the CWA is FOIL (Quinlan, 1990; Quinlan and Cameron-Jones, 1993). FOIL allows one to specify that a set of positive examples is complete, and will automatically apply the CWA to any complete set of positive examples to obtain negative examples. FOIL also has a number of other features that are desirable for this problem: it is specialized to learn Datalog programs, relatively mature, and freely available. Briefly, FOIL finds hypotheses consistent with the data by starting with an empty hypothesis and adding clauses one at a time. To add a clause, FOIL starts with a clause with an empty body and adds literals one by one; in each case the choice of which literal to add is made greedily, according to the metric of information gain. For further details, see (Quinlan, 1990). Using the set 5 " constructed by the CWA and the set S+ described above, FOIL will produce the correct specification

Vi(A, B, C, D)\- r{A, E, B, D, C) (We say "correct" here as this example is isomorphic to an actual RC view, and the specification correctly describes the implementation.) The specification is also considerably more concise than the implementation, which in this case contains 80 non-comment lines of C. 3.4.

Difficulties with Using ILP

Although FOIL's performance on this small example is encouraging, an exploratory study conducted using real RC views revealed several limitations which make it inappropriate for larger problems. First, for noise-free problems, FOIL is known to scale well with the number of training examples, but poorly with the arity of background relations (Pazzani and Kibler, 1992). However, many of the relations in the RC database have large arities—e.g., of 35 views used in our exploratory experiments, 18 used relations with arity greater than 10, and 9 used relations with arity greater than 25. FOIL's difficulty with high-arity relations is shared by many other ILP systems, such as GOLEM (Muggleton and Feng, 1992). We also note that all known positive learnability results for logic programming languages assume background relations have low arity. FOIL's difficulty with high-arity views manifests itself most immediately in using the CWA to generate negative tuples. Since the number of potential negative tuples for a view of arity a is exponential in a, it is impractical to generate all negative tuples for a high-arity

267

112

COHEN

view. To address this problem FOIL includes a mechanism for randomly sampling from the space of all negative tuples; in this domain, however, it is again impractical to generate enough negative examples to prevent over-generalization on views of even moderately high arity. For example, FOIL is unable to correctly learn the view v3(A,

B, C, D, E, F, G):- u(A, B, C, D), w(A, G, F, E, H)

from a dataset of containing 136 positive examples and 360,000 randomly chosen negative tuples. Instead FOIL learns the overgeneral specification v3(A,

B, C, D, E, F, G):- u(A, B, C, D)

Notice that the logic program above is in some sense ill-formed as a view specification, as it allows the final three columns of the view (corresponding to the variables E, F, and G) to be filled with any values, whereas in reality, the values in a view are always derived from the database. This example thus highlights a second limitation of FOIL, and most other off-the-shelf ILP systems: in general, when there is prior knowledge about the concept to be learned, FOIL cannot make use of this knowledge. (We note in passing that previous research in specification recovery has focused on highly knowledge-intensive methods, so it should not be surprising that the performance of a knowledge-free learning system like FOIL is limited.) In this case, it is reasonable to insist that a view specification must consist only of generative clauses—i.e., clauses v(X\,..., Xn) such that each X; in the head of the clause always appears somewhere in the body of the clause. More generally, due to coding conventions and the like, there are many regularities across view recovery problems which could potentially be taken advantage of in learning. FOIL, as a general-purpose ILP system, cannot exploit any of these regularities. 3.5.

A Knowledge-Based ILP Method

3.5.1. Overview of GrendeH. For the reasons discussed above, we concluded that a learning system appropriate to this task would have the following properties: • • • •

the ability to generate hypotheses in Datalog, or some more expressive language; the ability to learn from positive examples only; the ability to handle high-arity relations; the ability to make use of domain-specific background knowledge.

The learning system we choose for further experimentation in the view specification recovery domain is Grendel2, a successor system to Grendel (Cohen, 1992). Grendel2 employs a declarative bias. A declarative bias means that in addition to the usual set of positive and negative examples, the user can also provide to Grendel2 an explicit description of the space of possible hypotheses of the learning system.4 This bias description can alternatively be thought of as constraints on the hypotheses that the learning system is allowed to entertain; this allows the user to control the output of the learning system by limiting it to some set of hypotheses that are judged a priori to be "reasonable". By choosing an appropriate bias, one can take advantage of domain-specific prior knowledge about a learning problem.

268

INDUCTIVE SPECIFICATION RECOVERY

1j3

3.5.2. Antecedent Description Grammars. Grendel2's language for describing a bias is a formalism called an augmented antecedent description grammar (or more succintly, an augmented ADG). Augmented ADGs were originally proposed as a generalization of the Grendel system's (Cohen, 1993b) (simple) ADGs. Simple ADGs are roughly analogous to context-free grammars; the original Grendel system was constrained to output hypotheses composed of clauses with antecedents that were sentences of a user-provided ADG. While simple ADGs provide a concise way of describing biases used in "theory-guided" learning systems like FOCL (Pazzani and Kibler, 1992) and IOU (Mooney, 1993), they cannot concisely express certain other biases, in particular "language biases" like f/'-determinacy (Muggleton and Feng, 1992) or it-locality (Cohen, 1994b). Because of this limitation the extension of augmented ADGs was proposed (Cohen, 1993c). Augmented ADGs are analogous to definite clause grammars (Sterling and Shapiro, 1986). An augmented ADG is a theory A together with a series of rules of the form S-* h

Tk where

C,,...,Cn

Formally, this rule allows replacing the symbol S with the sequence of symbols T\6,..., Tk9, where 6 is some substitution such that A h C\6 A • • • A Cnd. Intuitively, each rule specifies that the symbol S can be rewritten as the series of symbols 7 ) , . . . , 7*, provided that conditions C\ Cn are true. For instance, the rule a(X, Y) -> t{X), u(Y), u(Z)

where add(X, Y, Z)

would allow replacing the symbol "a(5,4)" with the sequence "f (3), M(4), V(J)", assuming the predicate add is defined in the natural way in A. A symbol that cannot be rewritten using any of the rules in a grammar is called a terminal symbol of the grammar. One symbol in the grammar is designated as the start symbol of the grammar; the start symbol always has the form "body(A)" where "A" is some possible head of a Prolog clause. A sentence of a grammar is a sequence of terminal symbols that can be derived from the start symbol by performing rewriting steps in accordance with the rules of the grammar. This grammar is interpreted as follows by the learning system: the learning system will output only logic programs composed of clauses of the form A:-Bt,...,Bk such that the sequence of symbols B\

Bk is a sentence of the grammar.

3.5.3. An ADG for Database Projections. As a concrete example of an ADG, consider the following grammar rules. body(View) -> projection{View) where true. projection(View, Rel) -> Rel, pickjequalities(ViewCols, RelCols) where declaredLview(View, ViewCols), declaredjrelation{Rel, RelCols). The first rule says that the start symbol "body(View)" can be always be rewritten as the symbol "projection(View)". The second rule says that "projection(View)" can be rewritten

269

114

COHEN

as the sequence of symbols "Rel, pick_equalities(ViewCols, RelCols)" wherever the conditions "declared_view(View, ViewCols)" and "declared_relation(Rel, RelCols)" are true. For instance, suppose the user has recorded in A the facts declared.viewiviiR, S, T, U), [R, S, T, U]). declared.relation(r(V\ W, X, Y, Z), [V, W, X, Y, Z]). This states that vi is the view for which a specification is to be derived, that r is a relation in the database, and also specifies that the lists of arguments for the predicates representing v\ and r should be [R, S, T, U] and [V, W, X, Y, Z] respectively. (Here we are using the usual Prolog convention for representing lists.) In this case the symbol "body(A)" can be rewritten as r(V, W, X, Y, Z),pickuequalities([R,

S, T, U], [V, W, X, Y, Z])

If the following additional grammar rules are present pick-equalities([X],Ys)

-*• pickj>nejequality(X, Ys) where true.

pickjequalities{[X \ Xs], Ys) -> pick^one.equality<(X, Ys), pickjequalities{Xs,

Ys)

where Xs ^ [ ]. pickj)ne-equality{X,

Ys) -*• X =Y where member(Y, Ys).

then "body(A)" can be rewritten to any of the following sentences (assuming that A also includes Prolog's member and ^ predicates):

r(V, W, X , Y , Z ) , R = V,S=V,T r(V\ W, X , Y , Z ) , R = V,S=V,T r(V, W, X,Y,Z),R=V,S=V,T=V,U-

= V,U = V. = V , U = W. X.

r(V, W, X , Y , Z ) , R = Z , S = Z , T = Z , U = Z. Thus Grendel2 could use any of these Prolog clauses as part of a hypothesized logic program: i>i (R, S, T, [/):- r(V, W, X,Y,Z),R

= V,S

= V,T

Z ) , R = V, S = V,T

Vi (R, S, T, U)\- r(V, W, X,Y,

= V,U

= V.

= V, V = W.

v, (R, S, T, U):- r{V, W, X,Y,Z),

R = V, S = V,T

= V,U

vi(R,

R = Z, S = Z,T

= Z, U = Z .

S, T, U):- r(V, W, X,Y,Z),

= X.

The clauses above correspond to all possible projections of the columns in the relation r. In RC views, projections are common; thus all of these hypotheses (as well as many others) would be viewed as plausible a priori. The complete augmented ADG is shown in Fig. 1. The explicit use of equality predicates in these clauses may seem inconvenient and verbose; in fact this format is very convenient. In the RC subsystem a mnemonic field name is associated with each column in a view or relation. These field names can be easily

270

INDUCTIVE SPECIFICATION RECOVERY

115

G r a m m a r rules: body(View) -» projection(View) where true. projection(View) -* Rel,pick.equalities(Vie\uCols,RelCols) where declaredjoiew(Vitv>, ViewCols), declared-relation(Rel,RelCols). pick-equalities([X], Ys) -» pick.one.equality(X, Ys) where true. pick-equalilie$([X\XsJ, Ys) -» pick.one.equality(X, Ys),pick.eqtialities(Xs, Ys) where X&£[]. pick.one.equaliiy(X, Ys) -* X= Y where member(Y, Ys).

Theory A: declaredjoiewfa (R,S, T, U),[R,S, T, UJ). declaredjKlation(r(V, W,X, Y,Z),[V, W,X, Y,ZJ). member(X,[X\ Ys]). member(X,[Y\ Ys]) :• member(X, Ys).

Figure } .

An augmented ADG for projections.

added to the specification by simply changing the names of the variables. For example, by appropriately renaming variables, the clause ui(A, B, C, £>):- r(A, E, B, D, C) will become vi(VSSNum, VFirst, VLast, VMI):r(RID, RDept, RFname, RLnatne, RMl), VSSNum - RID, VFirst = RFname, VLast = RLname, VMI-RMI. Since RC programmers prefer to think in terms of field names rather than column positions, this format is more readable to persons familiar with the domain. 3.5.4. Grendel2's Search Strategy. Grendel2's search strategy is modelled on that of FOIL (Quinlan, 1990), and differs only in minor details from the strategy used by the original Grendel system (Cohen, 1994a).5 Grendel2 uses a greedy technique to search its hypothesis space. In its outer loop, Grendel2 repeatedly constructs a clause that covers no negative examples and (hopefully) a large number of positive examples, and then adds this clause to the hypothesis. The output of Grendel2 is in general a set of clauses, each of which is a sentence of the augmented ADG. In its inner loop, Grendel2 begins with the start symbol of the grammar, which corresponds to a clause with an empty body. (In general, the clause "corresponding to" a sentential

271

116

COHEN

form p derived from the start symbol a is the clause a:- terminals^), where terminals^) is the result of removing all non-terminal symbols from the sentential form fi.) A greedy procedure is then used to specialize this sentential form until it is consistent with the negative data. At each stage in the specialization process, candidate specializations are obtained by applying certain sequences of ADG rewrite rules to the current sentential form. In particular, Grendel2 considers replacing the current sentential form with any sentential forms that can be derived by a sequence of rewrite rules from the ADG such that • no nonterminal symbol is expanded more than once in the sequence (i.e., the sequence is nonlooping) and • each rule in the sequence rewrites a symbol that was introduced by the immediately preceding rule (i.e., the sequence is linear). Of the sentential forms thus derived, Grendel2 will choose the sentential form whose corresponding clause has the largest information gain. This specialization process is continued until the current sentential form is a sentence, and the corresponding clause covers no negative examples. 3.6. View Specification Recovery with Grendel2 In applying Grendel2 to the view recovery problem, we were able to exploit a number of domain-specific constraints. In Grendel2's datasets, each example is not simply a tuple in the materialized view, but a tuple to which extra information has been attached: specifically, each view tuple is tagged with the list of database tuples read in constructing this view tuple. These tags were obtained by instrumenting the database interface routines. For example, the dataset for v\ above would be +frace(uj(421-43-4532, dave, jones, q), [r(421-43-4532, sales, davejones, a)}) +frace(ui(921-31-3273,ya«e, davis, q), [/(921-31-3273, sales Jane, davis, q)}) Note that the information in these examples is in some sense redundant, as all of the information in the trace is also present in the background theory B. However, as we will explain below, by imposing appropriate constraints on the hypotheses of Grendel2 this "redundant" information can be useful. The augmented ADG used with Grendel2 is similar in character to the one shown in Fig. 1, but more complex. In addition to projections, it allows certain types of joins, and also combinations of projections and joins. As in the ADG of Fig. 1, clauses are constrained to be generative, in the sense defined in Section 3.4. Clauses must also use exactly the database relations appearing in the trace. Further, the clauses (when viewed as Prolog programs) must access the database in an order consistent with the traces. Using automatically-extracted information about which fields in views and relations are indexed, the ordering of literals is further constrained so that the operation of a specification mimics the operation of the actual C code in performing an indexed read. Mnemonic field names are also automatically supplied, using existing computer-readable documentation on the RC database. As an example, if the first columns of both ui and r are indexed fields, the ADG would allow the clause

272

INDUCTIVE SPECIFICATION RECOVERY

117

trace(vi(VSSNum, VFirst, VLast, VMI),[r(RlD, RDept, RFname, RLname, RMI)]):RID= VSSNum, r(RID, RDept, RFname, RLname, RMI), VFirst = RFname, VLast = RLname, VMI = RMI. Notice that these clauses cannot be immediately interpreted as view specifications: rather, they are characterizations of the "traces" appearing in the examples. To extract a view specification from such a clause simply replaces the head of the clause with its first argument. In this case this would yield the clause vdVSSNum, VFirst, VLast, VMI):R1D= VSSNum, r(RID, RDept, RFname, RLname, RMI), VFirst = RFname, VLast — RLname, VMI = RMI. No negative examples are given to (or constructed by) Grendel2. Instead of using negative examples to prevent over-general hypotheses from being produced, generality is implicitly constrained by the requirement that hypotheses be sentences of the augmented ADG. The constraints imposed by the augmented ADG also makes the space of possible clauses significantly easier to search. In particular, the constraints on database access strictly limit the ways in which the database predicates can be used. Because of these restrictions the high-arity database predicates do not increase the search space to the same degree they do in the knowledge-free ILP systems.6

4.

Exploratory Experiments and New Learners

Recall that Grendel2, like FOIL, is an off-the-shelf system. Grendel2 was designed to handle classification learning problems, in which the goal is to accurately predict novel examples similar to the ones seen in training. The goal for such a problem is somewhat different from the goal in inductive specification recovery, which is essentially a discovery task. For instance, a hypotheses that is 95% correct is likely to be useful in classification; on the other hand, a specification that agrees with the code only 95% of the time may not be appropriate. Furthermore, the datasets and constraints previously used with Grendel2 are generally quite different from the ones that associated with the view recovery problem. We thus began with an exploratory study, the purpose of which was to evaluate Grendel2 as a mechanism for inductive specification recovery of RC views. In the study, we used 35 views from the RC module. The data given the learning system was the result of materializing each of these views against a single database. The augmented ADG described in the previous section and the representation for examples associated with it were developed by examining these 35 views and their associated datasets. One conclusion of the exploratory study was that the augmented ADG language was wellsuited to expressing the constraints of this domain. While determining which constraints to impose on Grendel2 was a nontrivial process, expressing these constraints as an augmented

273

118

COHEN

ADG was straightforward. The largest ADG used in the experiments in this paper contains just 11 rewrite rules and about 30 lines of code. A second conclusion was that Grendel2 system was fairly competent at inductive specification recovery. With appropriate constraints we were able to successfully extract specifications for 60% of the sample views. The learned views also included several views with fairly high arity. For example, we were able to learn one 27-column view, and also a 6-column view that was the projection of the join of a 96-column relation and a 38-column relation. However, the exploratory experiments also revealed a number of weaknesses of Grendel2. In the remainder of this section we will discuss these weaknesses, and present some extensions to Grendel2 that were motivated by the exploratory experiments. 4.1.

Grendel2/M: Complete Search for Multiple Specifications

Recall that Grendel2 performs a greedy search to find a hypothesis that is (usually) consistent with both the examples and the augmented ADG. Because of the incompleteness of the search, it is difficult to interpret the results of a failed attempt to extract a specification using Grendel. If no specification is found, it may be that the true behavior of the system is one that cannot be described given the user's constraints; on the other hand, it may simply be that the search failed in finding this specification. More seriously, since search also terminates when a single reasonable hypothesis is found, it is hard to interpret even a successful attempt at specification recovery. When Grendel2 does find a specification, it may still be the case that there are other specifications that are consistent with both the constaints and the data, and hence are just as plausible as the one returned by Grendel2. In the RC view domain, multiple consistent specifications are not infrequent, largely because of the redundancy in the database. To illustrate this, consider the following (artificial) example. Suppose that we have the following view and relation: Materialized view \>f. first

last

jane dave

janes smith

Relation n: fname

Iname

login

jane dave

jones smith

Jones smith

If there is an integrity constraint that requires the Iname field of relation n to be identical to the login field, then the following are equally valid specifications of view 114: v^First, Last):n(Jd, Fname, Lname), First = Fname, Last = Lname.

v^First, Last):n{Id, Fname, Lname), First = Fname, Last — Login.

Multiple consistent specifications can also arise, of course, when the set of training data is inadequate: as a degenerate case, note that all specifications are consistent when there are no examples. For these reasons we decided that for inductive specification recovery, it was far more useful to conduct a complete search of the hypothesis space, and report all consistent

274

INDUCTIVE SPECIFICATION RECOVERY

]]9

hypotheses. Our hope was that this complete search could be conducted at least for some of the smaller views, even though in general the search is intractable. (For example, even for the simple ADG shown in Fig. 1, there are nnv' possible clauses, where nv is the number of columns in the view and nr is the number of columns in the relation. Since Grendel2 can produce any set of these clauses as output, this leads to a hypothesis space of size 2""'.) Since the views in the exploratory experiments could all be described by one-clause specifications, we elected to implement a complete search through all single clauses that were consistent with the augmented ADG. We call the learning system that uses this search strategy Grendel2/M. Surprisingly, a simple complete search strategy led to a usefully efficient implementation of Grendel2/M: our Quintus Prolog implementation never took more than of 6 minutes of CPU time on a Sparc 1+ to recover a view in the exploratory study, and more typically completed in well under a minute. Grendel2/M obtained the same 60% coverage of the views as GrendeI2, and in fact succeeded on exactly the same set of views. Five of the 35 views were discovered to have multiple consistent specifications. The search strategy used in Grendel2/M is depth-first search with pruning. In the remainder of this section we will describe this search procedure in some depth. Consider the graph of all sequences of symbols that can be derived by the grammar, in which an edge connects sequences s\ and Si if and only if S2 can be derived from s\ by applying a single grammar rule to the leftmost non-terminal symbol in s\. This graph is traversed in a depth-first manner, except that whenever any sequence J2 is reached containing a terminal symbol that is not contained in the parent sequence s\, a test is made to see whether the clause A-h

tk

is consistent with the examples, where t\,..., tk are the terminal symbols in sz. If this test fails, then the portion of the graph below this point is not searched. If this test succeeds and s2 is a sentence, then j 2 is known to be a consistent hypothesis. The graph and the search procedure is illustrated in Fig. 2, which shows an initial portion of the search space imposed by the ADG of Fig. 1, when using the examples and background theory below. S+ = {u,(421-43-4532, dave, jones, q), u,(921-31-3273, jane, jones, q)) B = {r(921-31-3273, sales, jane, jones, q):-, /•(421-43-4532, sales, dave, jones, q):-\ For these examples, the parts of the graph below the boxed nodes in the figure would not be searched by the search strategy. The portion of the graph below the node labeled a will be pruned because the condition 5 = V is not true for the examples. Likewise the portions of the graph below the nodes labeled b, c and d will be pruned because the conditions R = W, R = Z, and S = Z do not hold. Because one goal of Grendel2/M is to find all consistent hypotheses, the search does not stop when a consistent hypothesis is found. Instead, at each nonleaf node in the graph, all child nodes are searched, and the union of the sets of consistent clauses found at any child node are returned. To keep this union from growing too large, a special compact representation is used for unions of hypotheses. Notice that at the leaf nodes in the graph, the only sets that must be represented are singleton sets. At the next-highest level of the graph, the only sets that must be unioned together are sets of sentences that were derived

275

120

COHEN

body(A)

I

proj(V,R)

I

r(VWXYZ),gvs(RSTU,VWXYZ)

I

r(VWXYZ),gv(R,VWXYZ),gvs(STU,VWXYZ) ' \ " r(VWXYZ),R=V,gvs(STU,VWXYZ) \

I

I r(VWXYZ),R=Z,gva(STU,VWXYZ) |_ (c)

r(VWXYZ),R=W,gva(STU,VWXYZ)

(b)

r(VWXYZ),R=V,gv(S,VWXYZ),gvs(TU,VWXYZ) /

/

X

*

i

r(VWXYZ),R=V,S=Z,gvs(TU,VWXYZ)

_^

I

r(VWXYZ),R=V,S=V,gvs(TU,VWXYZ)

^ v ^

r(VWXYZ),R=V,S=W,gvs(TU,VWXYZ)

I

Figure 2. A search space for Grendel2/M.

from a string aNf) by rewriting the first nonterminal symbol N to a sequence of terminal symbols. That is, the only unions that must be represented are of the form

{ctT]p}\J---U{arnp) where the T,- 'S are sequences of terminal symbols. Such sets can also be represented by an expression of the form a(ji x • • • x rn)j6. More generally, one can show by mathematical induction that at any intermediate node of the search tree, the only unions that must be represented are of the form

where a, /?, and all of the y,'s can be represented by an expression containing " x " symbols, as shown above. Thus one can always represent sets compactly with an expression of the form a(yi x • • • x yn)j3 where a, f$, and the y,'s are themselves expressions for sets, written in this compact notation. As an example of the operation of this procedure, consider the ADG body(a) -*• B,C where true. B -> b\ where true. B -» bi where true. B —> i>3 where true. C -*• c\ where true. C —>• ci where true.

276

INDUCTIVE SPECIFICATION RECOVERY

121

body(a)(9) B,C (8)

(3) bl,C (1) bl,cl

Figure 3.

^n^p~\^m v

'b2,C

* (2) bl,c2

b3,C

(5) / b3,cl

(6) b3,c2

Another search space for Grendel2/M.

and a set of examples consistent with only the following clauses: a:- b\,c\; a:- bj, C\; a:- b\, c%\ and a- b^, c2. The search graph would be as indicated in Fig. 3 and the sets returned at each point would be as follows: Nodel: Node 2: Node 3: Node 4: Node 5: Node 6: Node 7: Node 8: Node 9:

b\,c\ b\, ci b\, (c\ x ci) 0 bit c t bz,C2 bi,{c\ x c2) (bt x b3), (c, x c2) (h x fo), (ci x c2)

The final output to the user would be the set of clauses listed above as being consistent with the data, expressed in this compact form: a:- (bi x 6 3 ), (cj x c2) As a slightly less artificial example, the factored specification v5{VBuffid, VAddr, VLength, VBytes):RKey = VBuffid, p(RKey, RLoc, RLen, RBytes), VAddr=Loc, {VLength = RLen x VLength = RBytes), (VBytes = RLen x VBytes = RBytes). is shorthand for four specifications, one for every possible pairing of the variables VLength, VBytes, RLen and RBytes.

277

122 4.2.

COHEN Grendel2/MD

Our exploratory experiments also showed that often data is not simply copied from a relation to a view; instead some simple conversion step is often performed. For example, in the specification below, view vs is a copy of q with every value of "0" in the third column replaced by the empty string. v5(VProblem, VLevel, V'Description):q(RProblem, RLevel, RDescription), VProblem = RProblem, level.name(VLevel,

RLevel),

VDescription = RDescription. level-name(minor, 1). level jiame(major,

2).

leveljiame(critical,

M)\- N > 2.

GrendeI2 typically fails on such views; furthermore, learning views with conversion functions like zero2nullstr is very difficult since a wide range of conversions are performed in RC views. We addressed this problem by extending Grendel2/M to generate view specifications that contain/uncfio/ja/ dependencies, also known as determinations (Russell, 1989). A functional dependency between two variables X and Y is denoted Y •< X, and indicates that the value of Y is some function of X. Using functional dependencies, 1)5 could be specified as Vs(VProblem, VLevel, VDescription):q(RProblem, RLevel, RDescription), VProblem = RProblem, VLevel •< RLevel, VDescription = RDescription. Specifications with functional dependencies are even more abstract than the Datalog specifications generated by Grendel; in particular, they cannot be used to materialize a view. However, they are still useful for program understanding. Grendel2/M requires only minor modifications to learn clauses including determinations. The only difference is that the test used in the search procedure to determine if a clause is "consistent with the examples" must be extended: a terminal symbol X •< Y is now considered consistent with the examples unless there exist two examples in which X and Y are bound to x\, y\ and xi, yi respectively where yi = yj but x\ and xi are different. Put another way, the terminal X < Y is considered consistent whenever the data does not contradict the belief that X can be functionally derived from Y, We note that when functional dependencies are allowed, almost all views have, technically speaking, multiple consistent specifications. For example, whenever the specification

v6(W, X):- s(Y, Z),W = Y,X = Z.

278

INDUCTIVE SPECIFICATION RECOVERY

123

is consistent with a set of examples, then the specification v6(W, X):- s(Y, Z), W -< Y, X -< Z. is also consistent. In fact, any literal of the form X = Y always can be replaced with X -< Y, and any literal of the form X < c where c is a constant can be replaced with X •< Y where Y is any field. In presenting view specifications to the user, this sort of trivial multiplicity is avoided by presenting only the "strongest possible" specifications: no functional dependency involving X is ever shown if a literal of the form X = Y is also consistent, and no literal of the form X •< Y is ever shown if X -< c is consistent. The extended version of Grendel2/M that learns specifications with determinations is called Grendel2/MD. Grendel2/MD was able to learn 88.6% of the 35 views used in our exploratory study. For reasons of efficiency, the augmented ADG used in our experiments allows only determinations between X and a single variable Y, or between X and the constant value zero. (Note that if c is any constant value, then X -< c is true exactly when X always is bound to the same constant c', where c' may be different from c.)

5.

Evaluation of Discovery Systems

Up to now, we have reported only the scope of the learning systems—the percentage of the views in our test suite for which specifications can be successfully extracted. Ideally one would also like to measure in some way the usefulness or quality of the specifications that are produced. The quality of the hypothesis of a machine learning system is typically measured by its accuracy—the probability that a randomly selected example is classified correctly by the hypothesis. However, this metric is appropriate only when the final purpose of the system is prediction; in our domain, the purpose is not prediction but discovery. In this section we will address the methodological question: how should one evaluate an inductive specification recovery system? This question is made more difficult by the fact that data is often stored redundantly in the RC database. Recall that redundacy can lead to a situation in which more than one specification is valid: it is not at all obvious what a discovery system should do in this situation. We will assume here that all valid specifications are of interest to the user, and that an ideal discovery system would recover all of them. To be precise, let us fix a specification language £. For every view u, there is a set of specifications CorrectSpecsdv) in £ that can be considered correct in the sense that for every legal database, they will produce the same set of tuples as the actual C implementation of the view. Assuming that the goal of the user of the system is to find all correct specifications of each view, the specification recovery problem can thus be restated quite naturally as an information retrieval task: given the query "what are the correct specifications of view u?" the discovery system will return a set ProposedSpecs{v), which would ideally be identical to CorrectSpecsc(v). If one knew the actual value of CorrectSpecsc (v), one could measure the performance of a discovery system easily. One natural metric would be the standard information retrieval measures of recall and precision. Recall is the percentage of things in CorrectSpecsdv) that appear in ProposedSpecs(y)\ it measures the percentage of correct specifications that are proposed. Precision is the percentage of things in ProposedSpecs(y) that appear in

279

a CorrectSpecsdv); it measures the number of correct vs. incorrect specifications that are proposed. Unfortunately, in the RC subsystem it is not easy to determine CorrectSpecsc(v) for a view v: although one can find a correct specification s by manually reading the code, to determine if a second specification s' is equivalent to s' for all legal databases requires knowledge of the integrity constraints enforced by the system, which are not completely documented. Thus, to evaluate the system, we used the following probabilistic approach to evaluate specifications. Using Grendel2/MD's search strategy it is possible to take a database DB and a materialized view v and enumerate all consistent one-clause specifications allowed by the grammar. Let us call this set ConsistentSpecsc(v, DB). Recall that a correct specification is by definition one that materializes the correct view for all databases; i.e. that CorrectSpecsc(v) =

f]ConsistentSpecsc(v,DBj) j

where the index j runs over all legal databases DBj. Thus one can approximate CorrectSpecsciy) by taking a series of databases DB\,DB2,..., DBt and using the rule k

CorrectSpecsciv) « {~}ConsistentSpecsc{v, Dfi,)

(1)

;=i

Using this approximation of CorrectSpecsdv) one can then approximate the recall and precision of a learning algorithm. As an example, if we report a precision of 75% for a view v, this means on average 75% of the specifications obtained by running the learner on a single database were 100% correct on all the sample databases. If we report a recall of 50% for u this means that on average half of the specifications that are consistent with all of the sample databases can be obtained by running the learner on a single database. This procedure is broadly similar to using cross-validation to measure accuracy (Weiss and Kulkowski, 1990); however, rather than measuring the predictive accuracy of a hypothesis on hold-out data, a hypothesis is evaluated by seeing if it is 100% correct on a set of holdout databases. This rewards the hypotheses most valuable in a discovery context: namely, the hypotheses that are with high probability 100% correct. For our experiments, we collected several RC databases. To estimate the recall and precision of a discovery system on a view v, we used the system tofindProposedSpecs(v) from one database DB*, and then computed recall and precision, using Eq. 1 to approximate CorrectSpecsc(v)? This process was repeated using different training databases DB*, and the results were averaged.8 One advantage of this procedure is that actually materializing a view is not part of the evaluation procedure. Hence specifications including functional dependencies can also be evaluated using the same approach. 6. Experimental Results In this section, we will describe a controlled study in which we compared the three different Grendel2-based learning algorithms. Importantly, these algorithms (and the ADGs used

280

INDUCTIVE SPECIFICATION RECOVERY

j25

Table 1. Results for Grendetf and extensions. Learner

Scope

Recall

Precision

Grendel2 Grendel2/M Grendel2/MD

31.6% 31.6% 63.1%

83.3% 100.0% 100.0%

83.3% 87.5% 61.0% 82.2% 52.2%

—all views — views solved by Grendel2/M — remaining views

with them) were developed without any knowledge of the views used as benchmarks below; hence this is a purely prospective test of the learning systems. In the experiments, we used four RC databases, ranging in size from 5.6 to 38 megabytes, and 19 benchmark views. The views contain up to 209 fields, involve relations containing up to 90 fields, and contain between one and 531 tuples. While it is difficult to measure the size of a complete implementation of any single view, the views have an average of 747 non-comment lines in top-level module of their C implementations. The longest top-level C implementation is 2904 lines long, and the shortest is 210 lines. In addition to measuring the recall and precision of recovered specifications, we also measured the percentage of benchmark views for which any specifications could be recovered; this is shown in the table below in a column labeled "scope". As remarked above, because complete search strategies are used in Grendel2/M and Grendel2/MD, both of these systems have the desirable property of knowing when a problem falls outside their scope. The result of applying Grendel2 to this set of benchmarks is shown in the first line of Table 1. For this set of problems, and using an augmented ADG written on the basis of our exploratory experiments, Grendel2 is able to learn specifications for less than a third of the views. When it does succeed, however, it achieves over 80% recall and precision. Notice that for Grendel2, ProposedSpecs(v) is always a singleton set. The high recall obtained by Grendel2 thus indicates that CorrectSpecsc(v) was usually a singleton set. The rows labeled Grendel2/M and Grendel2/MD shows results for these extensions of Grendel2/M. The complete search used by Grendel2/M raises the recall to 100%, with no apparent cost in precision. In fact, the precision of GrendeI2/M is slightly higher than that for Grendel2. The functional dependencies used in Grendel2/MD increase the scope of the learner to more than 60% but also decrease precision to around 60%—much less than the 87.5% precision obtained by Grendel2/M. However, closer examination of the data shows that the difference is less than it appears: Grendel2/MD obtains 82% recall on the problems also solvable by Grendel2/M, but only 52% recall on the problems that it alone can solve. Thus, only a small part of the decrease in precision appears to be due to the larger hypothesis space used by Grendel/MD. The greater part seems to be due to the fact that (in this set of benchmarks) many of the views requiring determinations are simply harder to learn. These results also emphasize the importance of prospective tests in the evaluation of knowledge-based systems. We note that all of these numbers are significantly worse that the those observed in the exploratory study: in the exploratory datasets, for example, the scopes of Grendel2, Grendel2/M, and Grendel2/MD were 60%, 60% and 88.6% respectively. This effect is to be expected when conducting a prospective test of a knowledge-based system, since the knowledge used in the system is always somewhat tuned to the cases examined

281

126

COHEN

while developing it. The magnitude of the effect is somewhat surprising, however, given the relatively small amount of knowledge (30 lines of augmented ADG) built into the learning systems. Interestingly, there seems to be little correlation between the complexity of the C implementation of a view and the performance of the learning systems.9 This suggests that specification recovery methods based on learning may be a useful complement to methods based primarily on source code analysis.

7.

Related Work

Specification and design recovery has been frequently proposed in the software engineering community as an aid in maintaining or replacing hard-to-maintain code. Known techniques for specification recovery rely primarily on deductive static analysis of code, typically using a parse-and-recognize approach to specification recovery. For examples of this line of research, see (Biggerstaff 1989), (Kozaczynski and Ning 1990), (Breuerand Lano 1991), or (Rich and Wills 1990). The techniques of this paper share some common ground with this previous work; notably, the methods are also knowledge-based, being provided with a good deal of information about the likely form of the specifications being extracted. However, the techniques presented in this paper make only minimal use of the source code. Instead the technique relies on generalizing specific examples of the program's behavior using inductive learning techniques. This approach has a number of advantages over the parse-and-recognize approach. Most obviously, it can be applied even if there is no source code, or if the source code is in a language for which there is no readily available parser. This is of value for older legacy systems that are implemented in machine language, or some uncommon or obsolete programming language. A second advantage is that it can be applied when the source code is arbitrarily complex—as long as some aspect of the program's behavior is regular enough to be accurately generalized using inductive learning methods. The case study described in this paper appears to be unique in employing first-order learning methods for specification recovery. However, inductive logic programming methods have previously been applied to the software engineering tasks of "data reification" and finding "loop invariants" (Bratko and Grobelnik, 1993). Data reification is one of the steps performed in converting an abstract specification of a program to executable code. Loop invariants are logical statements that are true every time a certain point in a program is reached; they are often used in proving the correctness of programs. Two new learning systems were introduced in this paper: Grendel2/M and Grendel2/MD. Other learning systems exist that are broadly similar. For example, RDT (Kietz and Wrobel, 1992) performs a complete search for clauses consistent with examples and user-imposed constraints; however it uses a more restricted languages for constraints. Also, Minton's BFSF search strategy performs a complete search over the set of small first-order formulae, and the BRUTE system (Riddle and Oren Etzioni, 1994) performs a similar search over the set of small conjunctive propositional rules. However, none of these earlier systems combine a complete search for all consistent hypotheses with an expressive constraint language. Grendel2/MD also appears to be unique in being able to construct hypotheses consisting of logic programs and functional dependencies, although several systems learn functional dependencies alone (Russell, 1989; Schlimmer, 1992).

282

INDUCTIVE SPECIFICATION RECOVERY

8.

j 27

Conclusions

To summarize, we have considered the problem of using learning techniques to extract concise, high-level specifications of software to aid in understanding a large software system. There are several contributions of this work. One contribution is the notion of inductive specification recovery. Since inductive specification recovery makes use of different information, it is a potentially useful alternative (or complement) to the more commonly used technique of parse-and-recognize. Future work may seek to combine the strengths of the two techniques, perhaps by using analysis of source code to derive bias constraints for a learning system. We emphasize that even though our evaluation of inductive specification recovery was on software performing a specific database-related task (materializing views in a relational database) we believe that the technique can be applied in other settings as well. In fact, the requirements for application of inductive specification recovery are quite modest: all that is necessary is the ability to instrument the program, a set of representative test cases, and the potential for describing some aspect of the program's behavior with a concise specification. In addition to showing that off-the-shelf learning systems can be used for inductive specification recovery, we also proposed two novel learning methods that are better suited to this task. The first is an extension of Grendel2 that outputs a set of hypotheses. The second is a further extension that allows one to discover specifications that include functional dependencies. In combination, these techniques provided a significant and measurable improvement in performance over the off-the-shelf learning system Grendel2. Another contribution of this paper is methodological. Inductive specification recovery is essentially a discovery task, and discovery systems are difficult to evaluate. Based on the assumption that the "useful" specifications are those that with high probability exactly agree with the implementation, we outlined a method for estimating from test data the recall and precision as well as the scope of inductive specification recovery system. A final contribution is the evaluation of these new techniques on real-world problems of reasonable complexity. As noted above, this technique has been used to extract specifications for almost two-thirds of the modules in a benchmark set, including modules whose C indentations are several thousands of lines long. This result is especially significant since the benchmark views were drawn from a large, real-world software system: the RC subsystem alone, in fact, is well over one million lines of C. In conclusion we note that the research problems encountered in specification recovery problems are quite different from those encountered in more conventional machine learning tasks, and that it seems likely that further inquiries into this area are likely to raise many topics for machine learning research. The results in this paper are thus likely to be of importance to the machine learning community, as well as to the software engineering community.

Acknowledgments This research would have been impossible without the data-collection efforts of Hari Vallanki, Sandra Carrico, Bryan Ewbank, David Ladd, and Ken Rehor. I am also grateful to Jason Catlett and Cullen Schaffer for advice on methodology, and to Bob Hall and Prem Devanbu for comments on a earlier version of this paper.

283

128

COHEN

Notes 1. 5ESS is a trademark of AT&T. 2. A research effort is currently underway to implement a language that allows declarative descriptions of integrity constraints and declarative view specifications to be compiled together into a C module that implements a view (Griffin and Trickey, 1994). 3. Except where otherwise noted, the examples of this paper are all isomorphic to real views; however all identifiers have been changed. This was done in part to avoid divulging proprietary information, and in part lo avoid reader confusion, as RC identifiers appear rather cryptic to an outsider. For pedagogical reasons, we have also used the simplest view specifications as examples. 4. In machine learning this space is usually called the bias of the learning system. 5. As is outlined below, the differences are that a slightly different rule is used to determine if a sequence of rewrite steps is "nonlooping" and that a simpler semantics is ascribed to sentential forms. The rewrite steps allowed by Grendel2's augmented ADGs are also different from the rewrite steps allowed by Grendel's simple ADGs. 6. To make this remark more concrete, consider the ADG of Fig. 1. Here there is only one way in which the arguments of the relation r can be instantiated. Potentially, however, each of the arguments to r could be bound to any of the variables appearing as arguments to the view as well; thus absent any restrictions there are at least 5 5 = 3125 instantiations for the arguments of r. 7. One problem with this approximation is that it is relatively expensive to compute because Consi.itentSpecsc (u, DB) must be enumerated. The cost of enumerating consistent specifications is the reason that we consider only consistent one-clause specifications. Fortunately many RC views seem to specifiable by a single clause. 8. Ideally the process is repeated using every possible database for training. However, in the actual experiments, it was sometimes the case that a view was empty in one of more databases. Such databases were not used for training. 9. The top-level C modules implementing the views successfully learned by Grendel2/MD average 717 noncomment lines long, including the longest view (2904 lines). The implementations of the unlearnable views average 813 lines (not statistically significantly from the average length of learnable views) and the shortest unlearnable view is only 354 lines.

References Bell, S. and Weber, S. 1993. On the close logical relationship between foil and the frameworks of Helft and Weber. In Proceedings of the Third International Workshop on Inductive Logic Programming, Bled, Slovenia. Biggerstaff, TJ. 1989. Design recovery for maintenance and reuse. IEEE Computer, 36-49. Bratko, 1. and Grobelnik, M. 1993. Inductive learning applied to program construction and verification. In Proceedings of the Third International Workshop on Inductive Logic Programming, Bled, Slovenia. Breuer, P.T. and Lano, K. 1991. Creating specifications from code: Reverse engineering techniques. Journal of Software Maintenance: Research and Practice, 3:145-162. Cohen, W.W. 1992. Compiling knowledge into an explicit bias. In Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, Scotland. Morgan Kaufmann. Cohen, W.W. 1993. Cryptographic limitations on learning one-clause logic programs. In Proceedings of the Tenth National Conference on Artificial Intelligence, Washington, D.C. Cohen, W.W. 1993. Learnability of restricted logic programs. In Proceedings of the Third International Workshop on Inductive Logic Programming, Bled, Slovenia. Cohen, W.W. 1993. Rapid prototyping of ILP systems using explicit bias. In Proceedings of the 1993 IJCAI Workshop on Inductive Logic Programming, Chambery, France. Cohen, W.W. 1994. Grammatically biased learning: learning logic programs using an explicit antecedent description language. Artificial Intelligence, 303—366. Cohen, W.W. 1994. Pac-leaming nondeterminate clauses. In Proceedings of the Eleventh National Conference on Artificial Intelligence, Seattle, WA.

284

INDUCTIVE SPECIFICATION RECOVERY

129

Corbi, T.A. 1989. Program understanding: challenge for the 1990s. IBM Systems Journal, 28(2):2M-3O6. Griffin, T. and Trickey, H. 1994. Integrity maintenance in a telecommunications switch. IEEE Data Engineering Bulletin, 17(2). Kietz, J.-U. and Wrobel, S. 1992. Controlling the complexity of learning in logic through syntactic and taskoriented models. Jn Inductive Logic Programming. Academic Press. Kietz, J.-U. 1993. Some computational lower bounds for the computational complexity of inductive logic programming. In Proceedings of the 1993 European Conference on Machine Learning, Vienna, Austria. Kozaczynski, W. and Ning, J. 1990. SRE: A knolwedge based environment for large scale software re-engineering activities. In Proceedings of the 11th International Conference on Software Engineering. Mooney, R.J. 1993. Induction over the unexplained: Using overly-general domain theories to aid concept learning. Machine Learning, 10(1). Muggleton, S. and Feng, C. 1992. Efficient induction of logic programs. In Inductive Logic Programming. Academic Press. Parikh, G. and Zvegintzov, N. (eds.) 1983. Tutorial on Software Maintenance. IEEE Computer Society Press. Pazzani, M. and Kibler, D. 1992. The utility of knowledge in inductive learning. Machine Learning, 9(1). Quinlan, J.R. and Cameron-Jones, R.M. 1993. FOIL: A midterm report. In P.B. Brazdil, editor, Machine Learning: ECML-93, Vienna, Austria. Springer-Verlag. Lecture notes in Computer Science # 667. Quinlan, J.R. 1990. Learning logical definitions from relations. Machine Learning, 5(3). Rich, C. and Willis, L. 1990. Recognizing a program's design: A graph-parsing approach. IEEE Software, 82-89. Riddle, P. and ad Oren Etzioni, R.S. 1994. Representation design and brute-force induction in a Boeing manufacturing domain. Applied Artificial Intelligence, 8:125-147. Russell, S. 1989. The Use of Knowledge in Analogy and Induction. Morgan Kaufmann. Schlimmer, J.C. 1992. Efficiently inducing determinations: a complete and systematic search algorithm that uses optimal pruning. In Proceedings of the Tenth International Conference on Machine Learning, Amherst, Massachusetts. Morgan Kaufmann. Sterling, L. and Shapiro, E. 1986. The Art of Prolog: Advanced Programming Techniques. MIT Press. Weiss, S, and Kulkowski, C. 1990. Computer Systems that Learn. Morgan Kaufmann.

285

|fe|| Automated Software Engineering, 7, 157-177,2000 wV © 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Explanation-Based Scenario Generation for Reactive System Models ROBERT J. HALL AT&T Labs Research, ISO Park Aw, Bldg 103, Florham Park, NJ 07932, USA

[email protected]

Abstract. Reactive systems control many useful and complex real-world devices. Tool-supported specification modeling helps software engineers design such systems correctly. One such tool, a scenario generator, constructs an input event sequence for the spec model that reaches a state satisfying given criteria. It can uncover counterexamples to desired safety properties, explain feature interactions in concrete terms to requirements analysts, and even provide online help to end users learning how to use a system. However, while exhaustive search algorithms such as model checkers work in limited cases, the problem is highly intractable for the functionally rich models that correspond naturally to complex systems engineers wish to design. This paper describes a novel heuristic approach to the problem that is applicable to a large class of infinite state reactive systems. The key idea is to piece together scenarios that achieve subgoals into a single scenario achieving the conjunction of the subgoals. The scenarios are mined from a library captured independently during requirements acquisition. Explanation-based generalization then abstracts them so they may be coinstantiated and interleaved. The approach is implemented, and I present the results of applying the tool to 63 scenario generation problems arising from a case study of telephony feature validation. Keywords:

1.

scenario generation, reactive system, validation, explanation-based generalization

Introduction

Reactive systems control many useful and complex real-world devices, such as telephone switches, air and space craft, and software agents. Such feature-rich systems are difficult to design correctly, particularly when distinct functional features are designed by different people at different times over the lifecycle of a product family. Specification modeling (Hall, 1995; Heitmeyer et al, 1996) allows engineers to apply relatively sophisticated validation tools such as simulation, coverage analysis, model checking (Holzmann, 1991; Burch et al., 1992), or theorem proving (Hall, 1997; Rich and Feldman, 1992), to relatively abstract models of the system's behavior in order to find design errors before implementation. It is the abstractness of the models that makes many of the reasoning techniques tractable. The validated spec model can be used as a starting point for code generation, as documentation of the behavior of the system, and in support of maintenance and evolution (Hall, 1995). A spec modeling tool suite benefits significantly from a scenario generator, which constructs an input event sequence for the spec model that reaches a state satisfying given criteria. Such a tool can uncover counterexamples to desired safety properties, explain feature interactions in concrete terms to requirements analysts, increase test coverage, and even function as documentation, showing end users how to achieve their goals while still learning how to use a system. However, while some model checkers (Holzmann, 1991;

286

158

HALL

Bultan et al., 1998) are capable of generating scenarios for certain limited classes of reactive systems, such as finite state machines with small (or highly symmetric) state spaces, the problem is intractable for functionally rich models that arise as natural abstractions of systems engineers wish to design. For example, in addition to requiring search in an infinite state space, models incorporating arithmetic operators can require the scenario generator to find satisfying instances of arbitrary sets of arithmetic constraints, which is undecidable. This paper describes a novel heuristic approach, called SGEN2 (for "Scenario Generation via Generalization") which is applicable to a large class of infinite state reactive systems. The key idea is to instantiate and piece together abstracted scenarios that achieve subsets of the conjuncts of a goal predicate into a single scenario achieving the conjunction of the subsets. The scenarios are mined from a library of concrete scenarios captured independently during requirements acquisition. Critically, they are then abstracted via explanation-based generalization. The approach is sound, but incomplete, so it will not succeed in finding scenarios in all cases of satisfiable goal predicates; however, it is intended to be fast, even in failing cases, so that it can be a practical interactive tool. Moreover, the approach's power can be increased by adding more scenarios to the library, so, as more requirements are uncovered and specified, the power of the tool grows naturally. Even an incomplete generator is quite useful. Typically, an engineer will discover (e.g. via static analysis or proof attempts) descriptions of states in which spec inconsistencies may arise, or correctness properties may be violated; the scenario generator is then run on these descriptions. Whenever the generator is successful, a definite design flaw has been found, so the engineer can focus attention there first. The other cases, which may not even be satisfiable, can be put off to later in the design process, after the known problems are fixed. Fixing these first problems may either alter or eliminate the other ones anyway. Also, in cases where the generator fails, putting out a scenario coming as close to the goal as possible (which SGEN2 does) can be helpful as well: while the human often can generate the scenario himself, the tedium and time cost of writing it down is a significant deterrent. If the tool generates a scenario achieving most of the goal, the user may be able to simply add or modify a few steps to reach the full goal. This paper can be summed up in three key ideas: • Current limited-domain exhaustive search approaches (such as model checking (Holzmann, 1991) are not sufficient; we need a usable scenario generator that accommodates more expressive logics and large state spaces, even though the problem is highly intractable; • The heuristic SGEN2 approach, based on mining and abstracting requirements knowledge using explanation-based generalization, applies to richly expressive logics and large state spaces; • A moderate sized case study involving feature interactions in telephony gives initial empirical evidence that SGEN2 is practical and useful. After Section 2 defines terms and describes the tool suite in which SGEN2 is implemented, the next three sections make these key points. I conclude with discussion of related work, limitations, and future work. Appendix A describes the implemented SGEN2 algorithm in detail and gives a pseudo-code description of it.

287

EXPLANATION-BASED SCENARIO GENERATION

2.

159

Background: spec modeling

This work is performed within the Interactive Specification Acquisition Tools (IS AT) framework. ISAT (Hall, 1995, 1997, 1998a) is a prototype tool suite for reactive system design that is intended to support full-lifecycle spec modeling as well as code generation. A reactive system is a (not necessarily finite-) state machine that reacts to parameterized input events by changing its state and by performing acts, which can be thought of as output events. ISAT is based on two hypotheses: • Functional requirements are most reliably elicited from and validated by requirers as concrete, formal behavior scenarios; and • Specifications must be executable and amenable to automatic analysis. A designer constructs a reactive system model in the executable spec language, while a requirer specifies functional requirements as concrete scenarios. The latter are interleaved sequences of applied input events and required output observations (acts or state changes). Thus, crucially to SGEN2, a natural part of the design lifecycle is the acquisition of a library of validated concrete scenarios describing the system's behavior. 2. /.

Model formalism and backpropagation

An ISAT spec model consists of a theory definition together with a set of event handlers. The theory defines the types, functions, and semantic axioms of a pure (side-effect-free) computational logic, as well as the signatures of the state relations, events, and acts that make up the system. In order to support model simulation (execution), all primitive function declarations in the model's theory must include a total operational function capable of computing the value of the function on inputs in its domain and some non-error, typecompatible output value on inputs outside its declared domain. (ISAT model theories are somewhat similar to computational logic as described in Boyer and Moore (1988). Thus, ISAT supports arbitrary functional richness, bounded only by the user's willingness and ability to code implementations for the functions and provide the logical axioms supporting the other reasoning tools (see below). For example, models can operate on arbitrary data structures. I have used this richness to good advantage in my work on applying ISAT to the specification and implementation of the Email Channels system (Hall, 1998a): the ISAT model operates on message data structures, lists of users and messages, and even database relation objects. Event handlers are expressed in a limited procedural language P-EBF ("procedural eventbased formalism"), which is semantically related to the rule-based EBFI described in (Hall, 1995). The details of P-EBF are not crucial to this paper, except that it supports & predicate backpropagation operator BACKPROP. Note that P-EBF need not be the input language seen by the designer; many general or domain-specific front-end formalisms, such as RSML (Leveson et al., 1994) or SCR (Heitmeyer et al., 1996), may be compiled into P-EBF. such formalisms are beyond the scope of the present paper, however. Formally, the state of an ISAT model is represented as a collection of parameterized (partial) functional relations r}.: D\ x • • • x Dn i-> T, where T and each £>, are data

288

160

HALL

domains (types). For example, the relation CALL -.Address t-*- Call stores for an address (i.e., phone number) the object representing its ongoing call (if any). State values are referred to within P-EBF expressions via the lookup operator; for example, (lookup c a l l "1234") returns the current call in which extension " 1234 " is involved, if any. A state predicate is a Boolean-typed P-EBF expression. State predicates may be parameterized by typed formal parameters. Note that state predicates implicitly take a state as an additional parameter; it is this state in which lookup subexpressions are evaluated. Here is a state predicate of one address parameter, usr: (and (member? usr (lookup known-addresses)) (equal IDLE (lookup mode usr)) (not (equal NO-CALL (lookup call usr))))

This predicate is true in all states in which there is an idle address that nevertheless still has a valid call object. It is the negation of a desirable state invariant; thus, a generated scenario reaching such a state proves the existence of a specification error. In this work, a goal is represented as a conjunctive state predicate, such as that shown above. A subgoal is simply the conjunction of a subset of the conjuncts of a goal. For example, (and (member? u s r (lookup known-addresses)) (equal IDLE (lookup mode u s r ) ) ) is a subgoal of the goal above. The BACKPROP Operator. Formally, ISAT's BACKPROP takes six arguments, (P1, a', s, e, s', M), and returns three values (P, E, a). P' is a state predicate and a' is a list of actual (concrete) parameters for P' such that P'(a') holds when evaluated in model M's state s'; s is a state for model M such that applying the concrete input event e to M in s results in the new state s'. Pictorially, M : s h* s' and P'(a') is true in s'. The return value £ is an event schema for (i.e., a variablization of) the concrete event e, defining fresh formal event parameters. P is a state predicate taking the same parameters as P' plus the formals of E, and a is a list of actuals for P such that P(a) is true in 5. Moreover, we specify that BACKPROP(/", a', s, e, s', M) = (P, E, a) only if for all states 5 and actual parameters A = {Apt, AE) satisfying P(A) in 5, applying E(AE) to the model M in S results in a state 5' in which P'(AP>) holds in S'. To clarify, the formal parameters of P are just the union of those of P' and those of the event schema E. Thus, the actual list A will have values both for the formals of E and the formals of P'. Intuitively, BACKPROP computes a sufficient (not necessarily necessary) condition on event E and the state prior to applying E, such that P' is true afterward. BACKPROP implements explanation-based generalization (Hall, 1995; DeJong and Mooney, 1986) for the P-EBF formalism. Others have described similar operators, such as Dijkstra's predicate transformers, or verification condition generators (Good, 1984). It

289

EXPLANATION-BASED SCENARIO GENERATION

161

is beyond the scope of this paper to explain the algorithm in detail, but BACKPROP is essentially the same as scenario fragment generalization discussed in reference (Hall, 1995), but applied to the P-EBF formalism instead of the original EBF formalism. To provide an intuitive understanding, here is a brief example of applying BACKPROP. Suppose that in state 1 user "1234" has MODE IDLE. Applying the event (OFFHOOK "1234") to our telephony model, T, in state 1 results in state 2 in which the MODE of " 1234" is DIALING. Then BACKPROP applied to the 1-parameter predicate P ' (?x) = (EQUAL DIALING (LOOKUP MODE ?x)) returns the event schema (OFFHOOK ?y) and predicate P(?x, ?y) = (AND (EQUAL : IDLE (LOOKUP MODE ?x)) (EQUAL ?x ?y) ). (The actuals lists bind both ?x and ?y to " 1234 ".) Intuitively, this means that if we offhook any idle user in our telephony model, that user will move to the dialing mode. This is a sound generalization of the concrete scenario in which we offhook "1234" when "1234" is IDLE, then "1234" moves to the dialing mode. In terms of the formal definition above, BACKPROP ( P \ { ? X = " 1 2 3 4 " } , State 1, (OFFHOOK "1234"), State 2, T) = (P, (OFFHOOK ?y), {?x=?y="1234"}). BACKPROP* . If we have a succeeding scenario trace involving a sequence of input events and state predicate observations, we can iteratively apply BACKPROP to get an entire generalized scenario, consisting of an initial condition predicate and a sequence of interleaved event schemas and (predicate) observation schemas. The initial condition predicate is the (sound) backpropagation of the conjunction of all predicate observations in the scenario; thus, for any state and binding of parameters satisfying the initial condition predicate, the corresponding instance of the generalized scenario is guaranteed to succeed when executed starting from that state. The rest of the paper will refer to this operation as BACKPROP*; it takes in a model, a scenario trace, and a predicate to be backpropagated together with its satisfying actuals list, and returns this fully backpropagated predicate, its actuals list, and the list of event and observation schemas making up the generalized scenario. For the rest of this paper, we will apply BACKPROP* only to scenarios consisting of a sequence of events followed by one predicate observation (the goal or subgoal predicate achieved by that scenario), though it handles the more general case of multiple interleaved observations. Section 4 provides examples (G4/5 and G1/5) of generalized scenarios computed by BACKPROP*. Note that a generalized scenario is not a (concrete) scenario as denned above: the latter is merely an interleaved sequence of (fully instantiated) events and observations, while a generalized scenario can be thought of as a schema or template representing a collection of structurally similar scenarios. Each generalized scenario is parameterized by a set of free variables appearing in the event and observation schemas: each member of the represented set of scenarios corresponds to a binding of data values to the variables in such a way that the initial condition predicate is satisfied.

2.2.

ISAT tools overview

ISAT exploits the two hypotheses above, Concrete Scenarios and Executable Models, to provide a suite of analysis tools to help the designer produce a specification that meets the true needs of the requirer. ISAT includes the following tools:

290

162

HALL

• Scenario simulation takes a scenario and a model and executes the model to determine whether the scenario represents correct behavior of the model. In particular, it sequentially applies the steps of the scenario, invoking event handlers to handle input events and evaluating output observations to ascertain whether all observations are satisfied and to verify that the simulator never encounters an inconsistent state change or action. Thus, requirements scenarios can be directly validated. • Coverage analysis reports states never reached by, and statements of the model that are not executed by, any of the requirement scenarios. This helps the designer elicit adequate requirements from the requirer. • Layered theorem proving (Hall, 1997; Rich and Feldman, 1992) is a technique for proving arbitrary correctness properties, such as state invariants and pseudo-state diagrams (Hall, 1997). • Conflict detection (Hall, 1998; Heitmeyer et al., 1996) returns predicates describing states in which the model, if it reaches them, will derive an inconsistent next state (potentially causing either a crash of the simulator or, worse, the implemented system). Inconsistencies can result from setting state relations to two inconsistent values or raising conflicting output events, such as playing both the ringback tone and the busy tone at the same time to the same phone. ISAT's validation tools are implemented on top of sound scenario generalization, which effectively analyzes the behavior of a model into an input-language independent set of partial symbolic event/condition/action rules known as scenario fragments (Hall, 1995, 1997, 1998). Note that coverage analysis, conflict detection, and proof attempts all produce state predicates to which we can apply a scenario generator in order to discover whether they represent reachable states of the model. 3.

The scenario generation problem

Formally, the scenario generation problem is to take a model M and state predicate P' and find a sequence L of concrete input events and a list of actual parameters Apr for P', such that executing L in M starting from the undefined initial state results in a state s' satisfying P'(Ap>, s'). For this work, I have concentrated on conjunctive state predicates, i.e. those whose expressions consist of the logical AND of a collection of predicates. (The method can be applied to disjunctions of conjunctive state predicates by applying it concurrently to each of the disjuncts, but that requires engineering for efficiency that is beyond the scope of this paper.) Sections 1 and 2 discussed some ways a tool suite can benefit from solving this problem. Why Rich Formalisms? Most existing approaches to the scenario generation problem operate only on specifications in which states are representable as simple bit vectors, and the next state function is expressible in terms of boolean combinational logic. One of my primary design goals in this work is to enable the tool suite to handle the more expressive functional logics familiar to programmers. This subsection examines why this is desirable and what impact it has on approaches to scenario generation.

291

EXPLANATION-BASED SCENARIO GENERATION

163

Model checkers (Holzmann, 1991) and symbolic model checkers (Burch et al., 1992) guarantee that when they find a property not valid in a model, they return a concrete counterexample (scenario) illustrating the violation. Thus, we should explore under what circumstances such tools solve the scenario generation problem before inventing different ways to solve it. Model checkers exhaustively search the state space of the system, testing the property in each state. Thus, they are limited by the size of the state space they can handle. Some model checkers exploit state space symmetries to handle systems with larger spaces, but all eventually run into this "state explosion problem". The first critical impact of the choice to handle rich formalisms is a typically radical increase in the size of the state space. For example, when state variables can take on values from structured (e.g. "message", "bank account") or unbounded (e.g. List, String, Graph) data types the state space can become either effectively or actually infinite. (Section 6 illustrates this.) Since SGEN2 does not depend on exhaustive search, it is able to operate in large or infinite state spaces without requiring prohibitive amounts of processing time. Of course, it is only heuristic so will not always succeed. While symbolic model checkers have checked properties in impressively large (10120 (Burch et al., 1992), 1056 (Anderson et al., 1996)) state spaces, it is not clear if the technique can be extended to handle nonboolean logics. The second critical impact of the rich-formalism choice is to negate or limit the usefulness of highly efficient data representations used by decision procedures such as the Binary Decision Diagram (BDD) representation employed by symbolic model checkers. If next state functions are significantly more complex than boolean formulae, then they are less likely to have efficient canonical or pseudo-canonical representations. SGEN2's efficiency does not depend on any domainspecific canonical representation, so the functional richness does not prohibit its use. Should we simply avoid models with large state spaces? I believe the answer is "no." Several common types of design problems are only manifest in more complex (large or unbounded state space) models of a system. For example, complex systems are frequently designed in a quasi-modular fashion by designing functional "features" independently and then combining feature sets to meet customization or market needs. While many reactive systems are built this way, telephone switching systems provide a good example of this approach. These reactive systems typically consist of millions of lines of code implementing call setup and routing, billing, and feature semantics such as call forwarding, call waiting, and call screening. The problem is that even though individual features are valid in isolation, their combination may lead to undesirable interactions that lead to faulty behavior. The only way for a tool to discover these interactions is to model the feature combinations; it follows that the more features a system has, the more complex must its model be in order to detect interactions. (Section 6 illustrates this as well.) Another reason limited-space approaches are not the final answer is that it is difficult both to do enough abstraction to make the problem tractable and yet to retain enough detail to manifest the problems of interest. In particular, each property to be checked may require a different, hand-constructed model abstraction. And since designers don't know in advance which problems the system has, there could be a lot of wasted effort and/or

292

164

HALL

false confidence in results. By dealing with more complex models, the abstraction can be relatively straight-forward, and a single one can be used for all properties. Finally, another reason to prefer a single, easily produced abstraction that is clearly faithful to the system, is that there is the possibility of generating implementations directly from the models, either through code synthesis or by direct manual implementation. Often, abstractions that are necessary for tractability of exhaustive search are missing too much detail to allow any direct mapping to implementations. For example, Alur et al. (1997) report on a model checking effort for a phone switch in which it was necessary to model queue data structures by 7 bit integers (representing the number of items in the queue). Such a high degree of abstraction makes it impossible to check for correctness properties relating to queue semantics, such as whether the queue behaves correctly with respect to ordering of input and output operations. In addition, it is impossible to automatically synthesize correct implementations from such a highly abstract specification, since crucial semantic details are abstracted away. An implementation must supply all details of queue implementation, as well as any system behavior depending on the actual contents of the queues. Why is scenario generation hard? As soon as our representation language allows event and state parameterization and functions, we have added an uncomputable constraint satisfaction problem to the problem of combinatorial search in large state spaces. For example, designers commonly need models with arithmetic, lists and other data structures, text manipulation functions such as pattern matching, etc. But then it is possible to define systems and properties that are only satisfied when the system reaches a state satisfying an arbitrary sentence of this rich theory. Proving such a state reachable is undecidable, by Godel's Incompleteness Theorem; generating a scenario that actually reaches it is "even harder" because of the combinatorial search. To illustrate this, consider that while it may be possible to find a satisfying assignment for a particular goal predicate, just finding a random one is not enough, it must be one that can be reached by the system under consideration. Thus, the tool must search for an element in the intersection of the set of satisfying instances for the constraint predicate and the set of reachable states of the system. Put another way, since events are parameterized, the search must choose correct event parameters in addition to generating possible paths through the state space. Thus, in summary, we want to be able to apply scenario generation to complex modeling formalisms, and yet the problem goes from merely search to uncomputable. Our only hope in these cases is to find an approach that can solve the problem in usefully many cases, and not take too long doing it. We also require that whenever the tool returns a scenario, it actually satisfies the goal predicate (soundness). These are the goals of the SGEN2 approach. 4.

The SGEN2 approach

Let us term the overall conjunctive state predicate the "goal predicate" and the individual conjuncts making it up the "conjunct predicates" or simply the "conjuncts." There are two key insights behind the SGEN2 algorithm. First, the library of requirement scenarios, while unlikely to have a scenario which reaches a state satisfying the entire goal predicate, nevertheless is likely to have scenarios that reach states satisfying subsets of the conjuncts.

293

EXPLANATION-BASED SCENARIO GENERATION

165

Thus, we might find such scenarios and somehow paste them together into a single scenario that achieves the full conjunction. Now, typically two such scenarios will operate on different data items; for example, scenario 1 may achieve set 1 of the conjuncts for address (i.e., phone number) "1234", while scenario 2 achieves set 2 for address "5678". Thus, these two concrete scenarios cannot be interleaved to form a scenario that achieves the union of the sets for a single address. However, the second key insight is that we can solve this subproblem by abstracting the two scenarios, using BACKPROP*, and finding a common instantiation of them (binding of their variables to data values) such that the union of the two predicate subsets is satisfied. Once such a common instantiation is found, a heuristic search merges the two event sequences into one, achieving the union of the conjunct sets. Appendix A describes the SGEN2 algorithm in detail. The following illustrative example is taken from the case study. Consider the goal predicate (and (member ?y (lookup known-addresses)) (lookup f p r - a c t i v e ?y) (equal d i a l i n g (lookup mode ?x)) (lookup t c s - a c t i v e ?y) (member ?x (lookup t c s - s c r e e n e d - l i s t ? y ) ) ) This describes states in which known address ?y has two features, "fpr" and "tcs" both active, with ?x on its tcs-screened-list, and in which ?x is d i a l i n g . As background, the "fpr" feature is essentially a call routing feature that forwards calls from one target phone to another based on time of day; the "tcs" feature is a screening feature that allows a phone to have a list of other phones that are not allowed to call it. Thus, the goal predicate above describes a state in which there is a phone ?y having both tcs and fpr active simultaneously such that there is a phone ?x on ?y's screened list that is ready to dial. Initialization. SGEN2 first mines its library and discovers scenario 5j: (init) (init-address "1234") (init-address "9876") (activate-tcs "1234" "9876") (offhook "9876")

which results in a state satisfying a predicate, which I will denote by PA/S(?X, ?y), defined by 4 of the 5 conjuncts: (and (member ?y (lookup known-addresses)) (equal dialing (lookup mode ?x)) (lookup tcs-active ?y) (member ?x (lookup tcs-screened-list ?y)))

when we bind ?x to "9876" and ?y to "1234". This achieves most of what we want, except the fpr feature is not active at phone ?y. Since it is unlikely we will find a scenario

294

166

HALL

that fortuitously achieves the rest of the goal for the constant " 1234 ", instead of searching for one, we apply BACKPROP* to P4/5 and the trace of scenario S\ to get the generalized scenario G4/5: (init) (init-address ?a) (init-address ?b) (activate-tcs ?a ?b) (offhook ?b)

which results in a state satisfying P4/s(?x, ?y) whenever the initial condition predicate (and (equal ?a ?y) (equal ?b ?x)) holds. SGEN2 also records the actual bindings {?b = ?x = "9876", ?a = ?y = "1234"}. SGEN2 recursive step. SGEN2-REC continues by searching the mined library information for satisfiers of the remaining conjunct(s) of the goal. In this case, it discovers (among others) that the scenario S2 (init) (init-address "5678") (activate-fpr "5678" 0 10 "8421")

reaches a state that satisfies the predicate Pi/s(?y) = (lookup f p r - a c t i v e ?y), when ?y is bound to "5678". Note that since Si and S2 operate on different constants, they cannot be directly interleaved to get a scenario reaching the desired conjunction for a single address. Applying BACKPROP* to P\/s and the trace of 52, we get the generalized scenario G1/5: (init) (init-address ?c) (activate-fpr ?c ?tl ?t2 ?d)

G1/5 is guaranteed to reach a state satisfying P\/s(?y) as long as the initial condition predicate (and (equal ?y ?c) (< ? t l ? t 2 ) ) holds. Note that the inequality is extracted via explanation-based generalization from the model's handling of the a c t i v a t e - f p r event. That is, the f pr feature can only be activated if the second time is later than the first time. The recorded actual bindings are {?c = ?y = "5678", ? t l = 0, ? t 2 = 10, ?d = "8421"). SGEN2-REC then calls the CoiNSTANTiATE routine which attempts to find a common instantiation of G4/5 and G1/5 obeying both sets of constraints. In this case, COINSTANTIATE quickly finds that the common instantiation / = {?a = ?c = "1234", ?b = "9876", ? t l = 0, ? t 2 = 10, ?d = "8421"} satisfies both sets. SGEN2-REC finally calls MERGESCENARIOS on the two scenarios G 4 / 5 (/)

295

EXPLANATION-BASED SCENARIO GENERATION

167

(init) (init-address "1234") (init-address "9876") (activate-tcs "1234" "9876") (offhook "9876")

and G]/5(/), (init) (init-address "1234") (activate-fpr "1234" 0 10 "8421") which denote the instances of G4/5 and G1/5 obtained by applying / . MERGESCENARIOS also takes the two instantiated predicates P^/sil) and Pi/s(I) which are satisfied by G 4/5(7) and G1/5 (/) respectively, so that it can check whether its result satisfies both simultaneously. In the case above, MERGESCENARIOS finds the following interleaving which does, indeed, satisfy the conjunct sets. (init) (init-address "1234") (init-address "9876") (activate-fpr "1234" 0 10 "8421") (activate-tcs "1234" "9876") (offhook "9876")

If at this point there were still unsatisfied conjuncts of the goal, SGEN2-REC would call BACKPROP* to generalize this result scenario and then recur to search for yet another scenario to satisfy the next subset. If COINSTANTIATE or MERGESCENARIOS fails, then SGEN2 and SGEN2-REC move on to the next candidates in the search (cf Appendix A). 4.1.

Library mining

The first step of SGEN2 is to search the library of execution traces of requirement scenarios for states in which sets of conjuncts are satisfied. The subroutine MINELIBRARY accomplishes this as follows. For each scenario in the requirements library, it first generates an execution trace by calling the simulator. It then extracts from the trace sets of data values (grouped by type) appearing in the trace. Then, for each possible well-typed assignment of these data values to the formal parameters of the goal predicate, it searches the states of the execution trace for those in which a conjunct first becomes true (for that parameter assignment). It creates a predicate group satisfier (pgs) for that state, which records the assignment and the set of conjuncts that are satisfied. This set of satisfied conjuncts is termed the satset of the pgs. MINELIBRARY returns the entire collection of pgss found in this way in all traces. It sorts the list in decreasing order of the size of the satset so that SGEN2 will consider earlier those pgss that satisfy the most predicates at once.

296

168

HALL

It is interesting that library mining can be viewed as a type of data mining in that previously undiscovered subscenarios embedded in the (known) scenario library are found that solve new goals or subgoals; however, it is dissimilar from most data mining techniques in that it does not use statistical techniques to induce patterns. Rather, it simply searches the library's set of reached states for those in which conjuncts relevant to the problem at hand become satisfied under particular instantiations. MINELIBRARY is linear in the total number of states in all traces in the library. More importantly, however, it is proportional to the number of parameter assignments, which is exponential in the number of goal predicate parameters. While the current implementation seems to work adequately fast on the case study examples (<5 parameters each), it may be necessary to limit the number of assignments considered when the goal predicate has many parameters. This will, of course, reduce the power of SGEN2 because it will not consider every possible subscenario with every possible instantiation. Thus, this tradeoff of power for speed must be investigated and engineered in future case studies. 4.2.

Coinstantiation

COINSTANTIATE heuristically attacks the (in general) uncomputable problem of coinstantiation by simply trying out all possible well-typed assignments of constants to the parameters of G4/5 and G1 /5, where the constant pool is simply the union of all constants in the actualbindings of the pgss from which G4/5 and G 1/5 were generalized. In the example above, the constant pool would include address constants "1234", "9876", and "8421", and time constants 0 and 10. This approach has proven effective in the case study, and takes negligible time (see statistics below). If necessary, COINSTANTIATE can be made to consider larger constant pools, such as those in all scenarios in the library. 4.3.

Scenario merging

MERGESCENARIOS takes in two scenario/predicate pairs, where each scenario results in a state satisfying its predicate. The goal is to return an interleaving of the two scenarios that satisfies both predicates. MERGESCENARIOS does not attempt to check all possible interleavings, as this would require checking exponentially many (in the sum of the lengths of the two input scenarios) interleavings in the worst case. And note that the worst case occurs any time no interleaving exists, so it is fairly common. Designate the input scenario/predicate pairs as the "left" scenario/predicate and the "right" scenario/predicate. One approach is to sequentially select the front event off of either the left or right scenario and add it to the end of the result scenario until both left and right are empty. Doing this in all possible ways, waiting until left and right are empty before checking the predicates, would result in the exponential worst case mentioned above. Instead, MERGESCENARIOS heuristically limits the search as follows. Each time it selects an event e\ from the left scenario, it checks to see whether, if the result scenario were extended from that point with the remainder of the right scenario, the right predicate would be satisfied. If not, e\ is vetoed; otherwise, it proceeds to the next choice. (By induction,

297

EXPLANATION-BASED SCENARIO GENERATION

169

one can show that if instead we extended the result with the remainder of the left scenario, the left predicate would also be satisfied.) The dual check is done when the event is selected from the right. When the front events on left and right are identical, the algorithm also attempts the third option of adding one event and discarding the other. Note that since there can be interleavings that satisfy both predicates at the end but which contain intermediate points at which the check would fail, this approach is less powerful than brute force search; however, in the case study, MERGESCENARIOS only failed once when a brute force search would have succeeded, and yet was as much as 12 times faster (average: 2x). 5.

Case study

I ran SGEN2 on 63 distinct scenario generation problems that arose in a larger case study of feature interactions in a telephone switch specification. (The study actually produced 66 problems, but three were duplicates, so were discarded for this paper.) The larger study is actually a tool contest associated with the 1998 International Workshop on Feature Interactions in Telecommunications and Software Systems (Griffeth et al., 1998). The system being modeled is a telephone switch implementing Plain Old Telephone Service (POTS), plus 12 functional features such as Call Forwarding (CF), Terminating Call Screening (TCS), FreePhone Routing (FPR), and nine others. This SGEN2 case study was performed before four of the twelve were modeled, so only POTS and eight features are included here. In a related paper (Hall, 1998), I explain how I used the IS AT tool set to model these specs and to detect various types of feature interactions among them, many of which are predicates describing states in which undesired things may happen, such as feature inconsistencies becoming manifest (conflicts) or feature correctness properties being violated. In the absence of a scenario generator, it is left to the user to determine whether those state predicates describe reachable states of the model. Thus, these 63 problems provide a moderately complex test of the power and usefulness of a scenario generator, and are representative of the problems that may be encountered by such a tool. The full data is available at (Hall, 1998b). Results. The 63 predicates averaged 1.72 parameters and 5.98 conjuncts each. Table 1 shows the results of running the generator. "Scenario Generated" refers to trials in which SGEN2 succeeded in finding a scenario; "Satisfiable/No Scenario" refers to the cases when it failed to find a scenario, even though a satisfying scenario exists; and "Not satisfiable" refers to those cases determined (through external means) to be unsatisfiable and, hence, there exists no scenario to generate. Table 2 shows run time statistics for the 63 trials. All times are measured on a 225 MHz Macintosh clone (144 MB memory) running the ISAT system under Macintosh Common Lisp 4.2. For this table, the "no.scen.gen only" condition includes all cases where the tool Table 1. SGEN 2 success on case study. Total attempts 63

Scenario generated

Satisfiable/no scenario

Not satisfiable

24

36

3

298

170 Table 2.

HALL SGEN 2 aggregate run times (rounded to nearest second). All # Total Library mining BackProp Coinstantiation Merge

Scen.gen. only

No scen.gen. only

63

24

39

8938

603

8335

658

510

148

1766

65

1701

0

0

0

6216

0

6216

did not find a scenario, whether or not the goal predicate was satisfiable (since to the user these are equivalent when waiting for the tool to finish). Discussion. Of the 60 cases in which it was possible to generate a scenario, SGEN2 succeeded 40% of the time. Thus, the user can be sure that at least those cases illustrate real design errors and therefore concentrate first on fixing them. Note that one error can cause scenarios to fail (due to conflicts) that would otherwise succeed far enough to reach a second error state. Thus, fixing an error can cause SGEN2 to succeed when it failed previously. I know of one definite case (and some others suspected) where this sort of error interference occurred in the case study. When I first ran the study, a few cases failed because individual conjuncts were not covered by the scenario library. Of course, if there is no known way to satisfy a single conjunct, the goal predicate won't be satisfied either. Fortunately, it is relatively easy to discover a scenario covering a single conjunct, such as (member ?x (lookup t c s - s c r e e n e d l i s t ? x ) ) . I easily created three scenarios to cover these cases, resulting in one more succeeding case. The results above reflect these additional scenarios. Turning to time, we see that the average time per trial is 142 seconds overall, with succeeding cases taking 25 seconds on average (101 sec maximum) and failing cases requiring 214 seconds (1054 sec maximum). Note that the distribution of time is radically different between succeeding and failing cases, with MERGESCENARIOS dominating for failing cases and MINELIBRARY dominating for succeeding cases. COINSTANTIATE was never significant, suggesting that there is room to improve its power (by checking larger constant pools, for example) without significantly harming the overall run time. On the other hand, we must be extremely careful in increasing the power of MERGESCENARIOS since that is the bottleneck in failing cases. These results are only intended to be suggestive of future algorithmic improvements; I believe they can be significantly reduced by a careful re-engineering effort. (The current ISAT system is an exploratory prototype.) Note also that these results depend on the model and scenario library as well. In summary, it seems that at least for validation purposes an imperfect scenario generator can still be quite useful as long as it doesn't take too long. Of course, we can always hope for a better success average, and future work will go into improving the heuristics. However, it is desirable to keep the times relatively low in all cases, including failure cases, so the tool is still usable. Thus, we must engineer the power/speed tradeoff carefully.

299

EXPLANATION-BASED SCENARIO GENERATION

171

6. Related work Model checkers are useful solutions to the problem of scenario generation as long as one can effectively generate models in the limited formalism necessary to run the tool tractably. (For a more extensive survey of model checking and its relation to theorem proving for verification, see (Clarke et al., 1996).) However, there is reason to believe that we need to handle the more complex formalisms addressed in this work for at least the reasons discussed in Section 3. In addition, we may wish to use scenario generation in ways beyond validation, such as online help systems. For comparison, it is amusing to estimate the state space size necessary to model the telephony case study specs in a finite state formalism. If we model all 12 features for n users, I estimate there are at least S{n) >

2"2+4nlin+n"

reachable states. This formula follows from the fact that each feature adds a set of state variables to the model whose domains form a cross-product. For example, the Terminating Call Screening feature adds a state variable TCS-SCREENED-LIST(x) for each user x, the value of which is a set of phone numbers to be prohibited from calling that user; thus, if there are n users, each user's TCS-SCREENED-LIST could take on any one of the 2" sets of user phone numbers, so the TCS-SCREENED-LIST values in a single n-user model can take on any one of (2")" valid combinations. Since the TCS feature can be activated independently of all other features, this multiplies the size of the state space by 2" . Several features (such as those that forward calls to another number under certain conditions) add a state variable per user that can take as value the phone number of a user. The collection of such variables multiply the state space of the n-user model by n" = 2" Ig ". By similarly counting up all the contributions for all 12 features, I arrived at the formula above. Now, if we consider Three Way Calling and similar features where three distinct parties can be involved in the feature usage, we need to include scenarios involving at least 3 users; but if we add forwarding and screening and other multi-user features, one can easily imagine correctness properties and requirement scenarios referring to 6 or more users, leading to 5(6) > 2 170 «s 1051 states, which would challenge even the best model checkers. To give a flavor of the complications which can arise, suppose A, B, C, D, E, and F all have the threeway calling feature, which allows having a conference call among more than two parties. Now suppose A, B, and C are engaged in a three way call and D, E, and F are engaged in a separate three way call. A attempts to call B to add B to the conference. This call could have the effect of joining the two conferences into one six-way conference, or it could be rejected, depending on the resource limitations imposed by the intended implementation platform. (In most systems, the success of this operation will depend on which parties to the original calls actually attempt the joins.) Whether this behavior is or should be allowed by the specification must be validated, so the model must accomodate a six-user case. Further, suppose A, B, and C also have call screening set up for various sets of users, possibly including D, E, or F. Then the question arises as to whether the join should be allowed if some party, say C, has screened out one of the other parties, say F. In order to validate that the model behaves correctly in this situation, it clearly must handle six users and the state variables supporting Three Way Calling and Call Screening, since joining conferences

300

172

HALL

together is distinct in the specification (and in the implementation) from placing simple calls. Note that even infinite-state model checkers, such as that of B ultan et al. (1998), are highly restrictive. That system is restricted to state spaces that are the cross product of a boolean state space and one representing integer inequalities (higher dimensional polyhedra). While increasing model checking power by adding specialized constraint reasoners shows promise, it is not even clear that most reactive systems people design will be expressible within such restricted formalisms, due to the common occurrence of functions mixing arguments of several different types. Another class of approaches to the problem that may seem applicable are Al-based planners, such as STRIPS (Fikes et al., 1972) or Prodigy (Minton, 1990). To use a planner, we would view the event handlers of the model as plan operators which change the state of the system, and the state predicates would represent planning goals. However, the problem with this idea is that planning operators must explicitly list their consequences; for example, STRIPS operators must have ADD and DELETE lists. Similarly, the macro operators learned by the EBG-based PRODIGY system must explicitly include the goal(s) they achieve. This is too limiting, because users of scenario generation may give any goal statement they wish in terms of functions defined in the logic. Any planning operator derived from the spec model potentially achieves too many (infinitely many, in fact) different goals (combinations of state predicates); far too many to be stored explicitly even if we could bound the vocabulary. SGEN2 avoids this problem by doing its abstraction and reasoning on the fly in MlNELlBRARY and BACKPROP*. The only knowledge stored is the "raw" scenario traces, unadorned with any goal information. There is also work in the traditional testing literature on generating test inputs to cover a given path in a program. For example, Gotlieb et al. (1998) describe a constraint-based approach, which essentially reduces to trying to find a satisfying assignment for a boolean functional expression which is, of course, uncomputable once we enrich the formalism to include (e.g.) arithmetic. However, the constraint based approach may prove useful in improving COINSTANTIATE and MlNELlBRARY; it does not address the state space search needed to handle reactive systems. Finally, there are other spec modeling tool suites providing some of the same (and many contrasting) tools as IS AT, such as the SCR tool suite (Heitmeyer, 1996). Such environments may incorporate model checking, but none of those capable of dealing with rich formalisms have scenario generators, to my knowledge. For example, SCR has tools for scenario simulation and to check for completeness and consistency of the hierarchical, tablebased input FSM specification, as well as a standard model checker for testing correctness properties. However, its boolean tables do not encompass more expressive functional logics. Leveson et al., (1994)'s RSML(Requirements State Machine Language)-based tool suite is similar to SCR; both are based on hierarchical finite state machine input languages and do similar checking. At the other end of the spectrum lies the PVS (Prototype Verification System) system (Crow et al., 1995). It is essentially a powerful theorem proving system for proving properties of specifications. While such a theorem prover can be used to prove invariants and other correctness properties, it is not directly useful in finding scenarios that reach prescribed states. PVS also has an integrated subtheory for the relational mu-calculus

301

EXPLANATION-BASED SCENARIO GENERATION

173

and a BDD-based decision procedure for it so that boolean model checking questions can be answered by the prover when they fall within the restricted domain of the subtheory. However, this decision procedure handles only standard, boolean logic domains; it cannot be used to find scenarios when the logic requires extra expressiveness.

7.

Limitations and future work

In a sense, the combinatorial explosion inherent in the validation problem arises from our executable specification model hypothesis (see Section 2) and the fact that it requires a complete model of behavior at the desired level of abstraction. However, this arises from the fact that we require specifications to be complete enough to execute; it need not be due to the use of a particular modeling formalism, or even the decision to use specification modeling instead of some other approach to formal specification. A methodology allowing incomplete behavior specifications might be more tractable, while sacrificing other desirable properties, such as unambiguity and ease of implementation. The most basic limitation of SGEN2 is that it is fundamentally a hill-climbing algorithm. In particular, there are examples in the case study which are easily solved by merging two scenarios from the library, but which SGEN2 cannot find. As an example, one scenario achieves a particular conjunct set halfway through its event sequence, but then has several more steps that are removed by BACKPROP* as irrelevant to achieving that set. It turns out they are necessary, however, if one wishes to later merge it with a second scenario achieving the rest of the goal. Such "extra" steps are things like hanging up a phone after activating a feature, because a subsequent scenario must start from the idle state. To avoid hill-climbing's limitations, we must do more search, including providing a way for the system to leave local minima to search for other, better minima, analogous to the way this is done in combinatorial optimization via simulated annealing, for example. (In SGEN2 terms, this amounts to having some probability of abandoning a given partial solution in favor of another partial solution achieving a smaller subset of the goal.) SGEN2 's power comes from the richness of the scenario library; it is therefore likely to be more useful in development processes and environments that encourage the formalization of such scenarios. SGEN2 provides, perhaps, a new argument in favor of integrating formal scenarios into the software process. Formalizing scenarios in this way improves the power of validation tools, in addition to providing additional test cases. Thus, the value of scenario formalization is higher to developers than merely providing test cases. Of course, it does require more up front effort, as well as requiring additional effort to maintain the scenarios as the specification of the system evolves over time. SGEN2 is still in its early youth, and there are many ways it can be improved. For example, in its search, SGEN2 only considers the first pgs having a given sat-set. A better, but more expensive, approach is to try a pgs if and only if its BACKPROP*-generalization is not isomorphic to one seen previously. The effect of this on run-time must be monitored, however. MERGESCENARIOS, being the time bottleneck on failing cases, may profit from more work on limiting its search. MINELIBRARY needs to search fewer cases when the predicate takes many parameters.

302

174

HALL

Of course, results from one case study are not conclusive, so future work should investigate SGEN2 's effectiveness on other domains and systems. To move toward proving broader effectiveness, we should apply SGEN2 to domains with even richer functional logics, as the telephony models used here are relatively simple functionally (boolean, plus simple list, string, and integer operations). I hope to do a case study in an electronic mail agent domain having much more complex logical functions, such as operations on structured data types (such as messages), cryptographic operations and random number generation, and string matching and transformations. Questions to be answered by such a case study include how tractable the SGEN2 approach is and how powerful it remains in the potentially larger search space induced by the richer logic. 8.

Conclusions

A scenario generation tool can be useful in a specification modeling tool suite, in focusing attention on design errors demonstrably present, in helping communicate errors in the requirements, and even in implementing online help systems. Exhaustive-search approaches, such as model checking, while useful, are not tractable in rich formalisms allowing more direct system models to be expressed. SGEN2 is a heuristic approach to a highly intractable problem, based on the simple idea of piecing together partially satisfying scenarios from the requirements library, using explanation-based generalization to abstract them in order to be able to coinstantiate them. Results from the case study are encouraging; SGEN2 seems to succeed often enough to be useful and yet be efficient enough to be engineered into an interactive tool. While the work needs further empirical validation, it seems promising and should be pursued. Appendix: SGEN2 pseudocode Figure 1 gives a pseudocode description of the recursive SGEN2 algorithm. SGEN2 takes in a goal predicate GP and returns (on success) a predicate group satisfier (pgs), a data type defined informally in Section 4.1. More precisely, a pgs is a quadruple (T, S, ss, I), where T is an execution trace (sequence of states reached) of a scenario, 5 is a state within T, ss, the "satset", is a subset of the conjuncts of some goal predicate GP that are achieved at state S, and / is the set of bindings of formal variables of ss to data values appearing in T that make ss true in state S. In other words, ss(I) is true in state 5 of trace T. Note that to each pgs, there corresponds naturally a scenario that achieves the satset; namely the prefix of the scenario of which T is a trace that leads up to state S. It is this scenario that I will mean by "the scenario of a pgs." If SGEN2 succeeds, it will return a pgs whose satset is all of GP. The scenario of this result pgs will be a scenario that achieves GP, as desired. If SGEN2 fails, we can program it so that we can recover its best effort, which is a scenario achieving as large a subset of the goal as was discovered during the search; this partial result may be useful, depending on the application for which SGEN2 is to be used. SGEN2 operates as follows. It first mines the library to get the initial PGSList. Next, it tries each pgs p in PGSList in decreasing order of the size of satset(/>). For each p, it first

303

EXPLANATION-BASED SCENARIO GENERATION

175

Parameter M : Model Parameter L : Library Function SGEN 2 (GP) PGSList := MlNELBRARY(GP, M, L) SatSets := () For each p ePGSList, do If Satset(p)£ SatSets Add Satset(p) to SatSets [CurPi, CurE, Curt] := BACKPROP*(P) rPGSList := Reduce(PGSList, satset(p)) If rPGSList is empty Return p ; (Success) If SGEN2-REC(rPGSList, CurP, Satset(p), CurE, Curl) succeeds Return its result. ; (Success) Fail Function SGEN2-REC(PGSList, CurPi, CurPf, CurE, Curl) SatSets := () Foreach p ePGSList, do If Satset(p) £ SatSets Add Satset(p) to SatSets [P, E, I] := BACKPROP*(p) II := COINSTANTIATE(P, CurP, I, Curl) IfII#none mm := MERGESCENARios(CurE(H), E(II), CurPf, Satset(p)) If mm 5^ none mmpgs := createPGS(mm, CurPf, Satset(p)) rPGSList := Reduce(PGSList, Satset(p)) If rPGSList is empty Return mmpgs ,• (Success) [CurP, CurE.Curl] := BACKPROP*(mmpgs) If SGEN2-REC(rPGSList, CurP, CurPf U Satset(p), CurE, Curl) succeeds Return its result. ; (Success) Fail Figure 1.

Pseudo code description of SGEN2.

applies BACKPROP* to its scenario, to get the corresponding generalized scenario, CurE, the predicate CurPi that must be true of instances of CurE in order that the corresponding instance of the satset(p) be satisfied, and Curl, the particular bindings that show that the scenario of p is an instance of CurE. Function Reduce(PGSList, satset(p)) removes from each pgs in PGSList the conjuncts in satset(p), so they become pgss relative to the goal predicate with the conjuncts in satset(p) removed. For example, if the overall goal predicate is A A B A C and we find a pgs that satisfies A A B, then calling Reduce(PGSList, {A, B}) returns a new list of pgss, one corresponding to each item in PGSList, but all of whose satsets have the conjuncts A and B removed. Reduce also sorts its output in the same way as MINELIBRARY, ie decreasing order of satset size. Now, if all conjuncts of the goal are satisfied by p, then the algorithm succeeds, having found a single scenario covering the goal. If, as is typical, not all conjuncts of the goal are satisfied, then rPGSList, the reduced PGSList, will be nonempty and we can enter the recursive phase of the algorithm.

304

176

HALL

The function SGEN2-REC is a recursive subroutine that keeps adding scenarios (using calls to BACKPROP*, COINSTANTIATE, and MERGESCENARIOS, which implement the tech-

niques described in this paper) to the current (partial) result until all conjuncts are satisfied. Accordingly, it takes several arguments: the current list of PGSs, the abstraction of the current (partial) result scenario (CurE), the constraint predicate (CurPi) that must be true of instances of CurE in order that they satisfy the conjunct set CurPf, the current (partial) goal subset CurPf achieved by instances of CurE, and a particular set of actual bindings (Curl) which instantiate CurE. Note that both SGEN2 and SGEN2-REC examine only one candidate pgs p having any given satset. This is a heuristic pruning of the search that may be weakened in order to increase the power (and cost) of the search. Section 7 discusses the algorithm's limitations, while Section 5 gives empirical performance results. References Alur, R., Jagadeesan, L., Kott, J., and Von Olnhausen, J. 1997. Model checking of real-time systems: A telecommunications application. In Proc. 19th Intl. Conf. Software Eng., ACM Press, pp. 514-524. Anderson, R.J., Beame, P., Burns, S., Chan, W., and Modugno, F. 1996. Model-checking large software specifications. In ACM SICSOFT Software Eng. Notes, 21(6):156-166. Boyer, R.S. and Moore, J. 1998. A Computational Logic Handbook. Academic Press. Bultan, T., Gerber, R., and League, C. 1998. Verifying systems with integer constraints and boolean predicates: A composite approach. In Proc. 1998 Intl. Symp. Software Testing and Analysis, ACM SIGSOFT Software Eng. Notes, 23(2):113-123. Burch, J.R., Clarke, E.M., McMillan, K.L., Dill, D.L., and Hwang, L.J. 1992. Symbolic model checking: 1020 states and beyond. Info, and Comput., 98:142-170. Clarke, E.M. and Wing, J.M. 1996. Formal methods: State of the art and future directions. ACM Comput. Surv., 28(4):626-643. Crow, J., Owne, S., Rushby, J., Shankar, N., and Srivas, M. 1995. A tutorial introduction to PVS. Workshop on Industrial-strength Formal Specification Techniques (WIFT'95), Boca Raton, Florida, h t t p : //www. c s l . sri.com/reports/html/wift-tutorial.html DeJong, G. and Mooney, R. 1986. Explanation-based learning: An alternative view. Machine Learning, 1(2): 145176. Fikes, R.E., Hart, P.E., and Nilsson, N.J. 1972. Learning and executing generalized robot plans. Artificial Intelligence, 3:251-288. Good, D.I. 1984. Mechanical proofs about computer programs. Phil. Trans. Royal Society of London A 312:389— 409. Reprinted in C. Rich and R. Waters, editors, Readings in Artificial Intelligence and Software Engineering, Los Altos, CA: Morgan Kaufmann, 1986, pp. 65-85. Gotlieb, A., Botella, G., and Rueher, M. 1998. Automatic test data generation using constraint solving techniques. In Proc. 1998 Intl. Symp. Software Testing and Analysis, ACM SIGSOFT Software Eng. Notes, 23(2):53-62. Griffeth, N., Ohta, T., Gregoire, J.-C, and Blumenthal, R. 1998. FIW'98 Feature Interaction Detection Tool Contest, h t t p : //www. t t s . l t h . se/FIW98/contest. html Hall, R.J. 1995. Systematic incremental validation of reactive systems via sound scenario generalization. J. Automated Software Engineering, 2(2):13I-166, NorweJl, MA: Kluwer Academic. Hall, R.J. 1997. Reactive system validation using automated reasoning over a fragment library. In Proc. 1997 IEEE Automated Software Engineering Conference (ASE'97), IEEE. Hall, R.J. 1998a. How to avoid unwanted email. Comm. ACM, 41(3), 88-95. Hall, R.J. 1998. Feature combination and interaction detection via foreground/background models. In Proc. 1998 Intl. Workshop on Feature Interactions in Telecommunications Systems, IOS Press. Hall, R.J. 1998b. Case study data for this paper, ftp://ftp.research.att.com/dist/hall/papers/ isat/sgen2-case-study.txt

305

EXPLANATION-BASED SCENARIO GENERATION

177

Heitmeyer, C.L., Jeffords, R.D., and Labaw, B.G. 1996. Automated consistency checking of requirements specifications. ACM Trans. Software Eng. and Methodology, 5(3):231-261. Holzmann, G.J. 1991. Design and Validation of Computer Protocols. Englewood Cliffs, NJ: Prentice Hall. Leveson, N., Heimdahl, M , Hildreth, H., and Reese, J. 1994. Requirements specification for process-control systems. IEEE Trans. Software Engineering, 20(9):694-707. Minton, S. 1990. Quantitative results concerning the utility of explanation-based learning. Artificial Intelligence, 42:363-392. Rich, C. and Feldman, Y. 1992. Seven layers of knowledge representation and reasoning in support of software development. IEEE Trans, on Software Eng., 18(6):451-469.

306

Chapter 8 ML Applications in Management of Development Knowledge Software design and development are increasingly becoming a complex process that relies critically on knowledge and expertise from many areas. Being a creative process on the part of the designer, the quality and efficiency of software design are the result of accumulated learning and practicing experience. To capitalize the investment by many designers over the years, it is highly desirable to retain the software design knowledge for future use. This chapter is concerned with the issue of how to capture, manage and reuse software design knowledge through some ML methods. In the two applications in this area, a single ML method, CBR, has been utilized. Table 28 indicates the current state-of-the-practice in this application area. Table 28. ML methods used in software design knowledge capturing and reuse. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Manage Software Development Knowledge

V

Software Process Knowledge

V

In this chapter, we include one paper by Henninger [64]. The work deals with a CBR based method for collecting and managing software development knowledge as it evolves in an organizational context. The approach emphasizes on accommodating dynamic and situationspecific software development knowledge through cases. A prototype tool was developed that has two main information containers: cases that contain information about some problem solving experience (from project development issues to source code to tips and techniques for using applications), and resource descriptors that contain information about specific resources used for developing software (tools, projects, people, and developing methods). A case is defined in terms of a number of attributes: 1. 2. 3. 4. 5. 6. 7. 8.

Description: a statement about a problem for which a solution is recorded. Solutions: the solution to the aforementioned problem. Characteristics: a field containing index terms for case retrieving purpose. Resources: the context indicating development resources utilized in the case. Attachments: links to related documents. Related Cases: links to related cases. Owner: the owner of the case. Status: the status of the case.

The paper discusses how cases can be acquired, what guidelines there are in documenting cases, and how to validate cases. It treats the case repository as a living design memory that captures emerging knowledge and helps identify effective work practices. How cases can evolve from independent problem solving experience to general principles that can be adopted as corporate-

307

wide standard is another issue addressed in the paper. In essence, cases can be turned into checklists that can be used as a guide for future software development efforts. The following paper will be included here: S. Henninger, "Case-based knowledge management tools for software development", Automated Software Engineering, Vol.4, No.3, 1997, pp.319-340.

308

Automated Software Engineering 4, 319-340 (1997) © 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.

Case-Based Knowledge Management Tools for Software Development SCOTT HENNINGER [email protected] Department of Computer Science & Engineering, 115 Ferguson Hall, CC 0115, University of Nebraska-Lincoln, Lincoln, NE 68588-0115

Abstract. Modern software development is a knowledge-intensive activity. The proliferation of development tools, rapidly changing technology, and increasing complexity and diversity of application domains all increase the cognitive burden placed on software developers. General purpose programming languages and CASE tools offer little relief from these problems. Knowledge management tools are needed that can effectively capture and disseminate software development knowledge that applies to the domain-specific needs of an organization. This knowledge is not static, but evolves with technology and the changing needs of the organization's development practices, customer base, and business milieu. This paper presents an infrastructure that supports evolving knowledge through case-based techniques and domain analysis methods that capture emerging knowledge and synthesize it into generally applicable forms. The approach is less concerned with the veracity of knowledge in its repository than evolving the knowledge toward answers to problems that fit the organization's technical and business context. Implications of this approach go beyond supporting software development to other knowledge-intensive professions where knowledge management tools can be used to support an organizational memory. Keywords: case-based reasoning, knowledge management, knowledge-based software engineering, organizational memory, organizational learning, domain analysis

1.

Knowledge requirements for software engineering

Modern software development is a knowledge-intensive activity. The number of resources available to the modern software developer is astounding. Process models, development methods, technologies, and development tools are all part of the modern software designer's toolbox, which also includes various tool kits, configuration management tools, test suites, standards, and smart compilers with sophisticated debugging capabilities, just to name a few. The vision of the software engineer carefully crafting language statements into a working program is obsolete, giving way to the employment of an arsenal of tools and techniques designed to support the coordination of work and the creation of systems that match the conceptual complexity demanded by modern software users. While it is unclear whether these tools have had the great impact predicted by CASE vendors, it is clear is that the "tool mastery burden" (Brooks, 1987) for software developers has increased. But the software engineering community has for the most part ignored the knowledge burden placed on developers, opting instead to pile new methods, languages, tools, and formal methods onto the requisite knowledge heap. Knowledge of these technical resources, as well as the application domain and the programs being developed, are amongst the software

309

320

HENNINGER

developer's most valuable assets. Currently, these resources are found scattered throughout development organizations in separate manuals, on-line databases, and documents with routing slips. Above all, software development remains an "oral culture" where knowledge is maintained as informal "folklore" (Terveen et al., 1993) passed between individuals. Because the field is changing so rapidly, and because user needs are as volatile as today's business milieu, software development knowledge is constantly changing. Knowledge emerges in work practices, often being defined by the first project to address the issues involved. Lack of experience in domains such as client-server computing cause projects to constantly push the boundaries of an organization's development infrastructure. Smart planning and analysis can identify some of this knowledge, but there will always be hidden, obscure, and non-obvious issues that will be missed without some means to detect recurring patterns of problem solving activities. Refined knowledge is difficult to come by, but there is a need to disseminate what is currently known so the organization can build on successes, avoid duplicate efforts, and avoid repeating mistakes. This paper presents a case-based method for collecting and managing software development knowledge as it evolves in an organizational context. Tools have been created to collect and disseminate project experiences as "cases" representing emerging knowledge of development practices in an organization. Situation-specific cases are augmented with organizing principles that help developers find applicable cases and helps the organization set standards in domains that are well-known or critical to the infrastructure of the organization. A methodology is being explored that complements the tools with a continuous process of knowledge acquisition so that existing knowledge can evolve to improve on previous efforts and meet the changing needs of customers (Henninger, 1995b, 1996b). The following describes some of our empirical findings about knowledge needs in a software development organization. Our overall approach to using case-based technology for knowledge management is discussed, followed by a description of a set of tools for collecting and disseminating experience cases. Knowledge acquisition issues are illustrated through a knowledge acquisition cycle for continuous update of knowledge in the repository. The relationship between cases and organizing principles is then described and examples are given. The paper closes with a discussion of future directions. 1.1.

Knowledge in a software development organization: A case study

We have been working closely with a technically sophisticated Information Technology department at Union Pacific Railroad (UPRR), a major U.S. railroad corporation. This organization employs about 350 software developers and a number of consultants to develop in-house information systems supporting the corporation. The organization has been undergoing a general shift from data management on mainframe systems to decision support systems in a PC-based client-server environment. The shift has caused problems along a number of dimensions, including Unix server technology, communications, PC applications, and decision support systems. Issues of configuration, server location, system downtime and recovery, and others that were stable or did not arise in the mainframe world are becoming critical issues in need of effective organization-wide solutions. These problems

310

CASE-BASED KNOWLEDGE MANAGEMENT TOOLS

321

have caused a "crisis in expertise," in which experts for specific technical problems are few and far between—a situation being experienced by most IT departments in today's business environment. The organization's infrastructure has become obsolete, causing many projects to push infrastructure boundaries. Situations have been documented in which groups have effectively solved the same problem without knowledge of each other. In one instance a project was undergoing a major review when it was suggested that another project was doing similar things. Subsequent communication with the other project revealed many common aspects, causing both projects to re-design their data models to become compatible. Software development at UPRR has been systematically studied at both macro and micro levels. We began at the macro level with a series of round-table meetings with developers and key management personnel. We then followed up these meetings with a series of in-depth interviews with people throughout the organization. This gave us an appreciation for some of the broad issues facing the organization. But people often spoke at a high level of abstraction, discussing issues in general terms and supplying a disappointing amount of detail on the daily activities of developers and managers. We used two strategies to get to these specifics. To get a better view of the micro level of software development at UPRR—the daily activities of developers and first-level managers—contextual inquiries and diary studies were performed. Contextual inquiry (Holtzblatt and Jones, 1993) is an interviewing technique where the interviewer observes a subject and asks questions in the context of work. This technique helps uncover details and reveal aspects of one's job that they tend to abstract out in a formal interview setting. We also had a number of developers perform a diary study (Perry et al., 1994; Rieman, 1993) in which they were asked to write down the major activities they engaged in each day for a week. We were not so much interested in measuring process intervals or where developers spend their time (Perry et al., 1994) as we were finding out how often, when, and where people searched for information. We therefore asked people to fill out "eureka reports" (Rieman, 1993) detailing information seeking and dissemination activities. Subjects documented the problem being addressed, the nature of the information need, the information source, and the eventual resolution of the problem. We asked subjects to interpret information sources broadly to include telephoning people, sending e-mail, searching archives, using manuals, or any other situation in which they found themselves looking for information. We found that people spent over 40% of their time accessing and documenting information for a wide range of issues and using many different sources. Our interpretation is that this finding confirms the fact that software development is indeed a knowledge-intensive activity, and that the complexity of the knowledge exceeds the capacity of individuals to solve problems by themselves. It also confirms other studies that developers spend a great deal of time and attention determining who should be contacted in the organization to get tasks completed (Perry et al., 1994). Interestingly, our findings were met by a great deal of alarm by upper management at UPRR, whose interpretation was that too much time is being wasted on finding resources. More studies are needed to get a better idea of which interpretation is right, but the study strongly bears out our hypothesis that knowledge is a valuable asset and that providing improved access to that knowledge has great potential for improving and streamlining the development process.

311

322

HENNINGER

Since knowledge plays such a crucial role in software development, tools are needed that can effectively disseminate information to spread application domain and other knowledge more evenly across the organization (Curtis et al., 1988). Much of this knowledge is very organization-specific, depending greatly on the tools used, knowledge levels of individuals, corporate culture, and other experience-based forms of knowledge. For example, a developer needing to get PowerBuilder to work with UPRR's Oracle servers is faced with a problem whose solution is not completely specified in the application manuals, but may have been accomplished by others in the organization. It is the dissemination of this and other kinds of knowledge that we are seeking to support. 1.2. Contextualized knowledge acquisition Developing a large-scale, real-world, knowledge base to support the development process is not a simple matter of engaging in an up-front knowledge acquisition effort. The dynamic and emergent nature of knowledge in software development organizations stands in stark contrast to most expert system and domain analysis approaches which recommend that a well-defined and stable domain be chosen. Traditional assumptions that the knowledge exists, can be elicited from experts, and can be validated for consistency and completeness will not work in this context. Indeed, much of a software developer's work is to create working systems in the face of ill-defined specifications and fluctuating customer requirements (Curtis et al., 1988; Walz et al., 1993). Yet there is a critical need to disseminate what is currently known in a timely manner so an organization can build a development infrastructure that is based on success, avoids duplicate efforts, and avoids repeating mistakes. Generally, there are two approaches to knowledge acquisition in software design environments. The first is to take the naive position that designers will readily perceive the future benefits of engaging in the often difficult, and always extra, task of putting knowledge into the system (Conklin and Yakemovic, 1991). A second approach is to appoint a "knowledge librarian" to maintain a knowledge base (Prieto-Diaz, 1991). While this shifts some of the workload away from developers (although it becomes necessary for the subject expert to adequately explain the information to the librarian) and ensures that the knowledge base is continually updated, it does not ensure that relevant and accurate information is acquired. It also introduces a bottleneck in the process, as the librarian, or some set thereof, must be able to keep up with the ongoing knowledge accumulation and evolution. This brings us to a third approach that provides knowledge acquisition tools so developers can encode knowledge as part of their actual development work. This leads to contextualized knowledge acquisition where knowledge is encoded into the system as part of the knowledge creation process. This approach hasn't been explored much (Gaines, 1989), but our approach has been to understand current development practices well enough that new practices can be designed that are not overly disruptive, but changes current documentation practices to one of knowledge collection that also supports current project development. Our studies have shown that knowledge collection is already part of the UPRR culture. People devote a significant part of their time to documenting their work in various ways.

312

CASE-BASED KNOWLEDGE MANAGEMENT TOOLS

323

We observed one person typing some carefully written notes into a Notes database because their group was the first to use a procedure that was to become a standard practice for similar efforts. A number of people were observed entering tips and techniques into Notes databases, writing up meeting notes, as well as standard documentation duties. Scribes are often appointed to meetings to take notes and keep track of action items. The problem is that this knowledge collection is performed haphazardly. With over 1800 Lotus Notes Databases and few organizing principles for the databases, it has become difficult to find what one is looking for. Also, there has been no effort to determine just what kind of documentation will have the largest impact on development practices, and what form that information should take. A more organized, centralized repository that puts knowledge into a standard format is needed to more effectively disseminate knowledge throughout the organization. 13.

Deriving theory from real-world problems

The method of research used in this study is somewhat different than traditional software engineering and artificial intelligence research. Instead of creating a solution and looking for applications, we began with real problems and applied some knowledge of existing techniques to the problems. While this "industry-as-laboratory" approach (Potts, 1993) is deemed highly desirable by the software engineering research community, few have undertaken research projects using the approach, and many misconceptions remain. The notion of industry-as-laboratory is to derive research questions from actual practice as opposed to abstract or academic exercises. The tools or techniques created as part of the research are based on empirical observations of software development organizations, but industry-as-laboratory should not be confused with technology transfer, which evokes a different set of issues concerning adoption of techniques and organizational change. A complicating factor in this research is that we have constructed a prototype, named BORE1, to demonstrate potential solutions to the problems we have observed. This complicates the picture because people immediately become concerned with issues of "technology transfer" and CASE tool adoption in development organizations. Both of these issues are immensely important and in need of further research, but are not the immediate focus of this project. Its purpose is to experiment with techniques that address some of the observed issues. Our expectation is that a mature solution to these problems will take a great deal of experimentation and iteration before we get to the processes of tool adoption and technology transfer. The purpose of this paper is to disseminate some of this work in its early form so that researchers and practitioners can learn from and improve the techniques. In the meantime, we are refining our techniques through continued interaction with development organizations and further empirical studies. We have currently installed BORE on UPRR servers and are using a pilot project to elicit feedback on BORE and improve our understanding of what it takes to create techniques that will be intrinsically motivating to project developers and yet be robust enough to provide detailed information for subsequent design efforts. Our highest expectation of this pilot study is that we will get some positive and negative feedback on our approach and learn more about what is truly needed in

313

324

HENNINGER

the organization. At this point, it would be fruitless to impose our techniques on the organization, as we are still very much in the experimental and data collection phase of this project. 2. A case-based approach to knowledge management The fact that knowledge in software development settings is both dynamic and situationspecific has led us to the concept of adopting a case-based decision support approach (Kolodner, 1991). The general case-based reasoning paradigm postulates that much of human problem solving consists of applying past experiences to analogically related situations. Our actions are based on past experiences, and to the extent that we can remember the similarities and differences between current and past circumstances, successful, defective, and failed actions can be repeated, modified, or avoided. Early case-based systems focused on how this kind of reasoning could be imitated in artificially intelligent systems. But recent systems have begun to explore how case-based technology can be used to support human decision making by providing an external memory of cases, effectively extending one's knowledge to include the experiences of others (Kolodner, 1993; Pearce et al., 1992). Case-based reasoning thus serves as both a theory of human cognition and a model for creating intelligent computer applications (Slade, 1991). In a similar way that paper and pencil is a "cognitive artifact" (Norman, 1993) that changes the task from one of memorizing a grocery list to the easier task of recognizing items on the list, case-based decision support changes the task from one of recalling similar cases to the easier task of recognizing similarities between cases and assessing the applicability of suggested actions and their potential outcome. In what follows, a set of case-based knowledge management tools for software development are presented. The structure and organization of BORE is described. The traditional case-based reasoning paradigm needed to be augmented to meet our needs in three ways: 1) In a dynamic knowledge domain it is necessary to continuously update the repository. While it has been acknowledged that collecting cases is an incremental process (Kolodner, 1991), few methods are available to support this process. Mechanisms and tools are needed to easily create new cases and organize them by resources and development projects. 2) Cases are only a part of the picture. Organizing principles and general-purpose knowledge is needed to augment and organize cases. Tools are needed that assist the process of evolving from isolated cases to generalized and/or authoritative knowledge. 3) Since it was important to get a view of cases relating to certain development resources, like specific CASE tools or projects, it was necessary to cross-reference cases by these resources. "Resource descriptors" provide a view of the case repository organized by these resources. 2.1.

BORE: A knowledge management tool for software development

BORE is a prototype that has been implemented to investigate the issues involved in using a case-based knowledge management tool to support software development organizations. The objective of BORE is to collect knowledge that can be re-used across the organization

314

CASE-BASED KNOWLEDGE MANAGEMENT

TOOLS

325

and provide an infrastructure for the creation of standard practices in the organization. BORE version 1.2 is currently in use in a pilot project at UPRR where we are collecting feedback on the prototype and the overall method of using case-based knowledge repositories to support software development. There are two main containers of case information in BORE. The first are cases. Cases contain information about some problem solving experience, from issues that arise during a development project to source code to tips and techniques for using applications. Cases are stored in an unstructured repository that can be searched with key phrases. The second information container is resource descriptors. Resource descriptors contain information about a specific resource used for developing software. There are currently four types of resources: tools, projects, people, and development methods. Each resource type has a number of resources to choose from. For example the "tool" resource type has "PowerBuilder", "Visual Basic", and "MicroFocus COBOL" amongst others. 2.2.

Experience cases

The central, elementary unit of information in BORE is the experience case. Figure 1 shows the user interface for a case in BORE. The case is titled "Application security with role/user and per window access" and consists of a number of attributes that describe the case. Cases

Figure 1. The case window.

315

326

HENNINGER

contain a description of a specific problem solving experience, from issues that arise during a development project to source code to tips and techniques for using applications. It is used to organize information about a significant problem solving episode with potential for use across the organization. The main attributes of a case are its problem and solution statements. These are represented by the 'Description' and 'Solution' attributes, respectively. A larger window can be displayed for either of these fields with a right-click of the mouse. The problem description and solution attributes are designed to provide a synopsis of an experience case. If more extensive documentation is necessary, any OLE-compatible document can be placed in the 'Attachments' attribute. Organizing cases in this manner follows a "minimalist explanation" strategy (Carroll et al., 1987), where users can browse cases by looking at the description and solution synopses. If the user decides to adopt the case, further explanation can be easily obtained. Other attributes are designed to provide index terms, describe the context of the case, and provide links to other items in the repository. Index terms are supplied in the 'Characteristics' field. Characteristics are part of a controlled vocabulary strategy that is used to retrieve relevant cases (Henninger et al., 1995). The context is described through the 'Resources For This Case' field that shows a number of development resources that are used in this case (development resources are further described in the following section). The 'Attachments' and 'Related Cases' attributes supply links to documents and cases, respectively. These attributes help place all the relevant information about cases in one place for user convenience. Examples of attachments include design documents, memos, and source code. Generally, a single descriptive document is attached that covers both the description and solution as well as any other information needed to understand the case. A double mouse-click on a name in the 'Related Cases field' will display the case window for the selected case. Finally, the bottom of the window shows some general accounting information about the owner(s), the current status of the case (resolved, unresolved, or open) and the date the case was created. 2.3.

Resource descriptors for development resources

A development resource is anything that can be used to help people develop software applications. Any time a developer has significant problem solving interaction with others, uses a specific development process, or uses some tool as a significant part of resolving an issue, a resource is being utilized. In general, any work or solution to a problem that can become a valuable resource for the organization should be recorded as a case. The importance of organizing cases in terms of development resources lies mainly in the need to assess which tools are best suited to a given kind of problem. For example, if a decision is required on which GUI builder should be used in a project, the different tools can be assessed with respect to the cases associated with them to determine which best supports the problem, which offers more reusable code, etc. It is important to keep organization-specific documentation on tools and other resources because manuals cannot be expected to support the specific kinds of problems and tasks developers will use the tool for. In many ways, resource descriptors are an organization-specific extension of manuals.

316

CASE-BASED KNOWLEDGE MANAGEMENT TOOLS

Figure 2.

327

Resource descriptor window.

2.3.1. Resource types. Resource descriptor windows, such as the one shown in figure 2, organize cases by development resources, and are designed as one-stop locations to get information on specific resources. The four types of resources, tools, projects, people, and development methods, have been chosen to represent the different kinds of resources we have observed people using in the software development process: • Tools. A broad definition of tools is taken here to include CASE tools, GUI builders, compilers, project support tools, word processors, and other applications used in the development process. The information contained in this resource type will be cases that describe how to use the tool and how the tool was employed to solve specific problems. • Projects. Projects contain information about development projects at the organization. They are used to document the important issues and trail blazing designs used by different projects. We are currently working on a comprehensive project documentation strategy that revises current documentation practices at UPRR to use BORE to coordinate and communicate project issues. • People. The people resource type contains information about software developers at the organization. This includes contact information and cases the person has submitted to the organizational memory. • Development methods. Development methods, such as waterfall, SDM, RAD, and variants of these describe different methods for developing software. Methods such as when and how to involve customers and users, how to employ different types of rapid prototyping, and other information are contained in these resource types. A detailed description of the process used, alternatives and variations, and cases that have employed the methods are included. 2.3.2. Resource descriptor windows. Information on specific resources is contained in resource descriptors for the different resources. Figure 2 shows a resource descriptor for

317

328

HENNINGER

PowerBuilder, a member of the "tools" resource type. The window shows a brief description of the tool and a listing of the cases that are associated with this resource. These cases are attached to the resource attributes as a means to keep detailed and updated information on how the resource has been used in the organization. This can help people assess the strengths and weaknesses of a resource, find reusable objects associated with a resource, and help people find other people with expertise on a specific resource or type of problem. The resource is further divided into a number of attributes. This helps separate information so users can easily find the information they are seeking. Cases are attached to these attributes. For example, in figure 2, the case named 'Application security using Oracle servers' is attached to the 'How-To' attribute of PowerBuilder's resource descriptor. The attributes differ according to the resource type. The attributes for the 'Tools' resource type is shown on the combo box shown broken away from the window in figure 2. Attributes for the 'Projects' resource type include project issues (project cases), open questions, personnel, tools, and reusable objects. The 'People' resource type has contact information, projects, cases, tools, and current issues attributes. The 'Development Methods' type has documentation, cases, process models, tools, projects, and personnel attributes. 2.3.3. Linking cases to resource descriptors. An important part of documenting a case is linking it to the resources that are used in the course of solving the problem. Developers accomplish this through the 'Edit Resources' button in figure 1. This brings up the Edit Resources window shown on the left of figure 3. Developers link cases by choosing items from lists of the three dimensions of resources, resource type, resource, and resource attribute. This builds a 'Descriptor' which can be added to the case. The current set of resource descriptors is displayed at the bottom of the window where users can remove descriptors, if necessary.

Figure 3. Linking cases to resources.

318

CASE-BASED KNOWLEDGE MANAGEMENT

TOOLS

329

Because there are a potentially large number of resources, the 'Change...' button next to the resource attribute is used to bring up the 'Select Resource' window on the right of figure 3. Users can scroll down the list or type a name which automatically scrolls the list. Selecting an item changes the resource in the 'Edit Resource' window. Since it is expected that this dimension will be dynamic, users can enter a new resource by simply typing a name with no match in the list. Clicking on the 'Add to List' button brings up a dialog box asking for a brief description of the resource, and asks the user if they wish to create the resource. 3.

Knowledge acquisition through a community of practice

The ill-defined and dynamic nature of the domain prohibits the traditional case-based reasoning goal of acquiring the right knowledge to provide adequate coverage for the domain (Kolodner, 1993). Knowledge is being dynamically formed, the problems are often illdefined in nature, and the outcomes are often ambiguous, indeterminate, and complex. Therefore, to be pragmatically useful in this domain, a re-thinking of the knowledge acquisition problem is needed. Knowledge acquisition must be seen as laying the groundwork for continuous improvement, instead of achieving "adequate coverage." In the absence of well-defined criteria for facts, the goal becomes one of documenting the effectiveness of solutions and focusing effort on areas needing improvement. Our approach relies critically on capturing knowledge as it is created in a software development organization. The repository must become a "living design memory" (Terveen et al., 1995) that captures emergent knowledge and helps identify effective work practices. Appointing a knowledge librarian has merit, and can be used effectively to maintain the integrity and organization of the repository, but to truly capture emerging knowledge across an organization requires reducing the librarian bottleneck and drawing on the collective experiences of personnel in the organization. This means changing the development process to incorporate knowledge collection activities in a manner that has intrinsic value to developers. This method is contrasted by many other AI and design rationale research efforts that have focused on making the knowledge acquisition process as non-disruptive to current practices as possible. In our setting at UPRR, we have found this approach to be lacking. People are not willing to engage in extraneous activities with minimal immediate benefit (Grudin, 1988). We have therefore taken the tack of redefining current practices so that our repository becomes an essential and central part of the design process, while leaving a trace of design experiences for future efforts. This is similar in spirit to the Designer Assistant project at AT&T, where the tool was injected into the design review process as a way of ensuring conformance to the usage of a specific piece of complex functionality in a switching system (Terveen et al., 1995). In this setting, the design process was modified to include a trace of the Designer Assistant session as part of a design document. The appropriateness of the designer's choices as well as the adequacy of the advice given by Designer Assistant are discussed during the design review. If the advice is found to be lacking, designers begin a formal change process to update the knowledge. The Designer Assistant experience represents a significant step in knowledge acquisition procedures for real-world settings. Utilizing a combination of existing and new

319

330

HENNINGER

organizational processes to place use of Designer Assistant into development practices ensures that the knowledge will evolve with the organization. Their lesson "that technology and organizational processes are mutual, complementary resources" (Terveen et al., 1995) has been a guiding principle for our project. But their approach was to use a well-defined, stable knowledge domain that was difficult for people to get right, but where experts existed that could provide advice on how to do things. Our circumstances differ in a number of significant ways: 1) As opposed to a narrowly defined domain like "asserts" error handling (Terveen et al., 1995), we need to capture general development knowledge in the organization. 2) As opposed to concentrating on advice for well-defined domains, we felt strongly that more effort should be put into supporting people with the ill-defined decision making activities that require human judgment. 3) We also want to investigate how the process of deriving general knowledge and principles from problem solving instances could be augmented and accelerated (Henninger, 1996a). In other words, we wish to support knowledge creation and refinement, not just knowledge delivery. These differences have motivated our case-based approach and the notion of a domain lifecycle, in which problem solving instances are collected as cases and synthesized into principles and guidelines through a domain engineering process (Henninger, 1995b, 1996b). Cases are used to document projects and capture individual, situation-specific, problem solving episodes. Refinement and ratification of knowledge is performed as a second step by domain analysts. This step can be seen as a traditional knowledge acquisition process. This two-step process allows the capture of current practices that can then be analyzed to verify knowledge, choose best or recommended practices, standardize techniques, or create guidelines. This process, which includes a third step to turn certified knowledge into knowledge-based design environments (Fischer, 1994) is discussed at length elsewhere (Henninger, 1996b). Here, we concentrate on how cases are acquired and modified as a part of development practices. 3.1.

The knowledge acquisition cycle

To successfully become a living, up-to-date repository for software development, knowledge acquisition must become part of the development process. We have approached this as a process of conformance similar to that used in the Designer Assistant (Terveen et al., 1995). But in the absence of well-defined knowledge and true experts that could adequately cover the domains, conformance needed a much softer, more evolutionary, definition. Projects need to verify that they are not duplicating problem solving efforts, yet trail-blazing projects need to be identified to set a precedent for others to follow. Where precedents have been set, methods are needed to confirm and enforce compliance. The ill-defined nature of problems and the difficulty of assessing outcomes necessitates an argumentative process (Fischer et al., 1991) of confirming the applicability of precedents. Our approach to the continuous acquisition of cases occurs through a combination of process and technology. This process works in a fashion similar to American case law where attorneys formally argue whether the precedent applies, demonstrating how existing cases are similar to or different from the current situation. When developers are given a problem to work on, they begin by looking for cases that are similar to the current problem.

320

CASE-BASED KNOWLEDGE MANAGEMENT

TOOLS

3 31

Existing cases can be adopted wholesale, resulting in a significant level of reuse. More frequently, developers choose an appropriate set of cases representing pieces of a solution or that are suggestive of solutions or a set of alternatives. The developer analyzes which parts of the cases do not fit and adapts the old cases to the new situation. A new case results that is placed in the repository. Rationale is used to document conformance and case evolution. If an existing case is followed, developers need only document this fact. The project document holds a pointer to the case, and the project is added to the existing case's representation to reflect the fact that the case has been used in this new context. Note that in addition to significant reuse, the documentation burden is reduced by simply referring to the adopted case. If an extension of the case is needed, a new case is created and developers provide rationale for why conformance could not be achieved and how the circumstances differ from existing cases. Rationale is attached to the new case along with a reference to the appropriate case. Again, note that documentation can be reduced because only deviations from standard practices need to be carefully documented. If an entirely new case is needed, the case and its circumstances is diligently documented and related cases are linked to the new one. Whether a new case is needed or a case needs to be extended, this knowledge becomes part of the repository through the newly created case and links to existing cases. Collecting cases in this manner accurately records current practice while creating an infrastructure for assessing current practices and finding opportunities for improvement. Precedents are set by people blazing trails and leaving footprints behind for subsequent development efforts to follow. Developers can use these cases as a basis for their designs and extend them as needed. This creates a "community of practice" (Brown and Duguid, 1991) in which common practices can be followed while evolving to meet the dynamic needs of an organization. Requiring rationale for adopting or deviating from existing cases ensures that best practices are followed while documenting how cases evolve. 3.2.

Case contents and project documentation

A fundamental question is what types of issues should be documented as cases. A canonical definition of a case is that "A case is a contextualized piece of knowledge representing an experience that teaches a lesson fundamental to achieving the goals of the reasoner" (Kolodner, 1993). This statement can be translated into a general rule of thumb that if a project encounters an issue that hasn't previously been documented in the repository and took a significant amount of resources to be addressed, then it should be recorded. Some examples are when a new way of achieving a goal is attempted, when a particularly insidious problem is encountered, or when a new technology or approach is used. Another decision is when to document that a case was significantly modified with a new, related, case. Instances in which effects could not be predicted from the existing cases, when different tools or knowledge was needed to perform the task, and when a significant amount of re-work was required (Kolodner, 1993), are all candidates for modifying a case. These guidelines are admittedly rather broad, and need refinement through further empirical studies of organizations using case-based repositories, but they serve as a starting point for understanding what should be treated as a case. It is also important to point out that

321

332

HENNINGER

Figure 4.

Documenting projects with cases.

good descriptions of failures is as important as documenting success cases. Some believe we learn more from mistakes than from success because more analysis of cause and effect is performed when a failure is encountered (Petroski, 1985). Cases should in general do two things: teach one or more lessons, and establish the context under which it is appropriate to use the case. As alluded to earlier, cases are used to document projects. Figure 4 shows the case repository before and after documenting a project. To illustrate how the documentation works, a sample project is shown with three cases representing significant project issues. Case A simply refers to a case in the existing repository, meaning that the project simply adopted an existing case solution. Case B points to a new case that was created for the project. Case C shows a case that has been modified from an existing case. The break-out box shows the existing case name, the new case name and a justification for why the existing case could not be used. 3.3.

Case validation

The process of validating information is more of a social process than researchers have realized (Gaines, 1989). Knowledge is often created on the fly and evaluation criteria are under constant flux. As knowledge is created, its validity can be debated, but there are few clear-cut answers. Many at UPRR have expressed the difficulty of assessment and the fact that the outcomes are often ambiguous with respect to success or failure. Even if

322

CASE-BASED KNOWLEDGE MANAGEMENT

TOOLS

333

success or failure can be documented, it is very difficult to assign causality. Projects that are currently in use often have known defects or bad designs that no one would want to use again. Other projects were scrapped after achieving work that later became well-respected. One example was a project that did an ergonomics study of screen design and layout. The project was canceled for other reasons, but was held in high esteem by those that knew of it. The complexity and ill-defined nature of business contexts and interwoven technology solutions defies simple and clear-cut criteria for success or failure. The resulting ambiguity of outcomes presents problems for real-world knowledge-based systems. Consensus on issues, in the rare instances that it is achieved, is socially constructed over time as people become aware of the value of solutions and approaches. This evolutionary, social, process rules out many AI approaches that rely critically on being able to "train" or otherwise imbue systems with clear and unambiguous outcomes. Yet there is clear value to collecting knowledge that has shown pragmatic value and can eventually become knowledge that is generally agreed upon. Indeed, we see our repository and tools as supplying the infrastructure by which agreement on best practices can be accelerated (Henninger, 1996a). These issues have caused us adopt a two-step process of acquiring knowledge (Henninger, 1995b). Cases are free-form knowledge that can be freely created and edited by projects and individuals. Intrinsic organizational forces in documenting projects and professionalism are relied upon to ensure the integrity of the repository. The second step uses domain analysis processes to synthesize knowledge in cases and create knowledge whose validity can be certified by organizational experts (Henninger, 1995b). Concentrating on the first step, case-based technology is a good fit with the social process of knowledge creation because knowledge can be accumulated incrementally. Mistakes can be avoided and successes repeated by documenting past failures and successes. As the repository accumulates cases, it becomes "more competent" and able to handle a wider range of circumstances (Kolodner, 1993). Allowing free access to creating cases causes a concern about repository integrity. What if "wrong" information is placed in the system? Our view is that incorrect or suboptimal knowledge is a natural part of the maturation of a knowledge domain (witness pre-Copernican knowledge of astronomy that placed the earth at the center of the universe). As people encounter incorrect or sub-optimal knowledge, the need for correction and updating will become apparent. The key to addressing this issue comes not from "validating" knowledge, but allowing the flexibility and motivation to change the repository for the better. This causes the repository to become better over time, evolving toward answers to problems that fit the organization's technical and business context. The real question is not whether the repository is "correct" in some objective sense, but rather whether less mistakes are repeated and better solutions adopted when using the case repository. A related concern is that users will blindly follow the advice given by the cases. The situation-specific nature of case-based technology exacerbates the problem of applying cases to inappropriate circumstances. Designers must be careful to validate an existing case against the current context to ensure applicability. A third issue arises over the perceived social stigma of recording failure cases. This is another area where research assumptions do not transfer to practice. The recording of "lessons learned" is already part of the UPRR

323

334

HENNINGER

culture. People readily record what has gone badly and what they would want to improve. "Failures" are rarely individual-based, and the social stigma of a failure can be reduced to the extent that the underlying problem and an effective solution can be identified. This reflects the fact that practitioners are less concerned with correctness in an absolute sense and more concerned with finding workable solutions. The key issue that we are trying to address is improving the accessibility of these lessons and providing methods to help identify common issues so that an appropriate amount of work can be applied to these centrally important issues. While validation of information in a repository will be a complex and sticky issue, especially when multiple users are able to add knowledge to the system, complete validation is not necessary to the success of our approach. We wish to provide a repository that is better than the rumor mill (and therefore needs some form of validation), but doesn't completely suppress opinions on important matters. The view that "The real precondition for domain analysis is consensus within a community" (Prieto-Diaz, 1991) is too restrictive to be of use in domains where rapid change is a barrier to achieving community-wide consensus. Our aim is therefore to provide a repository for emerging knowledge that can help inform subsequent development efforts and lead to formalized models of the domain. 4.

Integrating cases and organizing principles

One problem with a case-based approach to software development is that cases tend to represent isolated problems that need to be adopted from one set of specifics to another. For a given development problem, it is difficult to know which cases should be followed, how different cases interact and/or contradict each other, and when the cases should be followed. It is clear, though, that people have an ability to see analogies and draw conclusions from separate experiences. Schank's theory of dynamic memory hypothesizes a general knowledge acquisition process based on cases (Schank, 1982). Initially, when a subject has few experiences in a domain, cases are isolated to only being applicable to the specific context. Having never seen a ball before, the experience of a red ball can cause one to conclude that all balls are spherical, or all balls are red, or both. As more experiences accumulate, generalizations can be produced and organizing principles emerge. Having experienced a variety of balls, one can conclude that the spherical property is the relevant one. This leads to an aggregation of cases, called Memory Organization Packets (MOPs) (Schank, 1982) that represent the common themes of a concept or domain. MOPs can be thought of as repositories for general knowledge that also serve to organize memory as variations of generalized knowledge by indexing cases by the way in which they deviate from the norm (Kolodner, 1993). An American football can be seen as an object that does not meet the spherical property, but is nonetheless considered a type of ball. In addition to being an interesting model of human memory, this process can be exploited as a model for knowledge management tools. Initially, cases are independent problem solving episodes that have been documented by projects. As cases in a problem solving domain accumulate, a coherent picture may emerge or be created as a principle that can be followed as a corporate-wide standard. Cases can then be used to record extenuating circumstances and deviations from the norm. Domain analysis techniques can be used to

324

CASE-BASED KNOWLEDGE MANAGEMENT

TOOLS

335

evaluate cases and provide some advice to developers about which case should be used under what set of circumstances (Henninger et al., 1995). The resulting principles and abstractions can come in many forms, such as MOP-like structures, discrimination networks (Ackerman, 1994; Terveen et al., 1995), design environments (Fischer, 1994), and other structures that provide assistance in determining which cases are applicable to a given problem. Two different techniques for organizing will be described here. The first is a facility to retrieve cases based on their similarity. This allows people to see patterns of similarity between cases and can be used to group cases with organizing principles. The second is to turn cases into checklists that can be used as a guide for future development efforts. Finding the appropriate checklist becomes one of the problems, and a variation of discrimination networks is used for this purpose. 4.1.

Detecting patterns of similar cases

A simple form of organizing technique is to use pattern-matching retrieval algorithms to find cases with similar characteristics. This supports the abstraction process by detecting patterns of cases that are indexed with similar terms. For example, figure 5 shows the result of a partial match query for 'window' and 'security'. Cases have a number of "Characteristics" associated with them (see figure 1) that are used as keywords for searching (we are currently working on full-text case retrieval). Searching in BORE can occur through either an exact match or partial match retrieval algorithm, depending on which button the user chooses (see upper right-hand side of the 'Case Search' window in figure 5). Choosing the exact match button retrieves all cases that are indexed by any the characteristics specified in the query (i.e., a Boolean OR of all query terms). This method uses a simple inverted index to find cases. Simple matching techniques will not find cases that have similar characteristics, but do not use the same terminology. This is critical to finding patterns of similar cases in the repository as the cases may not be indexed with the same terms but nonetheless have some characteristics in common. Methods are needed that can find noisy and inexact patterns with a soft matching retrieval algorithm (Henninger, 1995a). This is accomplished through a

Figure 5. Retrieving similar cases with partial match retrieval.

325

336

HENNINGER

partial match query, which uses a spreading activation algorithm to retrieve cases that either match or are associatively related to the query terms (a technical description of this search method can be found in (Henninger, 1995a)). As shown in the Matching Cases window of figure 5, the query will find terms that are associated with the query. For example, the 'f.check.access' and 'f_check_update' cases refer to source code that is used to implement aspects of application security. Neither case has 'security' or 'window' as a characteristic, but because these cases have characteristics in common with other security related cases, they are also found by the spreading activation algorithm (please referto (Henninger, 1995a) or (Henninger et al., 1995) for a description of how this is accomplished). 4.2.

Checklists and domain-specific knowledge

Another, more complex and formal, organizing principle is to create knowledge structures that organize cases within a domain. Some examples of these kinds of structures include branching networks of multiple-choice questions (Ackerman and Malone, 1990) and decision trees (Terveen et al., 1995). Checklists have also been used to provide information about steps that need to be followed to achieve a certain task (Lemke, 1989). We have combined the approaches of checklists and decision trees to create knowledge structures that organize case information about application and technical domains. The general idea is to answer some questions relevant to the domain and find the checklist that is most appropriate to the current circumstances. The windows on the left side of figure 6 show a set of questions about the chosen application domain "Application Security." Questions can be answered in any order at the user's discretion. The lower window on the left shows how a multiple-choice question is answered by the user. Each checklist is associated with a number of question/answer pairs. For example, the IMPELS checklist associates the question "How are security levels determined?" with the answer 'role/user.' When a user answers a question, a score is computed for each of the checklists associated with the question. The checklists with the highest scores are displayed in the top window marked

Figure 6. Question interface and a checklist.

326

CASE-BASED KNOWLEDGE MANAGEMENT TOOLS

3 37

"Checklists Hits." Right now, the scoring algorithm simply adds one for each answer that matches a checklist. More appropriate algorithms based on the importance of questions to a checklist could also be used (Nakakoji and Fischer, 1995). A sample checklist is shown on the right of figure 6. It displays a number of check items that developers can use as a guide for development. The semantics of checking off an item is that the developer has completed or otherwise considered the task. The box on the right is controlled by the button on top of it to display information associated with an item. This can include guidelines, source code, or a list of subtasks to be completed before the overall check item can be checked off (Lemke, 1989). Related cases can be displayed, supplying users with a way of going from the abstractions in the checklist to concrete cases. Checklists can also be tailored, and are subjected to the same validation and justification procedures used for cases. 5.

Evaluation and use of BORE

It is best to view BORE as a research prototype based on empirical observations. The research described in this paper falls into an odd gap in traditional software engineering research. The work is based on a case study of an actual development organization, yet BORE is an experimental prototype designed to research a specific software development approach. The purpose of BORE is not so much to get UPRR to adopt it, but to use it to experiment with making the organizational learning approach to software development work. Evaluation of BORE is proceeding in two separate contexts. In the first, a pilot project at UPRR is evaluating BORE by documenting some cases and providing comments of how the system can be improved. This is a rudimentary first step that is not integrated into the team's process yet. One can see this as an "Alpha" test, and we have indeed spent a considerable amount of time fixing errors of our own making and adapting the tool to UPRR's platform and networks. We are also using BORE in a Software Engineering course consisting of seniors and first-year graduate students. Students in this class are developing small to medium scale software systems for clients external to the university system (one of the projects is working with UPRR). The class was asked to document their entire system with BORE. Assignments in requirements, design, implementation, and formal testing all require that BORE cases be created to document the activity. This exercise gives us the opportunity to see how well the mechanisms in BORE can be used to document designs and disseminate information. Students are encouraged to use cases from other groups. This not only helps us evaluate BORE's usefulness as a reuse tool, but also helps students get away from the isolationist perspective that is often engendered in academic settings. In many respects, the student projects are a real-world evaluation of BORE. Students are working on real projects with real clients. The problem is that the scope of the projects is rather small, with less than a semester to develop the software and 4-6 people on the project. The UPRR evaluations can help us get a feel for the issues that larger-scale projects will bring to BORE. The feedback elicited between these two contexts should give us the background to evolve BORE into a truly useful, usable, and effective tool.

327

338

6.

HENNINGER

Conclusions and future directions

Centering the information around a single repository through an ongoing process of capturing project experiences can prevent the duplication of efforts, avoid repeating common mistakes, and help streamline the development process. Similar projects have shown that such a system will be used by development personnel, provided it contains relevant, useful and up-to-date information (Terveen et al., 1995). This mandates a strong tie between technology and process in which using the technology must become part of normal work activities. Such an approach will succeed to the extent that people are rewarded in the short term for their efforts. Tools and methods need to have immediate benefit as well as future benefits. Some of the research questions we will continue to explore with BORE revolve around designing work practices that are centered around case-based organizational memory. Design tools must reward users for use in the current project as well as leaving a trace of information for subsequent development and maintenance efforts. We believe these are compatible goals, as long-term projects need to track their progress and decision making process. Some of the major research questions include: designing methodologies based on standards set by previous projects yet flexible enough to accommodate changes in business needs and technology; creating a development process that begins with the repository and updates the repository as the project progresses; finding the "right" level of documentation that doesn't overwhelm developers yet leaves a trace for subsequent efforts; and organizational changes needed to make such techniques work in the organization (i.e., the technology transfer and tool adoption problems). Continued work on this project has implications that reach beyond the software development context at UPRR. The industry-as-laboratory strategy has provided a crucial first step toward better understanding the concerns and needs of software developers. Systems like BORE which aim to capture organization-specific development practices can provide a second step by documenting the authentic issues that arise in software development efforts. This can provide important empirical information that can guide future tools and methods for developing software. There are many other research questions that will emerge through the industry-aslaboratory research strategy. These are research questions that can only be answered through continued work with users (i.e., real software developers) and empirical evaluation in work contexts. We hope to continue to use BORE as a sounding board to discover what practices and tools are most effective, and evolve the tool as necessary to explore the emerging issues. Acknowledgments I gratefully acknowledge the efforts of Lynden Tennison and Jim Montequin at Union Pacific Railroad. I am also indebted to a number of graduate students that have helped, including Kurt Baumgarten, Shawn Clymer, Kyle Haynes, and Roger Van Andel. This research was funded by the National Science Foundation (CCR-9502461), Union Pacific Railroad, and the Applied Information Management Institute in Omaha.

328

CASE-BASED KNOWLEDGE MANAGEMENT TOOLS

339

Note 1. BORE stands for Building an Organizational Repository of Experiences. The acronym is intended to refer to the process of drilling down to relevant information, not its entertainment value. We have also purposefully chosen the word "building" to emphasize that we are not providing a solution, but rather infrastructural support that can be developed into an effective tool.

References Ackerman, M.S. 1994. Augmenting the organizational memory: A field study of answer garden. Proceedings of the Conference on Computer-Supported Cooperative Work (CSCW'94), ACM, New York, pp. 243-252. Ackerman, M.S. and Malone, T. W. 1990. Answer garden: A tool for growing organizational memory. Proceedings of the Conference on Office Information Systems, ACM, New York, pp. 31-39. Brooks, F.P. 1987. No silver bullet: Essence and accidents of software engineering. Computer, 20(4):10-19. Brown, J.S. and Duguid, P. 1991. Organizational learning and communities-of-practice: Toward a unified view of working, learning, and innovation. Organization Science, 2(l):40-57. Carroll, J.R., Smith-Kerker, P.L., Ford, J.R., and Mazur-Rimetz, S.A. 1987. The minimal manual. HumanComputer Interaction, 3(2):123-153. Conklin, E.J. and Yakemovic, K. 1991. A process-oriented approach to design rationale. Human-Computer Interaction, 6(3^):357-391. Curtis, B., Krasner, H., and Iscoe, N. 1988. A field study of the software design process for large systems. Communications of the ACM, 31(11):1268—1287. Fischer, G. 1994. Domain-oriented design environments. Automated Software Engineering, 1(2). Fischer, G., Lemke, A.C., McCall, R., and Morch, A. 1991. Making argumentation serve design. Human-Computer Interaction, 6(3^):393^U9. Gaines, B. 1989. Social and cognitive processes in knowledge acquisition. Knowledge Acquisition, l(l):38-58. Grudin, J. 1988. Why CSCW applications fall: Problems in the design and evaluation of organizational interfaces. Proceedings of the Conference on Computer-Supported Cooperative Work (CSCW'88), ACM, New York, pp. 85-93. Henninger, S. 1995a. Information access tools for software reuse. Journal ofSystems and Software, 30(3):231-247. Henninger, S. 1995b. Supporting the domain lifecycle. IEEE Seventh International Workshop on Computer-Aided Software Engineering—CASE'95, Toronto, Ca: IEEE Computer Society Press, pp. 10-19. Henninger, S. 1996a. Accelerating the successful reuse of problem solving knowledge through the domain lifecycle. Fourth International Conference on Software Reuse, Los Alamitos, CA: Orlando, FL: IEEE Computer Society Press, pp. 124-133. Henninger, S. 1996b. Building an organization-specific infrastructure to support CASE tools. Journal ofAutomated Software Engineering, 3:1—21. Henninger, S., Lappala, K., and Raghavendran, A. 1995. An organizational learning approach to domain analysis. Seventeenth International Conference on Software Engineering, Seattle, WA, New York: ACM Press, pp. 9 5 104. Holtzblatt, K. and Jones, S. (Eds.). Contextual Inquiry: A Participatory Technique for System Design. NJ: Erlbaum, Hillsdale. Kolodner, J.L. 1991. Improving human decision making through case-based decision aiding. AI Magazine, 12(0:52-68. Kolodner, J.L. 1993. Case-Based Reasoning. San Mateo, CA: Morgan-Kaufman. Lemke, A.C. 1989. Design Environments for High-Functionality Computer Systems, Ph.D., Boulder, CO, University of Colorado. Nakakoji, K. and Fischer, G. 1995. Intertwining knowledge delivery and elicitation: A process model for humancomputer collaboration in design. Knowledge-Based Systems, 8(2-3) :94-104. Norman, D.A. 1993. Things That Make Us Smart. Reading, MA: Addison-Wesley. Pearce, M., Goel, A.K., Kolodner, J.L., Zimring, C , Sentosa, L., and Billington, R. 1992. Case-based design support: A case study in architectural design. IEEE Expert, 7(5): 14-20.

329

340

HENNINGER

Perry, D.E., Staudenmayer, N.A., and Votta, L.G. 1994. People, organizations, and process improvement. IEEE Software, ll(4):36-45. Petroski, H. 1985. To Engineer Is Human: The Role Of Failure In Successful Design. New York: St. Martin's Press. Potts, C. 1993. Software-engineering research revisited. IEEE Software, 10(5):19-28. Prieto-Diaz, R. 1991. Implementing faceted classification for software reuse. Communications of the ACM, 35(5). Rieman, J. 1993. The diary study: A workplace-oriented research tool to guide laboratory efforts. INTERCHI'93 Conference Proceedings, Amsterdam, New York: ACM, pp. 321-326. Schank, R. 1982. Dynamic Memory: A Theory of Learning in Computers and People. New York: Cambridge Univ. Press. Slade, S. 1991. Case-based reasoning: A research paradigm. AI Magazine, 12(1):42—55. Terveen, L.G., Selfridge, P.G., and Long, M.D. 1993. From 'folklore' to 'living design memory'. Proceedings InterCHI'93, Amsterdam, New York: ACM, pp. 15-22. Terveen, L.G., Selfridge, P.G., and Long, M.D. 1995. 'Living design memory'—Framework, implementation, lessons learned. Human-Computer Interaction, 10(l):l-37. Walz, D.B., Elam, J.J., and Curtis, B. 1993. Inside a software design team: Knowledge acquisition, sharing, and integration. Communications of the ACM, 36(10):62-77.

330

Chapter 9 Guidelines and Conclusion There exists a wide spectrum of the availabilities for data and domain theories (models) in software engineering tasks. Quantitatively, some tasks may be data-rich while others data-poor; qualitatively, available data may be ranging from noisy, incomplete to accurate, adequate. On the other hand, the availability of a domain theory (a model or models, or some background knowledge) for a given SE task may vary from correct and complete, to inaccurate or incomplete, or to nonexistent. Two paradigms exist in ML: inductive (or empirical) learning and analytical learning, and comparison of the two yields the following result in Table 29 [105]. Because the availability and utilization of data and domain theory play a pivotal role in these two paradigms (learning objectives, complementary merits and demerits), we can use data and domain theory as guiding factors in considering the adoption of learning methods (Figure 13). Table 29. Comparison of inductive learning and analytical learning. Inductive Learning Objective

Analytical Learning

Formulate general hypotheses that Formulate general hypotheses fit observed training data

that fit domain theory

Justification

Statistical inference

Deductive inference

Advantage Disadvantage

Require no prior knowledge Learn from scarce data Can fail if there exist scarce data, Can be misled when given or incorrect inductive bias incorrect or insufficient domain theory

Method

DT, NN, GA, ILP

AL, EBL

When a given task is data-rich, methods of inductive learning can be considered. If there exists a well-defined model for a task, then we can adopt analytical learning methods. Two paradigms can be combined to form a hybrid inductive-analytical learning approach. We can utilize hybrid methods in situations where both data and domain theory are less than desirable. Methods of either paradigm will be good candidates if a task has both an adequate domain theory and plenty of data. 9.1.

Guidelines for ML Method Selection

The use of such a dichotomy in data and domain theory is just the first step toward the learning method selection. Additional properties, as discussed in Chapter 1, should be taken into consideration subsequently. All inductive learning algorithms must wrestle with tradeoffs among the following factors: (1) the size or complexity of the learned target function, (2) the amount of training data, and (3) the prediction accuracy of the function [39]. The prediction accuracy on unseen examples will generally increase as the size of training data increases. On the other hand,

331

as the learned target function becomes more complex, the prediction accuracy first improves and then dips. Mechanisms are in place in inductive learning algorithms to seek to match the target function complexity to the complexity of the training data [39].

Figure 13. Choice of inductive and analytical learning methods.

The guidelines for selecting each of the learning methods, which we include below in Tables 3041, are based on the following factors: domain knowledge need, training data requirement, form of target function, advantages, disadvantages, and applicable SE activities.

Table 30. Guideline for selecting AL/EBL. Domain knowledge

Correct and complete domain knowledge required.

Training data

Small number of training cases needed.

Target function

Form of learned function is known and expressed as set of rules.

Advantages

Ability to use domain theory and deductive reasoning to generalize from small amount of training data.

Disadvantages

Can be misled if domain theory is not correct or complete.

Applicable SE activities

Requirement engineering (generalizing specifications, scenario based specification acquisition).

332

Table 31. Guideline for selecting BL. Domain knowledge

Required in terms of prior probabilities.

Training data

Incrementally decrease or increase the estimated probability.

Target function

Can be represented either explicitly (as a MAP hypothesis) or implicitly (as providing classifications for unseen cases).

Advantages

Flexible in learning target function. Probabilistic prediction.

Disadvantages

Requiring many initial probabilities. Computational cost.

Applicable SE activities

Predicting defects. Analyzing cost models.

Table 32. Guideline for selecting CL. Domain knowledge

Not required

Training data

Need to be correct and free of errors. Organized as positive and negative examples.

Target function

Form of learned function is known and expressed in terms of a conjunction of constraints.

Advantages

Version Space (general-to-specific ordering) and Candidate_Elimination algorithm provide a general structure for concept learning problems.

Disadvantages

Learning algorithm not robust to noisy data. H may be incomplete due to its inductive bias.

Applicable SE activities

Deriving system specifications and properties.

333

Table 33. Guideline for selecting DT learning. Domain knowledge

Not required

Training data

Adequate data needed to avoid overfitting. Missing values tolerated.

Target function

Discrete-valued. Form of learned function is known and expressed as decision trees.

Advantages

Robust to noisy data. Capable of learning disjunctive expressions.

Disadvantages

Overfitting.

Applicable SE activities

Predicting faults, costs, development efforts, and reusability. Acquiring specifications. Diagnosing system errors.

Table 34. Guideline for selecting GA/GP learning. Domain knowledge

Not required.

Training data

Not needed (some test data may be needed for fitness evaluation).

Target function

Expressed as bit strings (GA) or program trees (GP).

Advantages

Suited to tasks where functions to be approximated are complex. Algorithms can be easily parallelized.

Disadvantages

Crowding. Bloating.

Applicable SE activities

Predicting faults, size, and development efforts. Program transformation. Program synthesizing. Prototyping.

334

Table 35. Guideline for selecting IBL/CBR. Domain knowledge

Not required

Training data

Plenty data needed.

Target function

Can be represented either explicitly (as a linear function) or implicitly (as providing classifications for unseen cases). Local approximation.

Advantages

Training is fast. Can learn complex functions. Do not lose information.

Disadvantages

Slow at query time. Curse of dimensionality.

Applicable SE activities

Reuse library. Tasks scheduling and planning. Diagnosis.

Table 36. Guideline for selecting ILP learning. Domain knowledge

Background (domain) knowledge required.

Training data

Organized as positive and negative examples.

Target function

Form of learned function is known and expressed as set of rules.

Advantages

Expressive and human readable representation of learned function. Induction formulated as inverse of deduction. Background knowledge guided search.

Disadvantages

Not robust to noisy data. Intractable search space in general case. Increased background knowledge results in increased complexity of hypothesis space. No guarantee to find the smallest or best set of rules.

Applicable SE activities

Predicting faults, and cost. Identifying program invariants. Test data generation. Reverse engineering (extracting specifications from software).

335

Table 37. Guideline for selecting IAL. Domain knowledge

Required but does not have to be perfect.

Training data

Training data needed and possibly containing errors.

Target function

Form of learned function can be either unknown or known, depending on how domain theory and training data are combined to constrain the search process.

Advantages

Enjoying the benefits of both inductive and analytical learning. Better generalization accuracy (due to domain knowledge). Ability to circumvent imperfect domain knowledge (via training data). A framework where different inductive and analytical methods can be accommodated.

Disadvantages

Complex.

Applicable SE activities

Predicting faults, size, development efforts, reliability and testability. Generating test cases, and specifications. Identifying objects and module properties. Recognizing patterns in software systems.

Table 38. Guideline for selecting NN learning. Domain knowledge

Not required

Training data

Need to have plentiful data. Training data represented as many attribute-value pairs and possibly containing errors

Target function

Form of learned function is unknown but human readability of learned result is unimportant.

Advantages

Robust to errors in training data. Can learn complex functions (non-liner, continuous functions). Parallel, distributed learning process.

Disadvantages

Slow training and convergence process. Multiple local minima in error surface. Overfitting.

Applicable SE activities

Predicting faults, size, development efforts, reliability and testability. Identifying objects and module properties. Recognizing patterns in software systems.

336

Table 39. Guideline for selecting RL. Domain knowledge

Not required.

Training data

Not available.

Target function

Learned function is represented as a control strategy from state space to action space.

Advantages

An unsupervised paradigm geared toward goal-directed learning and decision-making by an agent through direct interaction with its environment without relying on training data or domain theory.

Disadvantages

Exploration-exploitation dilemma.

Applicable SE activities

Scheduling, planning and resource allocation in software project management.

Table 40. Guideline for selecting EL. Domain knowledge

Determined by the component (base-level) classifier.

Training data

Determined by the component (base-level) classifier.

Target function

Learned function is indirectly defined through ensemble of component functions.

Advantages

Less likely to misclassify than a single hypothesis. Improved prediction accuracy of a weak learning algorithm. Enriched hypothesis space without incurring much additional efforts.

Disadvantages

NN benefits less from ensemble approach. Exact reasons behind successful ensemble method (e.g., AdaBoost) are not fully understood. There is on-going study on the performance of an ensemble of homogeneous-and-weak component functions vs. an ensemble of heterogeneous-and-strong component functions.

Applicable SE activities

Predicting faults, costs, development efforts, and reusability. Acquiring specifications. Diagnosing system errors.

337

Table 41. Guideline for selecting SVM. Domain knowledge

Not required.

Training data

Training data needed.

Target function

Decision function corresponds to a separating hyperplane in feature space.

Advantages

Allows a similarity measure to be defined from the dot product in feature space. The weights associated with each data points are zero except for those points closest to separating hyperplane. The freedom to choose kernel functions makes it possible to design a variety of similarity measures and learning algorithms.

Disadvantages

Size of quadratic programming problem scales with the number of support vectors. High-noise problems result in more support vectors.

Applicable SE activities

Predicting faults. Identifying outliers. Recognizing patterns. Time series prediction.

Figure 14 indicates where each learning method may fit in the domain-theory/training-data dichotomy.

Figure 14. Learning methods in the domain-theory/training-data dichotomy.

338

9.2.

Guidelines in Formulating SE Tasks as A Learning Problem

Because there are many different SE tasks that can be approached as learning problems, we use some as examples to emphasize the main issues in formulating them under the frameworks of appropriate learning methods. Component reuse Component retrieval from a software repository is an important issue in supporting software reuse. This task matches the characteristics of IBL, and can be formulated into an instance-based learning problem as follows: (1) Components in a software repository are represented as points in the n-dimensional Euclidean space (or cases in a case base). (2) Information in a component can be divided into indexed and unindexed portions (attributes). Indexed information is used for retrieval purpose and unindexed information is used for contextual purpose. Because of the curse of dimensionality problem, the choice of indexed attributes must be judicious. (3) Queries to the repository for desirable components can be represented as constraints on indexible attributes. (4) Similarity measures for the nearest neighbors of the desirable component can be based on the standard Euclidean distance, distance-weighted measure, or symbolic measure. (5) The possible retrieval methods include K-Nearest Neighbor, inductive retrieval, Locally Weighted Regression. (6) The adaptation of the retrieved component for the task at hand can be structural (applying adaptation rules directly to the retrieved component), or derivational (reusing adaptation rules that generated the original solution to produce a new solution). Techniques range from no adaptation for problems with simple solutions, to parameter adjustment, variable reinstantiation, derivational replay, or model-guided repair. This step may involve human expert. (7) Retain the new case in the case base. Rapid prototyping Rapid prototyping is an important tool for understanding and validating software requirements. In addition, software prototypes can be used for other purposes such as user training and system testing [134]. Different techniques have been developed for evolutionary and throw-away prototyping. We can augment the existing techniques by including a GP based approach. Because GP does not require domain knowledge, and is suitable for generating complex functions, it is a good candidate for the task. In GP, a computer program is often represented as a program tree where the internal nodes correspond to a set of functions used in the program and the external nodes (terminals) indicate variables and constants used as input to functions. For a given problem, GP starts with an initial population of randomly generated computer programs. The evolution process of generating a final computer program that solves the given problem hinges on some sort of fitness evaluation and probabilistically reproducing the next generation of the program population through some

339

genetic operations. Given a GP development environment such as the one in [84], the framework of a GP-based rapid prototyping process can be described as follows: (1) Define sets of functions and terminals to be used in the developed (prototype) systems. (2) Define a fitness function to be used in evaluating the worthiness of a generated program. Test data (input values and expected output) may be needed in assisting the evaluation. (3) Generate the initial program population. (4) Determine selection strategies for programs in the current generation to be included in the next generation population. (5) Decide how the genetic operations {crossover and mutation) are carried out during each generation and how often these operations are performed. (6) Specify the terminating criteria for the evolution process and the way of checking for termination. (7) Translate the final returned program into a desired programming language format. Expert system shells have been used as environments for software prototyping [96]. An obvious benefit of using GP over expert system shells is that it does not rely on the availability of any domain theory. Requirement engineering Requirement engineering refers to the process of establishing the services a system should provide and the constraints under which it must operate [134]. A requirement may be, functional or non-functional. A functional requirement describes a system service or function, whereas a non-functional requirement represents a constraint imposed on the system. How to obtain functional requirements of a system is the focus here. The situation in which ML algorithms will be particularly useful is when there exist empirical data from a problem domain that describe how the system should react to certain inputs. Under this circumstance, functional requirements can be "learned" from the data through some learning algorithm. This fits a typical supervised learning profile and the learning process can be formulated as follows. (1) Let X and C denote the domain and the co-domain of a system function/to be learned if is a part of the functional requirement for the system). The data set D is then defined as: D = {<xt, ck>\ xi e X A ck G C}. (2) The target function/to be learned satisfies the condition: VJC,- G X and V Q G C, f{xi) = c* . (3) The learning algorithms applicable here have to be of supervised type. Depending on the nature of the data set D, different learning algorithms (AL, BL, CL, DT, ILP) can be utilized to capture (learn) a system's functional requirements. Reverse engineering Legacy systems are old systems that are critical to the operation of an organization which uses them and that must still be maintained. Most legacy systems were developed before software engineering techniques were widely used. Thus they may be poorly structured and their documentation may be either out-of-date or non-existent. In order to bring to bear the legacy system maintenance, the first task is to recover the design or specification of a legacy system

340

from its source or executable code (hence, the term of reverse engineering, or program comprehension and understanding). Below we describe a framework for learning functional specification of a legacy software system from its executable code. (1) Given the executable code p and its input data set X, and output set C, the training data set D is defined as: D = {<xt, p{xt )>\ xt e X A p(xi) e C}. (2) The process of deriving the functional specification / for p can be described as a learning problem in which/is obtained through some ML algorithm./satisfies the following: \/xiEX[f(xi)=p(xi)). (3) Many supervised learning algorithms can be used here to obtain/(e.g., CL). Cliche recognition [122] has been used to match stereotyped programming patterns to a program's data and control flow so as to help with software understanding. Contrasting cliche recognition with CL, the former results in substantial upfront overhead because a library of cliches must be in place before the approach can be applied. Validation Verification and validation are important checking processes to make sure that a software system is correctly developed and conforms to its specification. To check a software implementation against its specification as part of the validation process, we assume the availability of both a specification and an executable code. This checking process can be performed as an analytic learning task as follows: (1) Let X and C be the domain and co-domain of the implementation (executable code) p, which is defined as: p: X —> C. (2) The training set D is defined as: D = {<*,-, p(xi)>\ xt e X A /?(*,) e C ) . (3) The specification for p is denoted as B, which corresponds to the domain theory in the analytic learning. (4) The validation checking is defined as follows: p is valid if V<x,, p(xt)> e D [B A XI \- p(x,)]. (5) Explanation-based learning algorithms can be utilized to carry out the checking process. Test oracle generation Functional testing involves executing a program under test and examining the output from the program. An oracle is needed in functional testing in order to determine if the output from a program is correct. The oracle can be a human or a software one [114]. The approach we propose here allows a test oracle to be learned as a function from the specification and a small set of training data. The learned test oracle can then be used for the functional testing purpose. (1) Let X and C be the domain and co-domain of the program p to be tested. Let B be the specification for p. (2) Define a small training set D as: D = {<x,, p(xt)>\ JC, E X ' A X ' C X A/?(JC,) e C ) . (3) Use the explanation-based learning (EBL) to generate a test oracle © (0: X —» C) for p from B and D.

341

(4) Use 9 for the functional testing: VJC, G X [output of p is correct if p(xi) = 0(;c,-)]. Test adequacy criteria Software test data adequacy criteria are rules that determine if a software product has been adequately tested [150]. A test data adequacy criterion £ is a function: C,: P X S x T —> {true, false} where P is a set of programs, S a set of specifications and T the class of test sets. £(p, s, t) = true means that t is adequate for testing program p against specification s according to criterion £. Since C, is essentially a Boolean function, we can use a strategy such as CL to learn the test data adequacy criteria. (1) Define the instance space X as: X = { | pi e P A SJ G S A h e T}. (2) Define the training data set D as: D = {<x, £(JC)>| x e X A £(X) e V}, where V is defined as: V = {true, false}. (3) Use the concept of version space and the candidate-elimination algorithm in CL to learn the definition of C,. Software defect prediction Software defect prediction is a very useful and important tool to gauge the likely delivered quality and maintenance effort before software systems are deployed [50]. Predicting defects requires a holistic model rather than a single-issue model that hinges on either size, or complexity, or testing metrics, or process quality data alone. It is argued in [50] that all these factors must be taken into consideration in order for the defect prediction to be successful. BBN proves to be a very useful approach to the software defect prediction problem. A BBN represents the joint probability distribution for a set of variables. This is accomplished by specifying (a) a directed acyclic graph (DAG) where nodes represent variables and arcs correspond to conditional independence assumptions (causal knowledge about the problem domain), and (b) a set of local conditional probability tables (one for each variable) [67, 105]. A BBN can be used to infer the probability distribution for a target variable (e.g., "Defects Detected"), which specifies the probability that the variable will take on each of its possible values (e.g., "very low", "low", "medium", "high", or "very high" for the variable "Defects Detected") given the observed values of the other variables. In general, a BBN can be used to compute the probability distribution for any subset of variables given the values or distributions for any subset of the remaining variables. When using a BBN for a decision support system such as software defect prediction, the steps below should be followed. (1) Identify variables in the BBN. Variables can be: (a) hypothesis variables for which the user would like to find out their probability distributions (hypothesis variable are either unobservable or too costly to observe), (b) information variables that can be observed, or (c) mediating variables that are introduced for certain purpose (help reflect independence properties, facilitate acquisition of conditional probabilities, and so forth). Variables should be defined to reflect the life-cycle activities (specification, design, implementation, and testing) and capture the multi-facet nature of software defects (perspectives from size, testing metrics and process quality). Variables are denoted as nodes in the DAG. (2) Define the proper causal relationships among variables. These relationships also should capture and reflect the causality exhibited in the software life-cycle processes. They will be represented as arcs in the corresponding DAG.

342

(3) Acquire a probability distribution for each variable in the BBN. Theoretically well-founded probabilities, or frequencies, or subjective estimates can all be used in the BBN. The result is a set of conditional probability tables one for each variable. The full joint probability distribution for all the defect-centric variables is embodied in the DAG structure and the set of conditional probability tables. Project effort (cost) prediction How to estimate the cost for a software project is a very important issue in the software project management. Most of the existing work is based on algorithmic models of effort [131]. A viable alternative approach to the project effort prediction is instance-based learning. IBL yields very good performance for situations where an algorithmic model for the prediction is not possible. In the framework of IBL, the prediction process can be carried out as follows. (1) Introduce a set of features or attributes (e.g., number of interfaces, size of functional requirements, development tools and methods, and so forth) to characterize projects. The decision on the number of features has to be judicious, as this may become the cause of the curse of dimensionality problem that will affect the prediction accuracy. (2) Collect data on completed projects and store them as instances in the case base. (3) Define similarity or distance between instances in the case base according to the symbolic representations of instances (e.g., Euclidean distance in an n-dimensional space where n is the number of features used). To overcome the potential curse of dimensionality problem, features may be weighed differently when calculating the distance (or similarity) between two instances. (4) Given a query for predicting the effort of a new project, use an algorithm such as K-Nearest Neighbor, or, Locally Weighted Regression to retrieve similar projects and use them as the basis for returning the prediction result. 9.3. Concluding Remarks In this book, we have discussed issues and current status regarding ML applications in SE, and included fifteen previously published papers from seven application areas. The existing work certainly proves that the field of SE is a fertile ground for the application of ML methods. We hope that this book defines what this niche area of ML&SE is about, offers a glimpse of the state-of-the-practice, and provides an impetus to an increased interest in the area. A maturing software engineering discipline will definitely be able to benefit from the awareness and the utility of ML techniques. What can ML techniques do to the SE essential difficulties given in Brooks' paper? We think ML methods can be used to complement existing SE tools and methodologies to make headway in all aspects of those essential difficulties. The strength of ML methods lies in the fact that they have sound mathematical and logical justifications and they can be used to create and compile verifiable knowledge about the design and development of software artifacts. This has been demonstrated in the body of the existing work at least for a number of areas in SE. How can we avoid potential pitfalls of ML techniques? We think the single most important factor is to avoid mismatches between the characteristics of an SE problem and those of an ML method.

343

What lies ahead? For SE areas that have already witnessed ML applications, efforts will be needed in developing guidelines for issues regarding applicability, scaling-up, performance evaluation, and tool integration. For those SE areas that have not witnessed ML applications, the issue is if a match can be found between an SE task and some ML approach and how to realize the promise and potential ML techniques have to offer. Thus far, we only confine our discussions on ML applications in software development and maintenance tasks. An even ambitious topic is how to incorporate ML techniques into software products to make them adaptive and selfconfiguring.

344

References 1.

S. Abd-El-Hafiz, "Identifying objects in procedural programs using clustering neural networks", Automated Software Engineering, Vol.7, No.3, 2000, pp.239-261.

2.

D. Aha, Machine learning resources, hppt://www.aic.nrl.navy.mil/~aha/research/machinelearning.html.

3.

D. Aha, Case-based reasoning resources, hppt://www.aic.nrl.navy.mil/~aha/research/casebased-reasoning.html.

4.

Y.W. Ahn, HJ. Ahn and SJ. Park, "Knowledge and case-based reasoning for customization of software processes: a hybrid approach", International Journal of Software Engineering and Knowledge Engineering, Vol.13, No.3, 2003, pp.293-312.

5.

J.S. Aguilar-Ruiz, I.Ramos, J.C. Riquelme and M. Toro, "An evolutionary approach to estimating software development projects", Information and Software Technology, Vol.43, No.14, 2001, pp.875-882.

6.

S. C. Bailin, R. H. Gattis, and W. Truszkowski, "A learning-based software engineering environment", Proc. 6th Annual Knowledge-Based Software Engineering Conference, September 1991, pp. 198-206.

7.

W. Banzhaf, P. Nordin, R.E. Keller and F.D. Francone, Genetic Programming, Morgan Kaufmann Publishers, Inc., 1998.

8.

V. Basili, S. Condon, K. El Emam, R. Hendrick and W. Melo, "Characterizing and modeling the cost of rework in a library of reusable software components", Proc. International Conference on Software Engineering, 1997, pp.282-291.

9.

F. Bergadano and D. Gunetti, Inductive Logic Programming: from Machine Learning to Software Engineering, MIT Press, 1995.

10.

F. Bergadano and D. Gunetti, "Testing by means of inductive program learning", ACM Trans. Software Engineering and Methodology, Vol.5, No.2, April 1996, pp.119-145.

11.

B. Binder and J.J.P. Tsai, "KBRMS: an intelligent assistant for requirement definition", International Journal of Artificial Intelligence Tools, Vol.1, No.4, 1992, pp.503-522.

12.

S. Bhansali and M. Harandi, "Synthesis of Unix programs using derivational analogy", Machine Learning, Vol.10, No.l, 1993, pp.7-55.

13.

B. Boehm, "Requirements that handle IKIWISI, COTS, and rapid change", IEEE Computer, Vol. 33, No. 7, July 2000, pp.99-102.

14.

L. Bottaci, "Instrumenting programs with flag variables for test data search by genetic algorithm", Proceedings of Genetic and Evolutionary Computation Conference (GECCO) 2002, pp.1337-1342.

345

15.

I. Bratko and S. Muggleton, "Applications of inductive logic Communications of ACM, Vol.38, No. 11, November 1995, pp.65-7O.

16.

I. Bratko and M. Grobelnik, "Inductive learning applied to program construction and verification", in AI Techniques for Information Processing, J. Cuena, ed. North-Holland 1993.

17.

L. Briand, V. Basili and W. Thomas, "A pattern recognition approach for software engineering data analysis", IEEE Trans. SE, Vol. 18, No. 11, November 1992, pp. 931942.

18.

L. Briand, V. Basili and C. Hetmanski, "Developing interpretable models with optimized set reduction for identifying high-risk software components", IEEE Trans. SE, Vol. 19, No. 11, November 1993, pp. 1028-1043.

19.

L. Briand et al, "An assessment and Comparison of common software cost estimation modeling techniques", Proc. International Conference on Software Engineering, 1999, pp.313-322.

20.

F. Brooks, "No silver bullet: essence and accidents of software engineering", IEEE Computer, Vol.20, No.4, 1987, pp.10-19.

21.

F. Brooks, The Mythical Man-Month, Addison Wesley, 1995.

22.

M. Broy, "Toward a mathematical foundation of software engineering methods", IEEE Trans. SE, Vol.27, No.l, January 2001, pp.42-57.

23.

B. Bruegge and A.H. Dutoit, Object-Oriented Software Engineering, Second edition, Prentice Hall 2004.

24.

P.M.S. Bueno and M. Jino, "Automatic test data generation for program paths using genetic algorithms", International Journal of Software Engineering and Knowledge Engineering, Vol.12, No.6, 2002, pp.691-709.

25.

C.J. Burgess and M. Lefley, "Can genetic programming improve software effort estimation? a comparative evaluation", Information and Software Technology, Vol.43, No.14, 2001, pp.863-873.

26.

C. Chang, M. Christensen, and T. Zhang, "Genetic algorithms for project management", Annals of Software Engineering, Vol.11, No.l, 2001, pp.107-139.

27.

S. Choi and C. Wu, "Partitioning and allocation of objects in heterogeneous distributed environments using the niched Pareto genetic algorithm", Proc. the Asia-Pacific Software Engineering Conference, December 1998, pp.322-329.

28.

S. Chulani, B. Boehm and B. Steece, "Bayesian analysis of empirical software engineering cost models", IEEE Trans. SE, Vol. 25, No. 4, July 1999, pp. 573-583.

346

programming",

29.

W. Cohen, "Inductive specification recovery: understanding software by learning from example behaviors", Automated Software Engineering, Vol.2, No.2, 1995, pp. 107-129.

30.

W. Cohen and P. Devanbu, "A comparative study of inductive logic programming for software fault prediction", Proc. the fourteenth International Conference on Machine Learning, 1997.

31.

J. Cook and A. Wolf, "Discovering models of software processes from event-based data", ACM Trans. Software Engineering and Methodology, Vol.7, No.3, July 1998, pp.215-249.

32.

N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines, Cambridge University Press, 2000.

33.

Y.S. Dai, M. Xie, K.L. Poh and B. Yang, "Optimal testing-resource allocation with genetic algorithm for modular software systems", Journal of Systems and Software, Vol.66, No.l, 2003, pp.47-55.

34.

M. de Almeida, H. Lounis and W. Melo, "An investigation on the use of machine learned models for estimating correction costs", Proc. International Conference on Software Engineering, 1998, pp.473-476.

35.

M. de Almeida and S. Matwin, "Machine learning method for software quality model building", Proc. International Symposium on Methodologies for Intelligent Systems, 1999.

36.

P. Devanbu, R. Brachman, P. Selfridge and B. Ballard, "LaSSIE: a knowledge-based software information system", Communications of ACM, Vol.34, No.5, May 1991, pp.3549.

37.

T. G. Dietterich, "Machine learning research: four current directions", AI Magazine, Vol.18, No.4, 1997, pp.97-136.

38.

T. G. Dietterich, "Ensemble learning", in The Handbook of Brain Theory and Neural Networks, Second Edition, The MIT Press, Cambridge MA, 2002.

39.

T. G. Dietterich, "Machine learning", Nature Encyclopedia of Cognitive Science, MacMillan, London, 2003.

40.

T. Dohi, Y. Nishio, and S. Osaki, "Optimal software release scheduling based on artificial neural networks", Annals of Software Engineering, Vol.8, No.l, 1999, pp.167-185.

41.

J. Dolado, "A validation of the component-based method for software size estimation", IEEE Trans. SE, Vol.26, No. 10, October 2000, pp. 1006-1021.

42.

J.J. Dolado, "On the problem of the software cost function", Information and Software Technology, Vol.43, No.l, 2001, pp.61-72.

43.

C. Drummond, D. Ionescu and R. Holte, "A learning agent that assists the browsing of software libraries", IEEE Trans. SE, Vol.26, No. 12, December 2000, pp.1179-1196.

347

44.

S. Dzeroski and B. Zenko, "Is combining classifiers better than selecting the best one?" Proc. l$h International Conference on Machine Learning, 2002, pp. 123-130.

45.

K. El Emam, S. Benlarbi, N. Goel and S. Rai, "Comparing case-based reasoning classifiers for predicting high risk software components", Journal of Systems and Software, Vol.55, No.3, 2001, pp.301-320.

46.

M. C. F.P. Emer, S. R. Vergilio, "Selection and Evaluation of Test Data Based on Genetic Programming", Software Quality Journal, Vol.11, No.2, 2003, pp.167-186.

47.

M. Ernst, J. Cockrell, W. Griswold and D. Notkin "Dynamically discovering likely program invariants to support program evolution", IEEE Trans. SE, Vol.27, No.2, February 2001, pp.99-123.

48.

M. Evett, T. Khoshgoftar, P. Chien and E. Allen, "GP-based software quality prediction", Proc. Third Annual Genetic Programming Conference, 1998, pp.60-65.

49.

N.E. Fenton and S.L. Pfleeger, Software Metrics, PWS Publishing Company, 2nd ed., 1997.

50.

N. Fenton and M. Neil, "A critique of software defect prediction models", IEEE Trans. SE, Vol. 25, No. 5, Sept. 1999, pp. 675-689.

51.

G. Finnie, G. Wittig and J-M. Desharnais, "A comparison of software effort estimation techniques: using function points with neural networks, case-based reasoning and regression models", Journal of Systems and Software, Vol.39, No.3, 1997, pp.281-289.

52.

G. Fouque and S. Matwin, "CAESAR: a system for case based software reuse", Proc. 7th Knowledge-Based Software Engineering Conference, 1992, pp.90-99.

53.

Y. Freund and R.E. Schapire, "Experience with a new boosting algorithm", Proc. 13th International Conference on Machine Learning, 1996, pp. 148-156.

54.

K. Ganesan, T. Khoshgoftaar and E. Allen, "Cased-based software quality prediction", International Journal of Software Engineering and Knowledge Engineering, Vol.10, No.2, 2000, pp. 139-152.

55.

P. Garg and S. Bhansali, "Process programming by hindsight", Proc. International Conference on Software Engineering, 1992, pp.280-293.

56.

R. Glass, "Glass" (column), System Development, January 1988, pp.4-5.

57.

C. Green et al, "Report on a knowledge-based software assistant", In Readings in Artificial Intelligence and Software Engineering, eds. C. Rich and R.C. Waters, Morgan Kaufmann, 1986, pp.377-428.

58.

R. Hall, "Systematic incremental validation of reactive systems via sound scenario generalization", Automated Software Engineering, Vol.2, No.2, 1995, pp.131-166.

348

59.

R. Hall, "Explanation-based scenario generation for reactive system models", Proc. International Conference on Automated Software Engineering, 1998, pp.115-124.

60.

R. Hall, "Explanation-based scenario generation for reactive system models", Automated Software Engineering, Vol.7, 2000, pp.157-177.

61.

M.T. Harandi and H.Y. Lee, "Acquiring software design schemas: a machine learning perspective", Proc. 6th Annual Knowledge-Based Software Engineering Conference, September 1991, pp.188-197.

62.

M. Harman, R. Hierons and M. Proctor, "A new representation and crossover operator for search-based optimization of software modularization", Proceedings of Genetic and Evolutionary Computation Conference (GECCO) 2002, pp.1351-1358.

63.

A. Heiat, "Comparison of artificial neural network and regression models for estimating software development effort", Information and Software Technology, Vol.44, No. 15, 2002, pp.911-922.

64.

S. Henninger, "Case-based knowledge management tools for software development", Automated Software Engineering, Vol.4, No.3, 1997, pp.319-340.

65.

W.L. Hill, "Machine learning for software reuse", Proc. of International Joint Conference on Artificial Intelligence, 1987, pp.338-344.

66.

E. Hong and C. Wu, "Criticality models using SDL metrics set", Proc. the 4th Asia-Pacific Software Engineering and International Computer Science Conference, December 1997, pp.23-30.

67.

F.V. Jensen, An Introduction to Bayesian Networks, Springer, 1996.

68.

M. Jorgensen, "Experience with the accuracy of software maintenance task effort prediction models", IEEE Trans. SE, Vol.21, No.8, August 1995, pp.674-681.

69.

N. Karunanithi, D. Whitely and Y. Malaiya, "Prediction of software reliability using connectionist models", IEEE Trans. SE, Vol.18, No.7, July 1992, pp.563-574.

70.

P. Katalagarianos and Y. Vassiliou, "On the reuse of software: a case-based approach employing a repository", Automated Software Engineering, Vol.2, No.l, 1995, pp.55-86.

71.

T. Khoshgoftaar, A. Pandya and D. Lanning, "Application of neural networks for predicting faults", Annals of Software Engineering, Vol.1, 1995, pp.141-154.

72.

T. Khoshgoftaar, E. Allen J. Hudepohl and S. Aud, "Applications of neural networks to software quality modeling of a very large telecommunications system", IEEE Transactions on Neural Networks, Vol.8, No.4, 1997, pp.902-909.

349

73.

T. Khoshgoftaar, E. Allen and Z. Xu, "Predicting testability of program modules using a neural network", Proc. IEEE Symposium on Application-Specific Systems and Software Engineering Technology, March 2000, pp.57-62.

74.

T. Khoshgoftaar, B. Cukic and N. Seliya, "Predicting fault-prone modules in embedded systems using analogy-based classification models", International Journal of Software Engineering and Knowledge Engineering, Vol.12, No.2, 2002, pp.201-221.

75.

T. Khoshgoftaar, E.B. Allen and J. Deng, "Using regression trees to classify fault-prone software modules", IEEE Transactions on Reliability, Vol.51, No.4, 2002, pp.455-462.

76.

T. Khoshgoftaar, and N. Seliya, "Software quality classification modeling using the SPRINT decision tree algorithm", Proceedings of 14th IEEE International Conference on Tools with AI, November 2002, pp.365-374.

77.

T. Khoshgoftaar, Naeem Seliya, "Analogy-Based Practical Classification Rules for Software Quality Estimation", Empirical Software Engineering, Vol.8, No.4, 2003, pp.325-350.

78.

T. Khoshgoftaar, L. Nguyen, K. Gao and J. Rajeevalochanam, "Application of an attribute selection method to CBR-based software quality classification", Proceedings of 15th IEEE International Conference on Tools with AI, November 2003.

79.

T. Khoshgoftaar, Y. Liu and N. Seliya, "Genetic programming-based decision trees for software quality classification", Proceedings of 15th IEEE International Conference on Tools with AI, November 2003.

80.

T. Khoshgoftaar, Software Engineering with Computational Intelligence, Kluwer, 2003.

81.

T. Khoshgoftaar (ed.), Special Issue on Quality Engineering with Computational Intelligence, Software Quality Journal, Vol.11, No.2, June 2003.

82.

C. Kirsopp, M. J. Shepperd, J. Hart, "Search Heuristics, Case-based Reasoning And Software Project Effort Prediction", Proceedings of Genetic and Evolutionary Computation Conference (GECCO) 2002, pp. 1367-1374.

83.

R. Konavi and G.H. John, "Wrappers for feature subset selection", Artificial Intelligence, Vol. 97, No. 1-2, 1997, pp.273-324.

84.

M. Kramer, and D. Zhang, "Gaps: a genetic programming system", Proc. of IEEE International Conference on Computer Software and Applications (COMPSAC 2000), 2000, pp.614-619.

85.

van Lamsweerde and L. Willemet, "Inferring declarative requirements specification from operational scenarios", IEEE Trans. SE, Vol. 24, No. 12, Dec. 1998, pp. 1089-1114.

86.

W. Langdon, "Evolving data structures using genetic programming", Proc. the Sixth International Conference on Genetic Algorithms, 1995, pp.295-302.

350

87.

P. Langley and H. Simon, "Applications of machine learning and rule induction", Communications of ACM, Vol.38, No.ll, November 1995, pp.55-64.

88.

F. Lanubile and G. Visaggio, "Evaluating predictive quality models derived from software measures: lessons learned", Journal of Systems and Software, Vol.38, 1997, pp.225-234.

89.

N. Lavrac and S. Dzeroski, Inductive Logic Programming: Techniques and Applications, Ellis Horwood, 1994.

90.

B. Lee, B. Moon and C. Wu, "Optimization of multi-way clustering and retrieval using genetic algorithms in reusable class library", Proc. the Asia-Pacific Software Engineering Conference, December 1998, pp.4-11.

91.

J. Lee, Software Eng with Computational Intelligence, Springer-Verlag, 2003.

92.

J. Lee (ed.), Special Issue on Software Eng with Computational Intelligence, Information and Software Technology, Vol.45, No.7, May 2003.

93.

M. Lefley, M. J. Shepperd, "Using genetic programming to improve software effort estimation based on general data sets", Proceedings of Genetic and Evolutionary Computation Conference (GECCO) 2003, pp.2477-2487.

94.

A. Liu and J.J.P. Tsai, "A knowledge-based approach for requirements analysis", International Journal of Artificial Intelligence Tools, Vol.5, No.2, 1996, pp.167-184.

95.

Y. Liu, S. Gururajan, B. Cukic, T. Menzies and M. Napolitano, "Validating an online adaptive system using SVDD", Proceedings of 15th IEEE International Conference on Tools with AI, November 2003.

96.

M. Lowry, "Software engineering in the twenty first century", AI Magazine, Vol.14, No.3, Fall 1992, pp.71-87.

97.

C. Mair, G. Kadoda, M. Lefley, K. Phalp, C. Schofield, M. Shepperd and S. Webster, "An investigation of machine learning based prediction systems", Journal of Systems and Software, Vol.53, No.l, 2000, pp.23-29.

98.

Y. Mao, H. Sahraoui and H. Lounis, "Reusability hypothesis verification using machine learning techniques: a case study", Proc. 13th IEEE International Conference on Automated Software Engineering, 1998, pp.84-93.

99.

M. Mendonca and N.L. Sunderhaft, "Mining software engineering data: a survey", DACS State-of-the-Art Report, September 1999, http://www.dacs.dtic.mil/techs/datamining/.

100. T. Menzies, "Practical machine learning for software engineering and knowledge engineering", Handbook of Software Engineering and Knowledge Engineering, World Scientific Publishing Company, 2001.

351

101. C. Michael and G. McGraw, "Automated software test data generation for complex programs", Proc. International Automated Software Engineering Conference, October 1998, pp. 136-146. 102. C. Michael, G. McGraw and M. Schatz, "Generating software test data by evolution", lEEETrans. SE, Vol. 27, No. 12, December 2001, pp.1085-1110. 103. R. S. Michalski, I. Bratko and M. Kubat (ed.), Machine Learning and Data Mining: Methods and Applications, John Wiley & Sons Ltd., 1998. 104. S. Minton, and S. Wolfe, "Using machine learning to synthesize search programs", Proc. 9th Knowledge-based Software Engineering Conference, 1994, pp.31-38. 105. T. Mitchell, Machine Learning, McGraw-Hill, 1997. 106. T. Mitchell, "Does machine learning really work?", AI Magazine, Vol.18, No.3, 1997, pp. 11-20. 107. T. Mitchell, "Machine learning and data mining", Communications of ACM, Vol.42, No.ll, November 1999, pp.31-36. 108. J. Mostow (ed), Special issue on artificial intelligence and software engineering, IEEE Trans. SE, Vol.11, No.ll, November 1985, pp. 1253-1408. 109. E. Ostertag, J. Hendler, R.P. Diaz and C. Braun, "Computing similarity in a reuse library system: an Al-based approach", ACM Trans. Software Engineering and Methodology, Vol.1, No.3, July 1992, pp.205-228. 110. D. Parnas, "Designing software for ease of extension and contraction," IEEE Trans. SE, Vol. 5, No. 3, March 1979, pp. 128-137. 111. D. Partridge, W. Wang and P. Jones, "Artificial intelligence techniques for software system enhancement", Research Report No. 399, School of Engineering and Computer Science, University of Exeter, U.K. January 2001. 112. D. Partridge, Artificial Intelligence and Software Engineering, AMACOM, 1998. 113. W. Pedrycz and J.F. Peters, Computational Intelligence in Software Engineering, World Scientific Publisher, 1998. 114. D. Peters and D. Parnas, "Using test oracles generated from program documentation", IEEE Trans. SE, Vol. 24, No. 3, March 1998, pp. 161-173. 115. A. Porter and R. Selby, "Empirically-guided software development using metric-based classification trees", IEEE Software, Vol. 7, March 1990, pp. 46-54. 116. F. Provost and R. Kohavi, "On applied research in machine learning", Machine Learning, Vol.30, No.2/3, 1998, pp. 127-132.

352

117. J.R. Quinlan, "Learning logical definitions from relations", Machine Learning, Vol.5, No.3,1990, pp.239-266. 118. J.R. Quinlan, C4.5: Programs for machine learning, Morgan Kaufmann, San Mateo, CA, 1993. 119. J.R. Quinlan, "Some elements of machine learning", Proceedings of the S?h International Workshop on Inductive Logic Programming, Lecture Notes in Artificial Intelligence, Vol.1634, Springer-Verlag, 1999, pp.15-18. 120. A. Qureshi, "Evolving agents", Research Note, University College London, RN-96-4, January, 1996. 121. M. Reformat, W. Pedrycz and N.J. Pizzi, "Software quality analysis with the use of computational intelligence", Information and Software Technology, Vol.45, No.7, 2003, pp.405-417. 122. C. Rich and R. Waters (eds.), Readings in Artificial Intelligence and Software Engineering, Morgan Kaufmann, 1986. 123. C. Rich and L. Willis, "Recognizing a program's design: a graph-parsing approach", IEEE Software, Vol.7, No.l, Jan./Feb. 1990, pp.82-89. 124. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, second edition, Prentice Hall, 2003. 125. C. Ryan and L. Ivan, "An automatic software re-engineering tool based on genetic programming", in Advances in Genetic Programming, Vol. 3, L. Spector et al, ed. MIT Press, 1999, pp. 15-39. 126. C. Ryan, Automatic Re-engineering of Software Using Genetic Programming, Kluwer Academic Publishers, 2000. 127. L. Saitta and F. Neri, "Learning in the 'real world'", Machine Learning, Vol.30, No.2/3, 1998,pp.l33-163. 128. R. Schwanke and S. Hanson, "Using neural networks to modularize software", Machine Learning, Vol.15, No.2, 1994, pp. 137-168. 129. R. Selby and A. Porter, "Learning from examples: generation and evaluation of decision trees for software resource analysis," IEEE Trans. SE, Vol. 14, 1988, pp.1743-1757. 130. O. Selfridge, The gardens of learning: a vision for AI", AI Magazine, Vol.14, No.2, 1993, pp.36-48. 131. M. Shepperd and C. Schofield, "Estimating software project effort using analogies", IEEE Trans. SE, Vol. 23, No. 12, November 1997, pp. 736-743.

353

132. B. Scholkopf and AJ. Smola, Learning with Kernels, the MIT Press, 2002. 133. K. Shukla, "Neuro-genetic prediction of software development effort", Information and Software Technology, Vol.42, No.10, 2000, pp.701-713. 134. I. Sommerville, Software Engineering, Addison-Wesley, 1996. 135. K. Srinivasan and D. Fisher, "Machine learning approaches to estimating software development effort", IEEE Trans. SE, Vol. 21, No. 2, Feb. 1995, pp. 126-137. 136. I. Stamelos, L. Angelis, P. Dimou and E. Sakellaris, "On the use of Bayesian belief networks for the prediction of software productivity", Information and Software Technology, Vol.45, No.l, 2003, pp.51-60. 137. R. Sutton and A. Barto, Reinforcement Learning: An Introduction, MIT Press, 1999. 138. J.J.P. Tsai and T. Weigert, Knowledge-Based Software Development for Real-Time Distributed Systems, World Scientific Inc., Singapore, 1993. 139. J.J.P. Tsai, B. Li and T. Weigert, "A logic-based transformation system", IEEE Trans. KDE, Vol. 10, No. 1, Jan. 1998, pp. 91-107. 140. S. Vicinanza, M.J. Prietulla, and T. Mukhopadhyay, "Case-based reasoning in software effort estimation", Proc. 11th Intl. Conf. On Information Systems, 1990, pp.149-158. 141. A. M. R. Vincenzi, et al, "Bayesian-learning based guidelines to determine equivalent mutants", International Journal of Software Engineering and Knowledge Engineering, Vol.12, No.6, 2002, pp.675-689. 142. F. Walkerden and R. Jeffrey, "An empirical study of analogy-based software effort estimation", Empirical Software Engineering, Vol.4,1999, pp.135-158. 143. J. Wegener, H. Sthamer, B.F. Jones and D.E. Eyres, "Testing real-time systems using genetic algorithms", Software Quality Journal, Vol.6,1997, pp.127-135. 144. J. Wegener, A. Baresel and H. Sthamer, "Evolutionary test environment for automatic structural testing", Information and Software Technology, Vol.43, No. 14, 2001, pp.841854. 145. C. Welty and P. Selfridge, "Artificial intelligence and software engineering: breaking the toy mold", Automated Software Engineering, Vol.4, No.3, 1995, pp.255-270. 146. G. Wittig and G. Finnie, "Estimating software development effort with connectionist models", Information and Software Technology, Vol.39, 1997, pp.469-476. 147. D. Zhang, "Applying machine learning algorithms in software development", Proc. Monterey Workshop on Modeling Software System Structures, Santa Margherita Ligure, Italy, June 2000, pp.275-285.

354

148. D. Zhang and J.J.P. Tsai, "Machine learning and software engineering", Proc. 14th IEEE International Conference on Tools with Artificial Intelligence, November 2002, pp.22-29. 149. D. Zhang and J.J.P. Tsai, "Machine learning and software engineering", Software Quality Journal, Vol.11, Issue 2, June 2003, pp.87-119. 150. H. Zhu, "A formal analysis of the subsume relation between software test adequacy criteria", IEEE Trans. SE, Vol.22, No.4, April 1996, pp.248-255.

355

New Trends in Software Process Modeling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

Read more

New Trends in Software Process Modelling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

Read more

Software Engineering

Read more

Software Engineering

Read more

Software Engineering

Read more

Software Engineering Quality Practices (Applied Software Engineering Series)

Read more

Software Engineering for Game Developers (Software Engineering Series)

Read more

Software Engineering for Game Developers (Software Engineering Series)

Read more

Advances in Software Engineering

Read more

Classics in Software Engineering

Read more

Experience and Knowledge Management in Software Engineering

Read more

Graph Drawing and Applications for Software and Knowledge Engineers (Series on Software Engineering and Knowledge Engineering, 11)

Read more

Advances in Software Engineering

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Software engineering and development

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Programming and Software Engineering)

Read more

Recommend Documents

New Trends in Software Process Modeling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

New Trends in Software Process Modeling Series on Software Engineering A and Knowledge Engineering New Trends in S...

New Trends in Software Process Modelling (Software Engineering and Knowledge Engineering) (Series on Software Engineering and Knowledge Engineering)

New Trends in Software Process Modeling Series on Software Engineering A and Knowledge Engineering New Trends in S...

Software Engineering

First Edition, 2009 ISBN 978 81 907188 8 2 © All rights reserved. Published by: Global Media 1819, Bhagirath Palace,...

Software Engineering

Argila, C.A., Jones, C., Martin, J.J. “Software Engineering” The Electrical Engineering Handbook Ed. Richard C. Dorf Boc...

Software Engineering

0321313798_cover.qxd 26/4/06 17:48 Page 1 Software Engineering S O M M E RV I L L E SOMMERVILLE 8 The 8th editio...

Software Engineering Quality Practices (Applied Software Engineering Series)

AU4633_half title 9/16/05 2:18 PM Page 1 Software Engineering Quality Practices Series_AU_001 Page 1 Thursday, April...

Software Engineering for Game Developers (Software Engineering Series)

Software Engineering for Game Developers (Software Engineering Series)

TEAM LinG - Live, Informative, Non-cost and Genuine ! Software Engineering for Game Developers John P. Flynt with Oma...

Advances in Software Engineering

Communications in Computer and Information Science 117 Tai-hoon Kim Haeng-kon Kim Muhammad Khurram Khan Akingbehin K...

Classics in Software Engineering