This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Series Editor-in-Chief S K CHANG (University of Pittsburgh, USA)
Vol. 1
Knowledge-Based Software Development for Real-Time Distributed Systems Jeffrey J. -P. Tsai and Thomas J. Weigert (Univ. Illinois at Chicago)
Vol. 2
Advances in Software Engineering and Knowledge Engineering edited by Vincenzo Ambriola (Univ. Pisa) and Genoveffa Tortora (Univ. Salerno)
Vol. 3
The Impact of CASE Technology on Software Processes edited by Daniel E. Cooke (Univ. Texas)
Vol. 4
Software Engineering and Knowledge Engineering: Trends for the Next Decade edited by W. D. Hurley (Univ. Pittsburgh)
Vol. 5
Intelligent Image Database Systems edited by S. K. Chang (Univ. Pittsburgh), E. Jungert (Swedish Defence Res. Establishment) and G. Tortora (Univ. Salerno)
Vol. 6
Object-Oriented Software: Design and Maintenance edited by Luiz F. Capretz and Miriam A. M. Capretz (Univ. Aizu, Japan)
Vol. 7
Software Visualisation edited by P. Eades (Univ. Newcastle) and K. Zhang (Macquarie Univ.)
Vol. 8
Image Databases and Multi-Media Search edited by Arnold W. M. Smeulders (Univ. Amsterdam) and Ramesh Jain (Univ. California)
Vol. 9
Advances in Distributed Multimedia Systems edited by S. K. Chang, T. F. Znati (Univ. Pittsburgh) and S. T. Vuong (Univ. British Columbia)
Vol. 10 Hybrid Parallel Execution Model for Logic-Based Specification Languages Jeffrey J.-P. Tsai and Bing Li (Univ. Illinois at Chicago) Vol. 11 Graph Drawing and Applications for Software and Knowledge Engineers Kozo Sugiyama (Japan Adv. Inst. Science and Technology) Vol. 12 Lecture Notes on Empirical Software Engineering edited by N. Juristo & A. M. Moreno (Universidad Politecrica de Madrid, Spain) Vol. 13 Data Structures and Algorithms edited by S. K. Chang (Univ. Pittsburgh, USA) Vol. 14 Acquisition of Software Engineering Knowledge SWEEP: An Automatic Programming System Based on Genetic Programming and Cultural Algorithms edited by George S. Cowan and Robert G. Reynolds (Wayne State Univ.) Vol. 15 Image: E-Learning, Understanding, Information Retieval and Medical Proceedings of the First International Workshop edited by S. Vitulano (Universita di Cagliari, Italy) Vol. 16 Machine Learning Applications in Software Engineering edited by Du Zhang (California State Univ.,) and Jeffrey J. P. Tsai (Univ. Illinois at Chicago)
Machine Learning Applications in
Software Engineering editors
Du Zhang California State University, USA
Jeffrey J.P. Tsai University of Illinois, Chicago, USA
\[p World Scientific N E W J E R S E Y
• LONDON
• SINGAPORE
• BEIJING
• S H A N G H A I
• H O N G K O N G
• TAIPEI
•
CHENNAI
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
The author and publisher would like to thank the following publishers of the various journals and books for their assistance and permission to include the selected reprints found in this volume: IEEE Computer Society (Trans, on Software Engineering, Trans, on Reliability); Elsevier Science Publishers (Information and Software Technology); Kluwer Academic Publishers (Annals of Software Engineering, Automated Software Engineering, Machine Learning)
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN
981-256-094-7
Cover photo: Meiliu Lu
Printed in Singapore by World Scientific Printers (S) Pte Ltd
DEDICATIONS DZ: To Jocelyn, Bryan, and Mei JT: To Jean, Ed, and Christina
ACKNOWLEDGMENT The authors acknowledge the contribution of Meiliu Lu for the cover photo and the support from National Science Council under Grant NSC 92-2213-E-468-001, R.O.C.. We also thank Kim Tan, Tjan Kwang Wei, and other staffs at World Scientific for helping with the preparation of the book.
TABLE OF CONTENTS Chapter 1 Introduction to Machine Learning and Software Engineering 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
The Challenge Overview of Machine Learning Learning Approaches SE Tasks for ML Applications State-of-the-Practice in ML&SE Status Applying ML Algorithms to SE Tasks Organization of the Book
Chapter 2 ML Applications in Prediction and Estimation 2.1 Bayesian Analysis of Empirical Software Engineering Cost Models, (with S. Chulani, B. Boehm and B. Steece) IEEE Transactions on Software Engineering, Vol. 25. No. 4, July 1999, pp. 573-583. 2.2 Machine Learning Approaches to Estimating Software Development Effort, (with K. Srinivasan and D. Fisher) IEEE Transactions on Software Engineering, Vol. 21, No. 2, February 1995, pp. 126-137. 2.3 Estimating Software Project Effort Using Analogies, (with M. Shepperd and C. Schofield) IEEE Transactions on Software Engineering, Vol. 23, No. 12, November 1997, pp. 736-743. 2.4 A Critique of Software Defect Prediction Models, (with N.E. Fenton and M. Neil) IEEE Transactions on Software Engineering, Vol. 25, No. 5, September 1999, pp. 675-689. 2.5 Using Regression Trees to Classify Fault-Prone Software Modules, (with T.M. Khoshgoftaar, E.B. Allen and J. Deng) IEEE Transactions on Reliability, Vol. 51, No. 4, 2002, pp. 455-462. 2.6 Can Genetic Programming Improve Software Effort Estimation? A Comparative Evaluation, (with CJ. Burgess and M. Lefley) Information and Software Technology, Vol. 43, No. 14, 2001, pp. 863-873. 2.7 Optimal Software Release Scheduling Based on Artificial Neural Networks, (with T. Dohi, Y. Nishio, and S. Osaki) Annals of Software Engineering, Vol. 8, No. 1, 1999, pp. 167-185.
ix
1 1 3 9 13 15 25 35 36 37 41
52
64
72
87
95
106
Chapter 3 ML Applications in Property and Model Discovery 3.1 Identifying Objects in Procedural Programs Using Clustering Neural Networks, (with S.K. Abd-El-Hafiz) Automated Software Engineering, Vol. 7, No. 3, 2000, pp. 239-261. 3.2 Bayesian-Learning Based Guidelines to Determine Equivalent Mutants, (with A. M. R. Vincenzi, et al.) International Journal of Software Engineering and Knowledge Engineering, Vol. 12, No. 6, 2002, pp. 675-689. Chapter 4 ML Applications in Transformation 4.1 Using Neural Networks to Modularize Software, (with R. Schwanke and S.J. Hanson) Machine Learning, Vol. 15, No. 2, 1994, pp. 137-168. Chapter 5 ML Applications in Generation and Synthesis 5.1 Generating Software Test Data by Evolution, (with C.C. Michael, G. McGraw and M.A. Schatz) IEEE Transactions on Software Engineering, Vol. 27, No. 12, December 2001, pp. 1085-1110. Chapter 6 ML Applications in Reuse
125 127
150
165 167
199 201
227
6.1 On the Reuse of Software: A Case-Based Approach Employing a Repository, (with P. Katalagarianos and Y. Vassiliou) Automated Software Engineering, Vol. 2, No. 1, 1995, pp. 55-86. Chapter 7 ML Applications in Requirement Acquisition 7.1 Inductive Specification Recovery: Understanding Software by Learning From Example Behaviors, (with W.W. Cohen) Automated Software Engineering, Vol. 2, No. 2, 1995, pp. 107-129. 7.2 Explanation-Based Scenario Generation for Reactive System Models, (with R.J. Hall) Automated Software Engineering, Vol. 7, 2000, pp. 157-177. Chapter 8 ML Applications in Management of Development Knowledge 8.1 Case-Based Knowledge Management Tools for Software Development, (with S. Henninger) Automated Software Engineering, Vol. 4, No. 3, 1997, pp. 319-340.
229
261 263
286
307 309
Chapter 9 Guidelines and Conclusion
331
References
345
X
Chapter 1 Introduction to Machine Learning and Software Engineering 1.1. The Challenge The challenge of developing and maintaining large software systems in a changing environment has been eloquently spelled out in Brooks' classic paper, No Silver Bullet [20]. The following essential difficulties inherent in developing large software still hold true today: > Complexity: "Software entities are more complex for their size than perhaps any other human construct." "Many of the classical problems of developing software products derive from this essential complexity and its nonlinear increases with size." > Conformity: Software must conform to the many different human institutions and systems it comes to interface with. > Changeability: "The software product is embedded in a cultural matrix of applications, users, laws, and machine vehicles. These all change continually, and their changes inexorably force change upon the software product." > Invisibility: "The reality of software is not inherently embedded in space." "As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs, superimposed one upon another." [20] However, in his "No Silver Bullet" Refired paper [21], Brooks uses the following quote from Glass to summarize his view in 1995: So what, in retrospect, have Parnas and Brooks said to us? That software development is a conceptually tough business. That magic solutions are not just around the corner. That it is time for the practitioner to examine evolutionary improvements rather than to wait-or hope-for revolutionary ones [56]. Many evolutionary or incremental improvements have been made or proposed, with each attempting to address certain aspect of the essential difficulties [13, 47, 57, 96, 110]. For instance, to address changeability and conformity, an approach called the transformational programming allows software to be developed, modified, and maintained at specification level, and then automatically transformed into production-quality software through automatic program synthesis [57]. This software development paradigm will enable software engineering to become the discipline of capturing and automating currently undocumented domain and design knowledge [96]. Software engineers will deliver knowledge-based application generators rather than unmodifiable application programs. A system called LaSSIE was developed to address the complexity and invisibility issues [36]. The multi-view modeling framework proposed in [22] could be considered as an attempt to address the invisibility issue. The application of artificial intelligence techniques to software engineering (AI&SE) has produced some encouraging results [11, 94, 96, 108, 112, 122, 138, 139, 145]. Some of the successful AI techniques include: knowledge-based approach, automated reasoning, expert systems, heuristic search strategies, temporal logic, planning, and pattern recognition. To ultimately overcome the essential difficulties, AI techniques can play an important role. As a subfield of AI, machine learning (ML) deals with the issue of how to build computer programs that improve their performance at some task through experience [105]. It is dedicated to creating and compiling verifiable knowledge related to the design and construction of artifacts [116]. ML
1
algorithms have been utilized in many different problem domains. Some typical applications are: data mining problems where large databases contain valuable implicit regularities that can be discovered automatically, poorly understood domains where there is lack of knowledge needed to develop effective algorithms, or domains where programs must dynamically adapt to changing conditions [105]. Not surprisingly, the field of software engineering turns out to be a fertile ground where many software development and maintenance tasks could be formulated as learning problems and approached in terms of learning algorithms. The past two decades have witnessed many ML applications in the software development and maintenance. ML algorithms offer a viable alternative and complement to the existing approaches to many SE issues. In his keynote speech at the 1992 annual conference of the American Association for Artificial Intelligence, Selfridge advocated the application of ML to SE (ML&SE): We all know that software is more updating, revising, and modifying than rigid design. Software systems must be built for change; our dream of a perfect, consistent, provably correct set of specifications will always be a nightmare-and impossible too. We must therefore begin to describe change, to write software so that (1) changes are easy to make, (2) their effects are easy to measure and compare, and (3) the local changes contribute to overall improvements in the software. For systems of the future, we need to think in terms of shifting the burden of evolution from the programmers to the systems themselves...[we need to] explore what it might mean to build systems that can take some responsibility for their own evolution [130]. Though many results in ML&SE have been published in the past two decades, effort to summarize the state-of-the-practice and to discuss issues and guidelines in applying ML to SE has been few and far between [147-149]. A recent paper [100] focuses its attention on decision tree based learning methods to SE issues. Another survey is offered from the perspective of data mining techniques being applied to software process and products [99]. The AI&SE summaries published so far are too broad a brush that they do not give an adequate account on ML&SE. There is a related and emerging area of research under the umbrella of computational intelligence in software engineering (CI&SE) recently [80, 81, 91, 92, 113]. Research in this area utilizes fuzzy sets, neural networks, genetic algorithms, genetic programming and rough sets (or combinations of those individual technologies) to tackle software development issues. ML&SE and CI&SE share two common grounds: targeted software development problems, and some common techniques. However, ML offers many additional mature techniques and approaches that can be brought to bear on solving the SE problems. The scope of this book, as depicted in the shaded area in Figure 1, is to attempt to fill this void by studying various issues pertaining to ML&SE (the applications of other AI techniques in SE are beyond the scope of this book). We think this is an important and helpful step if we want to make any headway in ML&SE. In this book, we address various issues in ML&SE by trying to answer the following questions: > What types of learning methods are there available at our disposal? > What are the characteristics and underpinnings of different learning algorithms?
2
> How do we determine which learning method is appropriate for what type of software development or maintenance task? > Which learning methods can be used to make headway in what aspect of the essential difficulties in software development? > When we attempt to use some learning method to help with an SE task, what are the general guidelines and how can we avoid some pitfalls? > What is the state-of-the-practice in ML&SE? > Where is further effort needed to produce fruitful results?
Figure 1. Scope of this book. 1.2.
Overview of Machine Learning
The field of ML includes: supervised learning, unsupervised learning and reinforcement learning. Supervised learning deals with learning a target function from training examples of its inputs and outputs. Unsupervised learning attempts to learn patterns in the input for which no output values are available. Reinforcement learning is concerned with learning a control policy through reinforcement from an environment. ML algorithms have been utilized in many different problem domains. Some typical applications are: data mining problems where large databases contain valuable implicit regularities that can be discovered automatically, poorly understood domains where there is lack of knowledge needed to develop effective algorithms, or domains where programs must dynamically adapt to changing conditions [105]. The following list of publications and web sites offers a good starting point for the interested reader to be acquainted with the state-of-the-practice in ML applications [2, 3, 9, 15, 32, 37-39, 87, 99, 100, 103, 105107,117-119,127,137]. ML is not a panacea for all the SE problems. To better use ML methods as tools to solve real world SE problems, we need to have a clear understanding of both the problems, and the tools and methodologies utilized. It is imperative that we know (1) the available ML methods at our disposal, (2) characteristics of those methods, (3) circumstances under which the methods can be most effectively applied, and (4) their theoretical underpinnings. Since many SE development or maintenance tasks rely on some function (or functions, mappings, or models) to predict, estimate, classify, diagnose, discover, acquire, understand,
3
generate, or transform certain qualitative or quantitative aspect of a software artifact or a software process, application of ML to SE boils down to how to find, through the learning process, such a target function (or functions, mappings, or models) that can be utilized to carry out the SE tasks. Learning involves a number of components: (1) How is the unknown (target or true) function represented and specified? (2) Where can the function be found (the search space)? (3) How can we find the function (heuristics in search, learning algorithms)? (4) Is there any prior knowledge (background knowledge, domain theory) available for the learning process? (5) What properties do the training data have? And (6) What are the theoretical underpinnings and practical issues in the learning process? 1.2.1. Target functions Depending on the learning methods utilized, a target function can be represented in different hypothesis language formalisms (e.g., decision trees, conjunction of attribute constraints, bit strings, or rules). When a target function is not explicitly defined, but the learner can generate its values for given input queries (such as the case in instance-based learning), then the function is said to be implicitly defined. A learned target function may be easy for the human expert to understand and interpret (e.g., first order rules), or it may be hard or impossible for people to comprehend (e.g., weights for a neural network). interpretability \
\ \
formalism
representation \
easy to understand \ \
ex pij c i t
\ \ hard to understand \ \ \ length \ \
7
bit string, \
\
bayesian networks \ \ \
7
linear functions propositions \ Horn clauses \
7
/
/
/
/ statistical significance
/
/information content / / / / tradeoff between / / complexity and degree / / of fit to data /
ANN
decision trees \
\ implicit \ \ \ \
/ predictive accuracy
properties
attribute constraints
\
/ PQrrpr
ea
/
Ser
/ / / /
lazy /
_ Target |
Function
binary classification mu lti-value
classification regression
/
generalization
output
Figure 2. Issue in target functions. Based on its output, a target function can be utilized for SE tasks that fall into the categories of binary classification, multi-value classification and regression. When learning a target function from a given set of training data, its generalization can be either eager (at learning stage) or lazy
4
(at classification stage). Eager learning may produce a single target function from the entire training data, while lazy learning may adopt a different (implicit) function for each query. Evaluating a target function hinges on many considerations: predictive accuracy, interpretability, statistical significance, information content, and tradeoff between its complexity and degree of fit to data. Quinlan in [119] states: Learned models should not blindly maximize accuracy on the training data, but should balance resubstitution accuracy against generality, simplicity, interpretability, and search parsimony. 1.2.2. Hypothesis space Candidates to be considered for a target function belong to the set called a hypothesis space H. Let/be a true function to be learned. Given a set D of examples (training data) of/, the inductive learning amounts to finding a function h e H that is consistent with D. h is said to approximate/. How an H is specified and what structure H has would ultimately determine the outcome and efficiency of the learning. The learning becomes unrealizable [124] when / £ H. Since / is unknown, we may have to resort to background or prior knowledge to generate an H in which / must exist. How prior knowledge is utilized to specify an appropriate H where the learning problem is realizable (f e H) is a very important issue. There is also a tradeoff between the expressiveness of an H and the computational complexity of finding simple and consistent h that approximates / [124]. Through some strong restrictions, the expressiveness of the hypothesis language can be reduced, thus yielding a smaller H. This in turn may lead to a more efficient learning process, but at the risk of being unrealizable. structures
lattlce
\
no structure
properties
\
realizable (fs H)
\
unrea Iizable
\
Hypothesis Space H
\ 7
/
/ Prior knowledge
/
(fg H)
/ / domain theory
/
expressiveness
/ computational complexity of finding a simple and / consistent h /
constraints
tradeoff
Figure 3. Issues in hypothesis space H.
5
1.2.3. Search and bias How can we find a simple and consistent h e H that approximates / ? This question essentially boils down to a search problem. Heuristics (inductive or declarative bias) will play a pivotal role in the search process. Depending on how examples in D are defined, learning can be supervised or unsupervised. Different learning methods may adopt different types of bias, different search strategies, and different guiding factors in search. For an /, its approximation can be obtained either locally with regard to a subset of examples in D, or globally with regard to all examples in D. Learning can result in either knowledge augmentation or knowledge (re)compilation. Depending on the interaction between a learner and its environment, there are query learning and reinforcement learning. There are stable and unstable learning algorithms depending on the sensitivity to changes in the training data [37]. For unstable algorithms (e.g., decision tree, neural networks, or rule-learning), small changes in the training data will result in the algorithms generating significantly different output function. On the other hand, stable algorithms (e.g., linear-regression and nearestneighbor) are immune (not easy to succumb to) small changes in data [37]. Instead of using just a single learned hypothesis for classification of unseen cases, an ensemble of hypotheses can be deployed whose individual decisions are combined to accomplish the task of new case classification. There are two major issues in ensemble learning: how to construct ensembles, and how to combine individual decisions of hypotheses in an ensemble [37]. outcome \
guiding factor \
VS
style
domain theory
\ generalTospecific \
_
bias \
searchbias
\ knowledge \ \ \ training data \ info. g a i n idi^ 6 \ augmentation \ ,. . \ \ alone \ language Dias \ distance metric \ , ,.... ,. . \ \ \ \ \ \ g r e e d y ( hlU c h m - ) \ \ declarative bias \ gradient desce. \ \ \ \ \ \ fitnp^ \ s i m P l e T o c o m P l e x \ training data + \ changeable vs. unchangeable \ \ deductive \ domain theory \ \ Knowledge \recompilation \ cumulat. reward \ _ \ \ I m p l i c i t vs. explicit \ \relat. frequency \ randomized beam \ \ \ \ m-esti. accuracy \ n o explicit search \ \ I ^7 ^ i ^ 7 1 T Search / / / / how to / ' / query learning / supervised / unstable / / construct / / learning / algonthms / / local (subset of ensembles / I / / I training data) / / /
reinforcement learning
/ /
unSupervised
/ /
leamin
/
/
learner-environment supervision interaction
stable / algorithms / /
stability
how to combine classifiers
ensemble
Figure 4. Issues in search of hypothesis.
6
/ / /
global (all training ta)
approximation
Another issue during the search process is the need for interaction with an oracle. If a learner needs an oracle to ascertain the validity of the target function generalization, it is interactive; otherwise, the search is non-interactive [89]. The search is flexible if it can start either from scratch or from an initial hypothesis. 1.2.4. Prior knowledge Prior (or background) knowledge about the problem domain where an / is to be learned plays a key role in many learning methods. Prior knowledge can be represented differently. It helps learning by eliminating otherwise consistent h and by filling in the explanation of examples, which results in faster learning from fewer examples. It also helps define different learning techniques based on its logical roles and identify relevant attributes, thus yielding a reduced H and speeding up the learning process [124]. There are two issues here. First of all, for some problem domains, the prior knowledge may be sketchy, inaccurate or not available. Secondly, not all learning approaches are able to accommodate such prior knowledge or domain theories. A common drawback of some general learning algorithms such as decision tree or neural networks is that it is difficult to incorporate prior knowledge from problem domains to the learning algorithms [37]. A major motivation and advantage of stochastic learning (e.g., naive Bayesian learning) and inductive logic programming is their ability to utilize background knowledge from problem domains in the learning algorithm. For those learning methods for which prior knowledge or domain theory is indispensable, one issue to keep in mind is that the quality of the knowledge (correctness, completeness) will have a direct impact on the outcome of the learning. representation \ first order \ theories \
properties \setey \ \ inaccurate
\ correctness \ \
\COnStraintS
\ completeness \
\ probabilities \
quality
\
\
\ 7 7 /expedite learning / /fromfewerdata / e ^ / define different / / learning methods /
/ identify relevant / / attributes / roles
notavailable
hard t0
\ I t0 accommodate
accommodate
accommodation
Figure 5. Issues in prior knowledge.
7
Prior Knowledge
1.2.5. Training data Training data gathered for the learning process can vary in terms of (1) the number of examples, (2) the number of features (attributes), and (3) the number of output classes. Data can be noisy or accurate in terms of random errors, can be redundant, can be of different types, and have different valuations. The quality and quantity of training data have direct impact on the learning process as different learning methods have different criteria regarding training data, with some methods requiring large amount of data, others being very sensitive to the quality of data, and still others needing both training data and a domain theory. Training data may be just used once, or multiple times, by the learner. Scaling-up is another issue. Real world problems can have millions of training cases, thousands of features and hundreds of classes [37]. Not all learning algorithms are known to be able to scale up well with large problems in those three categories. When a target function is not easy to be learned from the data in the input space, a need arises to transform the data into a possible high-dimensional feature space F and learn the target function in F. Feature selection in F becomes an important issue as both computational cost and target function generalization performance can degrade as the number of features grows [32]. Finally, based on the way in which training data are generated and provided to a learner, there are batch learning (all data are available at the outset of learning) and on-line learning (data are available to the learner one example at a time) feature selection
\
\
7
\ \ noisy/accurate irrelevant feature \ r e a l v a l u e \ \ detection and \ \ r a n d o m errors \ elimination \ \ \ -u \ vector value \ redundancy J \ filters \ \ \ wrappers \ \ I \ \ \ Training
7
/ sequences / /time series / / spatial type
properties
«rctra'kyViscre,e value \ «-™-««
\
/
valuation
/
usec
7
^ once
/
/ /
I
batch learning
/ / / used multiple / online / times / learning / / frequency
availability
/
large data set
/ large number of / features / /
large number of classes scale-up
Figure 6. Issues in training data.
8
Data
/
1.2.6. Theoretical underpinnings and practical considerations Underpinning learning methods are different justifications: statistical, probabilistic, or logical. What are the frameworks for analyzing learning algorithms? How can we evaluate the performance of a generated function, and determine the convergence issue? What types of practical problems do we have to come to grips with? Those issues must be answered if we are to succeed in real world SE applications.
application types \
convergence
, . . data mining
\
\
analysis framework \
\
\ \ \ poorly understood \ \ domains \
feasibility
settin
PAC
\
\ stationary assumption \ \ sample complexity of H
/ crowding / intervals / / / Curse of / comparison / dimensionality / practical problem
Theory & Practice
\
7
evaluating h
logical
/ / /
probabilistic
/ justification
Figure 7. Theoretical and practical issues. 1.3. Learning Approaches There are many different types of learning methods, each having its own characteristics and lending itself to certain learning problems. In this book, we organize major types of supervised and reinforcement learning methods into the following groups: concept learning (CL), decision tree learning (DT), neural networks (NN), Bayesian learning (BL), reinforcement learning (RL), genetic algorithms (GA) and genetic programming (GP), instance-based learning (IBL, of which case-based reasoning, or CBR, is a popular method), inductive logic programming (ILP), analytical learning (AL, of which explanation-based learning, or EBL is a method), combined inductive and analytical learning (IAL), ensemble learning (EL) and support vector machines (SVM). The organization of different learning methods is largely influenced by [105]. In some literature [37, 124], stochastic (statistical) learning is used to refer to learning methods such as BL.
9
1.3.1. Concept learning In CL, a target function is represented as a conjunction of constraints on attributes. The hypothesis space H consists of a lattice of possible conjunctions of attribute constraints for a given problem domain. A least-commitment search strategy is adopted to eliminate hypotheses in H that are not consistent with the training set D. This will result in a structure called the version space, the subset of hypotheses that are consistent with the training data. The algorithm, called the candidate elimination, utilizes the generalization and specialization operations to produce the version space with regard to H and D. It relies on a language (or restriction) bias that states that the target function is contained in H. CL is an eager and supervised learning method. It is not robust to noise in data and does not have support for prior knowledge accommodation. 1.3.2. Decision trees A target function is defined as a decision tree in DT. Search in DT is often guided by an entropy based information gain measure that indicates how much information a test on an attribute yields. Learning algorithms in DT often have a bias for small trees. It is an eager, supervised, and unstable learning method, and is susceptible to noisy data, a cause for overfitting. It cannot accommodate prior knowledge during the learning process. However, it scales up well with large data in several different ways [37]. A popular DT tool is C4.5 [118]. 1.3.3. Neural networks Given a fixed network structure, learning a target function in NN amounts to finding weights for the network such that the network outputs are the same as (or within an acceptable range of) the expected outcomes as specified in the training data. A vector of weights in essence defines a target function. This makes the target function very difficult for human to read and interpret. NN is an eager, supervised, and unstable learning approach and cannot accommodate prior knowledge. A popular algorithm for feed-forward networks is Backpropagation, which adopts a gradient descent search and sanctions an inductive bias of smooth interpolation between data points [105]. 1.3.4. Bayesian learning BL offers a probabilistic approach to inference, which is based on the assumption that the quantities of interest are dictated by probability distributions, and that optimal decisions or classifications can be reached by reasoning about these probabilities along with observed data [105]. BL methods can be divided into two groups based on the outcome of the learner: the ones that produce the most probable hypothesis given the training data, and the ones that produce the most probable classification of a new instance given the training data. A target function is thus explicitly represented in the first group, but implicitly defined in the second group. One of the main advantages is that BL accommodates prior knowledge (in the form of Bayesian belief networks, prior probabilities for candidate hypotheses, or a probability distribution over observed data for a possible hypothesis). The classification of an unseen case is obtained through combined predictions of multiple hypotheses. It also scales up well with large data. BL is an eager and supervised learning method and does not require search during learning process. Though it has no problem with noisy data, BL has difficulty with small data sets. BL adopts a bias that is based on the minimum description length principle that prefers a hypothesis h that minimizes the description length of h plus the description length of the data given h [105]. There
10
are several popular algorithms: MAP (maximum a posteriori), Bayes optimal classifier, naive Bayes classifier, Gibbs, and EM [37, 105]. 1.3.5. Genetic algorithms and genetic programming GA and GP are both biologically inspired learning methods. A target function is represented as bit strings in GA, or as programs in GP. The search process starts with a population of initial hypotheses. Through the crossover and mutation operations, members of current population give rise to the next generation of population. During each step of the iteration, hypotheses in the current population are evaluated with regard to a given measure of fitness, with the fittest hypotheses being selected as members of the next generation. The search process terminates when some hypothesis h has a fitness value above some threshold. Thus, the learning process is essentially embodied in the generate-and-test beam search [105]. The bias is fitness-driven. There are generational and steady-state algorithms. 1.3.6. Instance-based learning IBL is a typical lazy learning approach in the sense that generalizing beyond the training data is deferred until an unseen case needs to be classified. In addition, a target function is not explicitly defined; instead, the learner returns a target function value when classifying a given unseen case. The target function value is generated based on a subset of the training data that is considered to be local to the unseen example, rather than the entire training data. This amounts to approximating a different target function for a distinct unseen example. This is a significant departure from the eager learning methods where a single target function is obtained as a result of the learner generalizing from the entire training data. The search process is based on statistical reasoning, and consists in identifying training data that are close to the given unseen case and producing the target function value based on its neighbors. Popular algorithms include: K-nearest neighbors, CBR and locally weighted regression. 1.3.7. Inductive logic programming Because a target function in ILP is defined by a set of (propositional or first-order) rules, it is highly amenable to human readability and interpretability. ILP lends itself to incorporation of background knowledge during learning process, and is an eager and supervised learning. The bias sanctioned by ILP includes rule accuracy, FOIL-gain, or preference of shorter clauses. There are a number of algorithms: SCA, FOIL, PROGOL, and inverted resolution. 1.3.8. Analytical learning AL allows a target function, represented in terms of Horn clauses, to be generalized from scarce data. However, it is in dispensable that the training data D must be augmented with a domain theory (prior knowledge about the problem domain) B. The learned h is consistent with both D and B, and good for human readability and interpretability. AL is an eager and supervised learning, and search is performed in the form of deductive reasoning. The search bias in EBL, a major AL method, is B and preference of a small set of Horn clauses (for learning h). One important perspective of EBL is that learning can be construed as recompiling or reformulating the knowledge in B so as to make it operationally more efficient when classifying unseen cases. EBL algorithms include Prolog-EBG.
11
1.3.9. Inductive and analytical learning Both inductive learning and analytical (deductive) learning have their props and cons. The former requires plentiful data (thus vulnerable to data quality and quantity problems), while the latter relies on a domain theory (hence susceptible to domain theory quality and quantity problems). IAL is meant to provide a framework where benefits from both approaches can be strengthened and impact of drawbacks minimized. IAL usually encompasses an inductive learning component and an analytical learning component, e.g., NN+EBL (EBNN), or ILP+EBL (FOCL) [105]. It requires both D and B, and can be an eager and supervised learning. The issues of target function representation, search, and bias are largely determined by the underlying learning components involved. 1.3.10. Reinforcement learning RL is the most general form of learning. It tackles the issue of how to learn a sequence of actions called a control strategy from indirect and delayed reward information (reinforcement). It is an eager and unsupervised learning. Its search is carried out through training episodes. Two main approaches exist for reinforcement learning: model-based and model-free approaches [39]. The best-known model-free algorithm is Q-learning. In Q-learning, actions with maximum Q value are preferred. 1.3.11. Ensemble learning In EL, a target function is essentially the result of combining, through weighted or unweighted voting, a set of component or base-level functions called an ensemble. An ensemble can have a better predictive accuracy than its component function if (1) individual functions disagree with each other, (2) individual functions have a predictive accuracy that is slightly better than random classification (e.g., error rates below 0.5 for binary classification), and (3) individual functions' errors are at least somewhat uncorrelated [37]. EL can be seen as a learning strategy that addresses inadequacies in training data (insufficient information in training data to help select a single best h 6 H), in search algorithm (deployment of multiple hypotheses amounts to compensating for less than perfect search algorithms), and in the representation of H (weighted combination of individual functions makes it possible to represent a true function f £ H). Ultimately, an ensemble is less likely to misclassify than just a single component function. Two main issues exist in EL: ensemble construction, and classification combination. There are bagging, cross-validation and boosting methods for constructing ensembles, and weighted vote and unweighted vote for combining classifications [37]. The AdaBoost algorithm is one of the best methods for constructing ensembles of decision trees [37]. There are two approaches to ensemble construction. One is to combine component functions that are homogeneous (derived using the same learning algorithm and being defined in the same representation formalism, e.g., an ensemble of functions derived by DT) and weak (slightly better than random guessing). Another approach is to combine component functions that are heterogeneous (derived by different learning algorithms and being represented in different formalism, e.g., an ensemble of functions derived by DT, IBL, BL, and NN) and strong (each of the component function performs relatively well in its own right) [44],
12
1.3.12. Support vector machines Instead of learning a non-linear target function from data in the input space directly, SVM uses a kernel function (defined in the form of inner product of training data) to transform the training data from the input space into a high dimensional feature space F first, and then learns the optimal linear separator (a hyperplane) in F. A decision function, defined based on the linear separator, can be used to classify unseen cases. Kernel functions play a pivotal in SVM. A kernel function relies only on a subset of the training data called support vectors. Table 1 is a summary of the aforementioned learning methods. 1.4.
SE Tasks for ML Applications
In software engineering, there are three categories of entities: processes (collections of software related activities, such as constructing specification, detailed design, or testing), products (artifacts, deliverables, documents that result from a process activity, such as a specification document, a design document, or a segment of code), and resources (entities required by a process activity, such as personnel, software tools, or hardware) [49]. There wee internal and external attributes for entities of the aforementioned categories. Internal attributes describe an entity itself, whereas external attributes characterize the behavior of an entity (how the entity relates to its environment). SE tasks that lend themselves to ML applications include, but are certainly not limited to: 1. Predicting or estimating measurements for either internal or external attributes of processes, products, or resources. 2. Discovering either internal or external properties of processes, products, or resources. 3. Transforming products to accomplish some desirable or improved external attributes. 4. Synthesizing various products. 5. Reusing products or processes. 6. Enhancing processes (such as recovery of specification from software). 7. Managing ad hoc products (such as design and development knowledge). In the next section, we take a look at applications that fall into those application areas.
13
Table 1. Major learning methods. Type
Target function representation
Target function generation
Search
Inductive bias
Sample algorithm
AL .pRT,
Horn clauses
Eager, D + B, supervised
Deductive reasoning
B + small set of Horn clauses
Prolog-EBG
BL
Probability tables Bayesian network
Eager, supervised, ^ (global), explicit or implicit
Probabilistic, no explicit search
Minimum description length
MAP, BOC, Gibbs, NBC
CL
Conjunction of attribute constraints
Eager, supervised, j) (global)
Version Space (^S) guided
c£ H
Candidate elimination
DT
Decision trees
Eager, D (global), supervised
Information gain (entropy)
Preference for small trees
ID3, C4.5, Assistant
EL
Indirectly defined through ensemble of component functions
Eager, D (global), supervised
Ensemble construction, classification combination
Determined by ensemble members
AdaBoost (for ensemble of DT)
GA GP
Bit strings, program trees
Eager, no D, unsupervised
Hill climbing (simulated evolution)
Fitness-driven
Prototypical GA/GP algorithms
IBL
Not explicitly defined
Lazy, D (local), supervised,
Statistical reasoning
Similarity to nearest neighbors
K-NN, LWR, CBR
ILP
If-then rules
Eager, supervised, D (global),
Statistical, general-tospecific
Rule accuracy, FOIL-gain, shorter clauses
SCA, FOIL, PROGOL, inv. resolution
NN
Weights for neural networks
Eager, supervised, D (global)
Gradient descent guided
Backpropagation
L\L
Determined by underlying learning methods
Eager, D + B, supervised
Determined by underlying learning methods
Smooth interpolation between data points Determined by underlying learning methods
RL
Control strategy n*
Eager, no D, unsupervised
Through training episodes
Actions with max. Q value
Q, TD
SVM
Decision function in inner product form
Eager, supervised, D local ( > support vectors)
Kernel mapping
Maximal margin separator
SMO
14
KBANN, EBNN, FOCL
1.5.
State-of-the-Practice in ML&SE
A number of areas in software development have already witnessed the machine learning applications. In this section, we take a brief look at reported results and offer a summary of the existing work. The list of applications included in the section, though not a complete and exhaustive one, should serve to represent a balanced view of the current status. The trend indicates that people have realized the potential of ML techniques and begin to reap the benefits from applying them in software development and maintenance. In the organization below, we use the areas discussed in Section 1.4 as the guideline to group ML applications in SE tasks. Tables 2 through 8 summarize targeted SE objectives, and ML approaches used. 1.5.1.
Prediction and estimation
In this group, ML methods are used to predict or estimate measurements for either internal or external attributes of processes, products, or resources. These include: software quality, software size, software development cost, project or software effort, maintenance task effort, software resource, correction cost, software reliability, software defect, reusability, software release timing, productivity, execution times, and testability of program modules. 1.5.1.1. Software quality prediction GP is used in [48] to generate software quality models that take as input software metrics collected earlier in development, and predict for each module the number of faults that will be discovered later in development or during operations. These predictions will then be the basis for ranking modules, thus enabling a manager to select as many modules from the top of the list as resources allow for reliability enhancement. A comparative study is done in [88] to evaluate several modeling techniques for predicting quality of software components. Among them is the NN model. Another NN based software quality prediction work, as reported in [66], is language specific, where design metrics for SDL (Specification and Description Language) are first defined, and then used in building the prediction models for identifying fault prone components. In [71, 72], NN based models are used to predict faults and software quality measures. CBR is the learning method used in software quality prediction efforts [45, 54, 74, 77, 78]. The focus of [45] is on comparing the performance of different CBR classifiers, resulting in a recommendation of a simple CBR classifier with Euclidean distance, z-score standardization, no weighting scheme, and selecting the single nearest neighbor for prediction. In [54], CBR is applied to software quality modeling of a family of full-scale industrial software systems and the accuracy is considered better than a corresponding multiple linear regression model in predicting the number of design faults. Two practical classification rules (majority voting and data clustering) are proposed in [77] for software quality estimation of high-assurance systems. [78] discusses an attribute selection procedure that can help identify pertinent software quality metrics to be utilized in the CBR-based quality prediction. In [74], CBR approach is used to calibrate software quality classification models. Data from several embedded systems are collected to validate the results.
15
Table 2. Measurement prediction and estimation. ML Method1
SE Task Software quality (high-risk, or faultprone component identification)
In [115], a DT based approach is used to generate measurement-based models of high-risk components. The proposed method relies on historical data (metrics from previous releases or projects) for identifying components of fault prone properties. Another DT based approach is used to build models for predicting high-risk Ada components [18]. Regression trees are used in [75] to classify fault-prone software modules. The approach allows one to have a preferred balance between Type I and Type II misclassification rates. The SPRINT DT algorithm is used 1
An explanation on the notations: {...} is used to indicate that multiple ML methods are each independently applied for the same SE task, and "...+..." is used to indicate that multiple ML methods are collectively applied to an SE task. These apply to Tables 2 through 8.
16
in [76] to build classification trees as quality estimation models that predict the class of software modules (fault-prone or not fault-prone). A set of computational intelligence techniques, of which DT is one, is proposed in [121] for software quality analysis. A hybrid approach, GPbased DT, is proposed in [79] for software quality prediction. Compared with DT alone, GPbased DT approach is more flexible and allows optimization of performance objectives other than accuracy. Another comparative study result is reported in [30] on using ILP methods for software fault prediction for C++ programs. Both natural and artificial data are used in evaluating the performance of two ILP methods and some extensions are proposed to one of them. Software quality prediction is formulated as a CL problem in [35]. It is noted in the study that there are activities (such data acquisition, feature extraction and example labeling) prior to the actual learning process. These activities would have impact on the quality of the outcome. The proposed approach is applied to a set of COBOL programs. 1.5.1.2. Software size estimation NN and GP are used in [41] to validate the component-based method for software size estimation. In addition to producing results that corroborate the component-based approach for software sizing, it is noticed in the study that NN works well with the data, recognizing some nonlinear relationships that the multiple linear regression method fails to detect. The equations evolved by GP provide similar or better values than those produced by the regression equations, and are intelligible, providing confidence in the results. 1.5.1.3. Software cost prediction A general approach, called optimized set reduction and based on DT, is described in [17] for analyzing software engineering data, and is demonstrated to be an effective technique for software cost estimation. A comparative study is done in [19] which includes a CBR technique for software cost prediction. The result reported in [28] indicates that the improved predictive performance of software cost models can be obtained through the use of Bayesian analysis, which offers a framework where both prior expert knowledge and sample data can be accommodated to obtain predictions. A GP based approach is proposed in [42] for searching possible software cost functions. 1.5.1.4. Software (project) development effort prediction IBL techniques are used in [131] for predicting the software project effort for new projects. The empirical results obtained (from nine different industrial data sets totaling 275 projects) indicate that CBR offers a viable complement to the existing prediction and estimations techniques. Another CBR application in software effort estimation is reported in [140]. The work in [82] focuses on the search heuristics to help identify the optimal feature set in a CBR system for predicting software project effort. A comparison is done in [142] of several CBR estimation methods and the results indicate that estimates obtained through analogues selected by human are more accurate than estimates obtained through analogues selected by tools, and more accurate than estimates through the simple regression model. DT and NN are used in [135] to help predict software development effort. The results were competitive with conventional methods such as COCOMO and function points. The main advantage of DT and NN based estimation systems is that they are adaptable and nonparametric.
17
NN is the method used in [63, 146] for software development effort prediction and the results are encouraging in terms of accuracy. Additional research on ML based software effort prediction includes: a genetically trained NN (GA+NN) predictor [133], and a GP based approach [93]. The conclusion in [93] epitomizes the dichotomy of the application of an ML method; "GP performs consistently well for the given data, but is harder to configure and produces more complex models", and "the complexity of the GP must be weighed against the small increases in accuracy to decide whether to use it as part of any effort prediction estimation". In addition, in-house data are more significant than the public data sets for estimates. Several comparative studies of software effort estimation have been reported in [25, 51, 97] where [51] deals with NN and CBR, [97] with CBR, NN and DT, and [25] with CBR, GP and NN. 1.5.1.5. Maintenance task effort prediction Models are generated in terms of NN and DT methods, and regression methods, for software maintenance task effort prediction in [68]. The study measures and compares the prediction accuracy for each model, and concludes that DT-based and multiple regression-based models have better accuracy results. It is recommended that prediction models be used as instruments to support the expert estimates and to analyze the impact of the maintenance variables on the process and product of maintenance. 1.5.1.6. Software resource analysis In [129], DT is utilized in software resource data analysis to identify classes of software modules that have high development effort or faults (the concept of "high" is defined with regard to the uppermost quartile relative to past data). Sixteen software systems are used in the study. The decision trees correctly identify 79.3 percent of the software modules that had high development effort or faults. 1.5.1.7. Correction cost estimation An empirical study is done in [34] where DT and ILP are used to generate models for estimating correction costs in software maintenance. The generated models prove to be valuable in helping to optimize resource allocations in corrective maintenance activities, and to make decisions regarding when to restructure or reengineer a component so as to make it more maintainable. A comparison leads to an observation that ILP-based results perform better than DT-based results. 1.5.1.8. Software reliability prediction Software reliability growth models can be used to characterize how software reliability varies with time and other factors. The models offer mechanisms for estimating current reliability measures and for predicting their future values. The work in [69] reports the use of NN for software reliability growth prediction. An empirical comparison is conducted between NN-based models and five well-known software reliability growth models using actual data sets from a number of different software projects. The results indicate that NN-based models adapt well across different data sets and have a better prediction accuracy.
18
1.5.1.9. Defect prediction BL is used in [50] to predict software defects. Though the system reported is only a prototype, it shows the potential Bayesian belief networks (BBN) has in incorporating multiple perspectives on defect prediction into a single, unified model. Variables in the prototype BBN system [50] are chosen to represent the life-cycle processes of specification, design and implementation, and testing (Problem-Complexity, Design-Effort, Design-Size, Defects-Introduced, Testing-Effort, Defects-Detected, Defects-Density-At-Testing, Residual-Defect-Count, and Residual-DefectDensity). The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for DefectsIntroduced, Defects-Detected and Defect-Density. 1.5.1.10. Reusability prediction Predictive models are built through DT in [98] to verify the impact of some internal properties of object-oriented applications on reusability. Effort is focused on establishing a correlation between component reusability and three software attributes (inheritance, coupling and complexity). The experimental results show that some software metrics can be used to predict, with a high level of accuracy, the potential reusable classes. 1.5.1.11. Software release timing How to determine the software release schedule is an issue that has impact on both the software product developer and the user and the market. A method, based on NN, is proposed in [40] for estimating the optimal software release timing. The method adopts the cost minimization criterion and translates it into a time series forecasting problem. NN is then used to estimate the fault-detection time in the future. 1.5.1.12. Testability prediction The work reported in [73] describes a case study in which NN is used to predict the testability of software modules from static measurements of the source code. The objective in the study is to predict a quantity between zero and one whose distribution is highly skewed toward zero, which proves to be difficult for standard statistical techniques. The results echo the salient feature of NN-based predictive models that have been discussed so far: its ability to model nonlinear relationships. 1.5.1.13. Productivity A BL based approach is described in [136] for estimating the productivity of software projects. A demonstrative BBN is defined to capture the causal relationships among components in the COCOMO81 model along with probability tables for the nodes. The results obtained are still preliminary. 1.5.1.14. Execution time Temporal behaviors of real-time software are pivotal to the overall system correctness. Testing whether a real-time system violates its specified timing constraints for certain inputs thus becomes a critical issue. A GA based approach is described in [143] to produce inputs with the longest or shortest execution times that can be used to check if they will cause a temporal error or a violation of a system's time constraints.
19
1.5.2.
Property and model discovery
ML methods are used to identify or discover useful information about software entities. Work in [16] explores using ILP to discover loop invariants. The approach is based on collecting execution traces of a program to be proven correct and using them as learning examples of an ILP system. The states of the program variables at a given point in the execution represent positive examples for the condition associated with that point in the program. A controlled closed-world assumption is utilized to generate negative examples. Table 3. Property discovery SE Task
ML Method
Program invariants
ILP [16]
Identifying objects in programs
NN [1]
Boundary of normal operations
S VM [95]
Equivalent mutants
BL [141]
Process models
NN [31 ], EBL [55]
In [1], NN is used to identify objects in procedural programs as an effort to facilitate many maintenance activities (reuse, understanding). The approach is based on cluster analysis and is capable of identifying abstract data types and groups of routines that reference a common set of data. A data analysis technique called process discovery is proposed in [31] that is implemented in terms of NN. The approach is based on first capturing data describing process events from an ongoing process and then generating a formal model of the behavior of that process. Another application involves the use of EBL to synthesize models of programming activities or software processes [55]. It generates a process fragment (a group of primitive actions which achieves a certain goal given some preconditions) from a recorded process history. Despite its effectiveness at detecting faults, mutation testing requires a large number of mutants to be compiled, executed, and analyzed for possible equivalence to the original program being tested. To reduce the number of mutants to be considered, BL is used in [141] to provide probabilistic information to determine the equivalent mutants. A detection method based on SVM is described in [95] as an approach for validating adaptive control systems. A case study is done with an intelligent flight control system and the results indicate that the proposed approach is effective for discovering boundaries of the safe region for the learned domain, thus being able to separate faulty behaviors from normal events.
20
1.5.3.
Transformation
The work in [125, 126] describes a GP system that can transform serial programs into functionally identical parallel programs. The functional identical property between the input and the output of the transformation can be proven, which greatly enhances the opportunities of the system being utilized in commercial environments. Table 4. Transformation. SE Task
ML Method
Transform serial programs to parallel ones
GP [125, 126]
Improve software modularity
CBR+NN [128], GA [62]
Mapping 0 0 applications to heterogeneous distributed environments
GA [27]
A module architecture assistant is developed in [128] to help assist software architects in improving the modularity of large programs. A model for modularization is established in terms of nearest-neighbor clustering and classification, and is used to make recommendations to rearrange module membership in order to improve modularity. The tool learns similarity judgments that match those of the software architect through performing back propagation on a specialized neural network. Another work for software modularization is reported in [62] that introduces a new representation (aimed at reducing the size of the search space) and a new crossover operator (designed to promote the formation and retention of building blocks) for GA based approach. GA is used in [27] in experimenting and evaluating a partitioning and allocation model for mapping object-oriented applications to heterogeneous distributed environments. By effectively distributing software components of an object-oriented system in a distributed environment, it is hoped to achieve performance goals such as load balancing, maximizing concurrency and minimizing communication costs. 1.5.4.
Generation and synthesis
In [10], a test case generation method is proposed that is based on ILP. An adequate test set is generated as a result of inductive learning of programs from finite sets of input-output examples. The method scales up well when the size or the complexity of the program to be tested grows. It stops being practical if the number of alternatives (or possible errors) becomes too large. A GP based approach is described in [46] to select and evaluate test data. A tool is reported in [101, 102] that uses, among other things, GA to generate dynamic test data for C/C++ programs. The tool is fully automatic and supports all C/C++ language constructs. Test results have been obtained for programs containing up to 2000 lines of source code with complex, nested conditionals. Three separate works on test data generation are also based on GA
21
[14, 24, 144]. In [14], the issue of how to instrument programs with flag variables is considered. GA is used in [24] to help generate test data for program paths, whereas work in [144] is focused on test data generation for structural test methods. Table 5. Generation and synthesis. SE Task
ML Method ILP [10], GA [14,24,101,102,144],
Test cases/data
GP [46] Test resource
GA [33]
Project management rules
{GA, DT] [5]
Software agents
GP [120]
Design repair knowledge
CBR + EBL [6]
Design schemas
IBL [61]
Data structures
GP [86]
Programs/scripts
IBL [12], [CL, AL} [104]
Project management schedule
GA [26]
Testing resource allocation problem is considered in [33] where a GA approach is described. The results are based on consideration of both system reliability and testing cost. In [5], DT and GA are utilized to learn software project management rules. The objective is to provide decision rules that can help project managers to make decisions at any stage during the development process. Synthesizing Unix shell scripts from a high-level specification is made possible through IBL in [12]. The tool has a retrieval mechanism that allows an appropriate source analog to be automatically retrieved given a description of a target problem. Several domain specific retrieval heuristics are utilized to estimate the closeness of two problems at implementation level based on their perceived closeness in the specification level. Though the prototype system demonstrates the viability of the approach, the scalability remains to be seen. A prototype of a software engineering environment is described in [6] that combines CBR and EBL to synthesize design repair rules for software design. Though the preliminary results are promising, the generality of the learning mechanism and the scaling-up issue remain to be open questions, as cautioned by the authors. In [61], IBL provides the impetus to a system that acquires software design schemas from design cases of existing applications.
22
GP is used in [120] to automatically generate agent programs that communicate and interact to solve problems. However, the reported work so far is on a two-agent scenario. Another GP based approach is geared toward generating abstract data types, i.e., data structures and the operations to be performed on them [86]. In [104], CL and AL are used in synthesizing search programs for a Lisp code generator in the domain of combinatorial integer constraint satisfaction problems. GA is behind the effort in generating project management schedules in [26]. Using a programmable goal function, the technique can generate a near-optimal allocation of resources and a schedule that satisfies a given task structure and resource pool. 1.5.5.
Reuse library construction and maintenance
This area presents itself as a fertile ground for CBR applications. In [109], CBR is the corner stone of a reuse library system. A component in the library is represented in terms of a set of feature/term pairs. Similarity between a target and a candidate is defined by the distance measure, which is computed through comparator functions based on the subsumption, closeness and package relations. Components in a software reuse library have an added advantage in that they can be executed on a computer so as to yield stronger results than could be expected from generic CBR. The work reported in [52] takes advantage of this property by first retrieving software modules from the library, adapting them to new problems, and then subjecting those new cases to executions on system-generated test sets in order to evaluate the results of CBR. CBR can be augmented with additional mechanisms to help aid other issues in reuse library. Such is the case in [70] where CBR is adopted in conjunction with a specificity-genericity hierarchy to locate and adopt software components to given specifications. The proposed method focuses its attention on the evolving nature of the software repository. Table 6. Reuse. SE Task
ML Method
Similarity computing
CBR [109]
Active browsing
IBL [43]
Cost of rework
DT [8]
Knowledge representation
CBR [52]
Locate and adopt software to
CBR [70]
specifications Generalizing program abstractions
EBL [65]
Clustering of components
GA [90]
23
How to find a better way of organizing reusable components so as to facilitate efficient user retrieval is another area where ML finds its application. GA is used in [90] to optimize the multiway clustering of software components in a reusable class library. Their approach takes into consideration the following factors: number of clusters, similarity within a cluster and similarity among clusters. In [8], DT is used to model and predict the cost of rework in a library of reusable software components. Prescriptive coding rules can be generated from the model that can be used by programmers as guidelines to reduce the cost of rework in the future. The objective of the work is to use DT to help manage the maintenance of reusable components, and to improve the way the components are produced so as to reduce maintenance costs in the library. A technique called active browsing is incorporated into a tool that helps assist the browsing of a reusable library for desired components [43]. An active browser infers its similarity measure from a designer's normal browsing actions without any special input. It then recommends to the designer components it estimates to be close to the target of the search, which is accomplished through a learning process similar to IBL. EBL is used as the basis to capture and generalize program abstractions developed in practice to increase their potential for reuse [65]. The approach is motivated by the explicit domain knowledge embodied in data type specifications and the mechanisms for reasoning about such knowledge used in validating software. 1.5.6. Requirement acquisition CL is used to support scenario-based requirement engineering in the work reported in [85]. The paper describes a formal method for supporting the process of inferring specifications of system goals and requirements inductively from interaction scenarios provided by stakeholders. The method is based on a learning algorithm that takes scenarios as examples and counter-examples (positive and negative scenarios) and generates goal specifications as temporal rules. Table 7. Process enhancement. SE Task
ML Method
Derivation of specifications of system
CL [85]
goals and requirements Extract specifications from software
ILP [29]
Acquire knowledge for specification refinement and augmentation Acquire and maintain specification consistent with scenarios
{DT, NN) [111] EBL [58, 59]
Another work in [58] presents a scenarios-based elicitation and validation assistant that helps requirements engineers acquire and maintain a specification consistent with scenarios provided.
24
The system relies on EBL to generalize scenarios to state and prove validation lemmas. A scenario generation tool is built in [59] that adopts a heuristic approach based on the idea of piecing together partially satisfying scenarios from the requirements library and using EBL to abstract them in order to be able to co-instantiate them. A technique is developed in [29] to extract specifications from software using ILP. It allows instrumented code to be run on a number of representative cases, and generate examples of the code's behavior. ILP is then used to generalize these examples to form a general description of some aspect of a system's behavior. Software specifications are imperfect reflections of a reality, and are prone to errors, inconsistencies and incompleteness. Because the quality of a software system hinges directly on the accuracy and reliability of its specification, there is dire need for tools and methodologies to perform specification enhancement. In [111], DT and NN are used to extract and acquire knowledge from sample problem data for specification refinement and augmentation. 1.5.7.
Capture development knowledge
How to capture and manage software development knowledge is the theme of this application group where both papers report work utilizing CBR as the tool. In [64], a CBR based infrastructure is proposed that supports evolving knowledge and domain analysis methods that capture emerging knowledge and synthesize it into generally applicable forms. Software process knowledge management is the focus in [4]. A hybrid approach including CBR is proposed to support the customization of software processes. The purpose of CBR is to facilitate reuse of past experiences.
Table 8. Management. SE Task
1.6.
ML Method
Collect and manage software development knowledge
CBR [64]
Software process knowledge
CBR [4]
Status
In this section, we offer a summary of the state-of-the-practice in this niche area. The application patterns of ML methods in the body of existing work are summarized in Table 9.
25
Table 9. Application patterns of ML methods. Pattern
Description
Convergent
Different ML methods each being applied to the same SE task
Divergent
A single ML method being applied to different SE tasks
Compound
Several ML methods being combined together for a single SE task
Figure 8 captures a glimpse of the types of software engineering issues in the seven application areas people have been interested in applying ML techniques to. Figure 9 summarizes the publication counts in those areas. For instance, of the eighty-six publications included in Subsection 1.5 above, forty-five of them (52%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts. On the other hand, Figure 10 provides some clue on what types of ML techniques people feel comfortable in using. Based on the classification, IBL/CBR, NN, and DT are the top three popular techniques in that order, amounting to fifty-seven percent of the entire ML applications in our study.
Figure 8. Number of different SE tasks in each application area.
26
Figure 9. Number of publications in each application area.
Figure 10. State-of-the-practice from the perspective of ML algorithms.
Table 10 depicts the distribution of ML algorithms in the seven SE application areas. The trend information captured in Figure 11, though only based on the published work we have been able to collect, should be indicative of the increased interest in ML&SE.
27
Table 10. ML methods in SE application areas. NN
Prediction
V
Discovery
-^
Transformation
V
IBL CBR
V
V
V V
Reuse
V
Management
V
GA
ILP
GP
V
EBL
V V
Generation
Acquisition
DT
yj
V V V yj
V
CL
BL
V
V
V
AL
IAL
V
V
V
V
V
V
V V
V
V
^j
Figure 11. Publications on applying ML algorithms in SE. Tables 11-21 summarize the applications of individual ML methods.
Similarity computing Active browsing Knowledge representation Locate/adopt software to specifications
Management
Software development knowledge Software process knowledge
29
Table 12. NN applications. Category
Application
Prediction
Quality Size Development effort Maintenance effort Reliability Release time Testability
Discovery
Identifying objects Process models
Transformation
Modularity
Acquisition
Specification refinement
30
Table 13. DT applications. Category
Application
Prediction
Quality Development cost Development effort Maintenance effort Resource analysis Correction cost Reusability
Generation
Project management rules
Reuse
Cost of rework
Acquisition
Specification refinement Table 14. GA applications.
Category
Application
Prediction
Development effort Execution time
Transformation
Modularity Object-oriented application
Generation
Test data Test resource allocation Project management rules Project management schedule
Reuse
Clustering of components
31
Table 15. GP applications. Category
Application
Prediction
Quality Size Development effort Software cost
Transformation
Parallel programs
Generation
Test data Software agents Data structures Table 16. ILP applications.
Category
Application
Prediction
Quality Correction cost
Discovery
Program invariants
Generation
Test data
Acquisition
Extract specifications from software Table 17. EBL applications.
Category
Application
Discovery
Process models
Generation
Design repair knowledge
Reuse
Generalizing program abstractions
Acquisition
Acquiring specifications from scenarios
32
Table 18. BL applications. Category
Application
Prediction
Development cost Defects Productivity
Discovery
Mutants
Table 19. CL applications. Category
Application
Prediction
Quality
Generation
Programs/scripts
Acquisition
Derivation of specifications
Table 20. AL applications. Category
Application
Generation
Programs/scripts
Table 21. SVM application. Category
Application
Discovery
Operation boundary
33
The body of existing work we have been able to glean definitely represents the efforts that have been underway to take advantage of the unique perspective ML affords us to explore for SE tasks. Here we point out some general issues in ML&SE as follows. > Applicability and justification. When adopting an ML method to an SE task, we need to have a good understanding of the dimensions of the leaning method and characteristics of the SE task, and find a best match between them. Such a justification offers a necessary condition for successfully applying an ML method to an SE task. > Issue of scaling up. Whether a learning method can be effectively scaled up to handle real world SE projects is an issue to be reckoned with. What seems to be an effective method for a scaled-down problem may hit a snag when being subject to a full-scale version of the problem. Some general guidelines regarding the issue are highly desirable. > Performance evaluation. Given some SE task, some ML-based approaches may outperform their conventional (non-ML) counterparts, others may not offer any performance boost but just provide a complement or alternative to the available tools, yet another group may fill in a void in the existing repertoire of SE tools. In addition, we are interested in finding out if there are significant performance differences among applicable ML methods for an SE task. To sort out those different scenarios, we need to establish a systematic way of evaluating the performance of a tool. Let S be a set of SE tasks, and let Tc and TL contain a set of conventional (non-ML) SE tools and a set of ML-based tools, respectively. Figure 12 describes some possible scenarios between S and TQ/TL, where Tc(s) c Tc and TL(S) C TL indicate a subset of tools applicable to an SE task s, respectively. If P is defined to be some performance measure (e.g., prediction accuracy), then we can use P(t, s) to denote the performance of t for seS, where t e (Tc(s) v TL(s)). Let A ::= < | = | >. Given an s e S, the performance of two applicable tools can be compared in terms of the following relationships: P(ti; s) A Pft, s), where % e Tc(s) A tj e TL(s), P(tk, s) A P(t,, s), where tk, t, e TL(s) A |T L (S)| >1.
Figure 12. Relationships between S and Tc, and between S and TL.
34
> Integration. How can an ML-based tool be seamlessly integrated into the SE development environment or tool suite is another issue that deserves attention. If it takes a heroic effort for the tool's integration, it may ultimately affects its applicability. 1.7.
Applying ML Algorithms to SE Tasks
In applying machine learning to solving any real-world problem, there is usually some course of actions to follow. What we propose is a guideline that has the following steps: Problem Formulation. The first step is to formulate a given problem such that it conforms to the framework of a particular learning method chosen for the task. Different learning methods have different inductive bias, adopt different search strategies that are based on various guiding factors, have different requirements regarding domain theory (presence or absence) and training data (valuation and properties), and are based on different justifications of reasoning (refer to Figures 2-7). All these issues must be taken into consideration during the problem formulation stage. This step is of pivotal importance to the applicability of the learning method. Strategies such as divide-and-conquer may be needed to decompose the original problem into a set of subproblems more amenable to the chosen learning method. Sometimes, the best formulation of a problem may not always be the one most intuitive to a machine learning researcher [87]. Problem representation. The next step is to select an appropriate representation for both the training data and the knowledge to be learned. As can be seen in Figure 2, different learning methods have different representational formalisms. Thus, the representation of the attributes and features in the learning task is often problem-specific and formalism-dependent. Data collection. The third step is to collect data needed for the learning process. The quality and the quantity of the data needed are dependent on the selected learning method. Data may need to be preprocessed before they can be used in the learning process. Domain theory preparation. Certain learning methods (e.g., EBL) rely on the availability of a domain theory for the given problem. How to acquire and prepare a domain theory (or background knowledge) and what is the quality of a domain theory (correctness, completeness) therefore become an important issue that will affect the outcome of the learning process. Performing the learning process. Once the data and a domain theory (if needed) are ready, the learning process can be carried out. The data will be divided into a training set and a test set. If some learning tool or environment is utilized, the training data and the test data may need to be organized according to the tool's requirements. Knowledge induced from the training set is validated on the test set. Because of different splits between the training set and test set, the learning process itself is an iterative one. Analyzing and evaluating learned knowledge. Analysis and evaluation of learned knowledge is an integral part of the learning process. The interestingness and the performance of the acquired knowledge will be scrutinized during this step, often with the help from human experts, which hopefully will lead to the knowledge refinement. If learned knowledge is deemed insignificant, uninteresting, irrelevant, or deviating, this may be indicative to the need for revisions at early stages such as problem formulation and representation. There are known practical problems in many learning methods such as overfitting, local minima, or curse of dimensionality that are due
35
to either data inadequacy, noise or irrelevant attributes in data, nature of a search strategy, or incorrect domain theory. Fielding the knowledge base. What this step entails is that the learned knowledge be used [87]. The knowledge could be embedded in a software development system or a software product, or used without embedding it in a computer system. As observed in [87], the power of machine learning methods does not come from a particular induction method, but instead from proper formulation of the problems and from crafting the representation to make learning tractable. 1.8.
Organization of the Book
The rest of the book is organized as follows. Chapters 2 through 8 cover ML applications in seven different categories of SE, respectively. Chapter 2 deals with ML applications in software measurements or attributes prediction and estimation. This is the most concentrated category that includes forty-five publications in our study. In this chapter, a collection of seven papers is selected as representatives for activities in this category. These seven papers include ML applications in predicting or estimating: software quality, software development cost, project effort, software defect, and software release timing. Those applications involve ML methods of BL, DT, NN, CBR, and GP. In Chapter 3, two papers are included to address the use of ML methods for discovering software properties and models, one dealing with using NN to identify objects in procedural programs, and the other tackling the issue of detecting equivalent mutants in mutation testing using BL The main theme in Chapter 4 is software transformation. ML methods are utilized to transform software into one with desirable properties (e.g., from serial programs to parallel programs, from a less modularized program to a more modularized one, mapping object-oriented applications to heterogeneous distributed environments). In this chapter, we include one paper that deals with the issue of transforming software systems for better modularity using nearest-neighbor clustering and a special-purpose NN. Chapter 5 describes ML applications where software artifacts are generated or synthesized. The chapter contains one paper that describes a GA based approach to test data generation. The proposed approach is based on dynamic test data generation and is geared toward generating condition-decision adequate test sets. Chapter 6 takes a look at how ML methods are utilized to improve the process of constructing and maintaining reuse libraries. Software reuse library construction and maintenance has been a fertile ground for ML applications. The paper included in this chapter describes a CBR based approach to locating and adopting reusable components to particular specifications. In Chapter 7, software specification is the target issue. Two papers are selected in the chapter. The first paper describes an ILP based approach to extracting specifications from software. The second paper discusses an EBL based approach to scenario generation that is an integral part of specification modeling. Chapter 8 is concerned with how ML methods are used to capture and manage software development or process knowledge. The one paper in the chapter discusses a CBR based method for collecting and managing software development knowledge as it evolves in an organizational context. Finally, Chapter 9 offers some guidelines on how to select ML methods for SE tasks, how to formulate an SE task into a learning problem, and concludes the book with remarks on where future effort will be needed in this niche area.
36
Chapter 2 ML Applications in Prediction and Estimation As evidenced in Chapter One, the majority of the ML applications (52%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts. The subject of the prediction or estimation involves a range of properties: quality, size, cost, effort, reliability, reusability, productivity, and testability. In this chapter, we include a set of 7 papers where ML methods are used to predict or estimate measurements for either internal or external attributes of processes, products, or resources in software engineering. These include: software quality, software cost, project or software development effort, software defect, and software release timing. Table 22 summarizes the current state-of-the-practice in this application area. Table 22. ML methods used in prediction and estimation. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Quality
V
Size
V
Development Cost Development Effort
V
Maintenance Effort
-^
Resource Analysis
V
V
V
V
V
V
V
V V
V
\j \j yj
Correction Cost
\j
y]
yj
Defects
yj
Reusability Release Time
A/ yj
Productivity
-\j
Execution Time Testability
V
V
Software Cost
Reliability
V
-\/ yj
A primary concern in prediction or estimation models and methods is accuracy. There are some general issues about prediction accuracy. The first is the measurement, namely, how accuracy is to be measured. There are several accuracy measurements and the choice of which one to use may be dependent on what objectives one has when using the predictor. The second issue is the
37
sensitivity, that is, how sensitive a prediction method's accuracy is to changes in data and time. Different approaches may have different level of sensitivity. The paper by Chulani, Boehm and Steece [28] describes a BL approach to software development cost prediction. A salient feature of BL is that it accommodates prior knowledge and allows both data and prior knowledge to be utilized in making inferences. This proves to be especially helpful in circumstances where data are scarce and incomplete. The results obtained by authors in the paper indicate that the BL approach has a predictive performance (within 30 percent of the actual values 75 percent of the time) that is significantly better than that of the previous multiple regression approach (within 30 percent of the actual values only 52 percent of the time) on their latest sample of 161 project datapoints. The paper by Srinivasan and Fisher [135] deals with the issue of estimating software development effort. This is an important task in software development process, as either underestimation or overestimation of the development effort would have adverse effect. Their work describes the use of two ML methods, DT and NN, for building software development effort estimators from historical data. The experimental results indicate that the performance of DT and NN based estimators are competitive with traditional estimators. Though just as sensitive to various aspects of data selection and representation as the traditional models, a major benefit of ML based estimators is that they are adaptable and nonparametric. The paper by Shepperd and Schofield [131] adopts a CBR approach to software project effort estimation. In their approach, projects are characterized in terms of a feature set that ranges from as few as one and as many as 29, and includes features such as the number of interfaces, development method, the size of functional requirements document. Cases for completed projects are stored along with their features and actual values of development effort. Similarity among cases is defined based on project features. Prediction for the development effort of a new project amounts to retrieving its nearest neighbors in the case base and using their known effort values as the basis for estimation. The sensitivity analysis indicates that estimation by analogy may be highly unreliable if the size of the case base is below 10 known projects, and that this approach can be susceptible to outlying projects, but the influence by a rogue project can be ameliorated as the size of dataset increases. The paper by Fenton and Neil [50] offers a critical analysis of the existing defect prediction models, and proposes an alternative approach to defect prediction using Bayesian belief networks (BBN), part of BL method. Software defect prediction is a very useful and important tool to gauge the likely delivered quality and maintenance effort before software systems are deployed. Predicting defects requires a holistic model rather than a single-issue model that hinges on either size, or complexity, or testing metrics, or process quality data alone. It is argued in [50] that all these factors must be taken into consideration in order for the defect prediction to be successful. BBN proves to be a very useful approach to the software defect prediction problem. A BBN represents the joint probability distribution for a set of variables. This is accomplished by specifying (a) a directed acyclic graph (DAG) where nodes represent variables and arcs correspond to conditional independence assumptions (causal knowledge about the problem domain), and (b) a set of local conditional probability tables (one for each variable) [67, 105]. A BBN can be used to infer the probability distribution for a target variable (e.g., "Defects Detected"), which specifies the probability that the variable will take on each of its possible values given the observed values of the other variables. In general, a BBN can be used to compute the probability distribution for any subset of variables given the values or distributions
38
for any subset of the remaining variables. In [50], variables in the BBN model are chosen to represent the life-cycle processes of specification, design and implementation, and testing. The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for Defects-Introduced, DefectsDetected and Defect-Density. The paper by Khoshgoftaar, Allen and Deng [75] discusses using a DT approach to classifying fault-prone software modules. The objective is to predict which modules are fault-prone early enough in the development life cycle. In the regression tree to be learned, the s-dependent variable is the response variable that is of the data type real, the ^-independent variables are predictors based on which the internal nodes of the tree are defined, and the leaf nodes are labeled with a real quantity for the response variable. A large legacy telecommunication system is used in the case study where four consecutive releases of the software are the basis for the training and test data sets (release 1 used as the training data set, releases 2-4 used as test data sets). A classification rule is proposed that allows the developer the latitude to have a balance between two types of misclassification rates. The case study results indicate satisfactory prediction accuracy and robustness. The paper by Burgess and Lefley [25] conducts a comparative study of software effort estimation in terms of three ML methods: GP, NN and CBR. A well-known data set of 81 projects in the late 1980s is used for the study. The input variables are restricted to those available from the specification stage. The comparisons are based on the accuracy of the results, the ease of configuration and the transparency of the solutions. The results indicate that the explanatory nature of estimation by analogy gives CBR an advantage when considering its interaction with the end user, and that GP can lead to accurate estimates and has the potential to be a valid addition to the suite of tools for software effort estimation. The paper by Dohi, Nishio and Osaki [40] proposes an NN based approach to estimating the optimal software release timing which minimizes the relevant cost criterion. Because the essential problem behind the software release timing is to estimate the fault-detection time interval in the future, authors adopt two typical NN (a feed forward NN and a recurrent NN) for the purpose of time series forecasting. Six data sets of real software fault-detection time are used in the case study. The results indicate that the predictive accuracy of the NN models outperforms those of software reliability growth models based approaches. Of the two NN models, the recurrent NN yields better results than the feed forward NN. The following papers will be included here: S. Chulani, B. Boehm and B. Steece, "Bayesian analysis of empirical software engineering cost models," IEEE Trans. SE, Vol. 25, No. 4, July 1999, pp. 573-583. K. Srinivasan and D. Fisher, "Machine learning approaches to estimating software development effort," IEEE Trans. SE, Vol. 21, No. 2, Feb. 1995, pp. 126-137. M. Shepperd and C. Schofield, "Estimating software project effort using analogies", IEEE Trans. SE, Vol. 23, No. 12, November 1997, pp. 736-743.
39
N. Fenton and M. Neil, "A critique of software defect prediction models," IEEE Trans. SE, Vol. 25, No. 5, Sept. 1999, pp. 675-689. T. Khoshgoftaar, E.B. Allen and J. Deng, Using regression trees to classify fault-prone software modules, IEEE Transactions on Reliability, Vol.51, No.4, 2002, pp.455-462. CJ. Burgess and M. Lefley, Can genetic programming improve software effort estimation? A comparative evaluation, Information and Software Technology, Vol.43, No. 14, 2001, pp.863873. T. Dohi, Y. Nishio, and S. Osaki, "Optimal software release scheduling based on artificial neural networks", Annals of Software Engineering, Vol.8, No.l, 1999, pp.167-185.
40
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,
VOL. 25,
NO. 4,
JULY/AUGUST 1999
573
Bayesian Analysis of Empirical Software Engineering Cost Models Sunita Chulani, Member, IEEE, Barry Boehm, Fellow, IEEE, and Bert Steece Abstract—To date many software engineering cost models have been developed to predict the cost, schedule, and quality of the software under development. But, the rapidly changing nature of software development has made it extremely difficult to develop empirical models that continue to yield high prediction accuracies. Software development costs continue to increase and practitioners continually express their concerns over their inability to accurately predict the costs involved. Thus, one of the most important objectives of the software engineering community has been to develop useful models that constructively explain the software development life-cycle and accurately predict the cost of developing a software product. To that end, many parametric software estimation models have evolved in the last two decades [25], [17], [26], [15], [28], [1], [2], [33], [7], [10], [22], [23]. Almost all of the above mentioned parametric models have been empirically calibrated to actual data from completed software projects. The most commonly used technique for empirical calibration has been the popular classical multiple regression approach. As discussed in this paper, the multiple regression approach imposes a few assumptions frequently violated by software engineering datasets. The source data is also generally imprecise in reporting size, effort, and cost-driver ratings, particularly across different organizations. This results in the development of inaccurate empirical models that don't perform very well when used for prediction. This paper illustrates the problems faced by the multiple regression approach during the calibration of one of the popular software engineering cost models, COCOMO II. It describes the use of a pragmatic 10 percent weighted average approach that was used for the first publicly available calibrated version [6]. It then moves on to show how a more sophisticated Bayesian approach can be used to alleviate some of the problems faced by multiple regression. It compares and contrasts the two empirical approaches, and concludes that the Bayesian approach was better and more robust than the multiple regression approach. Bayesian analysis is a well-defined and rigorous process of inductive reasoning that has been used in many scientific disciplines (the reader can refer to [11], [35], [3] for a broader understanding of the Bayesian Analysis approach). A distinctive feature of the Bayesian approach is that it permits the investigator to use both sample (data) and prior (expert-judgment) information in a logically consistent manner in making inferences. This is done by using Bayes' theorem to produce a 'postdata' or posterior distribution for the model parameters. Using Bayes' theorem, prior (or initial) values are transformed to postdata views. This transformation can be viewed as a learning process. The posterior distribution is determined by the variances of the prior and sample information. If the variance of the prior information is smaller than the variance of the sampling information, then a higher weight is assigned to the prior information. On the other hand, if the variance of the sample information is smaller than the variance of the prior information, then a higher weight is assigned to the sample information causing the posterior estimate to be closer to the sample information. The Bayesian approach discussed in this paper enables stronger solutions to one of the biggest problems faced by the software engineering community: the challenge of making good decisions using data that is usually scarce and incomplete. We note that the predictive performance of the Bayesian approach (i.e., within 30 percent of the actuals 75 percent of the time) is significantly better than that of the previous multiple regression approach (i.e., within 30 percent of the actuals only 52 percent of the time) on our latest sample of 161 project datapoints. Index Terms—Bayesian analysis, multiple regression, software estimation, software engineering cost models, model calibration, prediction accuracy, empirical modeling, COCOMO, measurement, metrics, project management.
• 1
CLASSICAL MULTIPLE REGRESSION APPROACH
M
OST of the existing empirical software engineering cost
can be used on software engineering data. We also highlight
models are calibrated using the classical multiple regression approach. In Section 1, we focus on the overall description of the multiple regression approach and how it
the assumptions imposed by the multiple regression approach and the resulting problems faced by the software engineering community in trying to calibrate empirical models using this approach. The example dataset used to facilitate the illustration is the 1997 COCOMO II dataset • S. Chulani is with IBM Research, Center for Software Engineering, 650 which is composed of data from 83 completed projects Harry Rd., San Jose, CA 95120. This work was performed while doing c o H e c t e d from commercial, aerospace, government, and research at the Center for Software Engineering, University of Southern ,. . . , , , , , , ., r« „ , T California, Los Angeles. E-mail: [email protected] com. nonprofit organizations [30]. It should be noted that with • B. Boehm is with the Center for Software Engineering, University of more than a dozen commercial implementations, COCOMO Southern California, Los Angeles, CA 90089. has been one of the most popular cost estimation models of E-mail: [email protected] the'80s and'90s. COCOMO II [2] is a recent update of the • B. Steece is with the Marshall School of Business, University of Southern ,-w-, J t California, Los Angeles, CA 90089. E-mail: [email protected]. popular COCOMO model published in [1]. Manuscript received 29 June 1998; revised 25 Feb. 1999. Multiple Regression expresses the response (e.g., Person Recommended for acceptance by D. Ross Jeffery. Months (PM)) as a linear function of k predictors (e.g., For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference 1EEECS Log Number 109543. Source Lines of Code, Product Complexity, etc.). This linear
function is estimated from the data using the ordinary least squares approach discussed in numerous books such as [18], [34]. A multiple regression model can be written as Vt = 0o + 0i xt\ + ••• + PkXtk
+ Et
RCON. We used a threshold value of 0.65 for high correlation among predictor variables. T a b l e 1s h o w s ^ h i S h l y correlated parameters that were aggregated for the 1997 calibration of COCOMO II. (1) The regression estimated the /? coefficients associated
where * u . . . xtk are the values of the predictor (or . •ii r i i • n r, regressor) variables for the xth observation, 0O • • • 0K are the coefficients to be estimated, st is the usual error term, and yt is the response variable for the j/th observation. Our example model, COCOMO II, has the following mathematical form:
with the scale factors and effort multipliers as shown below in the RCode (statistical software developed at University of M i n n e s o t a [8]) m n :
Data s e t = COCOMOII. 1997 Response = log[PM] - 1. 01*log[SIZE] Coefficient Estimates Label Estimate Std. Error t-value l.oi+VsFi 1? Constant_A 0.701883 0.231930 3.026 Effort = A x [Size] •=• ' x Y[EM{ (2) PMAT*log[SIZE] 0.000884288 0.0130658 0.068 PREC*log[SIZE] -0.00901971 0.0145235 0 . 6 2 1 >=1 where TEAM*log[SIZE] 0.00866128 0.0170206 0.509 FLEX*log[SIZE] 0.0314220 0.0151538 0.074 RBSL*log[SIZE] -0.00558590 0.019035 - 0 . 2 9 3 A = multiplicative constant log[PERS] 0.987472 0.230583 4.282 Size = Size of the software project measured in terms of KSLOC (thousands of Source Lines of Code) log [RELY] 0.798808 0.528549 1.511 [26] or function points [13] and programming l°g[CPLX] 1.13191 0.434550 2.605 1 log[RCON] 1.36588 0.273141 5 . 0 0 1 language log[PEXP] 0.696906 0.527474 1.321 SF = scale factor log[LTEX] -0.0421480 0.672890 0.063 EM = effort multiplier (refer to [2] for further log[DATA] 2.52796 0.723645 3.493 explanation of COCOMOII terms) logtRUSE] -0.444102 0.486480 0.913 • «. rv-^™,^ IT • u ilog[DOCU] -1.32818 0.664557 1.999 We can lineanze the COCOMO II equation by taking x „. 8 5 8 3 0 2 „_ 5 3 2 5 4 4 [ p v 0 L ] x _6 1 2 logarithms on both sides of the equahon as shown: 0.609259 0 . 920 l Q g[ A E x p ] 0 . 5 6 0 5 4 2 ln(PM) = f3o+ 0i-1.01 ••ln(Size) + fo-SF1-ln(Size) log[PCON] 0.488392 0.322021 1.517 + - . . +06-SF5-ln(Size) + 07-ln(EM1) logfTOOL] 2.49512 1.11222 2.243 „ , ' , log[SITE] 1.39701 0.831993 1.679 „ , , , , . + 0z-ln{EM2) + ---+022-ln{EMK) iog[SCED] 2.84074 0.774020 3.670 23
("I
As the results indicate, some of the regression estimates had counter intuitive values, i.e., negative coefficients (shown in
Using (3) and the 1997 COCOMO II dataset consisting of b o l d ) A sa n exam le c o n s i d e r 83 completed projects, we employed the multiple regression P ' the 'Develop for Reuse' (RUSE) , r.nD , ., ,. . . y. , , effort multiplier. This multiplicative parameter captures the r ,,.,. , , c . ,. r , , r . t , , approach [6]. Because some of the rpredictor variables had r r additional effort required to develop components intended high correlations, we formed new aggregate predictor f o r r e u s e on current or future projects. As shown in Table 2, variables. These included analyst capability and program- i f ^ R U S E r a t i n g i s E x t r a H i g h ( X H ) / iS/ d e v e l o p i n g f o r mer capability which were aggregated into personnel reuse across multiple product lines, it will cause an increase capability, PERS, and time constraints and storage con- in effort by a factor of 1.56. On the other hand, if the RUSE straints, which were aggregated into resource constraints, rating is Low (L), i.e., developing with no consideration of TABLE 1 COCOMO 11.1997 Highly Correlated Parameters TIME
CHULANI ET A L : BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
575
TABLE 2 RUSE—Expert-Determined a priori Rating Scale, Consistent with 12 Published Studies Develop for Reuse (RUSE) I Low(L) I Nominal (N) I High (H) I Very High (VH) I
Extra High
(XH)
__J
Definition 1997 A-priori Values
None |
0.89
|
Across project 1.00
Across Across product Across multiple program [ine product lines 1.16 | 1.34 | 1.56
TABLE 3 RUSE—Data-Determined Rating Scale, Contradicting 12 Published Studies Develop for Reuse (RUSE) I Low (L) I Nominal (N) I High (H) I Very High (VH) I
_J
Definition
None
1997 Data-Determined Values
1.05
Across project 1.00
future reuse, it will cause effort to decrease by a factor of 0.89. This rationale is consistent with the results of 12 published studies of the relative cost of developing for reuse compiled in [27] and was based on the expertjudgment of the researchers of the COCOMO II team. But, the regression results produced a negative coefficient for the j3 coefficient associated with RUSE. This negative coefficient results in the counter intuitive rating scale shown in Table 3, i.e., an XH rating for RUSE causes a decrease in effort and a L rating causes an increase in effort. Note the opposite trends followed in Table 2 and Table 3. A possible explanation [discussed in a study by [24] on "Why regression coefficients have the wrong sign"] for this contradiction may be the lack of dispersion in the responses associated with RUSE. A possible reason for this lack of dispersion is that RUSE is a relatively new cost factor and our follow-up indicated that the respondents did not have enough information to report its rating accurately during the data collection process. Additionally, many of the responses "I don't know" and "It does not apply" had to be coded as 1.0 (since this is the only way to code no impact on effort). Note (see Fig. 1 on the following page) that with slightly more than 50 of the 83 datapoints for RUSE being set at Nominal and with no observations at XH, the data for RUSE does not exhibit enough dispersion along the entire range of possible values for RUSE. While this is the familiar errors-in-variables problem, our data doesn't allow us to resolve this difficulty. Thus, the authors were forced to assume that the random variation in the responses for RUSE is small compared to the range of RUSE. The reader should note that all other cost models that use the multiple regression approach rarely explicitly state this assumption, even though it is implicitly assumed. Other reasons for the counterintuitive results include the violation of some of the restrictions imposed by multiple regression [4], [5]:
Across program 0.94
Across product line 0.88 '
Extra High
(XH)
Across multiple product lines 0.82
data has and continues to be one of the biggest challenges in the software estimation field. This is caused primarily by immature processes and management reluctance to release cost-related data. 2. There should be no extreme cases (i.e., outliers). Extreme cases can distort parameter estimates and such cases frequently occur in software engineering data due to the lack of precision in the data collection process. 3. The predictor variables (cost drivers and scale factors) should not be highly correlated. Unfortunately, because cost data is historically rather than experimentally collected, correlations among the predictor variables are unavoidable. The above restrictions are violated to some extent by the COCOMO II dataset. The COCOMO II calibration approach determines the coefficients for the five scale factors and the 17 effort multipliers (merged into 15 due to high correlation as discussed above). Considering the rule .of thumb that every parameter being calibrated should have at least five datapoints requires that the COCOMO II dataset have data
1. The number of datapoints should be large relative to the number of model parameters (i.e., there are many degrees of freedom). Unfortunately, collecting
Fig. 1. Distribution of RUSE.
43
576
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
Fig. 2. Example of the 10 percent weighted average approach: RUSE rating scale. TABLE 4 Prediction Accuracy of COCOMO 11.1997
COCOMO n.1997 PRED(-20) PREDC25) PREDC30)
Before Stratification by Organization 46% 49% 52%
After Stratification by Organization 49% 55% 64%
on at least 110 (or 100 if we consider that parameters are the trends followed by the a priori and the data-determined merged) completed projects. We note that the COCOMO curves are opposite. The data-determined curve has a 11.1997 dataset has just 83 datapoints. negative slope and as shown in Table 3, violates expert The second point above indicates that due to the opinion. imprecision in the data collection process, outliers can The resulting calibration of the COCOMO II model using occur causing problems in the calibration. For example, if a t h e 1 9 9 7 d a t a s e t o f 8 3 p r . t s o d u c e d e s n m a t e s within 30 particular organization had extraordinary documentation n t of ^ achjals 52 n t of ^ H m e for e f f o r t T h e requirements imposed by the management, then even a , . . . . j i ,.„ i i . i i . j i . n r „ . ., . , . , ,, . ., . . rpredichon accuracy improved to 64 percent when the data . , . , . , , • , , very small project would require a lot of effort that is expended in trying to meet the excessive documentation w a s s t r a t l f l e d m t o s e t s b a s e d o n * e 1 8 unique s o u r c e s o f t h e match to the life cycle needs. If the data collected simply d a t a <see I 19 !' I 20 !' t 14 l f o r f u r t h e r confirmation of local used the highest DOCU rating provided in the model, then calibration improving accuracy) The constant, A, of the the huge amount of effort due to the stringent documenta- COCOMO II equation was recalibrated for each of these sets tion needs would be underrepresented and the project i.e., a different intercept was computed for each set. The would have the potential of being an outlier. Outliers in constant value ranged from 1.23 to 3.72 for the 18 sets and software engineering data, as indicated above, are mostly yielded the prediction accuracies as shown in Table 4. due to imprecision in the data collection process. While the 10 percent weighted average procedure The third restriction imposed requires that no para- produced a workable initial model, we want to develop a meters be highly correlated. As described above, in the m o r e formal methodology for combining expert judgment COCOMO 11.1997 calibration, a few parameters were a n d s a m p l e information. A Bayesian analysis with an aggregated to alleviate this problem. informative prior provides such a framework. To resolve some of the counter intuitive results produced by the regression analysis (e.g., the negative coefficient for RUSE as explained above), we used a weighted average of 2 THE BAYESIAN APPROACH the expert-judgment results and the regression results, with 2.1 Basic Framework—Terminology and Theory only 10 percent of the weight going to the regression results T h e B i a n a p p r o a c h p r o v i d e s a formal process by which expert-judgment can be combined with sampling for all the parameters We selected the 10 percent weighting a factor because models with 40 percent and 25 percent . , . ,, ,° , , i • • j i .... , . , ,. . j - i: -n_information (data) to produce a robust a posteriori model, weighting factors produced less accurate predictions. This \ / r r Usin Ba es eorem we S y '^ ' can combine our two information pragmatic calibrating procedure moved the model parameters in the direction suggested by the sample data but sources as follows: retained the rationale contained within the a priori values. i /y i a\\ t IQ\ An example of the 10 percent application using the RUSE f(P\Y) = J{Y\ ^ effort multiplier is given in Fig. 2. As shown in the graph,
44
CHULANI ET AL.: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
where P is the vector of parameters in which we are interested and Y is the vector of sample observations from the joint density function f(f3\Y). In (4), f(P\Y) is the posterior density function for j3 summarizing all the information about /?, f(Y \ ff) is the sample information and is algebraically equivalent to the likelihood function for fi, and f(/3) is the prior information summarizing the expertjudgment information about p. Equation (4) can be rewritten as: f(P\ Y) oc p | Y) / (P) In words, (5) means:
(5)
Posterior oc Sample x Prior In the Bayesian analysis context, the "prior" probabilities are the simple "unconditional" probabilities to the sample information; while the "posterior" probabilities are the "conditional" probabilities given sample and prior informay n The Bayesian approach makes use of prior information that is not part of the sample data by providing an optimal combination of the two sources of informarionAs described in many books on Bayesian analysis [21], [3], the posterior mean, 6", and variance, Var(b"), are defined as:
[ [
1
1 ~l
I" i
1
577
conducted a Delphi exercise [12], [1], [29]. Eight experts from the field of software estimation were asked to independently provide their estimate of the numeric values associated with each COCOMO II cost driver. Roughly half of these participating experts had been lead cost experts for large software development organizations and a few of them were originators of other proprietary cost models. All of the participants had at least 10 years of industrial software cost estimation experience. Based on the credibility of the participants, the authors felt very comfortable using t h e r e s u l t s o f t h e Delphi rounds as the prior information for tne purposes of calibrating COCOMO 11.1998. The reader is urged to refer to [32] where a study showed that estimates made by experts were more accurate than model-determined estimates. However, in [16] evidence showing the inefficiencies of expert judgment in other domains is highlighted. O ^ m e f i r s t r o u n d o f m e D e l P h i w a s completed, we summarized the results in terms of the means and the r a n e s of S **•* responses. These summarized results were q u i t e r a w w i t h significant variances caused by misunderstanding of the parameter definitions. In an attempt to improve the accuracy of these results and to attain better consensus among the experts, the authors distributed the results back to the participants. A better explanation of the behavior of the scale factors was provided since there was
-JJX'X + #*j x ^-X^fe + #*b*J and . _! (6)
highest variance in the scale factor responses. Each of the participants got a second opportunity to independently
— X'X + H* s J where X is the matrix of predictor variables, s is the variance of the residual for the sample data and H* and 6* are the precision (inverse of variance) and mean of the prior information, respectively From (6), it is clear that in order to determine the Bayesian posterior mean and variance, we need to determine the mean and precision of the prior information and the sampling information. The next two subsections describe the approach taken to determine the prior and sampling information, followed by a section on the Bayesian a posteriori model.
refine his/her response based on the responses of the rest of the participants in round 1. The authors felt that for the 17 effort multipliers the summarized results of round 2 were representative of the real world phenomena and decided to use these as the a priori information. But, for the five scale factors, the authors conducted a third round and made sure that the participants had a very good understanding of the exponential behavior of these parameters. The results of the third round were used as a priori information for the five scale factors. Please note that is the prior variance for any parameter is zero (in our case, if all experts responded the same value) then the Bayesian approach will completely rely o n expert opinion. However, this construct is inoperative since not surprisingly in the software field, disagreement and hence variability amongst the experts exists. Table 5 provides the a priori set of values for the RUSE parameter, i.e., the Develop for Reuse parameter. As
2.2 Prior Information To determine the prior information for the coefficients (i.e., b* and H") for our example model, COCOMO II, we
TABLE 5 COCOMO II 1998 "a priori" Rating Scale for Develop for Reuse (RUSE)
Develop for Reuse (RUSE)
Productivity Range
Low (L)
Nominal (N)
High (H)
Very High (VH)
Extra High (XH)
Definition
Least Productive Rating/ Most Productive Rating
None
Across project
Across program
Across product line
Across multiple product lines
Mean=1.73 I Variance = 0.05
0.89
1.0
1.15
1.33
1.54
1998A-priori Values
45
578
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
discussed in Section 2, this multiplicative parameter captures the additional effort required to develop components intended for reuse on current or future projects. As shown in Table 5, if the RUSE rating is Extra High (XH), i.e., developing for reuse across multiple product lines, it will cause an increase in effort by a factor of 1.54. On the other hand, if the RUSE rating is Low (L), i.e., developing with no consideration of future reuse, it will cause effort to decrease by a factor of 0.89. The resulting range of productivity for RUSE is 1.73 (= 1.54/0.89) and the variance computed from the second Delphi round is 0.05. Comparing the results of Table 5 with the expert-determined a priori rating scale for the 1997 calibration illustrated in Table 2 validates the strong consensus of the experts in the Productivity Range of o ~ 1.7. 2.3 Sample Information The sampling information is the result of a data collection activity initiated in September 1994, soon after the initial publication of the COCOMO II description [2]. Affiliates of the Center for Software Engineering at the University of Southern California provided most of the data [30]. These organizations represent the commercial, aerospace, and federally funded research and development center (FFRDC) sectors of software development. Data of completed software projects is recorded on a data collection form that.asks between 33 and 59 questions depending on the degree of source code reuse [30]. A question asked very frequently is the definition of software size, i.e., what defines a line of source code or a Function Point (FP)? Appendix Bin the Model Definition Manual [30] defines a logical line of code using the framework described in [26], and [13] gives details on the counting rules of FPs. In spite of the definitions, the data collected to date exhibits local variations caused by differing interpretations of the counting rules. Another parameter that has different definitions within different organizations in effort, i.e., what is a person months (PM)? In COCOMO II, we define a PM as 152 person/hr. But, this varies from organization to organization. This information is usually derived from time cards maintained by employees. But, uncompensated overtime hours are illegal to report in time cards and hence do not get accounted for in the PM count. This leads to variations in the data reported and the authors took as much caution as possible while collecting the data. Variations also occur in the understanding of the subjective rating scale of the scale factors and effort multipliers [9] developed a system to alleviate this problem and help users apply cost driver definitions consistently for the PRICE S model. For example, a very high rating for analyst capability in one organization could be equivalent to a nominal rating in another organization. All these variations suggest that any organization using a parametric cost model should locally calibrate the model to produce better estimates. Please refer to the local calibration results discussed in Table 4. The sampling information includes data on the response variable, effort in person months (PM), where 1 PM = 152 hr and predictor variables such as actual size of the software in KSLOC (thousands of Source Lines of Code adjusted for breakage and reuse). The database has grown from 83
46
datapoints in 1997 to 161 datapoints in 1998. The disrributions of effort and size for the 1998 database of 161 datapoints are shown in Fig. 3. As can be noted, both the histograms are positively skewed with the bulk of the projects in the database with effort less than 500 PM and size less than 150 KSLOC. Since the multiple regression approach based on least squares estimation assumes that the response variable is normally distributed, the positively skewed histogram for effort indicates the need for a transformation. We also want the relationships between the response variable and the p r e d i c tor variables to be linear. The histograms for size in F i 3 and Fig. 4 and the scatter plot in Fig. 5 show that a log t r a n s f o r m a t i o n is appropriate for size. Furthermore, the log transformations on effort and size are consistent with (2) and (3) above. The egression analysis done in RCode (statistical software developed at University of Minnesota, [8]) on the log transformed COCOMO II parameters using a dataset of 161 datapoints yield the following results: Data s e t = COCOMOII 1998 Response = log[PM] C o e f f i c i e n t Estimates Estimate Std. Error t-value Label 0.103346 9.304 Constant_A 0 .961552 0.0460578 20.015 ± [SIZE] 0.921827 0.684836 0.481078 1.424 ^ C ' l o g l S I Z E ] 1.10203 TEAM*log[SIZE] 0.323318 FLEX*log[SIZE] 0.354658 RESL*log[SIZE] 1.32890 log[PCAP] 1.20332 ^g[RELY] 0.641228 log[CPLX] 1.03515 log [TIME] 1.58101 log[STOR] 0.784218 log[ACAP] 0.926205 log[PEXP] 0.755345 1 og [ LTEX ] 0.171569 l o g [DATA] 0.783232 l o g [RUSE] -0.339964 log[DOCU] 2.05772 lOg[PV0L] 0.867162 logfAEXP] 0.137859 i og [PC0N] 0.488392 l O g [ TOOL ] 0.551063 iog[SITE] 0.674702 l ! 11858 log[SCED]
T h e above results provide the estimates for the /? coefficients associated with each of the predictor variables (see (3). The t-value (ratio between the estimate and corresponding standard error, where standard error is the square root of the variance) may be interpreted as the signal-to-noise ratio associated with the corresponding predictor variables. Hence, the higher the t-value, the stronger the signal (i.e., statistical significance) being sent by the predictor variable. These coefficients can be used to
CHULANI ET AL: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
579
Fig. 3. Distribution of effort and size: 1998 dataset of 161 observations.
Fig. 4. Distribution of log transformed effort and size: 1998 dataset of 161 observations.
adjust the a priori Productivity Ranges (PRs) to determine the data-determined PRs for each of the 22 parameters. For example, the data-determined PR for RUSE = (1.73)"034 where 1.73 is the a priori PR as shown in Table 5. While the regression provides intuitively reasonable estimates for most of the predictor variables; the negative coefficient estimate for RUSE (as discussed earlier) and the magnitudes for the coefficients on Applications Experience (AEXP), Language and Tool Experience (LTEX), Development Flexibility FLEX, and Team Cohesion (TEAM), violate our prior opinion about the impact of these parameters on Effort (i.e., PM). The quality of the data probably explains some of the conflicts between the prior information and sample data. However, when compared to the results reported in Section 2, these regression results (using 161 datapoints) produced better estimates. Only, RUSE has a
Fig. 5. Correlation between log[effort] and log[size].
negative coefficient associated with it compared to PREC, RESL, LTEX, DOCU, and RUSE in the regression results using only 83 datapoints. Thus, adding more datapoints (which results in an increase in the degrees of freedom) reduced the problems of counterintuitive results. 2.4
Combining Prior and Sampling Information: Posterior Bayesian Update
As a means of resolving the above conflicts, we will now use the Bayesian paradigm as a means of formally combining prior expert judgment with our sample data. Equation (6) reveals that if the precision of the a priori information (H*) is bigger (or the variance of the a priori information is smaller) than the precision (or the variance) of the sampling information (l/s^X'X) the posterior values will be closer to the a priori values. This situation can arise when the gathered data is noisy as depicted in Fig. 6 for an example cost factor, Develop for Reuse. Fig. 6 illustrates that the degree-of-belief in the prior information is higher than the degree-of-belief in the sample data. As a consequence, a stronger weight is assigned to the prior information causing the posterior mean to be closer to the prior mean. On the other hand (not illustrated), if the precision of the sampling information (l/s2X'X) is larger than the precision of the prior information (H*), then a higher weight is assigned to the sampling information causing the posterior mean to be closer to the mean of the sampling data. The resulting posterior precision will always be higher than the a priori precision or the sample data precision. Note that if the prior variance of any parameter is zero, then the parameter will
47
580
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
Fig. 6. A posteriori Bayesian update in the presence of noisy data (develop for reuse, RUSE).
Fig. 7. Bayesian a posteriori productivity ranges.
be completely determined by the prior information, Although, this is a restriction imposed by the Bayesian approach, it is of little concern as the situation of complete consensus very rarely arises in the software engineering domain. The complete Bayesian analysis on COCOMO II yields the Productivity Ranges (ratio between the least productive parameter rating, i.e., the highest rating, and the most productive parameter rating, i.e., the lowest rating) illustrated in Fig. 7. Fig. 7 gives an overall perspective of the relative Software Productivity Ranges (PRs) provided by the COCOMO 11.1998 parameters. The PRs provide insight on identifying the high payoff areas to focus on in a software productivity improvement activity. For example, Product Complexity (CPLX) is the highest payoff parameter and Development Flexibility (FLEX) is the lowest payoff parameter. The variance associated with each parameter is indicated along each bar. This indicates that even though
48
the two parameters, Multisite Development (SITE) and Documentation Match to Life Cycle Needs (DOCU), have the same PR, the PR of SITE (variance of 0.007) is predicted with more than five times the certainty than the PR of DOCU (variance of 0.037). The resulting COCOMO 11.1998 model calibrated to 161 datapoints produces estimates within 30 percent of the actuals 75 percent of the time for effort. If the model's multiplicative coefficient is calibrated to each of the 18 major sources of project data, the resulting model (with the coefficient ranging from 1.5 to 4.1) produces estimates within 30 percent of the actuals 80 percent of the time. It is therefore recommended that organizations using the model calibrate it using their own data to increase model accuracy and produce a local optimum estimate for similar type projects. From Table 6, it is clear that the prediction accuracy of the COCOMO 11.1998 model calibrated using the Bayesian approach is better than the prediction accuracy
CHULANI ET AL: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
581
TABLE 6 Prediction Accuracies of C0COM0 11.1997, a priori COCOMO 11.1998 and Bayesian a posteriori COCOMO 11.1998 Before and After Stratification
Prediction Accuracy
COCOMO n.1997 (83 datapoints) Before After PREP(.2O) 46% 49% PRED(.2S) 49% 55% PREDQ30) I 52% | 64% |
COCOMO A-Priori COCOMO Bayesian A-Posteriori 11.1997(161 11.1998 (Based on Delphi COCOMO 11.1998 datapoints) Results -161 datapoints) (161 datapoints) Before After Before After Before After 54% 57% 48% 54% 63% 70% 59% 65% 55% 63% 68% 76% 63% | 67% | 61% | 65% | 75% | 80%
of the COCOMO 11.1997 model (used on the 1997 dataset of doesn't lend itself to alleviating the third problem of 83 datapoints as well as the 1998 dataset of 161 datapoints) measurement error as discussed in Section 2. and the A Priori COCOMO II Model which is based on the Consider a reduced model developed by using a backexpert opinion gathered via the Delphi exercise. The full-set ward elimination technique, of model parameters for the Bayesian a posteriori COCOData s e t = COCOMOII .1998 MO 11.1998 model are given in Appendix A. Response = log[PM] 2.5
Cross Validation of the Bayesian Calibrated
coefficient Estimates
Model The COCOMO 11.1998 Bayesian calibration discussed above uses the complete dataset of 161 datapoints. The prediction accuracies of COCOMO 11.1998 (depicted in Table 6) are based on the same dataset of 161 datapoints. That is, the calibration and validation datasets are the same. A natural question that arises in this context is how well will the model predict new software development projects? To address this issue, we randomly selected 121 observations for our calibration dataset with the remaining 40 becoming assigned to the validation dataset (i.e "new" data). We repeated this process 15 times creahng 15 calibration and 15
1.44428 0.437796 3.299 15 calibration datasets. We used the resulting a posteriori l ^ I T E ] models to predict the development effort of the 40 "new" D I S C E D ] 1.06009 0.286442 3.701 projects in the validation datasets. This validation approach, The above results have no counterintuitive estimates for known as out-of-sample validation, provides a true mea- the coefficients associated with the predictor variables. The sure of the model's predictive abilities. This out-of-sample high t-rario associated with each of these variables indicates test yielded an average PRED(0.30) of 69 percent; indicating a significant impact by each of the predictor variables. The that on average, the out-of-sample validation results highest correlation among any two predictor variables is 0.5 produced estimates within 30 percent of the actuals 69 a n d i s b e t w e e n R E L Y and CPLX. Overall, the above results percentofthehme Hence we conclude that our Bayesian a r f i s t a t i s t i c a l ] a c c e p t a b l e . This COCOMO II reduced model has reasonably good predictive qualities. , . . ., , . _ ,, „ u 7 r n ° model gives the accuracy results shown in Table 7. 2.6 Reduced Model These accuracy results are a little worse that the results When calibrating COCOMO II, the three main problems we obtained by the Bayesian A Posteriori COCOMO 11.1998 faced in our data are: 1) lack of degrees of freedom, 2) some model but the model is more parsimonious. In practice, highly correlated predictor variables, and 3) measurement removing a predictor variable is equivalent to stipulating error for a few predictor variables. These limitations led to TABLE 7 some of the regression results being counterintuitive. The posterior Bayesian update discussed in Section 3.4 alle_ .. .. . . , „ . . „ « „ « , , _ . ,, .,„„„ r . , , J ,, r , . . , Prediction Accuracies of Reduced COCOMO 11.1998 t. expert-judgment viated these problems by incorporating derived prior information into the calibration process. But, I p r e d i c t i o n Accuracy I Reduced COCOMO H. 1998 such prior information may not be always available. So, what must one do in the absence of good prior information? PRED(.2O) 54% One way to address this problem is to reduce over fitting by PRED(.25) 64% developing a more parsimonious model. This alleviates the PRED(.3O) 73% first two problems listed above. Unfortunately, our data
49
582
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.25, NO. 4, JULY/AUGUST 1999
TABLE 8 Acronym and Full Form Parameters " c b c O M O H Parameter
that variations in this variable have no effect on project effort. When our experts and our behavioral analyses tell us otherwise, we need extremely strong evidence to drop a variable. The authors believe that dropping variables for an individual organization via local calibration of the Bayesian Posteriori COCOMO 11.1998 model is a sounder option.
intuitive estimates when other traditional approaches are employed. We are currently using the approach to develop similar models to estimate the delivered defect density of software products and the cost of integrating commercialoff-the-shelf (COTS) software. APPENDIX A
3 CONCLUSIONS As shown in Table 6 and Table 7 of this paper, the estimation accuracy for the Bayesian a posteriori of COCOMO 11.1998 for the 161-project sample is better than the accuracies for the best version of COCOMO 11.1997, the 1998 a priori model, and a version of COCOMO 11.1998 with a reduced set of variables obtained by backward elimination. The improvement over the 1997 model provides evidence that the 1998 Bayesian variable-by-variable accommodation of expert prior information is stronger than the 1997 approach of one-factor-fits-all averaging of expert data and regression data. Overall, the class of Bayesian estimation models presented here provides a formal process for merging expert prior information with software engineering data. In many traditional models, such prior information is informally used to evaluate the "appropriateness" of the results. However, having a formal mechanism for incorporating expert prior information gives users of the cost model the flexibility to obtain predictions and calibrations based on a
different set of prior information.
Such Bayesian estimation models enable the engineering
This appendix has the acronyms and full forms of the 22 COCOMO II Post Architecture cost drivers and their associated COCOMO II. 1998 rating scales (see Table 8). F o ra further explanation of these parameters, please refer t0 PL [30]. ACKNOWLEDGMENTS This work was supported, both financially and technically, Contract No. F30602-96-C-0274, "KBSA Life Cycle Evaluation," and by the COCOMO II Program Affiliates: Aerospace, Air Force Cost Analysis Agency, A l l i e d Signal, AT&T, Bellcore, EDS, Raytheon E-Systems, GDE Systems, Hughes, IDA, JPL, Litton, Lockheed Martin, Loral, MCC, MDAC, Motorola, Northrop Grumman, Rational, Rockwell, SAIC, SEI, SPC, Sun, TI, TRW, USAF Rome Lab, U.S. Army Research Labs, and Xerox.
under AFRL
REFERENCES
g| ™ « ^ £ ! F £ f f i S tTuZe S ^ e ^
software community to more adequately address the
on Software Process and Product Measurement, J.D. Arthur and S.M.
challenge of making good decisions when the data is scarce , .
&
,
TTT
J i •
j• •
"
50
^ <&•• v °'- L PP-f" 60 ' Amsterdam, The Netherlands: J.C.
Baltzer AG, Science, 1995.
c
and incomplete. These models improve predictive performance and resolve problems associated with counter-
c£e
Processes: COCOMO 2.0," Annals of Software Eng. Special Volume
[3]
G
Box a n d G
TiaO/ Bayesian
Addison-Wesley, 1973.
inference in Statistical Analysis.
ENGINf CHULANI ET AL: AL.: BAYESIAN ANALYSIS OF EMPIRICAL SOFTWARE ENGINEERING COST MODELS
[4] [5]
[6] [7] [8] [9]
[10] [11] [12] [13] [14]
583
L.C. Briand, V.R. Basili, and W.M. Thomas, "A Pattern RecogniSunita Chulani holds the BE degree in compution Approach for Software Engineering Data Analysis," IEEE ter engineering from Bombay University, India, Trans. Software Eng., vol. 18, no. 11, Nov. 1992. and the MS and PhD degrees in computer S. Chulani, "Incorporating Bayesian Analysis to Improve the science from the Center for Software EngineerAccuracy of COCOMO II and Its Quality Model Extension," ing, University of Southern California. Dr. ChuQualifying Exam Report, Computer Science Dept, USC Center for lani is a research staff member at IBM Software Eng., Feb. 1998. Research, Center for Software Engineering. S. Chulani, B. Clark, and B. Boehm, "Calibration Results of She is also a visiting associate at USC participating in the COCOMO research effort. Her COCOMOII.1997," Proc. Int'l Conf. Software Eng., Apr. 1998. main contributions have included the Bayesian S.D. Conte, Software Engineering Metrics and Models. Menlo Park, in COCOMO II and the Calif.: Benjamin/Cummings, 1986. calibration approach on th development of COQUAL>del. Her D. Cook and S. Weisberg, An Introduction to Regression Graphics. MO, a cost/quality model. Hermain main interests interes include software process Wiley Series, 1994. improvement, softwaree reliability modeling, modeling and software metrics and a member of the IEEE A.M. Cuelenaere, xx. van Genuchten, and F.J. Heemstra, cost modeling. She isi a IE and the IEEE Computer "Calibrating Software Cost Estimation Model: Why and How," Society. Information and Software Technology, vol. 29, no. 10, pp. 558-567, 19XX. Barry Boehm received his BA degree from N.E. Fenton, Software Metrics: A Rigorous Approach. London: Harvard in 1957 and his MS and PhD degrees Chapman & Hall, 1991. from the University of California at Los Angeles A. Gelman, J. Garlin, H. Stern, and D. Rubin, Bayesian Data in 1961 and 1964, respectively, all in matheAnalysis. Chapman & Hall, 1995. matics. Dr. Boehm was with the DARPA O. Helmer, Social Technology. New York: Basic Books, 1966. Information Science and Technology Office, International Function Point Users Group (IFPUG), Function Point TRW, the Rand Corporation, and General Counting Practices Manual, Release 4.0, 1994. Dynamics. He is currently the director for the D.R. Jeffery and G.C. Low, "Calibrating Estimation Tools for Center for Software Engineering at USC. Software Eng.j.,/.,vvol. no.*±,4,ppp. 215-221, current >research interests j u n w a i e Development," L / e v e i u p i i i e i u , Software DUJIWUIC Lfi^. u i . o,5,iiu. p . Z.IU-£.AI, •His ««focus W K * J on ^H system's process process model, model, product product model, model, property property 1990. X990. integrating a softwarei system's via an an approach approach called called Model-Based Model-Based model via R.W. Jensen, Jensen, "An "An Improved Improved Macrolevel Macrolevel Software Software Development Development model, and success> model R.W. ware Engineering Engineering (MBASE). (MBASE). His His contributions contributions to Resource Estimation Estimation Model," Model," Proc. Proc. Fifth Fifth ISPA ISPA Conf, Conf, pp. pp. 88-92, Resource 88-92, Architecting and Software to Constructive Cost Model (COCOMO), the Spiral Apr. Apr. 1983. 1983. t n e f i e l d include: the Constructive Cost Model (COCOMO), the Spiral to process, and and the the Theory Theory W W (win-win) (win-win) approach approach to E.J. Johnson, Johnson, "Expertise "Expertise and and Decision Decision Under Under Uncertainty: Uncertainty: PerforPerfor- M o d e ' ° ' the software i process, E.J. it and mance Farr, software management mance and and Process," Process," The The Nature Nature of of Expertise, Expertise, Chi, Chi, Glaser,and Glaser,and Farr, andrequirements requirements determination. determination. He Hehas has served served eral scientific journals and as a member of the eds., 1988. eds., Lawrence Lawrence Earlbaum Earlbaum Assoc. Assoc. 1988. o n * n e board of several scientific journals and as a member of the ie IEEE C. C. Jones, Jones, Applied Applied Software Software Measurement. Measurement. McGraw-Hill, McGraw-Hill, 1997. 1997. governing board of the IEEEComputer ComputerSociety. Society.He Hecurrently currentlyserves servesas as Visitors for the CMU Software Engineering Institute. G.G. G.G. Judge, Judge, W. W. Griffiths, Griffiths, and and R. R. Carter Carter Hill, Hill, Learning Learning and and Practicing Practicing c n a i r o f t n e B o a r d o f v i s i t o r s f o r t h e C M U Software Engineering Institute. EEE, AIAA, and ACM, and a member of the IEEE Econometrics. Wiley, 1993. Econometrics Wiley 1993 He is a fellow of the IEEE, AIAA, and ACM, and member of the IEEE d the National Academy of a Engineering. C.F. Kemerer, "An Empirical Validation of Software Cost C.F. Kemerer, Empirical Validation of Software Cost Computer Society and the National Academy of Engineering. Models/' Comnt."An ACM, vol. 30, no. 5, pp. 416-429, 1987. Bert M. Steece is deputy dean of faculty and B.A. Kitchenham and N.R. Taylor, "Software Cost Models," ICL professor in the Information and Operations Technical J. vol. 1, May 1984. Management Department at the University of E.E. Learner, Specification Searches, ad hoc Inference with NonexperiSouthern California and is a specialist in mental Data. Wiley Series 1978. statistics. His research areas include statistical T.F. Masters, "An Overview of Software Cost Estimating at the modeling, time series analysis, and statistical National Security Agency," /. Parametrics, vol. 5, no. 1, pp. 72-84, computing. He is on the editorial board of 1985. Mathematical Reviews and has served on S.N. Mohanty, "Software Cost Estimation: Present and Future," various committees for the American Statistical Software Practice and Experience, vol. 11, pp. 103-121,1981. Association. Steece has consulted on a variety G.M. Mullet, "Why Regression Coefficients Have the Wrong / Quality Quality Technology, Technology 1976. of subjects: including forecasting, accounting, health care systems, legal Sign" 1976. Sign," /. L.H. Putnam and W. Myers, L.H.'putnam Myers', Measures for Excellence. Yourdon Press c a s e s ' a n d c h e m i c a l engineering. engineering. Computing Series, 1992. http://www.qsm.com/slim_estitnahttp://www.qsm.com/slim_estimate.html R.M. Park et al., "Software Size Measurement: A Framework for Counting Source Statements," CMU-SEI-92-TR-20, Software Eng. Inst., Pittsburgh, Pa. 1992. J.S. Poulin, Measuring Software Reuse, Principles, Practices and Economic Models. Addison-Wesley, 1997. H. Rubin, "ESTIMACS," IEEE, 1983. M CIKpnnprH and anH M. \A Schofield, fv-hnfiplrl "Estimating "FnHmatiiiff Software <^nftwarp Project Prniprf M. Shepperd Effort Using Analogies," IEEE Trans. Software Eng., vol. 23, no. 12, Nov. 1997. Center for Software Engineering, "COCOMO II Cost Estimation Questionnaire/' Computer Science Dept., USC Center for Software Eng., 1997. http://sunset.usc.edu/Cocomo.html Center for Software Engineering, "COCOMO II Model Definition Manual," Computer Science Dept., USC Center for Software Eng., 1997. http://sunset.usc.edu/Cocomo.html S. Vicinanza, T. Mukhopadhyay, and M. Prietula, "Software Effort Estimation: An Exploratory Study of Expert Performance, Information Systems," vol. 2, no. 40, pp. 243-262, 1991. F. Walkerden and D. Ross Jeffery, "Software Cost Estimation: A Review of Models, Process and Practices," Advances in Computers, 1997. S. Weisberg, Applied Linear Regression, second ed., New York: John Wiley & Sons, 1985. "Applications of Bayesian Analysis and Econometrics," The Statistician, vol. 132, pp. 23-34, 1983. >iw
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
Machine Learning Approaches to Estimating Software Development Effort Krishnamoorthy Srinivasan and Douglas Fisher, Member, IEEE
Abstract—Accurate estimation of software development effort is critical in software engineering. Underestimates lead to time pressures that may compromise full functional development and thorough testing of software. In contrast, overestimates can resuit in noncompetitive contract bids and/or over allocation of development resources and personnel. As a result, many models for estimating software development effort have been proposed, This article describes two methods of machine learning, which we use to build estimators of software development effort from historical data. Our experiments indicate that these techniques are competitive with traditional estimators on one dataset, but also illustrate that these methods are sensitive to the data on which they are trained. This cautionary note applies to any model-construction strategy that relies on historical data. All such models for software effort estimation should be evaluated by exploring model sensitivity on a variety of historical data. Index Terms—Software development effort, machine learning, decision trees, regression trees, and neural networks.
A
of delivered source lines of code (SLOC). In contrast, many methods of machine learning make no or minimal assumptions a b o u { m e form o f t h e f u n c t i o n u n d e r s t u d y (e.g., development „ . , ., , , , , , •. . . . effort )< b u t a s w l t h o t h e r a P P r o a c h e s they depend on historical data. In particular, over a known set of training data, the learning algorithm constructs "rules" that fit the data, and which hopefully fit previously unseen data in a reasonable manner as w d l This ^ ^ i l l u s t r a t e s m a c h i n e learning approaches . . „ , , „ . , ... t0 estimating software development effort usmg an algorithm for building regression trees [4], and a neural-network learning approach known as BACKPROPAGATION [19]. Our experiments, u s m g established case libraries [3], [11], indicate possible a d v a n t a g e s o f the approach relative to traditional models, but . ,. . . ,, . , , also point to limitations that motivate continued research. II. MODELS FOR ESTIMATING SOFTWARE DEVELOPMENT EFFORT
I. INTRODUCTION
CCURATE estimation of software development effort
has major implications for the management of software development. If management's estimate is too low, then the software development team will be under considerable pressure to finish the product quickly, and hence the resulting software may not be fully functional or tested. Thus, the product may contain residual errors that need to be corrected during a later part of the software life cycle, in which the cost of corrective maintenance is greater. On the other hand, if a manager's estimate is too high, then too many resources will be committed to the project. Furthermore, if the company is engaged in contract software development, then too high an estimate may fail to secure a contract.
Many models have been developed to estimate software development effort. Many of these models are parametric, in that they predict development effort using a formula of fixed form that is. parameterized from historical data. In preparation for later discussion we summarize three such models that were highlighted in a previous study by Kemerer [11]. Putnam [16] developed an early model known as SLIM, which estimates the cost of software by using SLOC as the major input. The underlying assumption of this model is that resource consumption, including personnel, varies with time and can be modeled with some degree of accuracy by the Rayleigh distribution: . Z
The importance of software effort estimation has motivated considerable research in recent years. Parametric models such as COCOMO
[3], FUNCTION POINTS
[2], and SLIM [16]
Rc = -pr e where Rc
—ft 2 /9fc 2 } v ;
'
,
is the instananeous resource consumption, t is
"calibrate" prespecified formulas for estimating development effort from historical data. Inputs to these models may include the experience of the development team, the required reliability of the software, the programming language in which the software is to be written, and an estimate of the final number
the time into the development effort, and k is the time at which consumption is at its peak. The parameter k and other "management parameters" are estimated by characteristics of a particular software project, notably estimated SLOC. The general relationship between inputs such as SLOC and management parameters can be determined from historical data. Manuscript received October 1992; revised October 1993 and October 1994. T h e constructive COst MOdel (COCOMO) was developed
Recommended by D. Wile. D. Fisher's work was supported by NASA Ames
,
Grant NAG 2-834. K. Srinivasan is with Personal Computer Consultants, Inc., Washington, D •£• . , .,
by Boehm [3] based on a regression analysis of 63 completed projects. COCOMO relates the effort required to develop a software project (in terms of person-months) to Delivered
D. Fisher is with the Department of Computer Science, Vanderbilt University, Nashville, Tennessee (e-mail: [email protected]).
S
IEEE Log Number 9408517.
_
,
.-, ,
,
.
.
.
r
,„
,
,
_ . i-, o ^ T /T^Ts ™ O u r c e Instructions (DSI). Thus, like SLIM, COCOMO assumes
SLOC as a major input. If the software project is judged to
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
127
be straightforward, then the basic COCOMO model (COCOMO-
III. MACHINE LEARNING APPROACHES
basic) relates the nominal development effort (N) and DSI as
TO ESTIMATING DEVELOPMENT EFFORT
follows: N = 3.2 x (KDSI)
,
where KDSI is the DSI in 1000s. However, the prediction of the basic COCOMO model can be modified using cost drivers . Cost drivers are classified under four major headings relating to attributes of the product (e.g., required software reliability), computer platform (e.g., main memory limitations), personnel (e.g., analyst capability), and the project (e.g., use of modern programming practices). These factors serve to adjust the nominal effort up or down. These cost drivers and other considerations extend the basic model to intermediate and final forms. The Function Point method was developed by Albrecht [2]. Function points are based on characteristics of the project that are at a higher descriptive level than SLOC, such as the number of input transaction types and number of reports. A notable advantage of this approach is that it does not rely on SLOC, which facilitates estimation early in the project life cycle (i.e., during requirements definition), and by nontechnical personnel. To count function points requires that one count user functions and then make adjustments for processing complexity. There are five types of user function that are included in the function point calculation: external input types, external output types, logical internal file types, external interface file types, and external inquiry types. In addition, there are 14 processing complexity characteristics such as transaction rates and online updating. A function point is calculated based on the number of transactions and complexity characteristics. The development effort estimate given the function point, F, is: N = 54 x F — 13390. Recently, a case-based approach called ESTOR was developed for software effort estimation. This model was developed by Vicinanza et al. [23] by obtaining protocols from a human expert. From a library of cases developed from expert-supplied protocols, an instance called the source is retrieved that is most "similar" to the target problem to be solved. The solution of the most similar problem retrieved from the case library is adapted to account for differences between the source problem and the target problem using rules inferred from analysis of the human expert's protocols. An example of an adjustment rule is: IF s t a f f s i z e of S o u r c e p r o j e c t i s s m a l l , AND s t a f f s i z e of T a r g e t i s l a r g e THEN i n c r e a s e e f f o r t e s t i m a t e of T a r g e t by 20%. Vicinanza et al., have shown that E STOR performs better than COCOMO and FUNCTION POINTS on restricted samples of problems. In sum, there have been a variety of models developed for estimating development effort. With the exception of ESTOR these are parametric approaches that assume that an initial estimate can be provided by a formula that has been fit to historical data.
This section describes two machine learning strategies that we use to estimate software development effort, which we assume is measured in development months (M). In many respects this work stems from a more general methodology for developing expert systems. Traditionally, expert systems have been developed by extracting the rules that experts apparently use by an interview process or protocol analysis (e.g., ESTOR), but an alternate approach is to allow machine learning programs to formulate rulebases from historical data, Thi s methodology requires historical data on which to apply learning strategies. There are several aspects of software development effort estimation that make it amenable to machine learning analysis, Most important, previous researchers have identified at least s o m e o f m e attributes relevant to software development effort estimation, and historical databases denned over these relevant attributes have been accumulated. The following sections describe two very different learning algorithms that we use to test the machine learning approach. Other research using machine learning techniques for software resource estimation are found j n [5], [i4] ; [i5]> [22], which we will discuss throughout the pa per. In short, our work adds to the collection of machine learning techniques available to software engineers, and our analysis stresses the sensitivity of these approaches to the nature of historical data and other factors, A
Learning Decision and Regression Trees Many learning approaches have been developed that construct decision trees for classifying data [4], [17]. Fig. 1 illustrates a partial decision tree over Boehm's original 63 projects from which COCOMO was developed. Each project is described over dimensions such as AKDSI (i.e., adjusted delivered source instructions), TIME (i.e., the required system response time), and STOR (i.e., main memory limitations). The complete, set of attributes used to describe these data is given in Appendix A. The mean of actual project development months labels each leaf of the tree. Predicting development effort for a project requires that one descend the decision tree along an appropriate path, and the leaf value along that path gives the estimate of development effort of the new project. The decision tree in Fig. 1 is referred to as a regression tree, because the intent of categorization is to generate a prediction along a continuous dependent dimension (here, software development effort). There are many automatic methods for constructing decision and regression trees from data, but these techniques are typically variations on one simple strategy. A "top-down" strategy examines the data and selects an attribute that best divides the data into disjoint subpopulations. The most important aspect of decision and regression tree learners is the criterion used to select a "divisive" attribute during tree construction, In one variation the system selects the attribute with values that maximally reduce the mean squared error {MSE) of the dependent dimension (e.g., software development effort) observed in the training data. The MSE of any set, S,
53
128
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
1=26.0
F U N C T I O N C A R T X (Instances)
( M £ A N = 5 7 3 ) [3i]
^ termination-condition(Instances)
XKDSI~ I >
<1.415
(MEAN = 299 2) {191
T H E N R E T U R N mean among Instances ELSE Set Best-Attribute to most informative a t t r i b u t e a m o n g the Instances.
'
— TIME— >1.415
RETURN
< 1 0
" I '
VEXP— >1 0
I {MEAN = 1069) (2]
V, of Best-Attribute . I
— AKDSI— f >U1
Nol BUS —
(MEAN = 702) |1]
C A R T X ( { I | 1 is an Instance
-°
*»'
Best-Attribute
(MEAN = 367.5) [2]
(MEAN=243)[1]
| V2 of Best-Attribute . I
. . .
C A R T X ( { I | I with V,})
K of Best-Attribute , I C A R T X ( { I | I with ¥ „ } )
2-partitions only. Similarly, techniques that 2-partition all attribute domains, for both continuous and nominally-valued J
(MEAN = 2250) p]
>315 5
- (MEAN = 9ooo) [2] Fig. 1. A regression tree over Boehm's 63 software project descriptions. Numbers in square brackets represent the number of projects classified under a no e '
of training examples taking on values yk in the continuous dependent dimension is: v~-» / _ _N2 ^—' -^ MSE(S) = where y is the mean of the yk values exhibited in S. The Values of each attribute, A{ partition the entire training data set, T, into subsets, T^, where every example in Tij takes on the same value, say Vj for attribute A;. The attribute, Ait that maximizes the difference: . „ „ „ _ MlFtT^ - Y^ MSFIT \ ~
*• '
2—i 3
^ li'
( i - e "finite,unordered) attributes, have been explored (e.g., [24]). For continuous attributes this bisection process operates as we have just described, but for a nominally-valued attribute all ways to group values of the attribute into two disjoint sets are considered. Suffice it t o say that treating all attributes as though they had the same number of values (e.g., 2) for purposes of attribute selection mitigates Certain biases that are present in some attribute selection measures (e.g., AMSE). As we will note again in Section IV, we ensure that all attributes are either continuous or binary-valued at the outset o f r e g ression-tree construction. j ^ t ^asic r e gression-tree learning algorithm is summarized in Fig. 2. The data set is first tested to see whether tree consanction is worthwhile; if all the data are classified identically or some other statistically-based criterion is satisfied, then expansion ceases. In this case, the algorithm simply returns a leaf labeled by the mean value of the dependent dimension found in the training data. If the data are not sufficiently distinguished, then the best divisive attribute according to AMSE is selected, the attribute's values are used to partition the data into subsets, and the procedure is recursively called on these subsets to expand the tree. When used to construct predictors along continuous dimensions, this general procedure is referred
is selected to divide the tree. Intuitively, the attribute that minimizes the error over the dependent dimension is used, While MSE values are computed over the training data, the inductive assumption is that selected attributes will similarly reduce error over future cases as well. This basic procedure of attribute selection is easily extended to allow continuously-valued attributes: all ordered 2-partitions of the observed values in the training data are examined, In essence, the dimension is split around each observed value. The effect is to 2-partition the dimension in k — 1 alternate ways (where k is the number of observed values), and the binary "split" that is best according to AMSE is considered along with other possible attributes to divide a regression-tree node. Such "splitting" is common in the tree of Fig. 1; see AKDSI, for example. Approaches have also been developed that split a continuous dimension into more than two ranges [9], [15], though we will assume
54
to as recursive-partitioning regression. Our experiments use a partial reimplementation of a system known as CART [4]. We refer to our reimplementation as CARTX. Previously, Porter and Selby [14], [15], [22], have investigated the use of decision-tree induction for estimating development effort and other resource-related dimensions. Their work assumes that if predictions over a continuous dependent dimension are required, then the continuous dimension is "discretized" by breaking it into mutually-exclusive ranges. More commonly used decision-tree induction algorithms, which assume discrete-valued dependent dimensions, are then applied to the appropriately classified data. In many cases this preprocessing of a continuous dependent dimension may be profitable, though regression-tree induction demonstrates that the general tree-construction approach can be adapted for direct manipulation of a continuous dependent dimension. This is also the case with the learning approach that we describe next.
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
129
Fig. 4. An example of function approximation by a regression tree. Fig. 3. A network architecture for software development effort estimation.
B. A Neural Network Approach to Learning A learning approach that is very different from that outlined above is BACKPROPAGATION, which operates on a network of simple processing elements as illustrated in Fig. 3. This basic architecture is inspired by biological nerve nets, and is thus called an artificial neural network. Each line between processing elements has a corresponding and distinct weight. Each processing unit in this network computes a nonlinear function of its inputs and passes the resultant value along as its output. The favored function is 1 7 \~T — I V^ Wili ] \ i / J where £ , wJi is a weighted sum of the inputs, Iit to a processing element T191 T251 The network generates output by propagating the initial inputs, shown on the leffhand side of Fig. 3, through subsequent layers of processing elements to the final output layer. This net illustrates the kind of mapping that we will use for estimating software development effort, with inputs corresponding to various project attributes, and the output line corresponding to the estimated development effort. The inputs and output are restricted to numeric values. For numerically-valued attributes this mapping is natural, but for nominal data such as LANG (implementation language), a numeric representation must be found. In this domain, each value of a nominal attribute is given its own input line. If the value is present in an observation then the input line is set to 1.0, and if the value is absent then it is set to 0.0. Thus, for a given observation the input line corresponding to an observed nominal value (e.g., COB) will be 1.0, and the others (e.g., FTN) will be 0.0. Our application requires only one network output, but other applications may require more than one. Details of the BACKPROPAGATION learning procedure are beyond the scope of this article, but intuitively the goal of learning is to train the network to generate appropriate output patterns for corresponding input patterns. To accomplish this,
comparisons
are made between a network's actual
output
pattern and an a priori known correct output pattern. The difference or error between each output line and its correct co ™sponding value is "backpropagated" through the net and S u i d e s t h e m ° d l f i c a t ' ™ of weights in a manner that will t e n d t 0 r e d u c e t h e c o l l e c t i v e e r r o r b e t w e e n a c t u a l a n d COITect out
Puts
on tralnln
t0 conver
Patterns
Se
in a
8 Pattems- ^ ^ Procedure h a s b e e n s h ° w n PP i n g s b e t w e e n i n P u t a n d °«put variet o f d o m a i n s [ 2 1 ] [25] y ' "
o n a c c u r a t e ma
a
Approximating Arbitrary Functions In trying to approximate an arbitrary function like development effort, regression trees approximate a function with a "staircase" function. Fig. 4 illustrates a function of one continuous, independent variable. A regression tree decomposes this function's domain so that the mean at each
l e a f r e f l e c t s the
8 e w i t h i n a l o c a l reSion- T h e hidden" processing elements that reside between the input and OUt ut la ers P y ° f a n e U r a l n e t w O r k d o rOU S hl y t h e S a m e t h i n g ' thou h me 8 approximating function is generally smoothed. The g r a n u l a n t y o f t h i s P h o n i n g of the function is modulated by ^ d e P t h o f a reSression tree or the number of hidden units ln a n e t w o r • . . . Each leamm S a P P r o a c h l s °°°P*»™t™, since it makes no a Priori a s s u m P t l o n s a b o u t * e f o r m o f * * f u n c t i o n b « n g a roximated PP - ""«*« ** a w i d e v a r i e t y o f Parametnc methods for function a P P r o x i m a t i o n s u c h a s regression methods of statistics a n d P^om^ interpolation methods of numerical a n a l s i s [10] O t h e r n o n a r a m e t r l c m e t h o d s i n c l u d e y P 8enetic al orithms and nearest nei hbor a S ^ S PProaches [1], though w e wiU not elaborate o n a n y of these a l t e m a t l v e s here -
D
function>s ran
- Sensitivity to Configuration Choices Both BACKPROPAGATION and CARTX require that the analyst make certain decisions about algorithm implementation. For example, BACKPROPAGATION can be used to train networks with differing numbers of hidden units. Too few hidden units can compromise the ability of the network to approximate a desired function. In contrast, too many hidden units can lead to "overfitting," whereby the learning system fits the "noise"
55
130
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
present in the training data, as well as the meaningful trends that we would like to capture. BACKPROPAGATION is also typically trained by iterating through the training data many times. In general, the greater the number of iterations, the greater the reduction in error over the training sample, though there is no general guarantee of this. Finally, BACKPROPAGATION assumes that weights in the neural network are initialized to small, random values prior to training. The initial random weight settings can also impact learning success, though in many applications this is not a significant factor. There are other parameters that can effect BACKPROPAGATION 'S performance, but we will not explore these here. In CARTX, the primary dimension under control by the experimenter is the depth to which the regression tree is allowed to grow. Growth to too great a depth can lead to overfitting, and too little growth can lead to underfitting. Experimental results of Section IV-B illustrate the sensitivity of each learning system to certain configuration choices.
linear relationship and those close to 0.0 suggest no such relationship. Our experiments will characterize the abilities of BACKPROPAGATION and CARTX using the same dimensions as Kemerer: MRE and R2. As we noted, each system imposes certain constraints on the representation of data. There are a number of nominally-valued attributes in the project databases, including implementation language. BACKPROPAGATION requires that each value of such an attribute was treated as a binary-valued attribute that was either present (1) or absent (0) in each project. Thus, each value of a nominal attribute corresponded to a unique input to the neural network as noted in Section III-B. We represent each nominal attribute as a set of binary-valued attributes for CARTX as well. As we noted in Section III-A this mitigates certain biases in attribute selection measures such as AMSE. In contrast, each continuous attribute identified by Boehm corresponded to one input to the neural network. There was one output unit, which reflected a prediction of development effort and was also continuous. Preprocessing for the neural IV. OVERVIEW OF EXPERIMENTAL STUDIES network normalized these values between 0.0 and 1.0. A simple scheme was used where each value was divided by We conducted several experiments with CARTX and ^ m a j d m u m o f ^ y a , u e s for ^ a t t r i b u t e i n ^ ^ BACKPROPAGATION for the task of estimating software ^ ft h a s b e e n s h o w n ^ neural networks iricall development effort In general, each of our experiments c o n ^ckly tf a U ^ y a l u e s f ( j r ^ a t t r i b u t e s K]a&wl partitions historical data into samples used to train our learning ^ b e ( w e e n z e f o a n d Q n e NQ such norma]ization was systems, and disjoint samples used to test the accuracy of the d o n e for c s i n c e u w o u W h a y e n Q e f f e c t ofl CARTX>S trained classifier in predicting development effort. , _ f b r performance. For purposes of comparison, we refer to previous experimental results by Kemerer [11]. He conducted comparative analyses between SLIM, COCOMO, and FUNCTION POINTS on A Experiment 1: Comparison with Kemerer's Results a database of 15 projects.1 These projects consist mainly of O u r first business applications with a dominant proportion of them experiment compares the performance of machine leamin 8 algorithms with standard models of software devel(12/15) written in the COBOL language. In contrast, the COCOMO database includes instances of business, scientific, ° P m e n t e s t i m a t i o n u s i n g Kemerer's data as a test sample. To and system software projects, written in a variety of languages t e s t C A R T X a n d BACKPROPAGATION, we trained each system including COBOL, PL1, HMI, and FORTRAN. For compar- o n COCOMO'S database of 63 projects and tested on Kemerer's isons involving COCOMO, Kemerer coded his 15 projects using 1 5 P r o J e c t s ' F o r BACKPROPAGATION we initially configured the n e t w o r k with 3 3 in ut units 1 0 h i d d e n umts a n d l out ut the same attributes used by Boehm. P ' ' P One way that Kemerer characterized the fit between the pre- u n i t ' a n d r e 1 u i r e d m a t *** t r a m i n 8 s e t e r r o r r e a c h ° 0 0 0 0 1 o r dieted (M e 8 t ) and actual (Mact) development person-months c o n t i n u e f o r a maximum of 12 000 presentations of the training data T r a i n i n " 8 c e a s e d a f t e r 1 2 0 0 ° Presentations without conwas by the magnitude of relative error (MRE): verging to the required error criterion. The experiment was _ Mest ~ Mact done on an AT&T PC 386 under DOS. It required about 6-7 Mpp Mact hours for 12000 presentations of the training patterns. We __. ,. , ,.., , . actually repeated this experiment 10 times, though we only , , f . f , This measure normalizes the difference between actual and , . , , , , , ,. , ., report the results of one run here; we summarize the complete an analyst with a . . _ . . . . _ predicted development months, and supplies K yy J c ,. . . . . ' . , ,.„ , , set of expenments in Section IV-B. . , • •• i • c ^ v n J.U measure of the reliability of estimates by different models. T c , . . , . , . . In our initial configuration of CARTX, we allowed the TT However, when using a model developed at one site for . : „ . „ , „, , , , . , i •, , c , regression tree to grow to a maximum depth, where each , . . . . , . .. . ., estimation at another site, there may be local factors that , , a ... ... , , • , leaf represented a single software project description from the are not modeled, but which nonetheless impact development .-,„„„„„ . . „ , _• . f. . • „ , . . ., . ~ . ^, „„ . , . , COCOMO data. We were motivated initially to extend the tree • , , i_ J • i • effort m a systematic way. Thus, following earlier work u , ... . t^ ,r ",.,,• , , . to singleton leaves, because the data is very sparse relative to by Albrecht r[2], Kemerer did a linear regression/correlation , , • , , ., , , c ,. , . .. '., „ , ... ., , . , the number of dimensions used to describe each data point; , ... -„. . . . , analysis to calibrate the rpredictions, with Mest treated as . . . , . ,, , .. , : , , our concern is not so much with overfitting, as it is with , ,, . ., , t „ . „ ... ., the independent variable and Mact treated as the dependent • •, r™ > - , • > , • j . , ^ underfitting the data. Expenments with the regression tree , , . „ . „_ . ^, , variable. The R value indicates the amount of variation in , , . , , , ,. , . , . . , learner were performed on a SUN 3/60 under TTXTT UNIX, and the actual values accounted for by a linear relationship with JU . . „, ... u u • J r r , . , , _2 , , , . required about a minute. The predictions obtained from the the estimated values. R* values close to 1.0 suggest a strong , . , •, ,a •• »u ^ AN 00 " learning algorithms (after training on the COCOMO data) are 'we thank Professor Chris Kemerer for supplying this dataset. shown in Table I with the actual person-months of Kemerer's
56
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
131
TABLE I
TABLE II
CARTX AND BACKPROPAGATION ESTIMATES ON KEMERER'S DATA —| 1 n Actual CARTX BACKPROP
287.00
1893.30
8145
82 50
162 03
14 14
A COMPARISON OF LEARNING AND ALGORITHMIC APPROACHES. THE REGRESSION EQUATIONS GIVE Mact AS A FUNCTION OF Mest(x)
MRE(%) R-Square CARTX
364
0.83
BACKPROP 70 FuNC - PTS- 103 COCOMO 610
0.80
Regress. Eq. 102.5 + 0.075x
1107.31 11400.00 86.90 243.00 336.30 6600.00
1000.43 88.37 540.42
84.00 23.20
129.17 129.17
13.16 45.38
130.30
243.00
78.92
c a s e of
116.00 72.00 258.70 230.70 157.00 246.90
1272.00 129.17 243.00 243.00 243.00 243.00
113.18 15.72 80.87 28.65 44.29 39.17
by "calibrating" a model's predictions in a new environment, the adjusted model predictions can be reliably used. Along the R2 dimension learning methods provide significant fits to the data. Unfortunately, a primary weakness of these learning approaches is that their performance is sensitive to a number of implementation decisions. Experiment 2 illustrates some of
69.90
129.17
214.71
these sensitivities.
II S u M ^j
I772
modeiSi
058
0.70 I0'89
78.13 + 0.88* -37 + 0.96* 27.7 + 0.156* I49'9 + 0 M 2 x
Kemerer argues that high R2 suggests that
B. Experiment 2: Sensitivity of the Learning Algorithms 15 projects. We note that some predictions of CARTX do not correspond to exact person-month values of any COCOMO (training set) project, even though the regression tree was developed to singleton leaves. This stems from the presence of missing values for some attributes in Kemerer's data. If, during classification of a test project, we encounter a decision node that tests an attribute with an unknown value in the test project, both subtrees under the decision node are explored, In such a case, the system's final prediction of development effort is a weighted mean of the predictions stemming from each subtree. The approach is similar to that described in [17]. Table II summarizes the MRE and R2 values resulting from a linear regression of Mest and Mact values for the two learning algorithms, and results obtained by Kemerer with COCOMO-BASIC, FUNCTION POINTS, and SLIM. 2 These results
indicate that CARTX'S and BACKPROPAGATION 'S predictions show a strong linear relationship with the actual development effort values for the 15 test projects.3 On this dimension, the performance of the learning systems is less than SUM'S performance in Kemerer's experiments, but better than the other two models. In terms of mean MRE, BACKPROPAGATION does strikingly well compared to the other approaches, and CARTX'S MRE is approximately one-half that of SLIM and COCOMO. In sum, Expenment 1 rilustrates two points. In an absolute sense, none of the models does particularly well at estimating software development effort, particularly along the MRE dimension, but in a relative sense both learning approaches are competitive with traditional models examined by Kemerer on one dataset. In general, even though MRE is high in the ,
'Results are reported for COCOMO-BASIC (i.e., without cost drivers), which was comparable to the intermediate and detailed models on this data, in addition, Kemerer actually reported R2, which is R? adjusted for degrees of freedom^and which is slightly lower than the unadjusted R2 values that we report. R2 valuesreportedby Kemerer are 0.55,0.68, and 0.88 for FUNCTION POINTS, COCOMO, and SLIM, respectively.
,
3
, ,
,
. ,, .
. -c
„„
Both the slope and R value are significant at the 99% confidence level. The t coefficients for determining the significance of slope are 8.048 and 7.25 for CARTX and BACKPROPAGATION, respectively.
W e have noted m a t each leaming system assumes a number
«grow- regres. ] u d e d i n t h e neural n e t w o r k T h e s e c h o i c e s c a n significantly impact the success o f l e a m i n g . E x p e riment 2 illustrates the sensitivity of our t w o l e a m i n g systems relative t0 different choices along ^ ^ d i m e n s i o n s . I n particular, W e repeated Experiment 1 using BACKPROPAGATION with differing numbers of hidden units a n d u s i n g C A R T X w i t h d i f f e r i n g c o n s t r aints on regression-tree growth T a b l e m i l l u s t r a t e s o u r r e s u l t s w i t h BACKPROPAGATION. E a c h c e l l s u m m a r i z e s reS ults over 10 experimental trials, rather ta one ^ w h i c h w a s lepoTted in Section IVA for p r e s e n t a t i o n p u r p o s e s . Thus, Max, and Min values of important choices such as depth to which t0 sion
of
^ ^
R2
o r ±e
a n d
of hidden units i n c
mmhel
in
M R E
each
cell
of
Table
m
suggest
± e
to initial random weight settingS( w h k h w e r e different in e a c h o f t h e 1 0 e x p e r i m e n t a i t r f a l s T h e e x p e r i m e n t a l r e s u lts of Section IV-A reflect the .< best ,, ^ ^ 10 ^ s u m m a r i z e d in Table Ill's 10m d d e n _ u n i t c o l u m n . I n general, however, for 5, 10, and 15 h i d d e n ^ ^ MRE s c o r e s m s t i n c o m p a r a b i e o r s u p e r i o r to s o m e o f m e o t h e r m o d e l s s u m m a r i z e d i n Table II, and mean R2 s c o r e s s u g g e s t m a t s i g ni f i c ant linear relationships between p r e d i c t e d a n d a c t u a l development months are often found. Poor resuks obtained with n o hidden units indicate ^ i r n p O rt a nce o f m e s e for a c c u r a t e f u n c t i o n a p p r o x i m a t i o n . T h e p e r f o r m a n c e o f C A R T X c a n vary with the depth to w h i c h w e e x t e n d t h e regression tree. The results of Experiment l ^ repeated ^ a n d r e p r e S ent the case where required sensitivity
of
BACKPROPAGATION
accuracy over the training data is 0%—that is, the tree is
, , . , , _ _ . . . decomposed to singleton leaves. However, w e experimented with more conservative tree expansion policies, where CARTX extended the tree only to the point where an error threshold ( r e l a t i v e to the training data) is satisfied. In particular, trees , , .,__,
were grown to leaves where the mean MRE among projects .
,
._ l prespecified threshold that ranged from 0% to 500%. The MRE of each project at a leaf
a t a l e a f w a s less t h a n o r e< ual t 0 a
57
132
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
TABLE III
TABLE V
BACKPROPAGATION RESULTS WITH VARYING NUMBERS OF HIDDEN NODES
SENSITIVITY OVER 20 RANDOMIZED TRIALS ON COMBINED COCOMO AND KEMERER'S DATA
(J Mean R2
I
Hidden Units I 5 10 15
CARTX
0.04 0.52 0.60 0.59
2
Max R Min R2
y
0.04 0.84 0.80 0.85 0.03 0.08 0.10 0.01
Mean MRE(%)
618
104
133
1
n
111
Mini? 2
1
0.00
BACKPROPAGATION 0.00 '
'
1 n Meanii 2 Max R2 — — —
0.48
0.97
0.40
0.99
'
U
TABLE VI
„
SENSITIVITY OVER 20 RANDOMIZED TRIALS ON KEMERER'S DATA
Max MRE{%) Min MRE(%)
915 I 369
163 72
254 70
161 77
Min ft2 Mean R2 Max i P CARTX BACKPROPAGATION
TABLE IV
0.00 0.03
0.26 0.39
0.90 0.90
'
"
CARTX RESULTS WITH VARYING TRAINING ERROR THRESHOLDS
R2
o% 25% 50% ioo% 200% 300% 400% 500%
0.83 0.62 o,63 0.60 0.59 0.59 o 59 o 60
tree configuration. The holdout method divides the available data into two sets; one set, generally the larger, is used to build decision/regression trees or train networks under different configurations. The second subset is then classified using each alternative configuration, and the configuration yielding the best results over this second subset is selected as the final configuration. Better yet, a choice of configuration may rest on a orm f °f resampling that exploits many randomized holdout trials. Holdout could have been used in this case by dividing the COCOMO data, but the COCOMO dataset is very small as is. Thus, we have satisfied ourselves with a demonstration of the sensitivity of each learning algorithm to certain configuration decisions. A more complete treatment of resampling and other strategies for making configuration choices can be found in Weiss and Kulikowski [24].
MRE(%) 364 404 461 870 931 931 931 835
was calculated by M — Mact Mact where M is the mean person-months development effort of projects at that node Table IV shows CARTX's performance when we vary the required accuracy of the tree over the training data. Table entries correspond to the MRE and R2 scores of the learned trees over the Kemerer test data. In general, there is degradation in performance as one tightens the requirement for regressiontree expansion, though there are applications in which this would not be the case. Importantly, other design decisions in decision and regression-tree systems, such as the manner in which continuous attributes are "split" and the criteria used to select divisive attributes, might also influence prediction accuracy. Selby and Porter [22] have evaluated different design choices along a number of dimensions on the success of decision-tree induction systems using NASA software project descriptions as a test-bed. Their evaluation of decision trees, not regression trees, limits the applicability of their findings to the evaluation reported here, but their work sets an excellent example of how sensitivity to various design decisions can be evaluated. The performance of both systems is sensitive to certain configuration choices, though we have only examined sensitivity relative to one or two dimensions for each system. Thus, it seems important to posit some intuition about how learning systems can be configured to yield good results on new data, given only knowledge of performance on training data. In cases where more training data is available a holdout method can be used for selecting an appropriate network or regression-
58
^ Experiment 3: Sensitivity to Training and Test Data Thus far, our results suggest that using learning algorithms to discover regularities in a historical database can facilitate predictions on new cases. In particular, comparisons between our experimental results and those of Kemerer indicate that relatively speaking, learning system performance is competitive with some traditional approaches on one common data set. However, Kemerer found that performance of algorithmic approaches was sensitive to the test data. For example, when a selected subset of 9 of the 15 cases was used to test the models, each considerably improved along the R2 dimension, By implication, performance on the other 6 projects was likely poorer. We did not repeat this experiment, but we did perform similarly-intended experiments in which the COCOMO and Kemerer data sets were combined into a single dataset of 78 projects; 60 projects were randomly selected for training the learning algorithms and the remaining 18 projects were used for test. Table V summarizes the results over 20 such randomized trials. The low average R2 should not mask the fact that many runs yielded strong linear relationships. For example, on 9 of the 20 CARTX runs, R2 was above 0.80. We also ran 20 randomized trials in which 10 of Kemerer's cases were used to train each learning algorithm, and 5 were used for test. The results are summarized in Table VI. This experiment was motivated by a study with ESTOR [23], a casebased approach that we summarized in Section II: an expert's protocols from 10 of Kemerer's projects were used to construct
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
133
a "case library" and the remaining 5 cases were used to test the model's predictions; the particular cases used for test were not reported, but ESTOR outperformed COCOMO and FUNCTION POINTS on this set. We do not know the robustness of ESTOR in the face of the kind of variation experienced in our 20 randomized trials (Table VI), but we might guess that rules inferred from expert problem solving, which ideally stem from human learning over a larger set of historical data, would render ESTOR more robust along this dimension. However, our experiments and those of Kemerer with selected subsets of his 15 cases suggest that care must be taken in evaluating the robustness of any model with such sparse data. In defense of Vicinanza's et al.P methodology, we should note that the creation of a case library depended on an analysis of expert protocols and the derivation of expert-like rules for modifying the predictions of best matching cases, thus increasing the "cost" of model construction to a point that precluded more complete randomized trials. Vicinanza et al. also point out that their study is best viewed as indicating ESTOR's "plausibility" as a good estimator, while broader claims require further study. In addition to experiments with the combined COCOMO and Kemerer data, and the Kemerer data alone, we experimented with the COCOMO data alone for completeness. When experimenting with Kemerer's data alone, our intent was to weakly explore the kind of variation faced by ESTOR. Using the COCOMO data we have no such goal in mind. Thus, this analysis uses an JV-fold cross validation or a "leave-one-out" methodology, which is another form of resampling. In particular, if a data sample is relatively sparse, as ours is, then for each of JV (i.e., 63) projects, we remove it from the sample set, train the learning system with the remaining TV - 1 samples, and then test on the removed project. MRE and R2 are computed over the N tests. CARTX's R2 value was 0.56 (144.48+0.74*, t = 8.82) and MRE was 125.2%. In this experiment we only report results obtained with CARTX, since a fair and comprehensive exploration of BACKPROPAGATION across possible network configurations is computationally expensive and of limited relevance. Suffice it to say that over the COCOMO data alone, which probably reflects a more uniform sample than the mixed COCOMO/Kemerer data, CARTX provides a significant linear fit to the data with markedly smaller MRE than its performance on Kemerer's data.
effort estimation suggest the promise of an automated learning approach to the task. Both learning techniques performed well on the R2 and MRE dimensions relative to some other approaches on the same data. Beyond this cursory summary, our experimental results and the previous literature suggest several issues that merit discussion.
In sum, our initial results indicating the relative merits of a learning approach to software development effort estimation must be tempered. In fact, a variety of randomized experiments reveal that there is considerable variation in the performance of these systems as the nature of historical training data changes, This variation probably stems from a number of factors. Notably, there are many projects in both the COCOMO and Kemerer datasets that differ greatly in their actual development effort, but are very similar in other respects, including SLOC. Other characteristics, which are currently unmeasured in the COCOMO scheme, are probably responsible for this variation. V. GENERAL DISCUSSION Our experimental comparisons of CARTX and BACKPROPAGATION with traditional approaches to development
A
Limitations of Learning from Historical Data There
are
well
" k n ° w n limitations of models constructed " In Particular- attributes u s e d t0 Predict software development effort can change over time and/or differ between software development environments. Mohanty [13] m a k e s this P o i n t i n comparisons between the predictions of a w i d e variet y of m o d e l s on a single hyP°thetlcal software ro ect I n P J ' P ^ u l a r , M o h a n t y s u r v e y e d approximately 15 m o d e l s a n d m e t h o d s for P ^ i c t i n g software development effort These models were used to P r e d i c t s o f t w a r e development effort o f a sin le 8 hypothetical software project. Mohanty's m a i n finding w a s t h a t estimated effort on this single project varied significantly over models. Mohanty points out that each model was developed and calibrated with data collected within a uni( ue software l environment. The predictions of m e s e m o d e l s in ' Part> r e f l e c t underlying assumptions that are not ex Iicitlv P presented in the data. For example, software development sites may use different development tools. These tools are constant wlthin a facihty and thus not ' represented exP l i c i t l v i n d a t a c o l l e c t e d ^ t h a t f a c i l i t y ' b u t t h i s environmental factor is n o t c o n s t a n t a c r o s s faclhtlss Differin S environmental factors not reflected in data are undoubtedly responsible for much of the unexplained variance in o u r experiments. To some extent, the R2 derived from linear egression is intended to provide a better measure of a model s in cases w h e r e ' " f i t " t 0 a r b i t r a i y n e w d a t a t h a n MRE the environment f r o m w h i c h a model was derived is different from the environment from which new data was drawn. Even so> t h e s e environmental differences may not be systematic in a wa y m a t i s w e l 1 a c c o u n t e d for by a linear model. In sum, great care must be taken w h e n uslng a model constructed from data from one environment to make predictions about data from another environment. Even within a site, the environment mav evolve over time thus ' compromising the benefits of previously-derived models. Machine learning research has u s i n g h i s t o r i c a l data
recentI
P r o b I e m o f trackin8 &* a c c u r a c y ' w h i c h t r i g g e r s relearning when ex e P rience with new data suggests that the environment has c h a n g e d [6] Howe ' v e r , in an application such as software development effort estimation, there are probably explicit ^ c a t e r s that an environmental change is occurring or will o c c u r (e ^' w h e n n e w development tools or quality control ractices P are implemented), y
focussed
on the
of a learned m o d e l o v e r time
&• Engineering the Definition of Data if environmental factors are relatively constant, then there is little need to explicitly represent these in the description of d a t a H o w e v e r j w h e n t h e env ironment exhibits variance along some dimension, it often becomes critical that this variance be codified and included in data description. In this way,
59
134
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
differences across data points can be observed and used in estimating the person-month effort required for the project model construction. For example, Mohanty argues that the requiring 23.20 M or the project requiring 1107.31 M; the desired quality of the finished product should be taken into projects closest to each among the remaining 14 projects are account when estimating development effort. A comprehensive 69.90 M and 336.30 M, respectively. survey by Scacchi [20] of previous software production studies The root of CARTX'S difficulties lies in its labeling of leads to considerable discussion on the pros and cons of many each leaf by the mean of development months of projects attributes for software project representation. classified at the leaf. An alternative approach that would enable Thus, one of the major tasks is deciding upon the proper CARTX to extrapolate beyond the training data, would label codification of factors judged to be relevant. Consider the each leaf by an equation derived through regression—e.g., dimension of response time requirements (i.e., TIME) which a linear regression. After classifying a project to a leaf, the was included by Boehm in project descriptions. This attribute regression equation labeling that leaf would then be used to was selected by CARTX during regression-tree construction, predict development effort given the object's values along the However, is TIME an "optimal" codification of some aspect independent variables. In addition, the criterion for selecting of software projects that impacts development effort? Consider divisive attributes would be changed as well. To illustrate, that strict response time requirements may motivate greater consider only two independent attributes, development team coupling of software modules, thereby necessitating greater experience and KDSI, and the dependent variable of software communication among developers and in general increasing development effort. CARTX would undoubtedly select KDSI, development effort. If predictions of development effort must since lower (higher) values of KDSI tend to imply lower be made at the time of requirements analysis, then perhaps (higher) means of development effort. In contrast, development TIME is a realistic dimension of measurement, but better team experience might not pro vide as good a fit using CARTX'S predictive models might be obtained and used given some error criterion. However, consider a CART-like system that divides data up by an independent variable, finds a best measure of software component coupling. In sum, when building models via machine learning or sta- fitting linear equation that predicts development effort given tistical methods, it is rarely the case that the set of descriptive development team experience and KDSI, and assesses error attributes is static. Rather, in real-world success stories in- in terms of the differences between predictions using this volving machine learning tools the set of descriptive attributes best fitting equation and actual development months. Using evolves over time as attributes are identified as relevant or this strategy, development team experience might actually be irrelevant, the reasons for relevance are analyzed, and addi- preferred; even though lesser (greater) experience does not tional or replacement attributes are added in response to this imply lesser (greater) development effort, development team analysis [8]. This "model" for using learning systems in the experience does imply subpopulations for which strong linear real world is consistent with a long-term goal of Scacchi [20], relationships might exist between independent and dependent which is to develop a knowledge-based "corporate memory" of variables. For example, teams with lesser experience may not software production practices that is used for both estimating adjust as well to larger projects as do teams with greater and controlling software development. The machine-learning experience; that is, as KDSI increases, development effort tools that we have described, and other tools such as ESTOR, increases are larger for less experienced teams than more might be added to the repertoire of knowledge-acquisition experienced teams. Recently, machine learning systems have strategies that Scacchi suggests. In fact, Porter and Selby [14] been developed that have this flavor [18]. We have not yet make a similar proposal by outlining the use of decision-tree experimented with these systems, but the approach appears promising. induction methods as tools for software development. The success of CARTX, and decision/regression-tree learners generally, may also be limited by two other processing C. The Limitations of Selected Learning Methods characteristics. First, CARTX uses a greedy attribute selection Despite the promising results on Kemerer's common data- strategy—tree construction assesses the informativeness of a base, there are some important limitations of CARTX and single attribute at a time. This greedy strategy might overlook BACKPROPAGATION. We have touched upon the sensitivity attributes that participate in more accurate regression trees, to certain configuration choices. In addition to these prac- particularly when attributes interact in subtle ways. Second, tical limitations, there are also some important theoretical CARTX builds one classifier over a training set of software limitations, primarily concerning CARTX. Perhaps the most projects. This classifier is static relative to the test projects; important of these is that CARTX cannot estimate a value any subsequent test project description will match exactly one along a dimension (e.g., software development effort) that is conjunctive pattern, which is represented by a path in the outside the range of values encountered in the training data, regression tree. If there is noise in the data (e.g., an error in the Similar limitations apply to a variety of other techniques as recording of an attribute value), then the prediction stemming well (e.g., nearest neighbor approaches of machine learning from the regression-tree path matching a particular test project and statistics). In part, this limitation appears responsible for may be very misleading. It is possible that other conjunctive a sizable amount of error on test data. For example, in the patterns of attribute values matching a particular test project, experiment illustrating CARTX'S sensitivity to training data but which are not represented in the regression tree, could using 10/5 splits of Kemerer's projects (Section IV-C), CARTX ameliorate CARTX'S sensitivity to errorful or otherwise noisy is doomed to being at least a factor of 3 off the mark when project descriptions.
60
SR1NIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
135
The Optimized Set Reduction (OSR) strategy of Briand,
Basili, and Thomas [5] is related to the CARTX approach in several important ways, but may mitigate problems associated with CARTX—OSR conducts a more extensive search for multiple patterns that match each test observation. In contrast to CARTX'S construction of a single classifier that is static relative to the test projects, OSR can be viewed as dynamically building a different classifier for each test project. The specifics of OSR are beyond the scope of this paper, but suffice it to say that OSR looks for multiple patterns that are statistically justified by the training project descriptions and that match a given test project. The predictions stemming from different patterns (say, for software development effort) are then combined into a single, global prediction for the test project. OSR was also evaluated in [5] using Kemerer's data for test, and COCOMO data as a (partial) training sample.4 The authors report an average MRE of 94% on Kemerer's data. However, there are important differences in experimental design that make a comparison between results with OSR, BACKPROPAGATION, and CARTX unreliable. In particular, when OSR was used to predict software development effort for a particular Kemerer project, the COCOMO data and the remaining 14 Kemerer projects were used as training examples. In addition, recognizing that Kemerer's projects were selected from the same development environment, OSR was configured to weight evidence stemming from these projects more heavily than those in the Cocomo data set. The sensitivity of results to this "weighting factor" is not described. We should note that the experimental conditions assumed in [5] are quite reasonable from a pragmatic standpoint, particularly the decision to weight projects more heavily that are drawn from the same environment as the test project. These different training assumptions simply confound comparisons between experimental results, and OSR's robustness across differing training and test sets is not reported. In addition, like the work of Porter and Selby [14], [15], [22], OSR assumes that the dependent dimension of software development effort is nominally-valued for purposes of learning. Thus, this dimension is partitioned into a number of collectivelyexhaustive and mutually-exclusive ranges prior to learning. Neither BACKPROPAGATION nor CARTX requires this kind of preprocessing. In any case, OSR appears unique relative to other machine learning systems in that it does not learn a static classifier; rather, it combines predictions from multiple, dynamically-constructed patterns. Whether one is interested in software development effort estimation or not, this latter facility appears to have merits that are worth further exploration. In sum, CARTX suffers from certain theoretical limitations: it cannot extrapolate beyond the data on which it was trained, it uses a greedy tree expansion strategy, and the resultant classifier generates predictions by matching a project against a single conjunctive pattern of attribute values. However, there appear to be extensions that might mitigate these problems.
"Our choice of using COCOMO data for training and Kemerer's data for test was made independently of [5].
VI. CONCLUDING REMARKS
^ the CARTX and micle has c o m pared BACKPROPAGATION learning methods to traditional a p p r o a c h e s for software effort estimation. We found that the l e a r n i n g approaches were competitive with SLIM, COCOMO, a n d ^ c n o N POINTS as represented in a previous study b y K emerer. Nonetheless, further experiments showed the s e n s i t i v i t y o f learning to various aspects of data selection and repr esentation. Mohanty and Kemerer indicate that traditional models ^ quite sensitive as well. advantage of learning systems is that they are A pTlm3ly adaptable and nonparametric; predictive models can be tailored t 0 t h e d a t a a t a p a r t i c u l a r s i t e . D e c ision and regression trees ^ p a r t i c u l a r l y We ll- S uited to this task because they make explicit the attributes (e.g., TIME) that appear relevant to ^ pre diction task. Once implicated, a process that engineers t h e d a t a d e f i n i t j o n i s o f t e n required to explain relevant and i r r e l e v a n t a s p e c t s o f t h e d a t a , and to encode it accordingly. T h i s p r o c e s s is b e s t d o n e locally> w i t h i n a s o f t w a r e shop, w h e r e t h e i d i o s y n c r a s i e s of that environment can be factored i n o f o u t I n s u c h a s e t t i n g a n a l y s t s m a y w a n t t 0 investigate t h e b e h a v i o r o f s y s t e m s l i k e BACKPROPAGATION, CART, and r e l a t e d a p p r o a c h e s [5]> [ 1 4 ] > [ 1 5 ] > [ 2 2 ] over a range of permiss i W e c o n f i g u r a t i o n S ) thus obtaining performance that is optimal i n m e i r env ironment.
APPENDIX A DATA DESCRIPTIONS Th e attributes defining the COCOMO and Kemerer databases were used to develop the COCOMO model. The following is a brief description of the attributes and some of their suspected influences on development effort. The interested reader is referred to [3] for a detailed exposition of them, These attributes can be classified under four major headings, They are Product Attributes; Computer Attributes; Personnel Attributes; and Project Attributes, A. Product Attributes ]) Required Software Reliability (RELY): This attribute measures how reliable the software should be. For example, if serious financial consequences stem from a software fault, then the required reliability should be high, 2) Database Size (DATA): The size of the database to be used by software may effect development effort. Larger databases generally suggest that more time will be required to develop the software product. 3) Product Complexity (CPLX): The application area has a bearing on the software development effort. For example, communications software will likely have greater complexity than software developed for payroll processing. 4) Adaptation Adjustment Factor (AAF): In many cases software is not developed entirely from scratch. This factor reflects the extent that previous designs are reused in the new project.
61
136
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
B. Computer Attributes -™,^ „ 1) Execution Time Constraint (TIME): tt there are conSttaintS on processing time, then the development time may , 8 • 2) Main Storage Constraint (STOR): If there are memory constraints, then the development effort will tend to be high. .,. . . .
, ,,
,.
...
...
,,,,r.T.>
Tj-
,
,
, .
3) Virtual Machine Volatility (VIRT): If the underlying hardware and/or system software change frequently, then development effort will be high.
[3] B. W. Boehm, Software Engineering Economics. Englewood Cliffs, NJ: Prentice-Hall, 1981. [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth International, 1984. [5] L. Briand, V. Basili, and W. Thomas, "A pattern recognition approach for software engineering data analysis," IEEE Trans. Software Eng., vol. is, pp. 931-942, Nov. 1992. W c - B r o d l e v a n d E - R'ssiand, "Measuring concept change," in AAAI Spring Symp. Training Issues in Incremental Learning, 1993, pp. 98-107.
^ [9]
C Personnel Attributes [JO] 1) Analyst Capability (ACAP): If the analysts working on the software project are highly skilled, then the development effort of the software will be less than projects With less-skilled analysts. T, . ,. . _ . / J C V D I TU • t 2) Applications Experience (AEXP): The experience of project personnel influences the software development effort. 3) Programmer Capability (PCAP): This is similar to • ,. ACAP, but it applies to programmers. 4) Virtual Machine Experience (VEXP): Programmer experience with the underlying hardware and the operating system has a bearing on development effort. 5) Language Experience (LEXP): Experience of the programmers with the implementation language affects the software development effort. 6) Personnel Continuity Turnover (CONT): If the same , , , . r , . . , , . personnel work on the project from beginning to end, then the development effort will tend to be less than Similar projects experiencing greater personnel turnover. ° D: Project Attributes J) Modern Programming Practices (MODP): Modern programming practices like Structured software design ° reduces the development effort. 2) Use of Software Tools (TOOL): Extensive use of software tools like source-line debuggers and syntax-directed editors reduces the software development effort. 3) Required Development Schedule (SCED): If the devel. , , . „ . ,
...
.
. . , . , ,
.
,
opment schedule of the software project is highly constrained, then the development effort will tend to be high. Apart from the attributes mentioned above, other attributes that influence the development are: programming language, and the estimated lines of code (unadjusted and adjusted for the use of existing software). ACKNOWLEDGMENT
The authors would like to thank the three reviewers and the action editor for their many useful comments.
REFERENCES
[1] M. /\ioen, Albert, "Instance-based ti j D. L/. Aha, mid, D. u. Kibler, rviuier, and anu ivi. insiaiitx-Daieu learning learning algorithms," aigorunms, Machine Machine Learning, Learning, vol. vol. 6, 6, pp. pp. 37-66, 37-66, 1991. 1991. [2] A. Albrecht and J. Gaffney Jr., "Software function, source lines of code, code, [2] A. Albrecht and J. Gaffney Jr., "Software function, source lines of and development development effort effort prediction: prediction: A A software software science science validation," validation," IEEE IEEE and Trans. Software Eng., vol. 9, pp. 639-648, 1983. Trans. Software Eng., vol. 9, pp. 639-648, 1983.
62
"Learning with genetic algorithms," Machine Learning, vol. 3, pp. 121-138, 1988. B. Evans andID^Fisher, "Overcoming process delays with decision tree induction, IEEE Expert, vol. 9, pp. 60-66, Feb. 1994. U. Fayyad, "On the induction of decision trees for multiple concept learning," Doctoral dissertation, EECS Dep., Univ. of Michigan, 1991. L. Johnson and R. Riess, Numerical Analysis. Reading, MA; AddisonWesley, 1982. ^^I^^^^^™. " " A. Lapedes and R. Farber, "Nonlinear signal prediction using neural networks: Prediction and system modeling," Los Alamos National Laboratory, 1987, Tech. Rep. LA-UR-87-2662 s M o h a I ( t y > "Software cost estimation: Present and future," Software—Practice and Experience, vol. 11, pp. 103-121, 1981. A. Porter and R.Selby, "Empirically-guided software development using metric-based classification trees," IEEE Software, vol. 7, pp. 46-54,
[7] K D e J o n g ?
"« [12] t,3]
[14]
Mar
1990
[15] A. Porter and R. Selby, "Evaluating techniques for generating metric based classification trees," 7. S ^ . to/rvvare, vol. 12, pp. 209-218, July [16] L. H. Putnam, "A general empirical solution to the macro software s i z i n 8 a n d estimating problem," IEEE Trans. Software Eng., vol. 4, pp. [17] j 3 4 ^ 3 ^ ^ . * Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. t 18 l J- R- Q u i n l a n . "Combining instance-based and model-based learning," in Proc. the 10th Int. Machine Learning Conf, 1993, pp. 236-243. [ 1 9 j D . E . Rume lhart, G. E. Hinton, and R J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Processr o m ("«• Cambridge MA: MIT Press, 1986. [20] W. Scacchi, Understanding software productivity: Toward a knowledge-based approach," Int. J. Software Eng. and Knowledge Eng., vol. 1, pp. 293-320, 1991. [21] T. J. Sejnowski and C. R. Rosenberg, "Parallel networks that learn to pronounce english text," Complex Systems, vol. 1, pp. 145-168, 1987. [22] R. Selby and A. Porter, "Learning from examples: Generation and evaluation of decision trees for software resource analysis, IEEE Trans. Software Eng., vol. 14, pp. 1743-1757, 1988. [23] S. Vicinanza, M. J. Prietulla, and T. Mukhopadhyay, "Case-based 7 ^ 9 9 ^ ^ ^ - ^ e s t i m a t i ° n > " i n Proc- IUh Int C o n / '"f°[24] S. Weiss and C. Kulikowski, Computer Systems that Learn. San Mateo, CA: Morgan Kaufmann, 1991. [25] J. Zaruda, Introduction to Artificial Neural Networks.
St. Paul, MN:
W e s t 1992
Krishnamoorthy Srinivasan, received the MB.A, in management information systems from the Owen Graduate School of Management, Vanderbilt University, and the M.S. in computer science from Vanderbilt University. He also received the Post Graduate Diploma in industrial engineering from the National Institute for Training in Industrial Engineering, Bombay, India, and the B.E. from the University of Madras, Madras, India. He is currently working as a Principal Software Engineer Inc., engineer with wun Personal rerj>onai Computer computer Consultants, consultants, inc., re joining PCC, he he worked as as a Senior Specialist with Washington, D.C. Before joining PCC, worked a Senior Specialist with Inc., Cambridge, Cambridge, MA. MA. His His primary primary research research interests interests McKinsey & Company,, Inc., :ations of machine learning techniques to real-world are in exploring applications of machine learning techniques to real-world business problems.
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT
Douglas Fisher (M'92) received his Ph.D. in information and computer science from the University of California at Irvine in 1987. He is currently an Associate Professor in computer science at Vanderbilt University. He is an Associate Editor of Machine Learning, and IEEE Expert, and serves on the editorial board of the Journal of Artificial Intelligence Research. His research interests include machine learning, cognitive modeling, data analysis, and cluster analysis. An electronic addendu to this article, which reports any subsequent analysis, can be found at (http://www.vuse.vanderbilt.edurdfisher/dfisher.html). Dr. Fisher is a member of the ACM and AAA1.
63
137
736
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 12, NOVEMBER 1 9 9 ?
Estimating Software Project Effort Using Analogies Martin Shepperd and Chris Schofield Abstract—Accurate project effort prediction is an important goal for the software engineering community. To date most work has focused upon building algorithmic models of effort, for example COC0M0. These can be calibrated to local environments. We describe an alternative approach to estimation based upon the use of analogies. The underlying principle is to characterize projects in terms of features (for example, the number of interfaces, the development method or the size of the functional requirements document). Completed projects are stored and then the problem becomes one of finding the most similar projects to the one for which a prediction is required. Similarity is defined as Euclidean distance in n-dimensional space where n is the number of project features. Each dimension is standardized so all dimensions have equal weight. The known effort values of the nearest neighbors to the new project are then used as the basis for the prediction. The process is automated using a PC-based tool known as ANGEL. The method is validated on nine different industrial datasets (a total of 275 projects) and in all cases analogy outperforms algorithmic models based upon stepwise regression. From this work we argue that estimation by analogy is a viable technique that, at the very least, can be used by project managers to complement current estimation techniques. Index Terms—Effort prediction, estimation process, empirical investigation, analogy, case-based reasoning.
+ 1 INTRODUCTION
A
N important aspect of any software development proj-
tion in the dependent variable that can be "explained" in
ect is to know how much it will cost. In most cases the major cost factor is labor. For this reason estimating development effort is central to the management and control of a software project. A fundamental question that needs to be asked of any estimation method is how accurate are the predictions. Accuracy is usually defined in terms of mean magnitude of relative error (MMRE) [6] which is the mean of absolute percentage errors: ;=n ( r p \ V (2) ,=11 J n
terms of the independent variables. Unfortunately, this is not always an adequate indicator of prediction quality where there are outlier or extreme values. Yet another approach is to use Pred(25) which is the percentage of predictions that fall within 25 percent of the actual value. Clearly the choice of accuracy measure to a large extent depends upon the objectives of those using the prediction system, For exlmple, MMRE is fairly conservative with a bias against overestimates while Pred(25) will identify those prediction systems that are generally accurate but occasiona % wildly inaccurate. In this paper we have decided to adopt MMRE and Pred(25) as prediction performance indicators since these are widely used, thereby rendering our results more comparable with those of other workers, The remainder of this paper reviews work to date in the field of effort prediction (both algorithmic and nonalgorithmic) before going on to describe an alternative approach to effort prediction based upon the use of analogy, Results from this approach are compared with traditional statistical methods using nine datasets. The paper then discusses the results of a sensitivity analysis of the analogy method. An estimation process is then presented. The paper concludes by discussing the strengths and limitations of analogy as a means of predicting software project effort,
where there are n projects, E is the actual effort and E is the predicted effort. There has been some criticism of this measure, in particular that it is unbalanced and penalizes overestimates more than underestimates. For this reason Miyazaki et al. [19] propose a balanced mean magnitude of relative error measure as follows: \ ^00 ,~\ min(£, E) j " ' This approach has been criticized by Hughes [8], among others, as effectively being two distinct measures that should not be combined. Other workers have used the adjusted R squared or coefficient of determination to indicate the percentage of varia-
(
•
2
A BRIEF HISTORY OF EFFORT PREDICTION
Over the past two decades there has been considerable activity in the area of effort prediction with most approaches being typified as being algorithmic in nature. Well known M. Shepperd and C. Schofield are with the Department of ComputinQ, i • i J r^r^r^i,n/-^ r,n J r .• • ] r_n Bournemouth University. Talbot Campus, Poole, BH12 5BB United King- examples include COCOMO [4] and function points [2]. dom. E-mail: Imshepper, [email protected]. Whatever the exact niceties of the model, the general form tends to be:
Manuscript received 10 Feb. 1997. Recommended for acceptance by D.R. Jeffery. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 104091.
1. We include function points as an algorithmic method since they are dimensionless and therefore need to be calibrated in order to estimate effort.
SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES
E = aS b
(3) _ „ . where E is effort, S is size typically measured as lines of , /T__. , . . 'v . , . . code (LOC) or function points, a is a productivity parameter and b is an economies or diseconomies of scale parame^^^.^ ^ , , ,, , ter. COCOMO represents an approach that could be re, , „ .. , r , ,,,, TT f . . , , earded as off the shelf. Here the estimator hopes that the o . , . , ; represent equations contained in the cost model adequately 4 4
737
compares linear regression with a neural net approach using the COCOMO dataset. Both approaches seem to per, , ,, ... . j A , O T r j mQ 1 ,. c ™ i form badly with MMREs of 520.7 and 428.1 percent, reHI , . , ,.„ , , , c . . Srmivasan and cFisher [27] also report on the use of a , ., , , . , . , ., neural net with a back propagation learning algorithm, „, , , ., . ., , ° . , ? ,," , They found that the neural net outperformed other ttech. , u nnroc ™ *. TUT
, . , , . their development environment and that any variations can . , ., , r • / , • be satisfactorily accounted for in terms of cost drivers or
niques and gave results of MMRE = 70 percent. However, it . , ., u ±u J t <. J - J J r is unclear exactly how the dataset was divided urp for . . , ... . , , . TT ,
, . , . . , . .r j i ^ • ^ ^ v , ™ , , , training and validation purposes. Unfortunately, they also parameters built into the model. For instance COCOMO £ , f, . ., ,, ... , ., , ,, ., v ._ . , , ., , , found that the results were sensitive to the number of hidTT , has 15 such drivers. Unfortunately, there is considerable , . ,, „ ,. , , , , ,.„/-,•, , ,^,, , . , den units and layers. Results to date suggest that accuracy ... , , . . ,. ., , O D , , ., f evidence that this off the shelf approach is not always . . , T, ..,, " / is sensitive to decisions regarding the topology of the net, very1 successful. Kemerer [12] reports average errors (in t.,n e , . , ° , . / . .•[. , , c , , ,•„ i i- j ? i • number of learning epochs and the initial random . , . , ., ..-r. ., , T .,-.• ., terms ofe the difference between predicted and actual proi.. . r . ,. , , f r weights of the neurons within the net. In addition, there is rnn ect effort) of over 600 percent in his independent study of l,..., , .. , , , ., . . , ,, lttle
^r^min r^^ • j J . t J- n . i n o i i . i explanation value in a neural net, that is such models j ^ j r ^ , X J I COCOMO. Other independent studies L[141,L[18] have also , .K . r .... '' ' do not help us understand software proiect development r reported high error rates. „ Another algorithmic approach is to calibrate a model by _, ' , , , , ^ ^ ^ J f , , r r . ,, . , , There have been a number of attempts to use regression J-J. L C LI. • estimating values for the parameters (a and b in the case of j j • • ,. L .... T T ° ., ..•!..<• j i j • ^ and decision trees to predict aspects of software engmeerHowever, the most straightforward method is to as„ . . , c.r, . f ., ., ° r c v(3)). / j , i • i • ,i m g . Srmivasan a n d Fisher [27] describe the use of a regressume a linear model, that is set b to unity, and then use re. . . j - ,. rr ^ • i_ix J n^t . . . , , , x _, sion tree to predict effort using the Kemerer dataset [121. ,„ , j li_ . ui_ u - ^ i r J r^r^r^rwtr^ J 6eression analysis to estimate the slope (parameter a) and ., , . , . if,. They found that although it outperformed C O C O M O a n d
possibly so the model becomes: r J introduce an intercept r
(4) E = aj+ a2S so that a i represents fixed development costs (for example regression testing will consume a fixed amount of effort irrespective of the size of the software) and a2 represents productivity. Kok et al.[15] describes how this approach has been successfully utilized on the Esprit MERMAID Project. Function points [2] are also often calibrated to local environments in order to convert size in function points to predieted effort. Again, as with COCOMO, quite mixed results have been reported [9], [10], [12], [17]. Kitchenham and Kansala [13] also note that better results can be obtained through disaggregating the components of function points and using stepwise regression to reestimate weights and determine the significant components. Although, most research into project effort estimation has adopted an algorithmic approach there has been limited exploration of machine learning or nonalgorithmic methods. For example, Karunanithi et al. [11] report the use of neural nets for predicting software reliability, and conelude that both feed forward and Jordan networks with a cascade correlation learning algorithm, out perform traditional statistical models. More recently Wittig and Finnie [28] describe their use of back propagation learning algorithms on a multilayer perceptron in order to predict development effort. An overall error rate (MMRE) of 29 percent was obtained which compares favorably with other methods. However, it must be stressed that the datasets were large (81 and 136 projects, respectively) and that only a very small number of projects were withdrawn for validation purposes. Some outliers also appear to have been removed.
,. , , f, . ... . ,. .. CT T / . ., SLIM, the results were less good than using either a statistic a l model derived from function points or a neural net. Bria n d e t aL ^ obtained rather better results (MMRE = 94 P e r « n t ) from their tree induction analysis. In this case they used a combination of the Kemerer and COCOMO datasets. P o r t e r a n d Selb y I 2 1 !' P 2 1 describe the use of decision or classification trees in predicting aspects of the software development process. Results from this approach seem to be <3 mte m l x e d a n d - a s w i t h t h e n e u r a l n e t approach, results are
This tends to confirm the findings from Serluca [25] that
3
neural nets seem to require large training sets in order to give good predictions Another study by Samson et al. [24] uses an Albus multilayer perceptron in order to predict software effort. In this instance they use Boehm's COCOMO dataset. The work
. . , „„„ Estimation by analogy is a form of CBR. Cases are defined ^abstractions of events that are limited in time and space. ^J " ^ * e s t i m a t l 0 n ^ *nalo& o f f e r s s o m e d i s t i n c t °
65
ANALOGY
738
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO, 12, NOVEMBER 1997
• It avoids the problems associated both with knowledge elidtation and extracting and codifying the knowledge.
(C, - C2 f . . _ _ . _ F«fa*_As»*»ton/y(C v , C2/) 0
• Analogy-based systems only need deal with those 1 problems that actually occur in practice, while gen. . „. ., , , r . .. , ., {. ^ u JI li where 1) the features are numeric, 2) if the features erative (i.e., algorithmic) systems must handle all posare categorical and Qj = C2j, or 3) where the features sible problems. • Analogy-based systems can also handle failed cases are categorical and, Cj ^ C2j, respectively. (i.e., those cases for which an accurate prediction was , Manually guided induction. Here an expert manually not made). This is useful as it enables users to identify identifies key features, although this reduces some potentially high-risk situations. o f the adavantages of using a CBR system in that • Analogy is able to deal with poorly understood doa n ex pert is required. mains (such as software projects) since solutions are . Template retrieval. This functions in a similar fashj o n t 0 query by example database interfaces, that is based upon what has actually happened as opposed to chains of rules in the case of rule based systems. the user supplies values for ranges, and all cases • Users may be more willing to accept solutions from that match are retrieved. analogy based systems since they are derived from a . Qoa\ directed preference. Select cases that have the form of reasoning more akin to human problem s a m e g o a i a s the current case, solving, as opposed to the somewhat arcane chains of . Specificity preference. Select cases that match fearules or neural nets. This final advantage is particut ures exactly over those that match generally, larly important if systems are to be not only deployed . frequency preference—select cases that are most but also have reliance placed upon them. frequently retrieved. The key activities for estimating by analogy are the • Recency preference. Choose recently matched cases identification of a problem as a new case, the retrieval of over those that have not been matched for a period similar cases from a repository, the reuse of knowledge deof time. rived from previous cases and the suggestion of a solution • Fuzzy similarity. Where concepts such as at4east-asfor the new case. This solution may be revised in the light similar and just-noticeable-difference are employed, of actual events and the outcome retained to augment the xh e similarity measures suffer from a number of disadrepository of completed cases. This approach to prediction vantages. First, they tend to be computationally intensive, poses two problems. First, how do we characterize cases? although Aha [1] has proposed a number of more efficient Second, how do we retrieve similar cases, indeed how do algorithms that are only marginally less accurate. However, we measure similarity? efficiency is not an issue for project effort estimation as Characterization of cases is largely a pragmatic issue of typically one is dealing with less than 100 cases. Second, the what information is available. Variables can be continuous algorithms are intolerant of noise and of irrelevant features, (i.e., interval, ratio or absolute scale measures) or categori- Qne strategy to overcome this problem is to build in leamcal (i.e., nominal or ordinal measures). When designing a j n g s o that the algorithm learns the importance of the varinew CBR system, experts should be consulted to try to es- o u s features. Essentially, weights are increased for matchtablish those features of a case that are believed to be sig- i n g f eatU res for successful predictions and diminished for nificant in determining similarity, or otherwise, of cases, unsuccessful predictions. Third, symbolic or categorical Rich and Knight [23] describe the problem of choosing in- features are problematic. Although there are several algosufficiently general features. Again the solution appears to rithms that have been proposed to accommodate such feabe to use an expert. tures they are all fairly crude in that they adopt a Boolean Assessing similarity is the other problem. There are a va- approach: features match or fail to match with no middle riety of approaches including a number of preference heu- g r o u n d . A fourth criticism of these similarity measures is ristics proposed by Kolodner [16]: t h a t they fail to take into account information which can be . Nearest Neighbor Algorithms. These are the most derived from the structure of the data, thus, they are weak popular and are either based upon straightforward for higher order feature relationships such as one might distance measures or the sum of squares of the differ- e x p e ct to see exhibited in legal systems, ences for each variable. In either case each variable O ur approach has been guided by the twin aims of exmust be first standardized (so that it has an equal in- pe diency and simplicity. In essence we take a new project, fluence) and then weighted according to the degree of o n e for w hich we wish to predict effort, and attempt to find importance attached to the feature. A common algo- o t h e r similar completed projects. Since these projects are rithm is given by Aha [1]. completed, development effort will be known and can be 1 used as a basis for estimating effort for the new project. biM(^,L2, I) = , = • . . • Similarity is defined in terms of project features, such as I/^-IIEP ~ -' li' ' 2 i ' number of interfaces, development method, application where P is the set of n features, Q and C2 are cases d o m a i n a n d s 0 f o r t h - CleailY t h e features used will depend upon what data is available to characterize projects. The ancj number of features is also flexible. We have analyzed data-
66
SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES
sets with as few as one feature and as many as 29 features, Features may be either categorical or continuous. Similarity, defined as proximity in n-dimensional space (where each dimension corresponds to a different feature), is most intuitively appealing, hence we use unweighted Euclidean distance. The most similar projects will be closest to each other. Note that each dimension is standardized (between 0 and 1) so that it has the same degree of influence and the method is immune to the choice of units, Moreover, the notion of distance gives an indication of the degree of similarity. Once the analogous projects have been found, the known effort can be used in a variety of ways, We use the weighted or unweighted average of up to three analogies. No one approach is consistently more accurate so the decision requires a certain amount of experimentation on the part of the estimators. Because of the small datasets, we cope with noise (that is, unhelpful features that do not aid in the process of finding good analogies) by means of an exhaustive search of all possible subsets of the project features so as to obtain the optimum predictions for projects with known effort. The whole method, from storing analogies through eliminating redundant features to finding analogies is automated by a PC-based software tool known as ANGEL (ANaloGy Estimation tool). A fuller description is to be found in Shepperd et al. [26]. 4
COMPARING ESTIMATION BY ANALOGY WITH
R .. KEGRESSION MODELS Next, we compared the accuracy of software project effort prediction using analogy with an algorithmic approach based upon equations derived through stepwise regression analysis. Table 1 summarizes the datasets that were used for our comparison of analogy based estimation with stepwise regression. As can be seen from the table the datasets are quite diverse and are drawn from many different application domains ranging from telecommunications to commerdal information systems. All the data was taken from industrial projects, that is, no academic or student projects are included. The projects range in size from a few person months to over 1,000 person months. It is also important to stress that none of the data was collected with estimation by analogy in mind, instead we were able to exploit data that was already available. The final point is that we only utilized information that would be available at the time the prediction would be made, so we avoided project features such as LOC. This is important if we wish to avoid creating a false impression as to the efficacy of different prediction methods. Table 2 shows the accuracy of the respective methods using the MMRE and Pred (25) values. A jack-knifing procedure was adopted for the analogy-based predictions, since this could be automated in the ANGEL tool, the re2. The authors are happy to provide a simple version of ANGEL at no
gression models were generated using the entire dataset. This means the results are likely to be biased in favor of the regression models. Note that we use two slightly different regression analysis techniques. Both regression 1 and 2 use stepwise regression, however, regression 1 restricts the procedure to the three variables most highly correlated with the dependent variable (i.e., effort). Not surprisingly the results are in general similar, however, occasional differences are due to the fact that the regression procedure attempts to minimize the sum of the squares of the residuals, whereas MMRE is based upon the mean of the sum of the unsquared residuals. Each dataset is treated separately since each one has different project features available and therefore we are not able to merge all the data into a single all encompassing dataset. This is appropriate since it is unlikely that an organization would have access to such large volumes of data and there seems some merit in estimating using smaller, more homogenous datasets, a point we will return to. From Table 2 we see that for all datasets the MMRE performance of estimating by analogy is better than that of the regression methods. This suggests that analogy is capable of yielding more accurate predictions, at least for these datasets. An interesting problem occurs for Real-time 1 dataset. Here it was not possible to develop an algorithmic model or to use regression analysis since the dataset comprises only categoriC l d & with * f ' ** e x c e P t i o n o f a c t u a l P ro J ect effort, indeed the dataset was very sparse and was made up of only three distinguishing project features. Yet even in these highly unpropitious circumstances the analogy method was able to yield a predictive accuracy of 74 percent. This is indicative of the possibility of being able to use analogy based estimation at an extremely early stage of a project when other estimation techniques may not be possible for the reason that analogy does not require quantitative data. Similarly, an accuracy of 39 percent was obtained for the dataset Telecom 1 despite the fact that only a single distinguishing feature was available, Again, stepwise regression only achieves a result of MMRE = 86 percent by method 1 or 2. The Pred(25) results from Table 2 are slightly more mixed. Recall that unlike MMRE, a higher score implies better predictive accuracy. Two datasets (Atkinson and Desharnais) yield a higher Pred(25) score for the regression model. In general, the results are closer than for the MMRE analysis. One explanation lies in the fact that the ANGEL tool explicitly tries to optimize the MMRE result so that it is not surprising that it performs best in terms of this indicator. A second explanation lies in the fact that MMRE and Pred(25) are assessing slightly different characteristics of a prediction system. MMRE is conservative and looks at the mean absolute percentage error whereas Pred(25) is optimistic and focuses upon the best predictions (i.e., those within 25 percent of actual) and ignores all other predictions. The choice of indicator to some extent depends upon the objectives of the user. Nevertheless, the overall picture suggests that estimation by analogy tends to be the more
cost. The zip files may be downloaded from http://xanadu.bournemoirth.ac.uk/ a c c u r a t e p r e d i c t i o n m e t h o d . ComputingResearch/ChrisSchofield/Angel/AngelPage.html 3. Jack knifing is a validation technique whereby each case is removed from the dataset and the remainder of the cases are used to predict the removed case. The case is then returned to the dataset and the next case removed. This procedure is repeated until all cases have been covered.
67
739
740
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 12, NOVEMBER 1997
TABLE 1 DATASETS USED TO COMPARE EFFORT PREDICTION METHODS Name
|
Source
n
Albrecht Atkinson
[2] [3]
Desharnais Finnish
J L 7] Finnish Dataset: dataset made available to the ESPRIT Mermaid Project by the TIEKE organization [12] MM2 Dataset: Dataset made available to the ESPRIT Mermaid Project anonymously not in public domain Appendix A not in public domain
Kemerer Mermaid Real-time 1 Telecom 1 Telecom 2
~
Features |
Description
24 21
5 12
IBM DP Services projects Builds to a large telecommunications product at U.K. company X Canadian software house—commercial projects Data collected by the TIEKE organization from IS projects from nine different Finnish companies.
77 38
9 29
15 28
2 17
Large business applications New and enhancement projects
21 18 33
3 1 13
Real-time projects at U.K. company Z Enhancements to a U.K. telecommunication product Telecommunication product at U.K. company Y
TABLE 2 RELATIVE ACCURACY LEVELS OF EFFORT ESTIMATION FOR ANALOGY AND REGRESSION Analogy (MMRE) (%)
TABLE 3 RELATIVE ACCURACY LEVELS OF HOMOGENIZED DATASETS
Dataset Desharnais 1 Desharnais 2 Desharnais 3 Mermaid E Mermaid N
||
Analogy (MMRE) (%)
[|
37 29 26 53 60
|
Regression 1 (MMRE) (%)
|
41 29 36 62 -
|
Regression 2 (MMRE) (%)
|
41 29 49 62 -
In general, the best results seem to be achieved where the data is drawn from many builds or enhancements to an existing system, for example the Atkinson, Telecom 1, and Telecom 2 datasets. The poorest results occur when the data is drawn from a wide range of projects from more than one organisation, such as the Mermaid dataset. This tendency appears to be true for both analogy and regression analysis. Table 3 shows the results of dividing the Desharnais and Mermaid datasets into more homogenous subsets. The Desharnais dataset is divided on the basis of differing development environments. The Mermaid data is divided into enhancement (E) and new (N) projects. We observe that this division leads to enhanced accuracy for all estimation methods. Overall analogy has equal or superior performance to regression based prediction for seven out of eight comparisons, the only exception being the Desharnais 2
||
47 47 70 39 25
|
Regression 1 (Pred 25) (%)
|
45 48 30 27 -
|
Regression 2 (Pred 25) (%) 45 48 50 27
|
regression based prediction when using the Pred (25) indicator. The Mermaid N dataset is particularly interesting as it shows a dataset for which no statistically significant relationships could be found between any of the independent variables and effort hence no statistically significant regression equation can be derived. By contrast, the analogy method is able to produce an overall estimation accuracy of MMRE = 60 percent. Finally, we note that the procedure to search for optimum subsets of features for predicting effort reduced the set of features for every dataset studied excepting, of course, Telecom 1 which only had a single feature in the first place. This procedure has a significant impact upon the levels of accuracy that we were able to obtain, g
dataset which reveals fractionally superior performance for
|
Analogy (Pred 25) (%)
SENSITIVITY ANALYSIS
An important question to ask about any prediction method is how sensitive is it to any peculiar characteristics of the data 4. In a previous paper [26] we reported an accuracy level of MMRE = 62 and how will it behave over time. All the datasets we studied
percent. The improvement is due to the use of additional project features with which to find analogies that were not utilized during our earlier work.
W e r e
68
1 • L • 1 • Li ,1 , ,1 j -i 1 1 xnA historical in the sense that they described completed
SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES
projects and we conducted the analysis after the event. This section explores the dynamic behavior of effort prediction by simulating the growth of a dataset over time. This enables us to answer questions such as how many data points do we need for estimation by analogy to be viable and how stable are the results (in other words, are the accuracy levels vulnerable to the addition of a single rogue project)? Figs. 1 and 2 show the trends in estimation accuracy as the datasets grow. The Albrecht dataset (Fig. 1) was selected as an example of a dataset for which a comparatively low level of accuracy was achieved and in contrast the Telecom 2 dataset (Fig. 2) showed the highest level of accuracy. The procedure was to randomly number the projects from 1 to n (where n is the number of projects in the dataset). Projects are added to the dataset one at a time, in the random number order. Thus, the dataset grows until all projects are added. The optimum subset of features was recalculated as each new project was added. This involved for each partial dataset (starting from two projects), jack knifing the dataset by holding out each project, one at a time, and using the remaining projects to predict effort. The average absolute prediction error for all projects contained in the partial dataset gives the MMRE of that partial dataset. This procedure was repeated three times for each dataset (hence, Al, A2, and A3 and Tl, T2, and T3).
741
risk technique at below this number of projects. The Telecom 2 dataset shows little improvement beyond 15 projects. On this theme, it is interesting to note that, overall, it is not the largest datasets such as the Desharnais dataset that have the lowest MMREs and clearly other factors, over and above size, such as homogenity also have an impact. An interesting feature of Fig. 1 is the sharp rise in the MMRE values that occurs after 10 projects have been added for random sequence Al and 16 added for random sequence A2. Further investigation reveals that both of these anomalies are linked to the introduction of the same project. The project is third in sequence A3, when predictions are still very poor. This suggests that the results from estimating by analogy, like regression, can be influenced by outlying projects. However, A2 demonstrates that the affect of a rogue project is ameliorated as the size of the dataset increases. Superficially there appears to be a similar effect in Fig. 2 for sequences Tl and T3 and projects 4 and 7, respectively. In this case, however, the peaks are caused by different projects and the most likely explanation is the vulnerability of finding analogous cases from very small datasets. 6
AN ESTIMATION PROCESS
This section considers how estimation by analogy can be introduced into software development organizations. The following are the main stages in setting up an estimation by analogy program: • identify the data or features to collect • agree data definitions and collection mechanisms • populate the case base • tune the estimation method • estimate the effort for a new project The first stage, that of identifying what data to collect, will be very dependent upon the nature of the projects for which estimates are required. Because of these variations, our software tool ANGEL is designed to be very flexible in the data that is used to characterize analogies and the user is able to define a template describing the data that will be supplied. Factors to be taken into account include beliefs as to what features significantly impact development effort (and are measurable at the time the estimate is required) and what features can easily be collected. There is little sense in identifying huge numbers of variables that cannot be easily or reliably collected in practice. Estimation by analogy can cope with both continuous and categorical data, although categorical data has to be held as binary values. For instance, programming language would be represented as a series of truth valued variables e.g., COBOL, 4GL, C++, etc. The reason for this is that the similarity measure treats categorical features as either being the same or different: there are no degrees of difference. The second stage is to agree definitions as to what is being Fig. 2. Estimation accuracy over time (Telecom 2 dataset). collected. Even within an organizations there may be no shared understanding of what is meant by effort. Any estiOverall, Figs. 1 and 2 show that the MMRE decreases as mation program will be flawed, possibly fatally, if different the size of the dataset grows. There is a tendency for the projects are measuring the same features in different ways. It MMRE to start to stabilize at approximately 10 projects is also important to identify who is responsible for the data which suggests that estimation by analogy may be a high collection and when they should collect the data. Sometimes it
69
742
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, V O L 23, NO. 12, NOVEMBER 1997
can be beneficial to have the same person collecting the data across projects in order to increase the level of consistency. Next, the case base must be populated. Like all estimation methods, other than inspired guess work, analogy requires some data collection. Our experience suggests that a minim u m of 10-12 projects are required in order to provide a stable basis for estimation. In general, more data is preferable although, in most cases, data collection will be an on-going process as projects are completed and their effort data becomes available. However, there appear to exist some tradeoffs between the size of the dataset and homogeneity. Again, our experience suggests there is merit in the strategy of dividing highly distinct projects into separate datasets. Often this separation is quite straightforward using such distinguishing features as application type or development site. The penultimate stage is to tune the estimation method. The user also will need to experiment with the optimum number of analogies searched for, and whether t o use a subset of variables, since some features may not usefully contribute to the process of finding effective analogies. Tuning can make quite a difference to the quality of predictions—typically tuning can yield a twofold improvement in performance—and for this reason the ANGEL tool provides automated support for this process. The last stage is to estimate for a new project. It must be possible to characterise the project in terms of the variables that have been identified at the first stage of the estimation process. From these variables, ANGEL can be used to find similar projects and the user can make a subjective judgment as to the value of the analogies. Where they are believed to be trustworthy the prediction can be relied on to greater extent than where they are thought to be doubtful. Here we wish to sound a note of caution. The value of estimation by analogy as an independent source of prediction will be somewhat reduced if the users discount values that are not consistent with their prior beliefs and for this reason there was no expert intervention or manipulation in any of the foregoing analysis. Another indicator of likely prediction quality is the average MMRE figure obtained through • , ? . , . . , , . °A • i f •,,.,• . jack knifing the dataset. Again a low figure will indicate 6 ' ,.° , , . ,6.. more confidence than a high figure.
this type of situation may be quite common particularly at a very early stage in a project, for example in response to an invitation to tender. This makes analogy an attractive method for producing very early estimates, Estimation by analogy also offers an advantage in that it is a very intuitive method. There is some evidence to suggest that practitioners use analogies w h e n making estimates b y means of informal methods [8]. O u r approach allows users to assess the reasoning process behind a prediction by identifying the most analogous projects thereby increasing, or reducing, their confidence in the prediction, Many experts have suggested that it is appropriate to use more than one method when predicting software development effort. We believe that estimation by analogy is a viable technique and can usefully contribute to this process. This is not to suggest that it is without weakness b u t on the empirical evidence presented in this paper it is certainly worthy of further consideration, APPENDIX A _ = — = ACT ACT_DEV ACT_TST 39522 250.49 54.73 330.29 225.4 104.89 333.96 177.35 156.61 1504 114 7 -PJLZ f^fl | f ^ i|L^ 1115 54 33305 2 67 09 153.56 130.4 28.16 573.71 372.15 201.56 276.95 232.7 44.25 ^Li5. i§^ -?§^ *374 ^4
-^|| -^§|| -||^g j ^ -^ 358^37 281J8 77719 T43 "84 123.1 87.7 35.4 ~ 257 ~ 257 23.54 16.42 7.12 6 6 34 25 27 5 6 75 §. j> 1131.8 |24.2 | 7.6 |3 |3 T , , J ,. • J C u J 1. •. T 1 ^ »rr The above data is drawn from tthe dataset Telecom 1. ACT . . , ,, . . „_ ,_.„,, , . „ „ _,„_ . , , is actual effort, ACT DEV and ACT TEST are actual devel, . ",, .. 7 „ . „ . „ „ . . 0 & opment and testing effort, respectively. CHNGS is the number of changes made as recorded by the configuration man7 CONCLUSIONS agement system and files is the number of files changed by Accurate estimation of software project effort at an early t h e Particular enhancement project. Only FILES can be used stage in the development process is a significant challenge f o r Predictive purposes since none of the other information for the software engineering community. This paper has w o u l d b e a v a i l a W e at the time of making the prediction, described a technique based upon the use of analogy sometimes referred to as case based reasoning. We have com- ACKNOWLEDGMENTS _ ., . , . . ., r. . , ^ T , , , ^ . .. c rpared the use of analogy with prediction models based . . • 1 • r • J ± 1 The authors are grateful to the Finish TIEKE organization for upon stepwise regression analysis for nine datasets, a total .. ,, ,° , . . ., r. . , ,°. , . n JT7E • i A I_-I • L • 1 ..• granting the authors leave to use the Finnish dataset; to Barof 275 projects. A striking pattern emerges in that estima- ? °, , , ,. .Ux* - J J , . .... nnu ,. , 1 j • j- • r bara Kitchenham for supplying the Mermaid dataset; to Bob , , . Xf , . ° . T , _, . tion by analogy produces a superior predictive perform- „ , . ,1 , j , .>,,mj Hughes for supplying the dataset Telecom 2; and to anony° . „ . V ; ° . . , , . i T 1 . J U I ance in all cases when measured by MMRE and in seven . r . i .1 n j/r,.-\ • ,. ,» • mous staff for the provision of datasets Telecom 1 and Keal, , , , t-. out of nine cases for the Pred(25) indicator. Moreover, esti- , . , „ „ . . . . , . ,, . . time 1. Many improvements have been suggested by Uan mation by analogy is able to operate in circumstances „ . „ iT < *\ _ , TT , _ . T ^?u , o * • > . : , • , , , f , ., . , , Diaper, Pat Dugard, Bob Hughes, Barbara Kitchenham,CiSteve where it is not possible to generate an algorithmic model, . . r_. „ A . „ • ° •, „ . , , „ . , ,c , ., • . .-n , .. , , „ ? , MacDonell, Austen Rainer, and Bill nSamson. This workuhas such as the dataset Real-time 1 where all the data was cate- , . ,, _ . . , ' , ITTT^T-• A ... . .u -Kit - j H.T J i i. i • been supported by British Telecom, the U.K. Engineering and goncal in nature or the Mermaid N dataset where no statis- „ , . ( { . -L , _ ., , _ °r--n/r •Zriaa °. „ .,-. . i .• i ,,, , , ... , ,. Physical Sciences Research Council under Grant GR/L372yo, J ticallyJ significant relationships could be found. We believe . .. _ , _ . . r ° and the Defence Research Agency.
70
SHEPPERD AND SCHOFIELD: ESTIMATING SOFTWARE PROJECT EFFORT USING ANALOGIES
743
I2°l T- Mukhopadhyay, S.S. Vicinanza, and M.J. Prietula, "Examining the Feasibility of a Case-Based Reasoning Model for Software Effort Estimation," MIS Quarterly, vol. 16, pp. 155-171, June, 1992. D.W. Aha, "Case-Based Learning Algorithms," Proc. 1991 DARPA Case-Based Reasoning Workshop. Morgan Kaufmann, 1991. [21] A. Porter and R. Selby, "Empirically Guided Software DevelopA.J. Albrecht and J.R. Gaffhey, "Software Function, Source Lines of ment Using Metric-Based Classification Trees," IEEE Software, no. 7, pp. 46-54,1990. Code, and Development Effort Prediction: A Software Science Validation," IEEE Trans. Software Eng., vol. 9, no. 6, pp. 639-648,1983. [22] A. Porter and R. Selby, "Evaluating Techniques for Generating K. Atkinson and M.J. Shepperd, "The Use of Function Points to Metric-Based Classification Trees," J. Systems Software, vol. 12, pp. Find Cost Analogies," Proc. European Software Cost Modelling 209-218,1990. Meeting, Ivrea, Italy, 1994. [23] E. Rich and K. Knight, Artificial Intelligence, second edition. B.W. Boehm, "Software Engineering Economics," IEEE Trans. McGraw-Hill, 1995. Software Eng., vol. 10, no. 1, pp. 4-21,1984. [24] B. Samson, D. Ellison, and P. Dugard, "Software Cost Estimation L.C. Briand, V.R. Basili, and W.M. Thomas, "A Pattern RecogniUsing an Albus Perceptron (CMAC)," Information and Software tion Approach for Software Engineering Data Analysis," IEEE Technology, vol. 39, nos. 1/2,1997. Trans. Software Eng., vol. 18, no. 11, pp. 931-942,1992. [25] C. Serluca, "An Investigation into Software Effort Estimation Using a Back Propagation Neural Network," MSc dissertation, S. Conte, H. Dunsmore, and V.Y. Shen, Software Engineering Metrics and Models. Menlo Park, Calif.: Benjamin Cummings, 1986. Bournemouth Univ., 1995. J.M. Desharnais, "Analyse statistique de la productivitie des pro- [26] M.J. Shepperd, C. Schofield, and B.A. Kitchenham, "Effort Estijets informatique a partie de la technique des point des fonction," mation Using Analogy," Proc. 18th Int'l COM/. Software Eng., Berlin: masters thesis, Univ. of Montreal, 1989. IEEE CS Press, 1996. R.T. Hughes, "Expert Judgement as an Estimating Method," In- [27] K. Srinivasan and D. Fisher, "Machine Learning Approaches to formation and Software Technology, vol. 38, no. 2, pp. 67-75,1996. Estimating Development Effort," IEEE Trans. Software Eng., vol. 21, no. 2, pp. 126-137,1995. D.R. Jeffery, G.C. Low, and M. Barnes, "A Comparison of Function Point Counting Techniques," IEEE Trans. Software Eng., vol. [28] G.E. Wittig and G.R. Finnie, "Using Artificial Neural Networks 19, no. 5, pp. 529-532,1993. and Function Points to Estimate 4GL Software Development efR. Jeffery and J. Stathis, "Specification Based Software Sizing: An fort," Australian J. Information Systems, vol. 1, no. 2, pp. 87-94, 1994. Empirical Investigation of Function Metrics," Proc. NASA Goddard Software Eng. Workshop. Greenbelt, Md., 1993. N. Karunanithi, D. Whitley, and Y.K. Malaiya, "Using Neural H___E___S_BH_ I Martin Shepperd received a BSc degree Networks in Reliability Prediction," 7EEE Software, vol. 9, no. 4, _B___E_H_^___| (honors) in economics from Exeter University, an pp. 53-59,1992. _ ^ H _ _ H 9 _ i M S c d e 9 r e e t r o m A s t o n University, and the PhD C.F. Kemerer, "An Empirical Validation of Software Cost Estima- H _ _ _ _ _ _ H _ H H _ _ | degree from the Open University, the latter two tion Models," Comm. ACM, vol. 30, no. 5, pp. 416-429,1987. • _ _ S _ 8 _ E _ B ' n c o m P u t e r s c i e n c e - He has a chair in software B.A. Kitchenham and K. Kansala, "Inter-Item Correlations among _ H _ H _ r i _ B _ S _ _ engineering at Bournemouth University. ProfesFunction Points," Proc. First Int'l Symp. Software Metrics, Balti- S____E___|____| s o r Shepperd has written three books and pub-
H E H H i H I H i 'ished more tnan 5 0 PaPers in *ne areas of soft-
more, Md.: IEEE CS Press, 1993.
[14] B.A. Kitchenham and N.R. Taylor, "Software Cost Models," ICL _ B _ _ _ _ H _ B _ H | w a r e metrics and process modeling. Technology /., vol. 4, no. 3, pp. 73-102,1984. W^mlKMSam [15] P. Kok, B.A. Kitchenham, and J. Kirakowski, "The MERMAID Approach to Software Cost Estimation," Proc. ESPRIT Technical _ _ _ _ _ _ _ _ _ _ _ _ _ C n r i s
Week, 1990.
HWllftrlBSBI
Schofield received a BSc degree (honors)
[16] J.L. Kolodner, Case-Based Reasoning. Morgan Kaufmann, 1993. IWlMimiMlWffii ' n s o f t w a r e engineering management from [17] J.E. Matson, B.E. Barrett, and J.M. Mellichamp, "Software Devel- E S _ H | _ _ _ _ 9 _ H | Bournemouth University, where he is presently opment Cost Estimation Using Function Points," IEEE Trans. IraB_B___gra_fR studying for his PhD. His research interests inSoftware Eng., vol. 20, no. 4, pp. 275-287,1994. _ _ _ _ _ _ H B S K c ' u c l e software metrics and cost estimation. [18] Y. Miyazaki and K. Mori, "COCOMO Evaluation and Tailoring," _H_H_HH__H Proc. Eighth Int'l Software. Eng. Conf. London: IEEE CS Press, 1985. _HHH__HH| [19] Y. Miyazaki et al., "Method to Estimate Parameter Values in HHBH^fflSH Software Prediction Models," Information and Software Technology, _ _ _ H _ _ _ _ _ _ S _ _ H | vol. 33, no. 3, pp. 239-243,1991. •__^__BH_B
A Critique of Software Defect Prediction Models Norman E. Fenton, Member, IEEE Computer Society, and Martin Neil, Member, IEEE Computer Society Abstract—Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art. Most of the wide range of prediction models use size and complexity metrics to predict defects. Others are based on testing data, the "quality" of the development process, or take a multivariate approach. The authors of the models have often made heroic contributions to a subject otherwise bereft of empirical studies. However, there are a number of serious theoretical and practical problems in many studies. The models are weak because of their inability to cope with the, as yet, unknown relationship between defects and failures. There are fundamental statistical and data quality problems that undermine model validity. More significantly many prediction models tend to model only part of the underlying problem and seriously misspecify it. To illustrate these points the "Goldilock's Conjecture," that there is an optimum module size, is used to show the considerable problems inherent in current defect prediction approaches. Careful and considered analysis of past and new results shows that the conjecture lacks support and that some models are misleading. We recommend holistic models for software defect prediction, using Bayesian Belief Networks, as alternative approaches to the single-issue models used at present. We also argue for research into a theory of "software decomposition" in order to test hypotheses about defect introduction and help construct a better science of software engineering. Index Terms—Software faults and failures, defects, complexity metrics, fault-density, Bayesian Belief Networks. A
1
INTRODUCTION
O
RGANIZATIONS are still asking how they can predict the This paper provides a critical review of this literature quality of their software before it is used despite the with the purpose of identifying future avenues of research, substantial research effort spent attempting to find an answer We cover complexity and size metrics (Section 2), the testto this question over the last 30 years. There are many papers ing process (Section 3), the design and development process advocating statistical models and metrics which purport to (Section 4), and recent multivariate studies (Section 5). For a answer the quality question. Defects, like quality, can be de- comprehensive discussion of reliability models, see [4]. We fined in many different ways but are more commonly de- uncover a number of theoretical and practical problems in fined as deviations from specifications or expectations which these studies in Section 6, in particular the so-called "Goldimight lead to failures in operation. lock's Conjecture." Generally, efforts have tended to concentrate on the folDespite the many efforts to predict defects, there appears lowing three problem perspectives [1], [2], [3]: t o b e mle c o n s e n s u s o n w h a t t h e constituent elements of the 1) predicting the number of defects in the system; problem really are. In Section 7, we suggest a way to improve 2) estimating the reliability of the system in terms of the defect prediction situation by describing a prototype, time to failure; Bayesian Belief Network (BBN) based, model which we feel 3) understanding the impact of design and testing pro- c a n a t \east palt\y s o lve the problems identified. Finally, in cesses on defect counts and failure densities. Section 8 we record our conclusions. A wide range of prediction models have been proposed. Complexity and size metrics have been used in an attempt _ _ •• *» #•» to predict the number of defects a system will reveal in o j - 2 PREDICTION USING SlZE AND COMPLEXITY eration or testing. Reliability models have been developed METRICS to predict failure rates based on the expected operational Most defect prediction studies are based on size and cornusage profile of the system. Information from defect detec- p l e x i t y metrics. The earliest such study appears to have been tion and the testing process has been used to predict de- p^y^s, | 5 | , w h i c h w a s b a s e d o n a s y s t e m developed at fects. The maturity of design and testing processes have F u j i t s u j ft is k a l of r e g r e S sion based "data been advanced as ways of reducing defects. Recently large f „ m o d d s w h i c ^ b e c a m e c o m m o n lace in t h e litera. complex multivariate statistical models have been pro%, t J u j i L * iJ i *• - i , ,. .. r- _j • i i • . . ture. The study showed that linear models of some simple duced in an attempt to find a single complexity metric that . ._/ ,, . . , i_ *• r .,! » f j f t metrics provide reasonable estimates for the total number of will account for defects. , . , , , _. defects D (the dependent variable) which is actually defined as the sum of the defects found during testing and the de. JV.£. Fenton and M. Neil are with the Centre for Software Reliability,
fects f o u n d
durin
E-mail: {n.fenton, martin}<s>csr.city.ac.uk. Manuscript received3 Sept. 1997; revised25 Aug. 1998.
Akiyama's first Equation (1) predicted defects from lines of code (LOC). From (1) it can be calculated that a 1,000
g ^ o months after release. Akiyama com-
puted four regression equations.
Northampton Square, London EC1V OHB, England.
Recommended for acceptance by R.Hamlet. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 105579.
TOP (\ P 1 K7 PlPI mnrlnlp k pvnprtpH tn have- annrnyi L O C (le " l KLU<~> module IS expected to have approximately 23 defects.
D = 4.86 + 0.018L (1) _. . i_ j i_ r ii • J J *• Other equations had the following dependent metrics: r J •• /u r u Tii r J u number of decisions C; number of subroutine calls J; and a . „ . C ° AnotherTariy study by Ferdinand, [6], argued that the expected number of defects increases with the number n of code segments; a code segment is a sequence of executable statements which, once entered, must all be executed. Specifically the theory asserts that for smaller numbers of segments, the number of defects is proportional to a power of n; for larger numbers of segments, the number of defects increases as a constant to the power n. Halstead, [7], proposed a number of size metrics, which have been interpreted as "complexity" metrics, and used these as predictors of program defects. Most notably, Halstead asserted that the number of defects D in a program P is predicted by (2): y £) = -^j^r (2) where Vis the (language dependent) volume metric (which like all the Halstead metrics is defined in terms of number of unique operators and unique operands in P; for details see [8]). The divisor 3,000 represents the mean number of mental discriminations between decisions made by the programmer. Each such decision possibly results in error and thereby a residual defect. Thus, Halstead's model was, unlike Akiyama's, based on some kind of theory. Interestingly, Halstead himself validated (1) using Akiyama s data. Ottenstein, [9], obtained similar results to Halstead. .. r,ni i r i_ i_ u J U Lipow, [10] went much further, because he got round the problem of computing V directly in (3), by using lines of executable code I instead. Specifically, he used the Halstead theory to compute a series of equations of the form: _ _ _ A + J 4 In L + A In2 L (3) 2 ^ ' where each of the A; are dependent on the average number of usages of operators and operands per LOC for a particular language. For example, for Fortran Ao = 0.0047; A, = 0.0023; A2 = 0.000043. For an assembly language Ao = 0.0012; Ai = 0.0001; A2 = 0.000002. Gaffney, [11], argued that the relationship between D and L was not language dependent. He used Lipow's own data to deduce the prediction (4): D = 4 . 2 + 0.0015(L)4/3 (4) An interesting ramification of this was that there was an optimal size for individual modules with respect to defect density. For (4) this optimum module size is 877 LOC. Numerous other researchers have since reported on optimal module sizes. For example, Compton and Withrow of UNISYS derived the following polynomial equation, [12]: D = 0.069 + 0.00156L + 0.00000047 (L)2 (5) Based on (5) and further analysis Compton and Withrow concluded that the optimum size for an Ada module, with respect to minimizing error density is 83 source statements. They dubbed this the "Goldilocks Principle" with the idea that there is an optimum module size that is "not too bis
The phenomenon that larger modules can have lower defect densities was confirmed in [13], [14], [151. Basili and , . . . . Pern cone argued that this may be explained by the rfact , .. ... , i_ r • r J r J- ^ -L. that there are a large number of interface defects distnbt e d evenl a( " Y ™ S s modules. Moller and Paulish suggested * a t l a r 8 e r m o ^ l e s tend to be developed more carefully, the y d f o v e r ^ d t h a t m o d u l e s i n s i s t i n g of greater than 70 lines of c o d e hav , f s i m i l a r d e f e c t densities^ For modules of size less t h a n 70 lines of c o d e t h e defect densit ' y increases significantly. Similar experiences are reported by [16], [17]. Hatton examined a number of data sets, [15], [18] and concluded that there w a s evidence of "macroscopic behavior" common to a11 d a t a s e t s d e s i t e t h e P massive internal complexity of each system studied, [19]. This behavior was likened to "molecules" in a gas and used to conjecture an entropy model for defects which also borrowed from ideas in cognitive psychology. Assuming the short-term memory affects the rate of human error he developed a logarithmic model, made up of two parts, and fitted it to the data sets.1 The first part modeled the effects of small modules on short-term memory, while the second modeled the effects of large modules, He asserted that, for module sizes above 200-400 lines of c o d e the human "memory cache" overflows and mistakes a r e m a c j e leading to defects. For systems decomposed into s m a i l e r pieces than this cache limit the human memory between the c a c h e i s u s e d i n e f f i c i e n t l y s t o r i n g •miksm o d u i e s thus also leadi to more defects H e c o n c i u d e d ^ j co n t s a r e p r o p o r t i o n a i i y m o r e reliable ., „ „. , ... ,, . . than smaller components. Clearly this would, ifr true, cast . _• t_ , . r_• se ous doubt o v " f * e theory of program decomposition w h i c h ls s o c e n t r a l t 0 s o f t w a r e engineering, The realization that size-based metrics alone are poor general predictors of defect density spurred on much research into more discriminating complexity metrics. McCabe's cyclomatic complexity, [20], has been used in many studies, but it too is essentially a size measure (being equal to the number of decisions plus one in most programs). Kitchenham et al. [21], examined the relationship between the changes experienced by two subsystems and a number of metrics, including McCabe's metric. Two differe nt regression equations resulted (6), (7): _ C - 0.042MCi - 0.075N + 0.00001HE (6) C = 0.25MCJ - O.53D7 + 0.09VG (7) F o r t h e first subsystem changes, C, was found to be reasonabl y dependent on machine code instructions, MCI, operator and operand totals, N, and Halstead's effort metric, HK F o r t h e o t h e r subsystem McCabe's complexity metric, VG w a s f o u n d t 0 P a rtially explain C along with machine code instructions, MCI and data items, DI. All of the metrics discussed so far are defined on code. There are now a large number of metrics available earlier in t h e Me-cycle most of which have been claimed by their P r ° P o n e n t s t 0 h a v e some predictive powers with respect l. There is nothing new here since Halstead [3] was one of the first to apply g P^P16 can only effectively recall seven plus or minus two
Millers fmdin that
.. „ nor too small.
items from their short-term memory. Likewise the construction of a partitioned model contrasting "small" module effects on faults and "large" module effects on faults was done by Compton and Withrow in 1990 [7].
73
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
to residual defect density. For example, there have been numerous attempts to define metrics which can be ex-
tracted from design documents using counts of between module complexity" such as call statements and data flows; the most well known are the metrics in [221. Ohls-
son a n d Alberg, [23], reported on a s t u d y at Ericsson w h e r e metrics d e r i v e d automatically from design docum e n t s w e r e u s e d t o predict especially fault-prone m o d u l e s p r i o r t o testing. Recently, there h a v e been several att e m p t s , such as [24], [25], to define metrics on objectoriented designs. The advent a n d w i d e s p r e a d use of Albrecht Function Points (FPs) raises the possibility of defect density predictions based on a metric which can be extracted at t h e specification stage. There is w i d e s p r e a d belief that FPs are a better (one-dimensional) size metric t h a n LOC; in theory at least they get r o u n d t h e problems of lack of uniformity a n d they are also language independent. We already see defect density defined in terms of defects p e r FP, a n d empirical studies are emerging that seem likely t o be t h e basis for predictive models. For example, in Table 1, [26] reports the following bench-marking study, reportedly based on large a m o u n t s of data from different commercial sources.
3
PREDICTION USING TESTING METRICS
677
". **CTS ™^f™
Defect Origins "alZTZZ^IZZ requirements Design Coding Documentation Bad fixes | Total
1
„ K^**™
Defects per Function Point Tnri l .«JU 1.25 1.75 0.60 0.40 5_00
I TABI C O DEFECTS FOUND PER TESTiNG APPROACH
Testing Type ~~Reaular use Black box White box Reading/inspections
Defects Found/hr 0~210 0 282 0.322 1.057
"inherent to the p r o g r a m m i n g process itself." Also useful (providing y o u are aware of t h e kind of limitations discussed in [33]) is the kind of data published by [34] in Table 2. O n e class of testing metrics that a p p e a r to be quite promising for predicting defects a r e t h e so called test coverage measures, A structural testing strategy specifies that we
have to select enough test cases so that each of a set of "obSome of the most promising local models for predicting jects" in a program lie on some path (i.e., are "covered") in residual defects involve very careful collection of data a t least on test case. For example, statement coverage is a about defects discovered during early inspection and test- structural testing strategy in which the "objects" are the ing phases. The idea is very simple: you have n predefined statements. For a given strategy and a given set of test cases phases at which you collect data dn (the defect rate. Sup- we can ask what proportion of coverage has been achieved, pose phase n represents the period of the first six months of The resulting metric is defined as the Test Effectiveness Rathe product in the field, so that dn is the rate of defects tio (TER) with respect to that strategy. For example, TER1 is found within that period. To predict dn at phase n - 1 the TER for statement coverage; TER2 is the TER for branch (which might be integration testing) you look at the actual coverage; and TER3 is the TER for linear code sequence and sequence d\, .... dn_i and compare this with profiles of simi- jump coverage. Clearly we might expect the number of dislar, previous products, and use statistical extrapolation covered defects to approach the number of defects actually techniques. With enough data it is possible to get accurate in the program as the values of these TER metrics increases, predictions of dn based on observed du .... dm where m is less Veevers and Marshall, [35], report on some defect and relithan n-l. This method is an important feature of the Japa- ability prediction models using these metrics which give nese software factory approach [27], [28], [29]. Extremely quite promising results. Interestingly Neil, [36], reported accurate predictions are claimed (usually within 95 percent that the modules with high structural complexity metric confidence limits) due to stability of the development and values had a significantly lower TER than smaller modules, testing environment and the extent of data collection. It T h i s supports our intuition that testing larger modules is appears that the IBM NASA Space shuttle team is achieving m o r e difficult and that such modules would appear more similarly accurate predictions based on the same kind of l i k e t y t 0 contain undetected defects. approach [18] Voas and Miller use static analysis of programs to conjecIn the absence of an extensive local database it may be t u r e t h e presence or absence of defects before testing has possible to use published bench-marking data to help with t a k e n P ] a c e ' I37J- T h e i r method relies on a notion of program this kind of prediction. Dyer, [30], and Humphrey, [31], con- testability, which seeks to determine how likely a program tain a lot of this kind of data. Buck and Robbins, [32], report w i U fail assuming it contains defects. Some programs will on some remarkably consistent defect density values during contain defects that may be difficult to discover by testing by different review and testing stages across different types of v i r t u e o f t h e i r structure and organization. Such programs software projects at IBM. For example, for new code devel- h a v e a I o w d e f e c t revealing potential and may, therefore, oped the number of defects per KLOC discovered with Fa- h i d e defects until they show themselves as failures during gan inspections settles to a number between 8 and 12. There operation. Voas and Miller use program mutation analysis to is no such consistency for old code. Also the number of man- simulate the conditions that would cause a defect to reveal hours spent on the inspection process per major defect is i t s e l f ^ a f a i l u r e if a defect was indeed present. Essentially if always between three and five. The authors speculate that, program testability could be estimated before testing takes despite being unsubstantiated with data, these values form P l a c e t h e estimates could help predict those programs that "natural numbers of programming," believing that they are w o u l d reveal less defects during testing even if they contained
defects. Bertolino and Strigini, [38], provide an alternative exposition of testability measurement and its relation to testing, debugging, and reliability assessment.
TABLE 3 RELATIONSHIP BETWEEN CMM LEVELS AND DELIVERED DEFECTS MULTIVARIATE APPROACHES SEI CMM
4 PREDICTION USING PROCESS QUALITY DATA
Defect Potentials
__J£^
There are many experts who argue that the "quality" of the development process is the best predictor of product quality (and hence, by default, of residual defect density). This issue, and the problems surrounding it, is discussed extensively in [33]. There is a dearth of empirical evidence linking process quality to product quality. The simplest metric of process quality is the five-level ordinal scale SEI Capability Maturity Model (CMM) ranking. Despite its widespread popularity, there was until recently no evidence to show that level (n + 1) companies generally deliver products with lower residual defect density than level (n) companies. The Diaz and Sligo study, [39], provides the first promising empirical support for this widely held assumption. Clearly the strict 1-5 ranking, as prescribed by the SEICMM, is too coarse to be used directly for defect prediction since not all of the processes covered by the CMM will relate to software quality. The best available evidence relating particular process methods to defect density concerns the Cleanroom method [30]. There is independent validation that, for relatively small projects (less than 30 KLOC), the use of Cleanroom results in approximately three errors per KLOC during statistical testing, compared with traditional development postdelivery defect densities of between five to 10 defects per KLOC. Also, Capers Jones hypothesizes quality targets expressed in "defect potentials" and "delivered defects" for different CMM levels, as shown in Table 3 [40].
1
|
2 3 4 •>
Removal
5
|
4 3 2 1
Delivered Defects
!5£i£EL« 85
j
89 91 93
95
0.75
1
—
0.44 0.27 0.14 0-Q5
underlying dimension being measured, such as control, volume and modularity. In [43] they used factor analytic variables to help fit regression models to a number of error data sets, including Akiyama's [5]. This helped to get over the inherent regression analysis problems presented by multicolinearity in metrics data. Munson and Khoshgoftaar have advanced the multivariate approach to calculate a "relative complexity metric." This metric is calculated using the magnitude of variability from each of the factor analysis dimensions as the input weights in a weighted sum. In this way a single metric integrates all of the information contained in a large number of metrics. This is seen to offer many advantages of using a univariate decision criterion such as McCabe's metric [44]. g
O F CURRENT APPROACHES TO _ PpcnirnnM DEFECT PREDICTION Despite the heroic contributions made by the authors of previous empirical studies, serious flaws remain and have detrimentally influenced our models for defect prediction. Of course, such weaknesses exist in all scientific endeavo u r s b u t if w e a r e t 0 i m r o v e P scientific enquiry in software 5 MULTIVARIATE APPROACHES engineering we must first recognize past mistakes before There have been many attempts to develop multilinear re- suggesting ways forward. gression models based on multiple metrics. If there is a conT h e k e y i s s u e s a ff ec ting the software engineering comsensus of sorts about such approaches it is that the accuracy m u n i t y - s historical research direction, with respect to defect of the predictions is never significantly worse when the prediction are' metrics set is reduced to a handful (say 3-6 rather than 30), [41]; a major reason for this is that many of the metrics are ' t h e u n k n o w n relationship between defects and failu r e s S e c t i o n 61 colinear; that is they capture the same underlying attribute ( )> (so the reduced set of metrics has the same information con* problems with the "multivariate" statistical approach tent, [42]). Thus, much work has concentrated on how to (Section 6.2); select those small number of metrics which are somehow * problems of using size and complexity metrics as sole the most powerful and/or representative. Principal Com"predictors" of defects (Section 6.3); ponent Analysis (see [43]) is used in some of the studies to * problems in statistical methodology and data quality reduce the dimensionality of many related metrics to a (Section b.4), smaller set of "principal components," while retaining most * f a l s e c l a i m s a b o u t s o f t w a r e decomposition and the of the variation observed in the original metrics. "Goldilocks Conjecture" (Section 6.5). For example, [42] discovered that 38 metrics, collected on 6 # 1 T h e unknown Relationship between Defects and around 1,000 modules, could be reduced to six orthogonal Failures . . ., ,. , , , ~ .... dimensions that account for 90 percent of the variability. J The TU . ^ There is considerable disagreement about the definitions ofr most important dimensions; size, nesting, and prime were , r ,. J, , ., ,.„ , . ,. . c T j . , . ,. . . . defects, errors, faults, and failures. In different studies dethen used to develop an equation to discriminate between r r . .... • .... _, , feet counts refer to: low and high maintainability modules. Munson and Khoshgoftaar in various papers, [41], [43], • postrelease defects; [44] use a similar technique, factor analysis, to reduce the • the total of "known" defects; dimensionality to a number of "independent" factors. • the set of defects discovered after some arbitrary fixed These factors are then labeled so as to represent the "true" point in the software life cycle (e.g., after unit testing).
75
A CR|T|QUE
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
679
The terminology differs widely between studies; defect TABLE 4 rate, defect density, and failure rate are used almost interDEFECTS DENSITY (F/KLOC) VS. MTTF I F/KLOC I MTTF changeably. It can also be difficult to tell whether a model is ™3 0 TrrAn predicting discovered defects or residual defects. Because of these problems (which are discussed extensively in [45]) we 20-30 4-5 min 5-10 1 hr have to be extremely careful about the way we interpret 2 5 several hours published predictive models. ~ Apart from these Fproblems of terminology and defini. . .. . r .J i 0.5-1 1 month K r any prediction of residual ' • ' ' tion the most serious weakness of defects or defect density _ ,. ... .... , , , , , , , . , J concerns the weakness of defect ,_ . . i. i_.i. 2 r. -r ReliabilityJ rprediction should, therefore, be viewed as corncount itself as a measure of software reliability. Even if we . ,c , . .... , , , , „ ., , , _ r plementary to defect density prediction, knew exactly the number of residual defects in our system we have to be extremely wary about making definitive 6.2 Problems with the Multivariate Approach statements about how the system will operate in practice. Applying multivariate techniques, like factor analysis, proThe reasons for this appear to be: d u c e s metrics w n i c h cannot be easily or directly interpret• difficulty of determining in advance the seriousness able in terms of program features. For example, in [43] a of a defect; few of the empirical studies attempt to factor dimension metric, control, was calculated by the distinguish different classes of defects; weighted sum (8): • great variability in the way systems are used by difcontrol = aflNK + a2PRC + a3E + a4VG + a5MMC ferent users, resulting in wide variations of opera® + a £ r r o r + a fg\fp + a IQQ tional profiles. It is thus difficult to predict which de6 7 8 fects are likely to lead to failures (or to commonly oc- where the a, s are derived from factor analysis. HNK was curring failures). Henry and Kafura's information flow complexity metric, T,, . ,. . . 11 . j i u u- u PRC is a count of the number of procedures, E is Halstead's , . . ..„,„. The latter point is particularly serious and has been high„ . ,._ . „ , , r i- u» J j ..• n u l^c^ A J • J J * t effort metric, VG is McCabe s complexity metric, MMC is lighted dramatically by [46]. Adams examined data from TT . . . , ,„„.,. , ., , i_-i_ i_ _ir Harrison s complexity metric, and LOC is lines ofc code. Al? , c rane large software products, each with many thousands of . . . . . . . . ._, • . i. . . , ,, ., , . , X. w uthough this equation might help to avoid multicohnearity it years off llogged use world wide. uHe charted the relationship . . . , . . , . . i t J~ . , , „ . , ., . .„ . .. is hard to see how you might advise a programmer or deJ f ., between detected defects and their manifestation as fail. , . 6, ,. ... ,, i oo r11jr4.1j4.c-1 signer on how to redesign the programs to achieve a bet& ures. For example, 33 percent of all defects led to failures ,, , . , e . , , , ., . , ter ... 4.4. c -i 4.u c nnn i control metric value for a given module. Likewise the with a mean time to failure greater than 5,000 years. In . f , . , ,. , , ,c .,, , effects of such a change & in module control on defects is less practical terms, this means that such defects will almost never manifest themselves as failures. Conversely, the pro„ ' ,. _. . , , . . _. . i_. i_ i J • r -i . These problems are compounded in the search for an ulportion of defects which led to a mean time to failure oft less .. , .. . £ . . . . . , „, . ,. . , rn ii / .o \ TT timate or relative complexity metnc [43]. The simplicity ofr H than 50 years was very small (around 2 percent). However, ^ g number ^ J de appealing bu/the it is these defects which are the important ones to find, . o f m e a s u r e m e n t a r e b a s e d o n i d e n t i f y l n g differing since these are the ones which eventually exhibit them- ^ e l I . d e f i n e d a t t r i b u t e s w i t h s i n g i e s t a n d a r d m e a s u r e s [ 4 5 f selves as failures to a significant number of users. Thus A l t h o u h t h e r e i s a c l e a r r o l e f o r d a t a r e d u c t i o n a n d a n a l y s i s Adams data demonstrates the Pareto principle: a very small t e c h n i s such as factor anal is t h i s s h o u l d n o t b e c o n . proportion of the defects in a system will lead to almost all f u s e d Q r u s e d i n s t e a d o f measurement theory. For example, the observed failures in a given period of time; conversely, s t a t e m e n t count and lines of code are highly correlated bemost defects in a system are benign in the sense that in the c a u s e p r o g r a m s w i t h m O re lines of code typically have a same given period of time they will not lead to failures. h i g h e r n u m b e r of statements. This does not mean that the It follows that finding (and removing) large numbers of t r u e s i z e o f prO grams is some combination of the two metrics, defects may not necessarily lead to improved reliability. It A m O re suitable explanation would be that both are alternaalso follows that a very accurate residual defect density pre- t i v e measures of the same attribute. After all centigrade and diction may be a very poor predictor of operational reliabil- fahrenheit are highly correlated measures of temperature, ity as has been observed in practice [47]. This means we Meteorologists have agreed a convention to use one of these should be very wary of attempts to equate fault densities a s a standard in weather forecasts. In the United States temwith failure rates, as proposed for example by Capers Jones perature is most often quoted as fahrenheit, while in the (Table 4 [48]). Although highly attractive in principle, such a United Kingdom it is quoted as centigrade. They do not take model does not stand up to empirical validation. a weighted sum of both temperature measures. This point Defect counts cannot be used to predict reliability because, lends support to the need to define meaningful and standard despite its usefulness from a system developer's point of measures for specific attributes rather than searching for a view, it does not measure the quality of the system as the single metric using the multivariate approach, user is likely to experience it. The promotion of defect counts
as a measure of "general quality" is, therefore, misleading. _ ,,
„
2. Here we use the technical concept of reliability, defined as mean time to failure or probability of failure on demand, in contrast to the "looser" concept of reliability with its emphasis on defects.
76
6 3
- Problems in Using Size and Complexity Metrics
to Predict Defects A discussion of the theoretical and empirical problems with
c i- • J • • J i • J• JI_ m a n y of the i n d i v i d u a l metrics discussed above m a y be
680
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, V O L 25, NO. 5, SEPTEMBER/OCTOBER 1999
found in [45]. There are as many empirical studies (see, for example, [49], [50], [51]) refuting the models based on Halstead, and McCabe as there are studies "validating" them. Moreover, some of the latter are seriously flawed. Here we concentrate entirely on their use within models used to predict defects. The majority of size and complexity models assume a straightforward relationship with defects-defects are a function of size or defects are caused by program complexity. Despite the reported high correlations between design complexity and defects the relationship is clearly not a straightforward one. It is clear that it is not entirely causal because if it were we couldn't explain the presence of defects introduced when the requirements are defined. It is wrong to mistake correlation for causation. An analogy would be the significant positive correlation between IQ and height in children. It would be dangerous to predict IQ from height because height doesn't cause high IQ; the underlying causal factor is physical and mental maturation, There are a number of interesting observations about the way complexity metrics are used to predict defect counts: . the models ignore the causal effects of programmers and designers. After all it is they who introduce the defects so any attribution for faulty code must finally restwithindividual(s). overly complex programs are themselves a consequence of poor design ability or problem difficulty. Difficult problems might demand complex solutions and novice programmers might produce "spaghetti coc j e ». . defects may be introduced at the design stage because of the overcomplexity of the designs already produced. Clerical errors and mistakes will be committed because the existing design is difficult to comprehend.
6.4.1 Multicolinearity Multicolinearity is the most common methodological probi e m encountered in the literature. Multicolinearity is pres e n t w h e n a number of predictor variables are highly posit i v e i y o r neg atively correlated. Linear regression depends o n m e assumption of zero correlation between predictor variables, [52]. The consequences of multicolinearity are m a n y f o l d ; i t c a u s e s u n s t a b l e coefficients, misleading statist i c a l t e s t s m d une xpected coefficient signs. For example, o n e o f t h e equations in [21] (9): _ C = 0MZMCI " 0 O 7 5 N + 0-00001HE (9) shows clear signs of multicolinearity. If we examine the equation coefficients we can see that an increase in the operator and operand total, JV, should result in an increase in changes, c, all things being equal. This is clearly counterintuitive. In fact analysis of the data reveals that machine code instructions, MCI, operand, and operator count, N, and Halstead's Effort metric, HE, are all highly correlated [42]. This type of problem appears to be common in the software metrics literature and some recent studies appear to h a v e fallen victim t0 t h e multicolinearity problem [12], [53]. Colinearity between variables has also been detected in a n u m b e r of studies that reported a negative correlation bem e e n defect densitv a n d module size. Rosenberg reports that since there m u s t b e a ' negative correlation between X, size a n d 1 / X h follows t h a t t h e ' correlation between X and Y / X (defects/size) must be negative whenever defects are gr°wing at most linearly with size [54]. Studies which have postulated such a linear relationship are more than likely to have detected negative correlation, and therefore concluded t h a t lar e 8 modules have smaller defect densities, because of this P ro Perty of arithmetic, 6 4 2 Factor Analysis vs. Principal Components Analysis
Defects of this type are "inconsistencies" between de, , j , ,, ,,. ,. . . sign modules and can be thought ofr as quite distinct . , c r from requirements defects. M
„, r * i • J • • i * i • The use ofr factor analysis and principal components analysis . , . . . . . . , . . solves the multicohneanty problem by creating new or, . . • • \ _•• . ,.m thogonal cfactors or principal component dimensions, [43]. 6.4 Problems in Data Quality and Statistical Unfortunately the application of factor analysis assumes the Methodology errors are Gaussian, whereas [55] notes that most software The weight given to knowledge obtained by empirical metrics data is non-Gaussian. Principal components analysis means rests on the quality of the data collected and the de- c a n b e u s e d i n s t e a d o f f a c t o r analysis because it does not rely gree of rigor employed in analyzing this data. Problems in o n a n y distributional assumptions, but will on many occaeither data quality or analysis may be enough to make the s i o n s produce results broadly in agreement with factor resulting conclusions invalid. Unfortunately some defect analysis. This makes the distinction a minor one, but one that prediction studies have suffered from such problems. These n e e d s t 0 b e considered, Predicting Data problems are caused, in the main, by a lack of attention to 6 4 3 FJWng Models vs the assumptions necessary for successful use of a particular „ . , ,. , . „ , ..... , . , . ,-.,, . .. . , , .. Regression modeling approaches are typically concerned statistical technique. Other serious problems include the .° _. . _, , , , , .. . , „ i i r ... ... j u t jir-J JI with fitting 6 models to data rather than predicting data. Relack of distinction made between model fitting and model . . . . „ , , . , , ~ .... , ., . ..f. , , c , . gression analysis typically finds the least-squares fit to the prediction and the unjustified removal of data points or 5 _• , J r , • r- , i .. . . J J * data and the goodness of this fit demonstrates how well the misuse ofc averaged data. .& , . . TT ,.,.,. ° . . . . , . c model explains historical data. However a truly successful T, u The ability to replicate results is a key component of any ... ,., ,. , / , c . . . . . . . . r T , . j . _ „ i model is one which can predict the number of defects disempincal discipline. In software development different find,. , , , ^ , . ,. . . u i _ i - j u i _ r covered in an unknown module. Furthermore, this must be a ings cfrom diverse experiments could be explained by the fact , , , . , - . . ^ , . , , , , ., . j . i r . i j module not used in the derivation of the model. Unfortut n j that different, perhaps uncontrolled, processes were used on , , , , , . , ,.tt j. . v. r* ui-4. 4. j ' •u u nately, perhaps because of the shortage of data, some redifferent proiects. Comparability over case studies might be , , J • • , r. , ... r. , .e .. , , . J 1 » searchers have tended to use their data to fit the model . , , . . , , , _, , better achieved if the processes used during development , *j l ... .. . ?., ^ . without being able to test the resultant model out on a new were documented, along with estimates of the extent to , ° , r r , „„, „ . . , ., , ii r H j data set. See, for example, [51, [12], [16]. p which they were actually followed. ' ' ' ' '
77
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
681
6.4.4 Removing Data Points However we can see that (9) and (10) are not equivalent. In standard statistical practice there should normally be The use of (10) mistakenly assumes the power of a sum is strong theoretical or practical justification for removing equal to a sum of powers. data points during analysis. Recording and transcription .._ . . . . , _ . „ errors are often an acceptable reason. Unfortunately it is 6 5 T h e "Gold.lock's Conjecture" often difficult to tell from published papers whether any T n e r e s u l t s of inaccurate modeling and inference is perhaps data points have been removed before analysis, and if they m o s t evident in the debate that surrounds the "Goldilock's have, the reasons why. One notable case is Compton and Conjecture" discussed in Section 2—the idea that there is an Withrow, [12], who reported removing a large number of optimum module size that is "not too big nor too small." data points from the analysis because they represented Hatton, [19], claims that there is modules that had experienced zero defects. Such action is "compelling empirical evidence from disparate sources to sugsurprising in view of the conjecture they wished to test; that gest that in any software system, larger components are propordefects were minimised around an optimum size for Ada. If tionally more reliable than smaller components." the majority of smaller modules had zero defects, as it ap- I f t h e s e r e s u l t s w e r e g e n e r a ] l y t r u e t h e implications for pears, then we cannot accept Compton and Withrow's con- s o f t w a r e engineering would be very serious indeed. It elusions about the "Goldilock's Conjecture." w o u ] d m e a n { h a t p r o g r a m decomposition as a way of solvm 64 5 Using "Averaged" Data S problems simply did not work. Virtually all of the work ,,.,,. , , ,., . . , . done in software engineering extending from fundamental r .. ... U J . i_ We believe that the use of averaged data in analysis ^ ... . , .,. ~, • c , , , . . ij . .. ,. i , concepts, like modularity and lnformation-hidine, to methrather than the original data prejudices many studies. The _, i-i u- .. • * J J .. * J J • UI_ B ,.,,„, i_ i j . j r i°ds, "ke object-oriented and structured design would be study in 19 uses graphs, apparently denved from the , J .. c , , . r , • • i K A c A n J 3 A J 4. i ^« • • suspect because all of them rely on some notion of decomonainal NASA-Goddard data, plotting average size in .. .. , J ,. , , , .. jr , . f , f, r f „ „ J| - ^ , position. If decomposition doesn t work then there would statements against number of defects or defect den- f , JT , . . .. A , . * j r ^u be no good reason for doing it. sity. Analysis ofr averages are one step removed from the „. . ., , . ° , , . . i j .. . > c• ITClaims with such serious consequences as these deserve original data and it raises a number of issues. Using aver. . .. ... , M , ., ., , _, ° , , *r-r • i i _ i i _ special attention. We must ask whether the data and m i • • ages reduces the amount of information available to test the , , . • * ,. _» L ° . , . j i • .11 L knowledge exists to support lthem. These are clear criteria coniecture under study and any conclusions will be corre. , 5 ,. . ^. K ., . . , , , J ,. . , _^ . . ^ j . noi J —ifr the data exist to refute the coniecture that large modspondinely weaker. The classic study in [13] used average . „, , .r , .1.1 1 . r ,,. . . J J ^ \u\ ^ j ules are better and if we have a sensible explanation for c fault density ofc sgrouped data in a way that suggested a ,. . , . . ... , „ 1 . 1 1 F , U J .JU»U j TU c this result then a claim will stand. Our analysis shows that, trend lthat was not supported by the raw data. The use of . , .. . , . . . . , 5 1 J *u i_ using these cntena, these claims cannot currently stand. TIn averages may be a practical way around the common prob- ., ° ,. ., ^ . ., . . , , f. c n , , _/c , . ,, . , ,. . . , , the studies that support the conjecture we found the followlem where defect data is collected at a higher level, perhaps ,, ,
1.
*
1
1
1.
• -j
1 j T
ing problems:
b r at the system or subsystem level, than is ideal; defects recorded against individual modules or procedures. As a con* n o n e d e f l n e "module" in such a way as to make cornsequence data analysis must match defect data on systems parison across data sets possjble; against statement counts automatically collected at the • none explicitly compare different approaches to structurin g a n d decomposing designs; module level. There may be some modules within a subsysthe tern that are over penalized when others keep the average * data analysis or quality of the data used could not high because the other modules in that subsystem have support the results claimed; # more defects or vice versa. Thus, we cannot completely a number of factors exist that could partly explain the results w h i c h ^ ^ s t u d i e s h a v e neglected to examine. trust any defect data collected in this way. Misuse of averages has occurred in one other form. In Additionally, there are other data sets which do not show Gaffney's paper, [11], the rule for optimal module size was any clear relationships between module size and defect derived on the assumption that to calculate the total num- density. ber of defects in a system we could use the same model as If W e examine the various results we can divide them into had been derived using module defect counts. The model three main classes. The first class contains models, exempliderived at the module level is shown by (4) and can be ex- fied by graph Fig. la, that shows how defect density falls as tended to count the total defects in a system, DT, based on module size increases. Models such as these have been proLj, (9). The total number of modules in the system is de- duced by Akiyama, Gaffney, and Basili and Pericone. The noted by N. second class of models, exemplified by Fig. lb, differ from w N the first because they show the Goldilock's principle at work. D T = X D i = 4-2N + ° 0 0 1 5 X ^ ® Here defect density rises as modules get bigger in size. The i=1 i=1 third class, exemplified by Fig. lc, shows no discernible patGaffney assumes that the average module size can be tern whatsoever. Here the relationship between defect denused to calculate the total defect count and also the opti- sity and module size appears random (no meaningful curvimum module size for any system, using (10): linear models could be fitted to the data at all. The third class of results show the typical data pattern p N -,4/3 y* £ from a number of very large industrial systems. One data set (10) was collected at the Tandem Corporation and was reported D = 4 2N + 0 0015JV '=' r ' ' N in, [56]. The Tandem data was subsequently analyzed by Neil [42], using the principal components technique to produce a
78
682
IEEE TRANSACTIONS O N SOFTWARE ENGINEERING, VOL. 25, NO. 5, SEPTEMBER/OCTOBER 1999
Fig. 1. Three classes of defect density results, (a) Akiyama (1971), Basili and Perricome (1984), and Gaffney (1984); (b) Moeller and Paulish (1993), Compton and Withrow (1990), and Hatton (1997); (c) Neil (1992) and Fenton and Ohlsson (1997).
"combined measure" of different size measures, such as deci- 7 PREDICTING DEFECTS USING BBNs sion counts. This principal component statistic was then plot^ ft from Qur is jn Section 6 ^ ion ted against the number of changes made to the system mod£ ^ defects be or sizf meas. ules (these were rpredominantlyJ changes made to fix defects). . . r . , J , .. -L, , „ . , , , , ,. , ° ,. . ures alone presents only a skewed picture. The number ofc This defect data was standardized according to normal statis, . ,. ,. , , , »j ; U r.. *• . , . • , .i . c-^. . i defects discovered is clearly related to ithe amount of testing ,. _• .. J / I_- u t. tical practice. A polynomial regression curve was fitted to the A , ,. . . * i • u tu *u • -cperformed, as discussed above. A program which has never data in order to determine whether there was significant f j , c .. , / &... , , . ,. _. „ . , r . , .t T , ,. been tested, or used for that matter, will have a zero defect nonlinear effects of size on defect density. The results were , , . . . , ,. , ., , . . , , , , ,, . p. o count, even though its complexity may be very high. Moreb here in Fig. 2. *u * * « *• c i ypublished and are reproduced y „ r i_ i •i u • over, we can assume the test effectiveness of complex proDespite some parameters of the polynomial curve being . . , , .„_, , , ij u . . „ .„ . . i . u /., . j. r grams is relatively low, 137], and such programs could be statistically significant it is obvious that there is no discerm- ° , . uu*i u r j c * ir ,. 1 . . . I _i c J j i • . expected to exhibit a lower number of defects per line of ble relationship between defect counts and module size in j , • ,. . . *u -U-J » J r » rr . _ , , ,, ii j i • j code during testing because they hide defects more effec. _. ,. , . ,,, .. , , ., . the Tandem data set. Many small modules experienced no , , „ , , ~ . , -i ij u tively.T1 This could explain many of the empirical results that defects at all and the fitted polynomial curve would be use- , J , , , , , i . , .i. T . J c . . ... „. , •: , , , . • ! • • larger modules have lower defect densities. Therefore, cfrom c less for prediction. This data clearly refutes the simplistic , i r ^ u-iu i J i_ i • -r- _• •_ i T~ i jiu J i iu what we know of testability, we could conclude that large ,, *• j j i j r ^ u u assumptions typified by class Fig. la and lb models (these ji ii. i • i- T J J \ . modules contained many residual defects, rather than conmodels couldn t explain the Tandem data) nor accurately ^ rf m o d u l e s w e f e m o r e reljable (and im predict the defect density values of these Tandem modules. A «n ^ s o « w a r e d e c o m i t i o n is w r o n } F S similar analysis and result is presented in [47]. ^ c l e a r aH of ^ o b l e m s d e s c r i b e d i n Sec tion 6 a r e n o t VVe conclude that the relationship between defects and ^ tQ so]ved easil H o w e v e r w e believe that model. module size is too complex in general, to admit to straightfhe c o m l e x i t i e s o f s o f t w a r e development using new forward curve fitting models. These results, therefore, conb a b i l i s t i c t e c h n i q u e s presents a positive way forward, tradict the idea that there is a general law linking defect T h e s e m e t h o d s c a l l e d gayesian Belief Networks (BBNs), density and software component size as suggested by the a l l o w u s t 0 e x p r e s s c o m p l e x interrelations within the model "Goldilock's Conjecture."
Fig. 2. Tandem data defects counts vs. size "principal component."
79
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
683
at a level of uncertainty commensurate with the problem, ties for the NPTs. One of the benefits of BBNs stems from the In this section, we first provide an overview of BBNs (Sec- fact that we are able to accommodate both subjective probtion 7.1) and describe the motivation for the particular BBN abilities (elicited from domain experts) and probabilities example used in defects prediction (Section 7.2). In Section based on objective data. Recent tool developments, notably on the 7.3, we describe the actual BBN. SERENE project [58], mean that it is now possible to build very large BBNs with very large probability tables (in7.1 An Overview Of BBNs eluding continuous node variables). In three separate indusBayesian Belief Networks (also known as Belief Networks, trial applications we have built BBNs with several hundred Causal Probabilistic Networks, Causal Nets, Graphical nodes and several millions of probability values [59]. T h e r e a r e mar >y advantages of using BBNs, the most imProbability Networks, Probabilistic Cause-Effect Models, and Probabilistic Influence Diagrams) have attracted much portant being the ability to represent and manipulate cornrecent attention as a possible solution for the problems of plex models that might never be implemented using convendecision support under uncertainty. Although the underly- tional methods. Another advantage is that the model can ing theory (Bayesian probability) has been around for a predict events based on partial or uncertain data. Because long time, the possibility of building and executing realistic B B N s h a v e a rigorous, mathematical meaning there are softmodels has only been made possible because of recent algo- ware tools that can interpret them and perform the complex rithms and software tools that implement them [57]. To date calculations needed in their use [58]. Th BBNs have proven useful in practical applications such as e benefits of using BBNs include: medical diagnosis and diagnosis of mechanical failures. . specification of complex relationships using condiTheir most celebrated recent use has been by Microsoft tional probability statements; where BBNs underlie the help wizards in Microsoft Office; . u s e of "what-if? analysis and forecasting of effects of also the "intelligent" printer fault diagnostic system which process changes; you can run when you log onto Microsoft's web site is in . easier understanding of chains of complex and seemfact a BBN which, as a result of the problem symptoms you ingly contradictory reasoning via the graphical forenter, identifies the most likely fault. mat; A BBN is a graphical network that represents probabilis. explicit modeling of "ignorance" and uncertainty in tic relationships among variables. BBNs enable reasoning estimates; under uncertainty and combine the advantages of an intui• u s e of subjectively or objectively derived probability tive visual representation with a sound mathematical basis distributions; in Bayesian probability. With BBNs, it is possible to articu• forecasting with missing data, late expert beliefs about the dependencies between different variables and to propagate consistently the impact of evi- 7 - 2 Motivation for BBN Approach dence on the probabilities of uncertain outcomes, such as Clearly defects are not directly caused by program complex"future system reliability." BBNs allow an injection of scien- ity alone. In reality the propensity to introduce defects will be tific rigor when the probability distributions associated influenced by many factors unrelated to code or design comwith individual nodes are simply "expert opinions." plexity. There are a number of causal factors at play when we A BBN is a special type of diagram (called a graph) to- want to explain the presence of defects in a program: gether with an associated set of probability tables. The graph # Difficulty of the problem is made up of nodes and arcs where the nodes represent un. C o m p l e x i t y o f des igned solution certain variables and the arcs the causal/relevance relation. Programmer/analyst skill , D ships between the variables Fig. 3 shows a BBN for an exammethods and edures used pie reliability prediction problem. The nodes represent discrete or continuous variables, for example, the node "use Eliciting requirements is a notoriously difficult process of IEC 1508" (the standard) is discrete having two values ar>d is widely recognized as being error prone. Defects intro"yes" and "no," whereas the node "reliability" might be con- duced at the requirements stage are claimed to be the most tinuous (such as the probability of failure). The arcs represent expensive to remedy if they are not discovered early enough, causal/influential relationships between variables. For ex- Difficulty depends on the individual trying to understand ample, software reliability is defined by the number of (la- a n d describe the nature of the problem as well as the probtent) faults and the operational usage (frequency with which km itself. A "sorting" problem may appear difficult to a novfaults may be triggered). Hence, we model this relationship i c e programmer but not to an expert. It also seems that the by drawing arcs from the nodes "number of latent faults and difficulty of the problem is partly influenced by the number "operational usage" to "reliability." °f foiled attempts at solutions there have been and whether a For the node "reliability" the node probability table (NPT) "ready made" solution can be reused. Thus, novel problems might, therefore, look like that shown in Table 5 (for ultra- have the highest potential to be difficult and "known" probsimplidty we have made all nodes discrete so that here reli- l e m s t e n d to be simple because known solutions can be idenability takes on just three discrete values low, medium, and tified and reused. Any software development project will high). The NPTs capture the conditional probabilities of a h a ve a mix of "simple" and "difficult" problems depending node given the state of its parent nodes. For nodes without o n w h a t intellectual resources are available to tackle them, parents (such as "use of IEC 1508" in Fig. 3.) the NPTs are G o °d managers know this and attempt to prevent defects by simply the marginal probabilities. pairing up people and problems; easier problems to novices There may be several ways of determining the probabili- a n d difficult problems to experts.
TABLE 5 NODE PROBABILITY TABLE (NPT) operational usage faults reliability
FOR THE NODE "RELIABILITY"
low
med
high
low
med
high
low
I med
high
low
med
low
0.10
0.20
0.33
0.20 | 0.33
O50
0~20
033
0J0
med high
0.20 0.70
0.30 0.33 0.50 | 0.33
0.30 | 0.33 0.50 j 0.33
0.30 0.30 0.20 j 0.50
0.33 0.33
0.20 0.10
When assessing a defect it is useful to determine when it was introduced. Broadly speaking there are two types of defect; those that are introduced in the requirements and those introduced during design (including coding/ implementation which can be treated as design). Useful defect models need to explain why a module has a high or low defect count if we are to learn from its use, otherwise we could never intervene and improve matters. Models using size and complexity metrics are structurally limited to assuming that defects are solely caused by the internal organization of the software design. They cannot explain defects introduced because: . the "problem" is "hard"• . problem descriptions are inconsistent; . the wrong "solution" is chosen and does not fulfill the requirements
high
Central to software design method is the notion that problems and designs can be decomposed into meaningful chunks where each can be readily understood alone and finally recomposed to form the final system. Loose coupling between design components is supposed to help ensure that defects are localized and that consistency is maintained. What we have lacked as a community is a theory of program composition and decomposition, instead we have fairly ill-defined ideas on coupling, modularity and cohesiveness. However, despite not having such a theory every day experience tells us that these ideas help reduce defects and improve comprehension. It is indeed hard to think of any other scientific or engineering disci line P * * h a s n o t b e n e f l t e d f r o m Ms approach. Surprisingly, much of the defect prediction work has been pursued without reference to testing or testability. According to [37], [38] the testability of a program will dictate its propensity to reveal failures under test conditions and We have long recognized in software engineering that u s e A ] s Q a t a ficiai level t h e a m o u n t of testi program quality can bepotentialfy improved through the use f o r m e d ^ d e t e r m i n e h o w defects ^ be discov. of proper project procedures and good design methods. Basic e r e d a s s u m i t h e r e a r e d e f e c t s t h e r e t 0 d i s c o v e r Q e a r l . f project procedures like configuration management, incident n o .. . f ., , c . .„ , r , , . , . > j j i_ u L i J testing is done then no defects will be found. nBy exten. logging, documentation and standards should help reduce s i.o n w e . ,. , . ,.™ , ,. ., J u i-i vu j r j r < . c L • u i u might argue that difficult problems, with complex r r the likelihood of defects. Such practices may not help the . . °. , ° _. . , _. ,. , ., „ ,.&. ,. . solutions, might be difficult to test and so might demand 7 , unique genius you need to work on the really difficult prob^ ,,- , . «• e , , , u u u j j r i _ Jmore test effort. If such testing effort is not forthcoming (as lems but they should raise the standards of the mediocre. . . . . . . . ,f. is typical in many commercial projects when deadlines
81
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
685
loom) then less defects will be discovered, thus giving an probability but, because less testing effort was allocated over estimate of the quality achieved and a false sense of than required, the distribution of defects detected peaks security. Thus, any model to predict defects must include around zero with probability 62 percent. The distribution testing and testability as crucial factors. for defect density at testing contrasts sharply with the residual ORM defect density distribution in that the defect density at testing 7.3 A Prototype bBN appears very favourable. This is of course misleading beWhile there is insufficient space here to fully describe the c a u s e t h e res idual defect density distribution shows a much development and execution of a BBN model here we have higher probability of higher defect density levels, developed a prototype BBN to show the potential of BBNs From the model we can see a credible explanation for and illustrate their useful properties. This prototype does not observing large "modules" with lower defect densities, exhaustively model all of the issues described in Section 7.2 Underallocation of design effort for complex problems nor does it solve all of the problems described in Section 6. r e s u i t s in more introduced defects and higher design size. Rather, it shows the possibility of combining the different Higher design size requires more testing effort, which if software engineering schools of thought on defect prediction unavailable, leads to less defects being discovered than into a single model. With this model we should be able to a r e actually there. Dividing the small detected defect show how predictions might be made and explain historical c o u n t s with large design size values will result in small results more clearly. defect densities at the testing stage. The model explains The majority of the nodes have the following states: t h e "Goldilock's Conjecture" without ad hoc explanation, "very-high," "high," "medium," "low," "very low," except Clearly the ability to use BBNs to predict defects will for the design size node and defect count nodes which have depend largely on the stability and maturity of the develinteger values or ranges and the defect density nodes which O p m e nt processes. Organizations that do not collect metrics have real values. The probabilities attached to each of these fata, do not follow defined life-cycles or do not perform states are fictitious but are determined from an analysis of a n y forrns o f systematic testing will find it hard to build or the literature or common-sense assumptions about the di- a p p i y s u c h m o dels. This does not mean to say that less marection and strength of relations between variables. t u r e organizations cannot build reliable software, rather it The defect prediction BBN can be explained in two stages, implies that they cannot do so predictably and controllably The first stage covers the life-cycle processes of specification, Achieving predictability of output, for any process, dedesign or coding and the second stage covers testing. In Fig. 4 m a n d s a degree of stability rare in software development problem complexity represents the degree of complexity inher- organizations. Similarly, replication of experimental results ent in the set of problems to be solved by development. We c a n o n ] y D e predicated on software processes that are decan think of these problems as being discrete functional re- fined a n d repeatable. This clearly implies some notion of quirements in the specification. Solving these problems ac- Statistical Process Control (SPC) for software development, crues benefits to the user. Any mismatch between the problem complexity and design effort is likely to cause the introduction of defects, defects introduced, and a greater design size. ° UONCLUSIONS Hence the arrows between design effort, problem complexity, Much of the published empirical work in the defect predicintroduced defects, and design size. The testing stage follows the tion area is well in advance of the unfounded rhetoric sadly design stage and in practice the testing effort actually allo- typical of much of what passes for software engineering cated may be much less than that required. The mismatch research. However every discipline must learn as much, if between testing effort and design size will influence the num- not more, from its failures as its successes. In this spirit we ber of defects detected, which is bounded by the number of have reviewed the literature critically with a view to better defects introduced. The difference between the defects detected understand past failures and outline possible avenues for and defects introduced is the residual defects count. The defect future success. density at testing is a function of the design size and defects Our critical review of state-of-the-art of models for predetected (defects/size). Similarly, the residual defect density is dieting software defects has shown that many methodoresidual defects divided by design size. logical and theoretical mistakes have been made. Many past Fig. 5 shows the execution of the defect density BBN studies have suffered from a variety of flaws ranging from model under the "Goldilock's Conjecture" using the Hugin model misspecification to use of inappropriate data. The Explorer tool [58]. Each of the nodes is shown as a window issues and problems surrounding the "Goldilock's Conjecwith a histogram of the predictions made based on the facts ture" illustrate how difficult defect prediction is and how entered (facts are represented by histogram bars with 100 easy it is to commit serious modeling mistakes. Specifically, percent probability). The scenario runs as follows. A very we conclude that the existing models are incapable of precomplex problem is represented as a fact set at "very high" dieting defects accurately using size and complexity metrics and a "high" amount of design effort is allocated, rather than alone. Furthermore, these models offer no coherent expla"very high" commensurate with the problem complexity. The nation of how defect introduction and detection variables design size is between 1.0-2.0 KLOC. The model then affect defect counts. Likewise any conclusions that large propagates these "facts" and predicts the introduced defects, modules are more reliable and that software decomposition detected defects and the defect density statistics. The distribu- doesn't work are premature, tion for defects introduced peaks at two with 33 percent
Fig. 5. A demonstration of the "Goldilock's Conjecture."
83
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
Each of the different "schools of thought" have their own view of the prediction problem despite the interactions and subtle overlaps between process and product identified here. Furthermore each of these views model a part of the problem rather than the whole. Perhaps the most critical issue in any scientific endeavor is agreement on the constituent elements or variables of the problem under study, Models are developed to represent the salient features of the problem in a systemic fashion. This is as much the case in physical sciences as social sciences. Economists could not predict the behavior of an economy without an integrated, complex, macroeconomic model of all of the known, pertinent variables. Excluding key variables such as savings rate or productivity would make the whole exercise invalid. By taking the wider view we can construct a more accurate picture and explain supposedly puzzling and contradictory results. Our analysis of the studies surrounding the "Goldilock's Conjecture" shows how empirical results about defeet density can make sense if we look for alternative explanations. Collecting data from case studies and subjecting it to isolated analysis is not enough because statistics on its own does not provide scientific explanations. We need compelling and sophisticated theories that have the power to explain the empirical observations. The isolated pursuit of these single issue perspectives on the quality prediction problem are, in the longer-term, fruitless. Part of the solution to many of the difficulties presented above is to develop prediction models that unify the key elements from the diverse software quality prediction models. We need models that predict software quality by taking into account information from the development process, problem complexity, defect detection processe, and design complexity, We must understand the cause and effect relations between important variables in order to explain why certain design processes are more successful than others in terms of the products they produce. It seems that successful engineers already operate in a way that tacitly acknowledges these cause-effect relations. After all if they didn't how else could they control and deliver quality products? Project managers make decisions about software quality using best guesses; it seems to us that will always be the case and the best that researchers J , . can do is 1) recognize this fact and . . . . „ n{ . 2) improve the guessing process. We, therefore, need to model the subjectivity and uncertainty that is pervasive in software development. Likewise, the challenge for researchers is in transforming this uncertain knowledge, which is already evident in elements of the
687
All of the defect prediction models reviewed in this paper operate without the use of any formal theory of program/ problem decomposition. The literature is however replete with acknowledgments to cognitive explanations of shortcomings in human information processing. While providing useful explanations of why designers employ decomposition as a design tactic they do not, and perhaps cannot, allow us to determine objectively the optimum level of decomposition within a system (be it a requiremen's specification or a program). The literature recognizes the two structural3 aspects of software, "within" component structural complexity and "between" component structural complexity, but we lack the way to crucially integrate these two views in a way that would allow us to say whether one design was more or less structurally complex than another, Such a theory might also allow us to compare different decompositions of the same solution to the same problem requirement, thus explaining why different approaches to problem or design decomposition might have caused a designer to commit more or less defects. As things currently stand without such a theory we cannot compare different decompositions and, therefore, cannot carry out experiments comparing different decomposition tactics. This leaves a gap in any evolving science of software engineering that cannot be bridged using current case study based approaches, despite their empirical flavor, ACKNOWLEDGMENTS The work carried out here was partially funded by the ESPRIT projects SERENE and DeVa, the EPSRC project IMPRESS, and the DISPO project funded by Scottish Nuclear. The authors are indebted to Niclas Ohlsson and Peter Popov for comments that influenced this work and also to the anonymous reviewers for their helpful and incisive contributions. REFERENCES 1" ^ Mjn-^jnd-^R ^ K ^ f f i T ] ^ Eng., vol. 5, no. 3, May 1979. [2] D. Potier, J.L. Albin, R. Ferreol, A, and Bilodeau, "Experiments ? " * C o m ? u , t e r S o £ w a r e C ° m Pi e o xit y a n d RellabiUt y." p™- Sixth IntlConf. Software Eng., pp. 94-103, 1982. [3] T N a k a j o a n d H K u m e »A C a s e History Analysis of Software Error Cause-Effect Relationships," IEEE Trans. Software Eng., vol. 17, no. 8, Aug. 1991. [4] s Brocklehurst and B. Littlewood, "New Ways to Get Accurate f ^ f ^ R e l i a b i l "y Modelling," IEEE Software, vol. 34, no. 42, [5] ^Akivama, "An Example of Software System Debugging," Information Processing, vol. 71, pp. 353-379,1971. 16] A.E. Ferdinand, "A Theory of System Complexity," InflJ. General
various quality models already discussed, into a prediction
.,, ? T u e ^ \ v o l J ' E P ' ^ f V ^ i '
model that other engineers can learn from and apply. We are already working on a number of projects using Bayesian Belief Networks as a method for creating more sophisticated models for prediction, [59], [60], [61], and have described one of the prototype BBNs to outline the approach. Ultimately, this research is aiming to produce a method for the Statistical process control (SPC) of software production
__T, \ , W J lmplied by the Sfc.1 S Capability Maturity Model.
I
c•
c>
• M U U ,. J
[7] M.H. Halstead, Elements of Software Science. Elsevier, North-Holland,
ring
84
3 We
'
are careful here to use the term structural complexity when dis-
cussing attributes of design artifacts and cognitive complexity when referto an individuals understanding of such an artifact. Suffice it to say, that structural complexity would influence cognitive complexity.
L.M. Ottenstein, "Quantitative Estimates of Debugging Require- [38] A. Bertolino and L. Strigini, "On the Use of Testability Measures for Dependability Assessment," IEEE Trans. Software Eng., vol. 22, ments," IEEE Trans. SoftwareEng., vol. 5, no. 5, pp. 504-514,1979. no. 2, pp. 97-108,1996. M. Lipow, "Number of Faults per Line of Code," IEEE Trans. Software Eng., vol. 8, no. 4, pp. 437-439, 1982. [39] M. Diaz and J. Sligo, "How Software Process Improvement J.R. Gaffney, "Estimating the Number of Faults in Code," IEEE Helped Motorola," IEEE Software, vol. 14, no. 5, pp. 75-81, 1997. Trans. SoftwareEng., vol. 10, no. 4, 1984. [40] C. Jones, "The Pragmatics of Software Process Improvements," Software Engineering Technical Council Newsletter, Technical Council T. Compton, and C. Withrow, "Prediction and Control of Ada Software Defects," ;. Systems and Software, vol. 12, pp. 199-207, on Software Eng., IEEE Computer Society, vol. 14 no. 2, Winter 1990. 1996. V.R. Basili and B.T. Perricone, "Software Errors and Complexity: An [41] J.C. Munson and T.M. Khoshgoftaar, "Regression Modelling of Software Quality: An Empirical Investigation," Information and Empirical Investigation," Comm. ACM, vol. 27, no. 1, pp. 42-52, 1984. Software Technology, vol. 32, no. 2, pp. 106-114, 1990. V. Y. Shen, T. Yu, S.M., Thebaut, and L.R. Paulsen, "Identifying [42] M.D. Neil, "Multivariate Assessment of Software Products," /. Software Testing, Verification and Reliability, vol. 1, no; 4, pp. 17-37, Error-Prone Software—An Empirical Study," IEEE Trans. Software 1992. Eng., vol. 11, no. 4, pp. 317-323, 1985. K.H. Moeller and D. Paulish, "An Empirical Investigation of [43] T.M. Khoshgoftaar and J.C. Munson, "Predicting Software DevelSoftware Fault Distribution," Proc. First Int'l Software Metrics opment Errors Using Complexity Metrics," IEEE J. Selected Areas Symp, pp. 82-90, IEEE CS Press, 1993. in Comm., vol. 8, no. 2, pp. 253-261, 1990. L. Hatton, "The Automation of Software Process and Product [44] J.C. Munson and T.M. Khoshgoftaar, "The Detection of Fault-Prone Quality," M. Ross, C.A. Brebbia, G. Staples, and J. Stapleton, eds., Programs," IEEE Trans. Software Eng., vol. 18, no. 5, pp. 423-433, Software Quality Management, pp. 727-744, Southampton: Compu1992. tation Mechanics Publications, Elsevier, 1993. [45] N.E. Fenton and S. Lawrence Pfleeger, Software Metrics: A Rigorous L. Hatton, C and Safety Related Software Development: Standards, and Practical Approach, second edition, Int'l Thomson Computer Subsets, testing, Metrics, Legal Issues. McGraw-Hill, 1994. Press, 1996. T.Keller, "Measurements Role in Providing Error-Free Onboard [46] E. Adams, "Optimizing Preventive Service of Software Products," Shuttle Software," Proc. Third Int'l Applications of Software Metrics IBM Research ]., vol. 28, no. 1, pp. 2-14, 1984. Conf. La Jolla, Calif., pp. 2.154-2.166, 1992. Proc. available from [47] N. Fenton and N. Ohlsson, "Quantitative Analysis of Faults and Software Quality Engineering. Failures in a Complex Software System," IEEE Trans. Software L. Hatton, "Re-examining the Fault Density-Component Size Eng., 1999. to appear [48] T. Stalhane, "Practical Experiences with Safety Assessment of a Connection," IEEE Software, vol. 14, no. 2, pp. 89-98, Mar./Apr. 1997. System for Automatic Train Control," Proc. SAFECOMP'92, ZuT.J. McCabe, "A Complexity Measure," IEEE Trans. Software Eng., rich, Switzerland, Oxford, U.K.: Pergamon Press, 1992. vol. 2, no. 4, pp. 308 - 320,1976. [49] P. Hamer and G. Frewin, "Halstead's Software Science: A Critical B.A. Kitchenham, L.M. Pickard, and S.J. Linkman, "An Evaluation Examination," Proc. Sixth Int'l Conf. Software Eng., pp. 197-206, of Some Design Metrics," Software Eng J., vol. 5, no. 1, pp. 50-58, 1982. 1990. [50] V.Y. Shen, S.D. Conte, and H. Dunsmore, "Software Science RevisS. Henry and D. Kafura, "The Evaluation of Software System's ited: A Critical Analysis of the Theory and Its Empirical Support," IEEE Trans. SoftwareEng., vol. 9, no. 2, pp. 155-165, 1983. Structure Using Quantitative Software Metrics," Software— Practice and Experience, vol. 14, no. 6, pp. 561-573, June 1984. [51] M.J. Shepperd, "A Critique of Cyclomatic Complexity as a SoftN. Ohlsson and H. Alberg "Predicting Error-Prone Software ware Metric," Software Eng. J., vol. 3, no. 2, pp. 30-36,1988. Modules in Telephone Switches, IEEE Trans. Software Eng., vol. 22, [52] B.F. Manly, Multivariate Statistical Methods: A Primer. Chapman & no. 12, pp. 886-894, 1996. Hall, 1986. V. Basili, L. Briand, and W.L. Melo, "A Validation of Object Ori- [53] F. Zhou, B. Lowther, P. Oman, and J. Hagemeister, "Constructing and Testing Software Maintainability Assessment Models," First ented Design Metrics as Quality Indicators," IEEE Trans. Software Eng., 1996. Int'l Software Metrics Symp., Baltimore, Md., IEEE CS Press, 1993. S.R. Chidamber and C.F. and Kemerer, "A Metrics Suite for Object [54] J. Rosenberg, "Some Misconceptions About Lines of Code," Software Metrics Symp., pp. 37-142, IEEE Computer Society, 1997. Oriented Design," IEEE Trans. Software Eng., vol. 20, no. 6, pp. 476498, 1994. [55] B.A. Kitchenham, "An Evaluation of Software Structure Metrics," C. Jones, Applied Software Measurement. McGraw-Hill, 1991. Proc. COMPSAC'88, Chicago 111., 1988. M.A. Cusumano, Japan's Software Factories. Oxford Univ. Press, [56] S. Cherf, "An Investigation of the Maintenance and Support 1991. Characteristics of Commercial Software," Proc. Second Oregon Workshop on Software Metrics (AOWSM), Portland, 1991. K. Koga, "Software Reliability Design Method in Hitachi," Proc. Third European Conf. Software Quality, Madrid, 1992. [57] S.L. Lauritzen and D.J. Spiegelhalter, "Local Computations with K. Yasuda, "Software Quality Assurance Activities in Japan," Probabilities on Graphical Structures and Their Application to Japanese Perspectives in Software Eng., pp. 187-205, Addison-Wesley, Expert Systems (with discussion)," J.R. Statistical Soc. Series B, 50, 1989. no. 2, pp. 157-224, 1988. M. Dyer, The Cleanroom Approach to Quality Software Development. [58] HUGIN Expert Brochure. Hugin Expert A/S, Aalborg, Denmark, Wiley, 1992. 1998. W.S. Humphrey, Managing the Software Process. Reading, Mass.: [59] Agena Ltd, "Bayesian Belief Nets," http://www.agena.co.uk/bbnarticle/ Addison-Wesley, 1989. bbns.html R.D. Buck and J.H. Robbins, "Application of Software Inspection [60] M. Neil and N.E. Fenton, "Predicting Software Quality Using Methodology in Design and Code," Software Validation, H.-L. Bayesian Belief Networks, "Proc 21st Ann. Software Eng. Workshop, Hausen, ed., pp. 41-56, Elsevier Science, 1984. pp. 217-230, NASA Goddard Space Flight Centre, Dec. 1996. N.E. Fenton, S. Lawrence Pfleeger, and R. Glass, "Science and [61] M. Neil, B. Littlewood, and N. Fenton, "Applying Bayesian Belief Networks to Systems Dependability Assessment," Proc. Safety Substance: A Challenge to Software Engineers," IEEE Software, pp. 86-95, July 1994. Critical Systems Club Symp., Springer-Verlag, Leeds, Feb. 1996. R.B. Grady, Practical Software Metrics for Project Management and Process Improvement. Prentice Hall, 1992. A. Veevers and A.C. Marshall, "A Relationship between Software Coverage Metrics and Reliability," J Software Testing, Verification and Reliability, vol. 4, pp. 3-8, 1994. M.D. Neil, "Statistical Modelling of Software Metrics," PhD thesis, South Bank Univ. and Strathclyde Univ., 1992. J.M. Voas and K.W. Miller, "Software Testability: The New Verification," IEEE Software, pp. 17-28, May 1995.
85
FENTON AND NEIL: A CRITIQUE OF SOFTWARE DEFECT PREDICTION MODELS
Norman E. Fenton is professor of computing science at the Centre for Software Reliability, City University, London and is also a director at Agena Ltd. His research interests include software metrics, empirical software engineering, safety critical systems, and formal development methods. However, the focus of his current work is on applications of Bayesian nets; these applications include critical systems' assessment, vehicle reliability prediction, and software quality assessment. He is a chartered engineer (member of the IEE), a fellow nber of the IEEE Computer Society. of the IMA, and a member
689
Martin Neil holds a first degree in mathematics for business analysis from Glasgow Caledonian University and has achieved a PhD in statistical analysis of software metrics jointly from South Bank University and Strathclyde University. Currently he is a lecturer in computing at the Centre for Software Reliability, City University, London. Before joining the CSR, He spent three years with Lloyd's Register as a consultant and researcher and a year at South Bank University. He has also worked with J.P. Morgan as a software quality consultant. nt. His research interests cover software metrics, Bayesian probability, and the software process. Dr. Neil is a director at Agena Ltd., a consulting :ing company specializing in decision support and risk assessment of safety afety and business critical systems. He is a member of the CSR Council, sil, the IEEE Computer Society, and the ACM.
86
IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002
455
Using Regression Trees to Classify Fault-Prone Software Modules Taghi M. Khoshgoftaar, Member, IEEE, Edward B. Allen, Member, IEEE, and Jianyu Deng
Abstract—Software faults are defects in software modules that might cause failures. Software developers tend to focus on faults, because they are closely related to the amount of rework necessary to prevent future operational software failures. The goal of this paper is to predict which modules are fault-prone and to do it early enough in the life cycle to be useful to developers. A regression tree is an algorithm represented by an abstract tree, where the response variable is a real quantity. Software modules are d a s sified as fault-prone or not, by comparing the predicted value to a threshold. A classification rule is proposed that allows one to choose a preferred balance between the two types of misclassification rates. A case study of a very large telecommunications systerns considered software modules to be fault-prone if any faults were discovered by customers. Our research shows that classifying fault-prone modules with regression trees and the using the classification rule in this paper, resulted in predictions with satisfactory accuracy and robustness.
Enhanced Measurement for Early Risk Assessment of Latent Defects cumulative distribution function , f raun-prone not fault-prone probability density function NOTATION
3
mentiner ot a predictor u a o i e s u ana i n snow software metrics notation)
I
node identifier number of objects (modules)
object # i ' s value of Xj vector of predictor values for object #i n u m b e r o f c u s tomer-discovered faults in object „. response for object #i predicted y, average response for training objects in n o d e #1 s-deviance of node #1 . ..!•• ^ > i_ •_• pnor probabilities of class membership s-deviance threshold minimum number of objects in a decision node the leaf that object #i falls into number of training objects that fall into leaf * , , „ ,f i,. actual class of object #z predicted class of object #i, based on its x,p r { a n o b j e c t i m , e a f ( ig f a u l t . p r o n e l : _ }_. _ , ,., , .. '
<J(L(XJ))
estimated qi
£
classification-rule parameter
_ , , , . , Pr{fp|nfp}
T , — ™ ,^ , ^ type I misdassification rate, Pr{Class( X i ) fp|Class; - nfp} type II misclassification rate, Pr{Class(x,;) . . nf P |Clas S i = fp} pdf of the Gaussian Cdf.
Prjnfp fp} Saud«
I.
= =
INTRODUCTION
H
IGH software reliability is important for many software systems, especially those that support society's infras t r u c t u r e s s u c h a s t e l e c o m m u n i c a t i o n s y s t e m s . Reliability is u s u a l l y m e a s u r e d f r o m t h e user>s v i e w p o i n t i n t e r m s o f t i m e between failures, according to an operational profile [29], A s o f t w a r e fau[{
predictor #j
{& d e f m e d a s a d e f e c t j n a n e x e c u t a b l e
IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002
fault-prone. The exact nature of the software improvement processes that developers could apply to fault-prone modules is not addressed here. In a well-built system, fault-prone modules typically are only a small fraction of the total. A variety of classification techniques have been used to model software quality, including • logistic regression [2], [14]; • discriminant analysis [21], [28]; • discriminant power [34], [35]; • discriminant coordinates [30]; • optimal set reduction [4]; • neural networks [24]; • fuzzy classification [7]; • classification trees [37], ti°n A classification tree is an algorithm represented by an abstract tree of decision rules. The s-dependent variable is the response variable which is categorical (e.g., fault-prone or not). The s-independent variables are predictors. Each internal node represents a decision that is based on a predictor. Each edge leads to a potential next decision. Each leaf is labeled with a class. An object (e.g., software module) is classified by traversing a path from the root of the tree to a leaf, according to the values of the object's predictors. Finally, the object's response variable is assigned the leaf's class. A classification tree accommodates nonmonotonic and nonlinear relationships among combinations of variables in a model that is easy to understand and use. References [31], [37] model software quality using the ID3
is similar to, but s-independent of the training data set. Both the training and evaluation data sets must represent historical software modules where actual faults are known. After a tree model has been built and evaluated with historical data, it is ready to make predictions for a similar current development project, where predictors are known, but faults have not yet been discovered. The accuracy of a classification model is characterized by misclassification rates. When the response variable can be 1 of 2 classes, e.g., fault-prone or not, then a model can make 2 kinds °f rnisclassifications. In the application in this paper, a Type I misclassification is when the model predicts that a module is fault-prone when it is not. Conversely, a Type II misclassifica*s when the model predicts that a module is not fault-prone when it is. This P a P e r presents a method for using regression trees to classify software modules as fault-prone or not, allowing one to choose a preferred balance between Types I and II misclassifixation rates. To our knowledge, this is the first time the S-Plus regression tree algorithm has been used for classification of software quality. A case study of a very large telecommunication system illustrates the approach [6]. Future work might include a comparative study of the various tree algorithms, The remainder of this paper explains how S-Plus builds a regression tree, defines the authors' classification rule for choosing a preferred balance between misclassification rates, ^ d presents details of the authors' case study.
algorithm [32] to build trees using an entropy-based criterion.
II. A CLASSIFICATION RULE FOR REGRESSION TREES
Reference [38] extended the ID3 algorithm by applying Akaike Information Criterion procedures [1] to prune the tree. The authors' research group has classified fault-prone modules with the CART algorithm [3], [17], [22] and the TREEDISC algorithm [23], [33] which is a refinement of the CHAID algorithm [12]. S-plus also has an algorithm for constructing classification trees [5]. However, this algorithm does not incorporate prior probabilities of membership nor costs of misclassifications [13]. In one case study, this algorithm did not build a tree, because our data had a very small proportion of fault-prone modules. This led the authors to explore the use of regression trees for the purpose of classifying fault-prone modules. A regression tree is also an algorithm represented by an abstract tree. However, the response variable is a real quantity, instead of a class. Decision nodes are similar to a classification tree's, but each leaf is labeled with a quantity for the response variable. The processing of an object is similar to a classification tree, but once the object reaches a leaf, the response variable is assigned the appropriate quantity. Reference [25] briefly reports using the Classification and Regression Trees (CART) regression i
-.v.
m .
J i
7
•
J
• •
<-.
A tree al
- g°nthm
bullds a
based on a training set of obPonse variable a n d Predictor v a l u e s a r e k n o w n f o r e a c h ob ect J I n this P a P e r ' a s o f t w a r e m o d u l e i s c o n s i d e r e d ™ object. The V" l s e n c o d e d for each module as a real number: * 0.0 for fault-prone, * 10 for not fault-prone. S-Plus constructs a regression tree that predicts the value of this real response variable. The Appendix gives details on the S-Plus algorithm for building a regression tree. After the regress i o n tree is built, each leaf, /, must be labeled with a class so that t he tree can be used for classification. This, in effect, determines a m l e f o r classifying objects. A threshold is applied to the &, to determine the predicted class Because of the way y is encoded, the w , is the proportion of n o t f a u i t . p r o n e training modules that fall into leaf I. This yields m e s t i m a t e o f t h e p r o b a b i l i t y t h a t m o d u l e ; i s f au lt- P rone J
ects
'
^
where the value of the res
,..
„ ,„,
c
..
,TI
n
<j(() = Pr-fClassi = fp L ( X i ) = / } = 1 — tii,
tree algorithm [3] to model software project productivity. Case studies in [9] and [39] used the S-Plus regression tree algorithm to predict the number of faults. As future work, [9] suggests applying a threshold to the predicted quantity to classify modules. Tree algorithms are often considered "machine learning" techniques, because the structure of a tree is derived from processing a training data set that represents objects of interest; the algorithm "learns" from the structure of the training data set. One should evaluate a tree's accuracy with an s-unbiased method, such as cross-validation, or an evaluation data set that
88
w
l
Hl l
'
'
^'
v
(1)
' (2)
The goal of this paper is to allow appropriate emphasis on each type of misclassification according to the preference of the project. The authors proposed such a classification rule for use with software quality models based on discriminant analysis [15] and this paper adapts it to regression trees. Let a software quality modeling technique produce a likelihood function for each class, /nfp and /fp. Equation (3) enables a project to select
KHOSHGOFTAAR el al.: USING REGRESSION TREES TO CLASSIFY FAULT-PRONE SOFTWARE MODULES
its preferred balance between the misclassification rates by choosing a parameter f /nfp (xi)
i
(3) P
"TWxiT ^ (3) , . ,,„ ... , .c . , ' , , , When applied to a classification tree, the /nfp and /fp, are probabilities of class membership at the leaves. Thus, the general classification rule for a regression tree is: -, ~,T/ w fn if l ~ i(L(Xi^ •>/• r
{
nrp r
fpy
the generalizability of results. Software applications have various product characteristics, and software is developed under a variety of conditions in various organizations. Section HI-A de-
" q(L(xi)) , . otherwise.
An alternative formulation is: Class(xi) = \ p l f 1 ?(M X »)) - ° • I * p otherwise
(4)
scribes the subject development organization and its product, so that others can assess its similarity to their own. Sections ni-B and III-C present the methodology of the case study and its em. . . . A- System Description For a n e m pincal study to be credible, the software engineering community demands that the subject be a system with , „ ., . , . . .«, the following charactenstics r[401: 1) developed by a group, rather than an individual; 2) developed by professionals, rather than students; ^) developed in an industrial environment, rather than an artificial setting; 4) large enough to be comparable to real industry projects, The case study in this paper fulfills all these criteria, The case study was of a very large legacy telecommunication system with the characteristics: 1) developed by teams in a large organization; 2) developed by professional programmers using the procedural development paradigm and a standardized development process; 3 ) Part of a commercial product, which was an embedded, real-time system with many finite-state machines; 4 ) consisted of appreciably more than 107 lines of code in a high-level language (Protel) similar to Pascal. Four consecutive releases (labeled 1-4 in this paper) were studied - R e l e a s e ] w a s u s e d a s a tmininS &** set, and the remaining 3 releases were used as evaluation data sets. Even thou h the software w a s a S PPreciably enhanced from release to release < * e Project staff considered the software development Process t o b e stableA module consisted of one or more functionally related , , - , . , , • • >, source-code files. A problem-reporting-system recorded data , , , , .. , ., . r , a t t n,e module level on customer-discovered problems. A fault ., , J , I _ J I _ J J was attnbuted to a module when source code was changed due ,. ,. , .. „ . c c .. . , . . to a customer-discovered problem. Repair of faults in deployed telecommunication tems can be extremel expensive, ^ custQmer ^ ^ o f t e n n e c e s (7) b e c a u s e ^ ^ t Q a ,, ^ t mstajj a Datcn ja& s m d y c o n s i d e r e d a moduiQfauU.prone i f my f a u k s w e r e d i s c o v e r e d b c u s t o m e r s , a n d not fault-prone otherwise:
e = ——. 1+ £ (5) Software engineering is a complex human activity; consequently, it is impossible for any model to account for all the things that influence human mistakes. When a model predicts a module's classification, it might turn out to be wrong. The goal in this paper is to have correct predictions most of the time. For 2 classes, there are 2 kinds of misclassification rates. Type I: Pr{fp|nfp} =, proportion of not fault-prone modules that are incorrectly classified as fault-prone; Type II: Pr{nfp|fp} = , proportion of fault-prone modules that are incorrectly classified as not fault-prone. With various classification techniques, a tradeoff is observed between Types I and H misclassification rates as functions of C [15], [19], [22], [41]. As Pr{nfp|fp} goes down, Pr{fp|nfp} goes up, and conversely. This paper chooses a preferred value of C empirically. Given a candidate value of C, estimate the misclassification rates Pr{f P |nfp} and Pr{nfp|fp} by resubstitution of the training data set into the model. If the balance is not satisfactory, select another candidate value of C and estimate . 1 .u u »/• • c A t tu • . TUA • again, until the best C is found for the project. This procedure is . ,c ,. , u • 1 •/ • straightforward in practice, because the misclassification rates , . „ .. y -c 1 •*• u /• are monotonic functions ofc C. For example, if one chooses C u u . n i£ 1 r 1 TI r r if 1 u 1. 1 • i • such that Pr{fp nfp} = Pr{ nfp fp}, then the larger misclassi} '. , '. ', , . .. fication rate is minimized [36]. In practice, one can achieve only approximate equality due to finite discrete data-sets. Let some software improvement process be applied to each module predicted to be fault-prone, and let Ct and Cn be the costs of misclassification, based on improvement costs, effectiveness in finding faults prior to release, and the consequences of uncorrected faults during operations. Equation (3) is a minf nfp faults; = 0 UasSi = ( imum-cost rule [13], [36] when \ fp faults; > 0. ' = C ( ) ' ( "pr" I • (6) A configuration-management-system recorded data on \ nfp/ \ 1 j changes to source-code files. Modules were identified that However, the costs of misclassifications are often difficult to w e f e u n c h a n d f r o m t h e rior r e l e a s e M o r e t h a n 9 9 % o f t h e III. EMPIRICAL CASE STUDY estimate. If 7rfp and 7rnfp are estimated from the training data u n c h d m o d u l e s h a d n o faults i-e a l m o s t a l l u n c h a n g e d set then a preferred value of C implies a subjective assessment m o d u , e s w e r e ^ f a u l t Consequently, the scope of the of CnlCu under a cost minimization rule. ^ to m o d u l e s ^ h a d ^ , e a s t j c h a n g e case study w a s ^ to source code since the prior release, including new modules, m . EMPIRICAL CASE STUDY T h e se( . o f u p d a t e d m o d u l e s h a d s e v e r a l million lines of code in The case study in this paper illustrates how a general classi- a few thousand modules in each release. fication rule can yield useful classification accuracy when apFault data were collected from the problem reporting system, plied to regression trees. Case studies have inherent limits on Problem reports were tabulated and anomalies were resolved. 89
457
458
IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002
TABLE I
TABLE II
DISTRIBUTIONS OF FAULTS
SOFTWARE PRODUCT METRICS
EYaction of updated modules F
*ults
0 1 2 3 4 g g
5—:
Keliase^ 2 2
_}
93.7% 5.1% 0.7% 0.3% 0.1% * * j
95.3% 3.9% 0.7% 0.1% *
Symbol
98.7% 1.0% 0.2% 0.1%
CAL
~
97.7% 2.1% 0.2% 0.1%
T,,
..
, •
r
.„,
„
, .
„
.
„
,
.
, ..
a similar amount of time. Comparison of fault-discovery rates across releases is a topic for future research. This paper advocates a pragmatic approach to collecting software metrics, and does not recommend one set of metrics to the exclusion of others recommended in the literature. A data-
mining approach is preferred to exploiting metric data [8], [18] .
r
v v
L J» L
f a
J
analyzing a broad set of candidate metrics. The subject system was supported by the EMERALD rim
c i / r n u r ,
i
j
j i.
XT
I
IXT
system [10]. EMERALD was developed by Nortel Networks in partnership with Bell Canada [27] et al. EMERALD provides software designers and managers access to software measurements and software quality models based on those metrics. EMERALD'S software metrics analysis tool measured over 50 metrics from source code. Preliminary data analysis selected metrics that were appropriate for modeling purposes. Table II lists the 24 software product metrics used in this study [20]; CAL and VARUSD were not used as predictors because they are redundant with others. They measure attributes of call ,
,
,„
U
J
.
.
r-
number of calls.
Number of distinct procedure calls to others. Number of second and following calls to others. CAL2 = CAL - CALUNQ Control Flow Graph Metrics CNDNOT Number of arcs that are not conditional arcs, CNDSPNSM Total span of branches of conditional arcs. The unit of measure is arcs. CNDSPNMX Maximum span of branches of conditional arcs. CTRNSTMX Maximum control structure nesting. FILINCUQ Number of distinct include files. IFTH Number of non-loop conditional arcs:
tomers. The proportion of modules with no faults among the updated modules of the^i? data set (Release 1) was 7rnfp = 0.937, and the proportion with at least 1 fault was 7rfp = 0.063. Such a small set of modules is often difficult to identify early in development. In this study, due to a lack of detailed data, the conservative assumption was made that customers used the releases
...
Tot8j
CALUNQ CAL2
Table I summarizes the distribution of faults discovered by cus.
Description
Call-Graph Metrics
i i _
graphs, control flow graphs, and statements. For example, the span of variables is the number of lines of code between and last use of a variable in a procedure; VARSPNSM and VARSPNMX are totals and maximums, respectively. Table III lists 4 execution metrics used in this study. The proportion of installations that had each module, USAGE, was approximated by data from a prior release [11]. The project considered usage across releases to be similar, because the customer-base was stable. Execution times were measured in a laboratory under 3 workloads. For example, RESCPU is the amount of execution time of a module under the workload of a system serving consumers. Refinement of execution metrics is a topic for future research. B. Methodology The case study in this paper consists of the steps: 1) Collect data on historical releases, and perform preliminary data analysis.
90
KNT
LGPATH £° NDSENT
if-then constructs.
Number of knots. A "knot" in a control flow
^ ^ i8 w h e r e „ „ c r o s s d u e t o a v i o l a t i o n structured programming principles. iog2(number of ^-independent paths). JJ^J * g c ° o f ^ c t s Number of entry nodes: t h e n u m b e r of procedures.
NDSEXT
Number of exit nodes.
NDSINT
N u m b e r of i n t e r n a i nodes>
™SPND STMDEC STMEXE
XJ2SSS
VARSPNMX
VARSPNSM VARUSD VARUSDUQ
VARUSD2
of
^ not an entry, exit, or pending node. Number of £ * • « * . dead code segments, Number of declarative statements, Number of executable statements.
^T^
°f ^ ' / " ^ l * U"
Maximum span of variables.
Total span of variables. total number of variable uses.
Number of distinct variables used.
Number of second and following uses of variables: VARUSD - VARUSDUQ.
TABLE HI SOFTWARE EXECUTION METRICS Symbol BUSCPU
Description Execution time (microseconds) of an average
RESCpu
Execution time
first TANCPU USAGE
transaction on a system serving businesses,
( m i c r o a e c o n ds) of an average transaction on a system serving consumers, Execution time (microseconds) of an average transaction on a tandem system. P a y m e n t fraction of the module.
2) Select a response variable and a broad set of candidate predictors. 3) Prepare training and evaluation data sets. 4) Build a regression-tree based on the training data-set using S-Plus. 5 ) c h o o s e t h e p r e f e r r e d v a l u e o f t h e classification rule's parameter, 6, based on training data-set results and project-specific criteria 6) Classify each module in the evaluation data-sets and calculate misclassification rates. 7) Evaluate model accuracy and interpret the structure of the tree.
KHOSHGOFTAAR el al:. USING REGRESSION TREES TO CLASSIFY FAULT-PRONE SOFTWARE MODULES
Training Release 1 Type I Type II lWo OHM 1.5% 77.7% 1.5% 77.7% 7.7% 45.0% 9.6% 39.7% 15.0% 31.0% 20.4% 23.1% 20.4% 23.1% 20.4% 23.1% 25.6% 18.8% 25.6% 18.8% 37.2% 12.2% 54.2% 5.2% | 72.4% 1.8% I
Misclaasification Rates Evaluation Release 2 I Release 3 I Release 4 Type I Type II Type I Type II Type I Type II 8577% Tffit, 93~6^ TWa 81.5% TWo 1.9% 83.6% 2.4% 87.2% 6.1% 70.7% 1.9% 83.6% 2.4% 87.2% 6.1% 70.7% 9.0% 60.9% 9.9% 59.6% 12.9% 51.0% 10.7% 57.7% 11.9% 53.2% 15.0% 45.7% 15.8% 42.9% 17.9% 38.3% 21.4% 39.1% 20.5% 38.6% 23.7% 31.9% 30.2% 27.2% 20.5% 38.6% 23.7% 31.9% 30.2% 27.2% 20.5% 38.6% 23.7% 31.9% 30.2% 27.2% 26.1% 26.5% 26.7% 21.3% 32.2% 21.7% 25.1% 26.5% 26.7% 21.3% 32.2% 21.7% 36.8% 16.4% 35.4% 17.0% 38.5% 20.7% 52.7% 7.9% 54.0% 10.6% 63.0% 9.8% 74.3% 3.7% | 79.9% 4.3% | 92.9% 3.3%
The model is then ready for application to a current release or a similar project. C Empirical Results The case study used data on 4 consecutive releases of a large legacy telecommunication system. The response variable was whether a module was fault-prone or not. The candidate predictors were the 24 product metrics in Table II and the 4 execution metrics in Table III. The training data-set consist of data on updated modules from Release 1; and the evaluation data sets consist of data on updated modules from Releases 2-A. The S-Plus regression-tree algorithm built a tree based on the training data-set. The minimum s-deviance parameter was mindev = 0.10, and the minimum size node was minsize = 40; these parameters were chosen empirically [6]. The resulting tree had 41 nodes, 21 leaves, and 11 important predictors. The important predictors are: FILINCUQ, LGPATH, NDSENT, CNDNOT, LOP, STMDEC, VARSPNMX, VARSPNSM, NDSPND, USAGE, RESCPU. The number of distinct files included, FILINCUQ, is the predictor at Nodes 1 and 2. Because programmers typically put externally defined function prototypes in included files ("header" files), this variable indicates the variety of interfaces among files. Interfaces can be easily misunderstood by developers. The logarithm of paths in the control flow graph, LGPATH, which indicates the size and complexity of the logic, was somewhat s-correlated with FILINCUQ. NDSENT is equivalent to the number of procedures in the module, because every procedure had only one entry point. This was strongly s-correlated with the size of a module. CNDNOT, LOP, STMDEC were also s-correlated with the overall size of the module. VARSPNMX, VARSPNSM were s-correlated with each other, and indicate the locality of references to variables. Small locality of reference can improve awareness of all uses of each variable. NDSPND ("dead code": pending nodes in the control flow graph) can indicate incomplete maintenance.
91
USAGE is a surrogate measure for the extent that customers used a module, and thus, roughly gauges opportunities to discover faults. The execution time of a module, e.g., RESCPU, also indicates opportunities for faults to manifest as failures. Table IV shows how misclassification rates vary as a function of 9. The Type II misclassification was preferred to be less than the Type I rate for the training data set, and the misclassification rates were preferred to be approximately balanced. Thus 6 = 0.95 was chosen (it is bold in Table IV). Another project might prefer a different criterion for choosing 6. For example, due to resource constraints, one might prefer to limit the total fraction predicted to be fault-prone. Fig- 1 depicts the tree. The cutpoint for each decision node is marked on its left edge. For example, at the root node (Node *). if as marked on the left edge, FILINCUQ < 50, then the algorithm represented by the tree proceeds to the left (Node 2), and otherwise to the right (Node 25). Each leaf shows its mean response, fit, and the preferred class for 6 = 0.95. For example, upon arriving at Leaf 5 for module z, the algorithm assigns ^ = Us = 0.995. By (2) and (5), 1 - <j(L(x;)) = y{ > 0 = 0.95, a n d t h u s t h e p r e d i c t e d c l a s s o f m o d u l e { i s not fauit.prone: Classfx,) = nfp ! I f a l l t h e l e a v e s deS cending from a decision node have the s a m e classification, then one can draw an equivalent simplified t r e e F o r e x a m p l e ; b o t h o f t h e l e a v e s descending from Node 39 m l a b e ] e d f a u i t . p r o n e , because ft, < 0 for both. Consequently, t h e d e c i s i o n a t N o d e 3 9 d o e s n o t a f f e c t t h e classification. One c o u i d redraw the tree, replacing Node 39 and its child nodes w jth a i ea f labeled fault-prone. E v e n though the S-Plus regression tree algorithm allows only binary splits, some combinations of nodes are equivalent to multiway splits. For example, • if FILINCUQ < 35 then a module probably has low-risk as determined by the subtree at Node 3. • If 35 < FILINCUQ < 50, then a module's class is predieted by the subtree at Node 16; otherwise (50 < FTLINCUQ) a module probably has high risk as determined by the subtree at Node 25. In other words, Nodes 1 and 2 together form a 3-way
460
Fig. 1.
IEEE TRANSACTIONS ON RELIABILITY, VOL. 5 1 , NO. 4, DECEMBER 2002
Regression tree with classifications.
split. Similarly, consecutive nodes elsewhere that represent a single concept can be viewed as a multi-way split. For example, • Nodes 26 and 27 indicate a 3-way split on USAGE. • Nodes 18, 19, and 20 indicate a 4-way split on the concept of span of variables. Consider the benefits of using the preferred model to target software improvement efforts [16]. For example, let a current release be similar to the last release in the study, Release 4, having 7Tfp = 2.3% fault-prone modules, and 7rnfP = 97.7% not fault-prone (see Table I). Recall that the preferred 9 = 0.95, i.e., C = 19 = 0/(1 — 6). For Release 4, the model correctly predicted the class of 78.3% = 1 - 0.217 of the fault-prone modules, and incorrectly predicted that 32.2% of the not fault-prone modules were fault-prone. For the hypothetical current release, the model
92
would predict that 33.3% = 0.783 • 7Tfp + 0.322 • 7rnfp of the modules are fault-prone, and thus, are candidates for improvement. Of these candidates, one would anticipate 5.4% = 0.783 • (7Tfp/0.333) to be fault-prone. For comparison, if one randomly chose a set of modules for improvement, only 2.3% = 7Tfp would actually be fault-prone, Let the . c o s t rf i m p r o v i n g my m o d u l e beCl = l unit; * v a l u e o f ™P r o v ' n g a n o t fault-prone module be negligible; • cost-avoidance (benefit) of improving a fault-prone module be Cu - 807.1 = C • (**{?/*fp) by (6) under a minimum-cost classification rule. In light of the high cost of fixing faults in telecommunication software after release, and the very small proportion of fault-
KHOSHGOFTAAR et al.: USING REGRESSION TREES TO CLASSIFY FAULT-PRONE SOFTWARE MODULES
prone modules, this subjective estimate of the cost ratio appears to be plausible. The cost of improving n modules predicted to be fault-prone by the model would be n units. The value of improving those candidates that were fault-prone would be 43.66 • n = 0.054 • Cu • n. Thus, the profit for using the model's predictions would
461
Outline of the S-Plus Regression-Tree Algorithm [5] 1} I n i t i a l i z e t h e c u r r e n t n o d e 2 ) ff t h e cumM a>,
por eacn
L
n o d e is n o t nuUj then predictor
t h e c u r r e n t n o d e > s s e t of
Pmition
objects into 2 subsets, minimizes the
c h o o s i n g a cutpoint for the current predictor that
sum of the s-deviances of the left and right prospective child "°deS
Profit = 42.66 • 7i = 43.66 •ra- n,
-D(Weft, bright! y)=l_s
and the return on investment would be 42.66-n ROI = 4266% = . For comparison, improving n randomly selected modules would result in a profit of Profit = 17.56 • 7i = 7Tfp • Cu • n — n, and a return on investment of ROI = 1756% =
Th
^(Weft! Vi)+ isleft
D
(^ighU
ieright
c)
.
fj,light; y).
If one of the following conditions, (13) or (14), is true for
n ( < minsize.
vv
(12)
node* ^ n o d e ^-deviance is less than a small fraction of the root node s-deviance ^ J-J-^—;—-^- < mindev. (13) „, , ^ro°t ' ^' . , . • The number of modules in the current node is less than a threshold,
S-Plus requires that each predictor be an ordinal measure. Only ranks of quantitative predictors are considered. In this application, all the predictors are software metrics, modeled as ordinal measures, where Xij is the predictor s value #j for
partitioning. . , . . , . For the purpose of regression, this algorithm assumes that the response variable is s-normally distributed [5]: yi ~ gaud(/Ui, a2) (8)
(11)
tne current
APPENDIX BUILDING A REGRESSION TREE WITH S-PLUS
, *' . . ,,,. , - , , , . , T In the course of the tree-building algorithm, modules in the training data-set are assigned to nodes, and thus are "modules in a node." The algorithm initially assigns all the modules in the training data-set to the root node The algorithm then recursively partitions each node s modules into 2 subsets that are assigned to child nodes, until a stopping criterion halts further
Vi)-
b) Choose the predictor whose best split maximizes the chan e in S -"deviance between the s-deviance of the current noc e a n c t n e s u m * ' °^ t n e s"deviances of the prospective child nodes &D = D(nr, y) - D(meft,
That is, using predictions from this model would more than double the profit of improving a random selection of modules. Thus, under a plausible assessment of C / / / C / , the level of accuracy of this preferred classification model in Table IV could be very useful to a software development project.
2s
(14)
. ^ ^ nodes; ^ dse ged ^ R e c u r s ivel y call the algorithm to process the left
then d o nQt
, ...
,
B) Recursively call the algorithm to process the right .... . 3) R t th t . . n ^ ^ ^ ^ i t c a n b e used to predict each yt of . ^ & m o d u l e s T h e e d i c t e d v a [ u e rf ^ f e a current . ,. , , . .. ., • . , . . , , „ c, vanable for module i is the mean of training modules in the leaf it falls into yi — u i f
s.
(15)
The parameters mindev and minsize are tools for controlta is estimated by the mean value of y over all training mod- U n g overfilling. Future work will evaluate these parameters and ules that all in the same leaf as module i. The variance, a2, is p r u n i n g a l a r g e o v e r f l t t e d t r e e t 0 control overfitting. assumed to be constant for all modules. For classification, violation of these assumptions by the response variable was not a ACKNOWLEDGMENT practical problem. The s-deviance of module i is minus twice the log-likelihood, The authors are pleased to thank W. D. Jones and the scaled by a2, which reduces to [5]: EMERALD team for collecting the case-study data, J. P. Hude_, \ —( \2 /Q-. P°hl f° r his encouragement and support, and the anonymous UUH, yi) - (Vi - Vi) • (V) reviewers for their thoughtful comments. The s-deviance of a node / is the sum of the s-deviances of all the training modules in the node [5]: REFERENCES TC
,,
, ,
•
D(fif, y) = \ ^ (j/i — Hi)2. ~^ , , , ,
(10) c
.
[1] H. Akaike, "Factoranalysis and AIC,"Psychometrika, vol. 52, no. 3, pp. 317-332, 1987. [2] V. R. Basili, L. C. Briand, and W. Melo, "A validation of object-ori-
, •
J e n t e d d e s i g n m e t r i c s a s q u a l i t y i n d i c a t o r s .. I E E E T r a n s . S o f t w a r e E n g i .
If all modules in a node have the same value of y, then each is equal to the mean, and thus, the s-deviance is zero.
neering, vol. 22, no. 10, pp. 751-761, Oct. 1996.
93
462
IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 4, DECEMBER 2002
[3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees: Chapman & Hall, 1984. [4] L. C. Briand, V. R. Basili, and C. J. Hetmanski, "Developing interpretable models with optimized set reduction for identifying high-risk software components," IEEE Trans. Software Engineering, vol. 19, no. 11, pp. 1028-1044, Nov. 1993. [5] L. A. Clark and D. Pregibon, "Tree-based models," in Statistical Models in S, J. M. Chambers and T. J. Hastie, Eds: Wadsworth, 1992, pp. 377-^119. [6] J.Deng, "Classification of software quality using tree modeling with the S-Plus algorithm," Master's thesis (advised by Taghi M. Khoshgoftaar), Florida Atlantic University, Dec. 1999. [7] C. Ebert, "Classification techniques for metric-based software development," Software Quality J., vol. 5, no. 4, pp. 255-272, Dec. 1996. [8] U. M. Fayyad, "Data mining and knowledge discovery: Making sense out of data," IEEE Expert, vol. 11, no. 4, pp. 20-25, Oct. 1996. 19] S. S. Gokhale and M. R. Lyu, "Regression tree modeling for the prediction of software quality," in Proc. Third ISSATInt. Conf. Reliability and Quality in Design, 1997, pp. 31-36. [10] J. P. Hudepohl, S. J. Aud, and T. M. Khoshgoftaar et al, "EMERALD: Software metrics and models on the desktop," IEEE Software, vol. 13, no. 5, pp. 56-60, Sept. 1996. [11] W. D. Jones, J. P. Hudepohl, T. M. Khoshgoftaar, and E. B. Allen, "Application of a usage profile in software quality models," in Proc. Third European Conf. Software Maintenance and Reengineering: IEEE Computer Soc, 1999, pp. 148-157. [12] G. V. Kass, "An exploratory technique for investigating large quantities of categorical data," Appl. Statistics, vol. 29, pp. 119-127, 1980. [13] T. M. Khoshgoftaar and E. B. Allen, "Classification of fault-prone software modules: Prior probabilities, costs, and model evaluation," Empirical Software Engineering: An International Journal, vol. 3, no. 3, pp. 275-298, Sept. 1998. [14] , "Logistic regression modeling of software quality," Int. J. Reliability, Quality, and Safety Engineering., vol. 6, no. 4, pp. 303-317, Dec. 1999. [15] , "A practical classification-rule for software quality models," IEEE Trans. Reliability, vol. 49, no. 2, pp. 209-216, June 2000. [16] T. M. Khoshgoftaar, E. B. Allen, W. D. Jones, and J. P. Hudepohl, "Return on investment of software quality models," in Proc. 1998 IEEE Workshop on Application-Specific Software Engineering and Technology pp 145-150 [17] , "Classification tree models of software quality over multiple releases," in Proc: Tenth Int. Symp. Software Reliability Engineering, 1999 pp 116-125 [18] ' "Data mining for predictors of software quality," Int J Software Engineering and Knowledge Engineering., vol. 9, no. 5, pp. 547-563, [ggg [19] ', "Which software modules have faults that will be discovered by customers?," J. Software Maintenance: Research and Practice, vol. 11, no. l,pp. 1-18, Jan. 1999. [20] , "Accuracy of software quality models over multiple releases," Annals of Software Engineering., vol. 6, 2000. [21] T. M. Khoshgoftaar, E. B. Allen, K. S. Kalaichelvan, and N. Goel, "Early quality prediction: A case study in telecommunications," IEEE Software, vol. 13, no. 1, pp. 65-71, Jan. 1996. [22] T. M. Khoshgoftaar, E. B. Allen, and A. Naik et al, "Using classification trees for software quality models: Lessons learned," Int. J. Software Engineering and Knowledge Engineering, vol. 9, no. 2, pp. 217-231, 19 9 ' [23] T. M. Khoshgoftaar, E. B. Allen, and X. Yuan et al, "Preparing measurements of legacy software for predicting operational faults," in Proc: Int. Conf. Software Maintenance, 1999, pp. 359-368. [24] T. M. Khoshgoftaar and D. L. Lanning, "A neural network approach for early detection of program modules having high risk in the maintenance phase," /. Systems and Software, vol. 29, no. 1, pp. 85-91, Apr. 1995. [25] B. A. Kitchenham, "A procedure for analyzing unbalanced datasets," IEEE Trans. Software Engineering, vol. 24, no. 4, pp. 278-301, Apr. 199 °[26] M. R. Lyu, "Introduction," in Handbook of Software Reliability Engineering: McGraw-Hill, 1996, ch. 1, pp. 3-25. [27] J. Mayrand and F. Coallier, "System acquisition based on software product assessment," in Proc. 18th Int. Conf. Software Engineering, 1996, pp. 210-219. [28] J. C. Munson and T.M. Khoshgoftaar, "The detection of fault-prone programs," IEEE Trans. Software Engineering, vol. 18, no. 5, pp. 423-433, May 1992.
94
[29] J. D. Musa, "Operational profiles in software reliability engineering," IEEE Software, vol. 10, no. 2, pp. 14-32, Mar. 1993. [30] N. Ohlsson, M. Zhao, and M. Helander, "Application of multivariate analysis for software fault prediction," Software Quality J., vol. 7, pp. 51—66, 1998. [31] A. A. Porter and R. W. Selby, "Empirically guided software development using metric-based classification trees,"/£££S<>/*ware, vol. 7, no. 2, pp. 46-54, Mar. 1990. [32] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986. [33] SAS Institute staff, "TREEDISC macro (beta version)," 1995 Technical report, SAS Institute. Documentation with macros. [34] N. F. Schneidewind, "Software metrics validation: Space Shuttle flight software example," Annals of Software Engineering, vol. 1, pp. 287-309, 1995. [35] , "Software metrics model for integrating quality control and prediction," in Proc. 8th Int. Symp. Software Reliability Engineering, 1997, pp. 402-415. [36] G. A. F. Seber, Multivariate Observations: John Wiley & Sons, 1984. [37] R. W. Selby and A. A. Porter, "Learning from examples: Generation and evaluation of decision trees for software resource analysis," IEEE Trans. Software Engineering, vol. 14, no. 12, pp. 1743-1756, Dec. 1988. [38] R. Takahashi, Y. Muraoka, and Y. Nakamura, "Building software quality classification trees: Approach, experimentation, evaluation," in Proc. 8th Int. Symp. Software Reliability Engineering, 1997, pp. 222-233. [39] J. Troster and J. Tian, "Measurement and defect modeling for a legacy software system," Annals of Software Engineering, vol. 1, pp. 95-118, 1995. [40] L. G. Votta and A. A. Porter, "Experimental software engineering: A report on the state of the art," in Proc. 17th Int. Conf. Software Engineering, 1995, pp. 277-279. [41] X. Yuan, "Modeling software quality with TREEDISC," Master's thesis (advised by Taghi M. Khoshgoftaar), Florida Atlantic University, Dec. 1999.
Taghi M. Khoshgoftaar is a Professor of the Department of Computer Science and Engineering, Florida Atlantic University, and is also Director of the EmpirSoftware Engineering Laboratory. His research interests are in software engreeting, software measurements, software reliability and quality engineering, computational intelligence, computer performance evaluation, multimedia syst e m s m
E d w a r d B . A U e n rec eived
the B.S. in 1971 in engineering from Brown Univeri n 1 9 7 3 i n s y s t e m s engineering from the University o f Pennsylvania, Philadelphia, and the Ph.D. in 1995 in computer science from F l o r i d a A t l a n t i c U n i v e r s i t y i B oca Raton; his work for this paper was performed w h i l e h e w a s a t m i s u n i v e r s i t y . H e is an assistant professor in the Department of Computer Science at Mississippi State University. He began his career as a p r o g r a m m e r w i t h the U.S. Army. From 1974 to 1983, he performed system engineering and software engineering on military systems, first for Planning Research Corp. and then for Sperry Corp. From 1983 to 1992, he developed c o r p o r a t e d a t a p r o c e s s i n g s y s t e m s for Glenbeigh, Inc., a specialty health care c o m p a n y . F r o m 1 9 9 5 t 0 2 000, he performed research in software engineering a t p ^ ^ A t l a m i c U n i v e r s i t y . His research interests include software measurem e m s o f t w a r e prO cess, software quality, and computer performance modeling. H e h a s m o r e t h a n 6 0 r e f e r e e d p u b i i c a t i o n s i n t h e S e areas. He is a member of the IEEE Computer Society and the Association for Computing Machinery. s i t y R h o d e Island_ m e M s
Jianyu Deng received the M.S. in 1999 in computer science from Florida Atlantic University, Boca Raton. She is a software engineer with Motorola. Her research interests include software engineering and software quality.
*|®§^§r
INFORMATION
t ojf^\t'i
* ITKTB J&emm
ELSEVIE R
AND
Informatio n and Software Technology 43 (2001) 863-873
SOFTWARE TECHNOLOGY
= ^ ^ = = = www.elsevier.com/locate/infsof
Can genetic programmin g improve software effort estimation ? A comparativ e evaluation Colin J . Burgess** , Marti n Lefleyb "Department of Computer Science, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, UK ^School of Design Engineering and Computing, University of Bournemouth, Talbot Campus, Poole BH12 5BB, UK
Most organisation s must decide how to allocat e valuabl e resource s based on prediction s of the unknow n future . The example studied her e is software effort estimation , where volume and costs ar e not directly proportionall y relate d [12]. Any improvement in the accurac y of prediction of development effort can significantl y reduc e the costs from inaccurat e estimation , misleadin g tenderin g bids an d disablin g the monitorin g progress . Accurat e modelling can also assist in schedulin g resource s an d evaluatin g risk factors . In this paper , we evaluat e the potentia l benefits from applying the tool of genetic programmin g (GP) to improve the accurac y of software effort estimation . In order to do this, we compar e it with other methods of prediction, in terms of accurac y an d other qualitativ e factor s deemed importan t to potentia l user s of such a tool. All the parameter s used for the estimation process ar e restricte d to those availabl e at the specification stage. Fo r those reader s who ar e alread y familia r with the concepts of GP, Sections 6.1 an d 6.2 can be skipped.
. „ * Correspondin g author .
Learnin g allows human s to solve hugel y complex problems at speeds which outperfor m even the fastest computer s of the presen t day [36]. Machin e learnin g (ML) technique s have been used successfull y in solving man y difficult problems, such as speech recognitio n from text [37], adaptive contro l [34,16], integrate d circui t synthesis [31] an d mark-u p estimation in the constructio n industr y [16]. Due to their minima l assumption s an d broa d an d deep coverage of solution space, ML approache s deserve exploratio n to evaluat e the contributio n they have to the prediction of software effort. One of the first methods used to estimate software effort automaticall y was COCOMO [5], where effort is expressed as a functio n of anticipate d size. The genera l form of the model tends to be: _ b
(1)
where E is the effort required , S is the anticipate d size and a an d b ar e domain specific constants . Other s have developed local models, usin g statistica l technique s such as stepwise regressio n e.g. Kok et al. [23]. Linea r approache s to the problem cover only a small L rr r J '
Numeric identifier Team experience in years Project manager's experience in years Number of transactions processed Number of entities Unadjusted function points Adjusted function points Development environment Year of completion Complex measure derived from other factors — denning the environment
x\ 0
Measured in person-hours
approaches. Emphasis is placed on assisting the software engineering manager to realistically consider the use of Qp a s a n estimator of effort. For example, parameters that cannot be measured at the outset of a project, such as lines of , , ^ , . , , ,. code, are not used. Evaluation does not merely focus on prediction accuracy but also on other important factors necessary for good design of an estimation methodology. \ y e attempt to assess G P against alternative techniques with minimum bias to give a fair comparison. We have tried . ° to avoid p u t t i n g a lot of t u n m g effort into o n e p a r a d i g m With the misguided a i m to p r o v e that it is better than other methods that have been comparatively sparsely explored. _ , , . f, c , • Some researchers make significant benefits by removing alleged outliers from the data sets, but we have avoided this since w e want to use as much of the data as possible. Also, the determination of outliers can incorporate bias, complicate comparison and could be based on any number of heuristics. Thus, this is not a simple task for an estimator with only, in general, a limited data set available. Another danger is to generate many diverse solutions so that a best model, based on query performance can be selected that will always fit very closely to the evaluation data. Ultimately, we test the hypothesis that a G P based solution can learn the nuance of a data set and make its own decisions for issues such as variable weighting and pre-processing, dealing with outliers and the selection of feature subsets.
Dependent variable Effort
domains such as software development environments. For example, Kemerer [21] and Conte et al. [6] frequently found errors considerably in excess of 100% even after model calibration. A variety of M L methods has been used to predict software development effort. Artificial neural networks (ANNs) [20,42], case based reasoning (CBR) [15], rule induction (RI) [19,22,40] offer good examples. Hybrids are also possible [22], e.g. Shukla [39] reports better results from using an evolving A N N compared to a standard back propagation ANN. Dolado [9,10] and Fernandez and Dolado [13] analyse many aspects of the software effort estimation problem. Recent research by Dolado [11] shows promising results for a GP based estimation system on a single input variable. This research extends this idea into richer models requiring larger populations and much longer learning lifetimes. This paper investigates the potential for the use of GP methods to
3 . Data set used for comparisons and Weibull distribution modelling In order to explore and compare the potential of the three M L techniques for building effort prediction models we selected an existing project effort data set. The data set USed comprises 81 software projects derived from a Canadian software house in the late 1980s by Jean-Marc Desharnais [8]. Despite the fact that this data is now over 10 years o l d > i t i s o n e o f t h e l a r g e r ; p u b l i c l y available, data sets and
The project number and effort of the 18 projects selected for measuring prediction capability Project number
Effort (person-hours)
2 8 12
5635 3913 9051
15
4977
19
4494
22 35 38
651 5922 2352
42
14987
50
g232
56 61 63
2926 2520 2275 1386 ° . , „„
72 orl oU
can be used to assess the comparative performance of new approaches. The data set comprised 10 features, one dependent and nine independent, as summarised by Table 1. A second dependent feature, namely length of code, is ignored, as ' ' s far ' e s s important to predict than total effort. Four of the 81 projects contained missing values so these were replaced by random samples from the other projects. In order to compare many different prediction paradigms, the data was divided into two sets which are called a training set an d a query set. The query set chosen was 18 projects, which were randomly selected from the 81 projects, i.e. approximately 2 2 % of the total (Table 2). T h e remaining 63 projects were used as the learning set. In order to eliminate all possible distortion and to simulate more realis-
tn s
jloU
tically the practical situation, the query set was not used in
96
865
C.J. Burgess, M. Lefley I Information and Software Technology 43 (2001) 863-873
Table 3 Trainin g data output statistics
predictions?' . A numbe r of summar y statistics were used, can be seen as significantl y better tha n an y of the others . Ultimately the decision shoul d be made by the user , based on estimates of the costs of unde r or over estimatin g project effort. All ar e based on the calculate d error , j e m e differenc e between the predicted and observed ., . , r outpu t values for the projects.
n o n e of w hic h
Effor t
Mean Minimum ,, . SD
Maximu m
Weibuil Alpha Weibull Beta
4908.556 4563.736 546 ..... 23940
F
*
J
The methods used in this paper were:
1.41
4830
any of the furthe r analysi s or tuning , apar t from measurin g the accurac y of the final prediction of effort. The trainin g outpu t values have been foun d by the authors , to be fitted well by Weibull distributions , with a highl y significan t Chi Squar e and smaller erro r tha n other candidat e distributions . The Weibull parameter s were foun d by piecewise approximatio n an d provided in Table 3. One usefu l measur e of how much the inpu t or indepen dent variable s influenc e the outpu t or dependen t variables , is to calculat e the correlatio n coefficients between inpu t an d output . Since variable s may have functiona l relationship s a numbe r of function s of inpu t variable s (such as logarithmi c and exponential ) were used to searc h for the highes t corre lations . None of the function s was foun d to have increase d the ra w correlation s by enoug h to justify their continue d use. An illustratio n of the outpu t data' s distributio n is given in Fig . 1, alon g with the fitted distributions . 4. How to evaluat e the techniques ?
(a) Correlatio n coefficient. (b) Adjusted mean squar e erro r — AMSE. (c) Prediction s within 25% — Pred(25) . (d) Percentag e of prediction s within 25% — Pred(25) % Distributions . (e) Mean magnitud e of relative erro r — MMRE. (f) Balance d mean magnitud e of relative erro r — BMMRE. Most commonly, accurac y is defined in terms of mean magnitud e of relative erro r (MMRE) [10], which is the mean of absolut e percentag e errors : = "' / I F — F \ in n (2) MMRE = T I — I (2) ;=i V ' /n ^ ^ ^ a c t u a , e f f o r t ^ £ .g ^ wher e ther e ^ n ^ ^ h a s b e e n SQme ^ ^ ^ dicted effor t of ^ measure > i c u l a r t h e fac t tha t it is u n b a l a n c e d a n d p e n a l i s e s o v e r . e s t i m a t e s m o r e t h a n underestimates . Fo r this reason , Miyazaki et al. [33] proposed a balance d MMRE measur e as follows: i—n /
Software effort prediction is typified by comparativel y small data sets with prediction error s havin g significan t effect on costs. An importan t question tha t needs to be asked of an y estimation method is 'Ho w accurat e ar e the
c\
\ 1 QQ
(3)
This approac h ha s been criticised by Hughe s [18], amongs t others , as effectively being two distinct measure s
Distribution s
Fig . 1. Distributio n of effort alon g with fitted Weibull distribution .
97
£
BBMRE ~ A I m m ( £ . jr.) J ~^ ~
866
C.J. Burgess, U. Leftey / Information and Software Technology 43 (2001) 863-873
that should not be combined. Another approach is to use Pred(25)%, which is the percentage of predictions that fall within 25% of the actual value. It is clear that the choice of accuracy measure depends to a large extent upon the objectives of those using the prediction system. For example, MMRE is fairly conservative with a bias against over-estimates, whilst Pred(25) supports those prediction systems that are generally accurate but occasionally wildly inaccurate. Other workers have used the adjusted R squared or coefficient of determination to indicate the percentage of variation in the dependent variable that can be 'explained' in terms of the independent variables. In this paper, we replace this with the simpler correlation coefficient. The simplest raw measure is the mean square error, however this depends on the mean of the data sets and it is thus difficult to interpret or make comparisons. Instead, we use AMSE, the adjusted mean square error. This is the sum of the squared errors, divided by the product of the means of the predicted and observed outputs. ~ 2 i=n AMSE = y J (4) i=i (£•, * £,)
(e) Robustness. (f) Likelihood of convergence, (g) Prediction beyond learning data set space,
5. Previous related work using machine learning Any method of transforming data may be used for software estimation, for example, least squares regression. This section considers ML techniques previously recommended for comparative effort estimation viz. CBR and ANNs. These techniques have been selected on the grounds that, there exists adequate previous research to promote their efficacy and because of their significantly differing approaches, they form a good basis for comparisons, 5]
Artificiai
neurai
networks
ANNs are learning systems inspired by organic neural systems, which are known to be very successful at learning to solve problems. They comprise a network of simple interconnected units called neurons. The connections between the neurons are weighted and the values of these weights determine the function of the network. Input values are multiplied by the weights, summed, passed through a step function and then on to other neurons and finally to the output neurons. The weights are selected to optimise the output vector produced from a given input vector. Typically, this selection is carried out by adjusting the weights systematically, based on a given data set, a process of training. The most common method of training is by back propagation. Here, the weights begin with small random values and then a proportion of the difference between desired and current output, called the error, is used to adjust weights back through the net, proportionally to their contribution to the error. Thus, the outputs are moved towards the desired values and the net learns the required behaviour, Recent studies concerned with the use of ANNs to predict software development effort have focused on comparative accuracy with algorithmic models, rather than on the suitability of the approach for building software effort prediction systems. An example is the investigation by Wittig and Finnie [46]. They explore the use of a back propagation neural network on the Desharnais and ASMA (Australian Software Metrics Association) data sets. For the Desharnais data set, they randomly split the projects three times between 10 test and 71 training sets, which is very similar to the procedure we follow in our paper. The results from three validation sets are aggregated and yield a high level of accuracy. The values cited are Desharnais set MMRE = 27% and ASMA set MMRE = 17%, although some outlier values are excluded. Mair et al. [30] show neural networks offer accurate effort prediction, though not as good as Wittig and Finnie, but conclude that they are difficult to configure and interpret.
4.1. Qualitative measures Whatever measures are being used, it is clear that although accuracy is an important consideration, it is not sufficient to consider the accuracy of prediction systems in isolation. Hence, in assessing the utility of these techniques, we have considered three factors, adapted from Mair et al. [30]: accuracy, explanatory value and ease of configuration. Accuracy has been the primary concern of researchers and clearly it is of considerable importance; a prediction system that fails to meet some minimum threshold of accuracy will not be acceptable. However, we believe that accuracy, by itself, is not a sufficient condition for acceptance. The quality of information provided for the estimator is of great importance. Empirical research has indicated that endusers coupled with prediction systems can outperform either prediction systems or end-users alone [41]. The more explanations given for how a prediction was obtained, the greater the power given to the estimator. If predictions can be explained, estimators may experiment with'what if scenarios and meaningful explanations can increase confidence in the prediction. Apart from accuracy, other evaluation factors for a prediction system, which could be important to non-expert practitioners are:(a) Resources required (i) Time and memory needed to train. (ii) Time and memory needed to query. (b) Ease of set up. (c) Transparency of solution or decision. (d) Generality.
98
867
C.J. Burgess, M. Lefley /Information and Software Technology 43 (2001) 863-873
accurate predictors, within the measurement constraints already outlined in Section 4.
5.2. Case-based reasoning CBR is based on the psychological concepts of analogical reasoning, dynamic memory and the role of previous situations in learning and problem solving [36]. Cases are abstractions of events, solved or unsolved problems. New problems are solved by a weighted concatenation of the solutions from the set of similar cases. Aarmodt and Plaza [1] describe CBR as being cyclic and composed of four
6. Background to genetic programming 6 L
Introduction to genetic algorithms
Genetic
algorithms (GA) were developed as an alternative technique for tackling general optimisation problems with large search spaces. They have the advantage that they do not need any prior knowledge, expertise or logic , , , . , ,, , . , K„ . ,, related tto the particular problem being solved. Occasionally, , . . they can produce the optimum solution, but for most ,, .. , , , . . problems with a large search space, a good approximation r r , . . ,., . ° to the optimum is a more likely outcome. _ , , . . , . , , _ . . . The basic ideas used are based on the Darwinian theory of evolution, which in essence says that genetic operations feetween c h r o m o s o m e s eventually leads to fitter individuals w b j c h &K mQK ] i k d y tQ s u r v i y e T h u S j o y e r & , o n g p e r i o d o f ^ the population of the species as a whole improves. However, not all of the operations used in the computer a n a l o g y o f Ms p r o c e s s n e c e s s a r i l h a v e a b i o i o g i c a l e q u ivaj { In ^ c o m p u t e r i m p l e m e n t a t i o n of these ideas a solutiori;
. . . . . 1. retrieval of similar cases. . , , _ , . . , 2. reuse ofr the retrieved cases to find a solution to the ,. problem. _ . . , , , , . ... 3. revision of the proposed solution if necessary. . , . :, . , 4. retention of the solution to form a new case. Consequently, issues concerning case characterisation [35], similarity [2,45] and solution revision [28] must be addressed prior to CBR system deployment. Software project estimation has been tackled by a number of researchers using CBR. Three doculented examples are Estora [43], finding analogies for cost estimation (FACE) [4] and ANGEL [38]. A summary of each of these works 0
ows ' Estora uses the estimator's protocols, and infers rules from these. An analogy searching approach is used to produce estimates which the developers claim were comparable, in terms of/^-squared values, to the expert's and superior to those obtained using function points, regression based techniques, or COCOMO. FACE was developed by Bisio and Malabocchia who assessed it using the COCOMO data set. It allocates a normalised similarity score 6, with values between 0 and 100 for each candidate analogy from the case repository. A user threshold, typically 9 = 70, is used to decide which cases are used to form the estimate. If no cases score above the threshold, then reliable estimation is not deemed possible. The research shows that FACE performs very favourably against algorithmic techniques such as regression Shepperd and Schofield report on their tool ANGEL, an estimation tool based upon analogical reasoning. Here projects, or cases are represented in a Euclidean hyperspace where a modified nearest neighbour algorithm identifies the best analogies. They report results, derived from a number of data sets, of superior performance to LSR models. Other approaches include that of Debuse and RaywardSmith [7] who apply simulated annealing algorithms to the problem of feature subset selection which could then be used with other tools. There are also those who consider supplementary tools to enhance another method, for exampie an evolutionary tool. Though such approaches could be used in conjunction with the tools here, assessing the complete tools available is the goal of this work. Whatever overall methodology is used, the aim is to find the most
but not necessarily a very good solution, to the problem being
s o l y e d is r e p r e s e n t e d t
icall
by
a
fixed
binafy stfing; w h j c h js t e r m e d a c h r o m o s o m e b
^
^
measure is o n e
b i o l o g i c a l e q u i v a l e n t . It
^
^
fitness
of any
is n e a r g r
^
^
^
must
whefe Q
o p t i m a l SOM[OR
w h i c h i s r e l e v a n t tQ t h e m m m
be
length anal
possible
^
t0
individual
Qn&
example
p r o b l e m > is t h a t a
solution minimises the error between the predicted
fitter
values
and tng tnje vaiues The basic
s s o f ±&
^
a,though a n u m b e r of variations m
a l g o r i t h m is as
follows>
possible:.
Generate at random a population of solutions, i.e. a family of chromosomes 2 . Create a new population from the previous one by applyi n g g e n e t i c o p e r a t o r s t 0 t h e fittest chromosomes, or pairs of fittest chromosomes of the previous population. 3 R e p e a t s t e p ( 2 ) i u n t i l e i t h e r t h e fitness o f t h e b e s t s o i u t i o n has converged or a specified number of generations have b e e n Droduceci L
The best solution in the final generation is taken as the approximation to the optimum for that problem that can be attained in that run. The whole process is normally run a number of times, using different seeds to the pseudo-random number generator. The key parameters that have to be determined for any given problem are:
best
(a) The best way of representing a solution as a fixed length binary string, (b) The best combination of genetic operators. For GA,
99
868
CJ. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873
Fig. 2. (a) Illustration of crossover operator before operation. The double line illustrates where the trees are cut, (b) Illustration of crossover operator after operation. The double line illustrates where the sub-trees have been swapped.
reproduction, crossover and mutation are the most common. (c) Choosing the best fitness function, to measure the fitness of a solution. (d) Trying to keep enough diversity in the solutions in a population to allow the process to converge to the global optimum but not converge prematurely to a local optimum.
that GP can be used to solve a large number of seemingly different problems from many different fields" [24]. He goes on to state that GP offers solutions in representations of computer programs. These offer the flexibility to:(a) Perform operations in a hierarchical way. (b) Perform alternative computations conditioned on the outcome of intermediate calculations. (c) Perform iterations and recursions. (d) Perform computations on variables of different types. (e) Define intermediate values and sub-programs so that they can be subsequently reused.
More information related to GA can be found in Goldberg [14] or Mitchell [32]. 6.2. Introduction to genetic programming
The determination of a program to model a problem offers many advantages. Inspection of the genetic program solutions potentially offers understanding of the forces of behaviour behind the population. For example, Langley et al. [27] describe BACON, a system that successfully rediscovered the scientific laws; Ohm's laws, Coulomb's law, Boyle's law and Kepler's law from the given finite samples of data. As so many of the phenomena in our universe may be represented by programs, there is encouraging evidence for the need to explore a general program space for solutions to problems from that universe, The initial preparation for a GP system has several steps.
GP is an extension of GA, which removes the restriction that the chromosome representing the individual has to be a fixed length binary string. In general, in GP, the chromosome is some type of program, which is then executed to obtain the required results. One of the simplest forms of program, which is sufficient for this application, is a binary tree containing operators and operands. This means that each solution is an algebraic expression, which can be evaluated. Koza "offers a large amount of empirical evidence to support the counter-intuitive and surprising conclusion
100
869
C.J. Burgess, M. Lefley IInformation and Software Technology 43 (2001) 863-873
Table 4 Main parameters used for the GP system
is not defined for negative or non-integer values. The multip l y i s a p r o t e c t e d multiply which prevents the return of any values greater than 1020. This is to minimise the likelihood of real overflow occurring during the evaluation of the solations. On average, 10% of the operands used were random constants in the range 0-1.0.
Value
Size of population Number of generations Number of runs Maximum initial full tree depth Maximum number of nodes in a tree
Percentage of elitism
1000 500 10 5 64
6.2.1. Illustration
5
of the crossover
operator
Parents before the crossover: i \ ! ± ! L
(a)
First, it is necessary to choose a suitable alphabet of operands and operators. The operands are normally the independent input variables to the system and normally include an ephemeral random constant (ERC), which is a random variable from within a suitable range, e.g. 0-1.0. The operators should be rich enough to cover the type of functionality expected in solutions, e.g. trigonometric functions if solutions are expected to be periodic, but having too many operators can hamper convergence. Secondly, it is necessary to construct an initial population. In this paper, we use a set of randomly constructed trees from the specified alphabet, although this is dependent on the type of problem being solved and the representation chosen for the programs. For an initial population of trees, a good and common approach is called Ramped Half and Half. This means that half the trees are constructed as full trees, i.e. operands only occur at the maximum depth of the tree, and half are trees with a random shape. Within each half, an equal number of trees are constructed for each depth, between some minimum and maximum depths. This is found to give a good selection of trees in the original population. The main genetic operations used are reproduction and crossover. Mutation is rarely used, but other more specialised operators are sometimes used, but not for the problem tackled in this paper. Reproduction is the copying of one individual from the previous generation into the next generation unchanged. This often takes the form of elitism, where the top n% of the solutions, as measured by fitness, is copied straight into the next generation, where n can be any value but is typically 1-10. The crossover operator chooses a node at random in the first chromosome, called crossover point 1, and the branch to that node is cut. Then it chooses a node at random in the second chromosome, called crossover point 2, and the branch to that node is cut. The two sub-trees produced below the cuts are than swapped. The method of performing crossover can be illustrated using an example, see Fig. 2a and b. Although this example includes a variety of operations, for simplicity in this application only the set { +, - , *} were made available. More complex operators can of course be developed by combining these simple operations. Simple operators eliminate the bounding problems associated with more complex operations such as XOR, which
x^Y2
(b)
2
New children produced by crossover operation:(X + 3) + X (X XORX ) * 3 (a) — Y — X ^ o 3 2 Thus two new individuals are created, whose fitness can be evaluated. This process is shown diagrammatically in Fig. 2a and b. 6.2.2. Controlling the GP algorithm Since trees tend to grow as the algorithm progresses, a maximum depth is normally imposed (or maximum number of nodes) in order that the trees do not get too large. This is often a lot larger than the maximum size allowed in the initial population. Any trees produced by crossover that are too large are discarded. The reasons for imposing the size limit is to save on both the storage space required and the execution time needed to evaluate the trees. There is also no evidence that allowing very large trees will necessarily lead to improved results. The basic algorithm is the same as for GA as given in Section 6.1. Key parameters that have to be determined for any given problem are: ( a ) The best way of choosing the alphabet, and representS t h e solution as a tree (or other structure), (°) The best genetic operators. ( c ) Choosing the best fitness function, to measure the fitness of a solution. ( d ) Trying to keep enough diversity in the solutions in a population to allow the process to converge to the global optimum but not converge prematurely to a local optimum. ( e ) Choosing sensible values for the population size, maximum tree size, number of generations etc. in order t0 S e t g ° o d solutions without using too much time or space, in
^
More information related to GP can be found in Banzhaf a n d l n K o z a 25 26 [ > 1-
7. Applying GP to software effort estimation The software effort estimation problem is an example of a
101
870
C.J. Burgess, M. Lefley /Information and Software Technology 43 (2001)
863-873
Table 5 Comparing the best prediction systems, within each paradigm. Best performing estimators are highlighted in bold type. The figures for ANNs and GP are averages over ten executions of the systems. For Pred25, AMSE and MMRE; (*denotes significance at 5% level and **at 1% level) Estimated effort Random Correlation
- 0.16589
AMSE Pred(25) Pred(25)% MMRE BMMRE
9.749167 3 16.67 181.72 191
Linear LSR
2 nearest neighbours
5 nearest neighbours
Artificial neural network
0.557
0.550
0.586
0.635
*7.378 10 **55.56 **46.18 59
"6.432 8 **44.44 162.30 66
**5.733 8 **44.44 168.30 70
**5.477 10 **S5.S6 **60.63 69
symbolic regression problem, which means, given a number of sets of data values for a set of input parameters and one output parameter, construct an expression of the input parameters which best predicts the value of the output parameter for any set of values of the input parameters. In this paper, GP is applied to the Desharnais [8] data set already described in Section 3, with the same 63 projects in the Learning Set and remaining 18 projects in the Test Set as used for the other methods. The parameters chosen for the GP system, after a certain amount of experimentation, are shown in Table 4. The results obtained depend on the fitness function used. In order to make comparison with the other methods, the fitness function was designed to minimise the MMRE measure as applied to the Learning Set of data. The values of MMRE quoted in the results are the result of applying the solution obtained to the Test Set of data. The GP system is written in C and runs on a shared Sun 4 x 1 7 0 MHz Ultra Sparc, which means that timings are approximately equivalent to 300 MHz Pentium. However, the run-time core size is approximately 4 Mbytes to store the trees required in any one generation. One run of 500 generations takes 10 min c.p.u. time.
predictors at 1 and 5% levels of significance. If the results between estimators show universal increases in accuracy, then, comparative statistical tests may also be applied, Ease of configuration depends mainly on the number of parameters required to set up the learner. For example, a nearest neighbour system needs parameters to decide the weight of the variables and the method for determining closeness and combining neighbours to form a solution, Many learners need an evaluation function to determine error to decide on future search direction. The choice of method to make this evaluation is crucial for accurate modelling. Koza [24] lists the number of control parameters for GP as being 20 as compared to the neural networks 10, but it is not always easy to count what constitutes a control parameter. However, it is not so much the number of the parameters as the sensitivity of the accuracy of the solutions to their variation. It is possible to search using different parameters but this will depend on their range and the granularity of search. In order to determine the ease of configuration for a genetic program, we test empirically whether parameter values suggested by Koza offer sufficient guidance to locate suitably accurate estimators. Similarly, all of our solutions use either limited experimentation or commonly used values for contr ° l parameters.
7.1. Comparing the GP results with other methods The hypothesis to be tested is that GP can significantly produce a better estimate of effort than other techniques. In the comparison, those that contain a stochastic element are tested over a population of solutions, independently gener, • , ,^ , , _ ._ * ated. The ANN uses random starting values, and GP s use . , . .. . . , , . ' . random selection of initial population and crossover point. __ , . . , • , , , • • j- , , The range of solutions obtained has implications for both b , . . , , ,. . , the accuracy ofc the solutions and the ease of configuration. „, . „. . • , „ , , , The software effort estimator must consider all available . . , „ _ , tools rfor improving the accuracy of estimates. The aim of ,. . .., • . . . . . . . this paper is to provide the estimator with the information to , . . , , . . . . __ . , . , make an informed decision on the use of GP for this task. „. . . . . u » u j i, J i_ •,,,-, ,i Since the data has been shown, to be modelled by J a Weibull . . . . . . ._ . _. . . ... , distribution (Section 3), it is possible to produce random sets of data by computer. This means that statistical tests can be produced to compare the predictors with random data to isolate statistically significant accuracy. Table 5 tests the
„ „ , „ , " ReSUltS °f t h e
8
Pa»son ^ ^ ML ^ This has eyaluated ^ . ^ ^ i r • cc . • i J/~.r>. i t niques including a GP tool to make software project effort , .., , , ,. . ,,, , , . predictions. We believe tthat the best way to assess the prac. , .... c., , . ... ., , tical utility of these techniques would be to consider them i_u r i_ • • • •. , r within the context of their interaction with an example of an , , . . ; r intended user, viz. a software project manager. THowever, . , , ,. ,. since the data set we have used is one used by many other , ,, , ., , . .„„„ , . . ... workers, and based on the late 1980s, this is not possible. „, .,, „ ., . , , . Thus we will perform the comparisons based on the , , 5 . . , . c ,, accuracy of the results, the ease of configuration and the , , , . transparency of the solutions. 81
com
Accuracy of the predictions Table 5 lists the main results used for the comparison.
102
C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873 Table 6 Population behaviour for the best and worst (chosen on the basis of MMRE) of the population of solutions from ANN and GP solutions v v
Estimated effort
Correlation AMSE Pred(25) Pred(25)% MMRE BMMRE
Artificial neural network Worst Average Best
Genetic programming Worst Average Best
0.588 6.278 10 56 65.45 74
0.612 14.58 2 112 52.12 92.47
0.635 5.477 10 56 60.63 69
0.650 5.209 10 56 59.23 66
0.752 11.13 42 23 5 44.55 74.57
0.824 7.77 5 28 37.95 59.23
The various different types of regression equations all gave insignificantly differing results from each other, and so only, the linear LSR results are quoted. Results for ANNs are less accurate than those reported by Wittig and Finnie [24], although this is for a different data set and this may be, in part, due to the impact of their removing outlier projects in some of the validation sets. For the GP and ANN solutions, we have generated 10 solutions to assess reliability of accuracy. The population behaviour is summarised in Table 6. The neural network seems to converge fairly consistently to a reasonable solution, with a difference of 6% in MMRE. In contrast, the GP system is consistently more accurate for MMRE but does not converge as consistently as the ANN to a good solution, with a 14% variation in MMRE. This suggests that more work needs to be done to try and prevent premature convergence in the GP system. For AMSE and Pred(25), the ANN is more accurate and consistent. Early results from using GP, suggest that optimising one particular measure, in this case MMRE, has the effect of degrading a lot of the other measures and that a fitness function that is not specifically tied to one particular measure may give more acceptable overall results. This is illustrated in Tables 5 and 6, which give superior results for MMRE and correlation, but rather poor results for Pred(25), AMSE and BMMRE. This suggests that more research is required not only on GP, but also on the problem itself as to which of the many measures or combination of measures is the most appropriate in practice. find 8.2. Transparency of the solution One of the benefits of LSR is that it makes explicit the weight and contribution of each input used by the prediction
system. This can lead to insights about the problem for
understanding of the data by the user, it gives little indicat i o n o f t h e contribution of specific variables. Changing para,. , , ,, , .• - iT_. ,
meters could allow some exploration of this space, but complex interactions between variables and small data sets to make exploration limited. ^ T h e neural nets u s e d w k h i n this s m d y d o nQt a U o w us er to see the rules generated by the prediction system. If a particular prediction is surprising, it is hard to establish any rationale for the value generated. It is difficult to understand ..... . , "j . , , , . ,. . , , an ANN merely by studying the net topology and individual node weights. To extract rules from the best ANN a method of pruning was used as suggested by Lefley [29]. This reduced the net to the following nodes and links that were making a significant contribution to the error. Node 10 -0.35*X7 + - 0 . 8 1 * X 8 Node 11 1.60 + 0.39 *X1 — 0.69 *X4 — 0.67 *X6 — 1.075981 *X1 - 1.40 *X8 Node 12 0.42 *X5 + 0.61 *X1 Node 13 - 0 . 2 8 *X6 — 0.74 *X1 — 3.24 *X8 Node 14 0.32 *X6 + 0.33 *X1 + 0.52 *X8 O u t p u t n o d e 15 1.59 - 2.27*iV10 - 4.22*2tfll + 1.12 *N12 — 2.41 *W13 + 1.09 *W14.
Though this does not paint a very clear picture of how a decision is made it indicates the importance of variables X6, X7 and X8 (see Table 1) and careful examination might yield a little more information. However, even with such an analysis, ANNs do remain hard to interpret, certainly harder than say LSR, for small gains in terms of accuracy, Potentially, GP can produce very transparent solutions, in the sense that the solution is an algebraic expression. At present, the good solutions for this problem have the maximum number of nodes, i.e. 64 operators or operands and the expressions do not simplify a great deal. Further work may solutions with less nodes, which may make the results more understandable analytically. In contrast, increasing the number of nodes may provide better estimates, but the results are likely to be less transparent and the computation time will increase significantly, 83
example to direct efficiencies. CBR, or estimation by analogy, also has potential explanatory value since projects are ordered by degree of similarity to the target project. Indeed, it is instructive that this technique demonstrates the effectiveness of user-involvement in performing better, when the user is able to manipulate the data and modify predicted outputs. However, although this suggests an
Ease
of configumtion
ofthe
system
The third factor in comparing prediction systems is what we term ease of configuration. In other words how much effort is required to build the prediction system in order to generate useful results. Regression analysis is a well-established technique with good tool support. Even allowing for analysis of residuals and so forth, little effort needs to be
103
871
872
C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873
expended in building a satisfactory regression model. CBR needs relatively little work though more might be gained by relative weighting of the inputs. The number of neighbours did not seem to have a significant effect. By contrast, we found it took some effort to configure the neural net and it required a fair degree of expertise. Generally, different architectures, algorithms and parameters made little difference to results but some expertise is required to choose ,,
,
....
, ,
. . .
,
making his data set available. We also thank the journal reviewers for their helpful comments which have significantly improved the quality of this report, References W A- Aarmodt, E. Plaza, Case-based reasoning: foundational issues,
iT i j
methodical variations and system approaches, AI Commun. 7
reasonable values. Although heuristics have been published on this topic [17,44], we found the process largely to be one of trial and error guided by experience. ANN techniques are not easy to use for project estimation by end-users as some expertise is reauired
' ^ case-Based Learning Algorithms, 1991 DARPA CaseBased Reasoning Workshop, Morgan Kaufmann, Los Atlos, CA, 1991. ^l W. Banzhaf, P. Nordin, R.E. Keller, F.D. Francome, Genetic
[2] w D A h a
„„ , ., , • j- c G P has many parameters too, the choice of functions, crossover and reproduction rates, percentage of elitism, dealing with out of bounds conditions, creation Strategy and setting maximum tree sizes and depths, i.e. code lengths, to name some of the more significant. We found
Programming: An Introduction, Morgan Kaufmann, Los Atlos, CA, 1998 [ 4 ] R B i s i o F M a i a b o c c h i a i C ost estimation of software projects through case based reasoning, International Conference on Case Based Reasoning, Sesimbra, Portugal, 1995. W B - W - B o e h m - Software Engineering Economics, Prentice Hall, New
,„ T 0 *' 198 , 1 "
by following the decisions suggested by Koza [24], we J
°
obtained good results but again, at present, some expertise
_
,
,
.
_
„
~,
„ .
• »„ •
Menlo Park, CA, 1986.
[7] J.C.W. Debuse, V.J. Rayward-Smith, Feature subset selection within a simulated annealing data mining algorithm , J. Intell. Inf. Syst. 9 (1997)57-81. [8] J.M. Desharnais, Analyse statistique de la productivitie des projets
,
9. Conclusions and future work
informative a partie de la technique des point des fonction, Unpublished Masters Thesis, University of Montreal, 1989. [9] J.J. Dolado, A study of the relationships among Albrecht and Mark II Function Points, lines of code 4GL and effort, J. Syst. Soft. 37(1997)
In this paper, we have compared techniques for predicting software project effort. These techniques have been
compared in terms of accuracy, transparency and ease of r.. „ -..CJ*i » »i sec • configuration. Despite finding that there are differences in °
, 7 V C.
and M o d e l S | Benjamin/Cummings,
is Still needed. „
T.
[6] S. Conte, H.E. Dunsmore, V.Y. Shen, Software Engineering Metrics
161-172.
[101 J.J. Dolado, A validation of the component-based method for software . . . ____ „ . £ . . ^ . . . ^ ,.„, size estimation, IEEE Trans. Soft. Engng 26 (2000) 1006-1021.
1
prediction accuracy levels, we argue that it may be other characteristics of these techniques that will have an equal, if not greater, impact upon their adoption. We note that the explanatory value of both estimation by analogy (CBR) and r, • »i J i_ J • ii • • » RI,T gives them an advantage when considering their interb
j j D o l a d o , O n t h e p r o b l e m of the software . cost function_ Inf . Soft . Tech. 43 (2001) 61-72. [12] S. Drummond, Measuring applications development performance, Datamation 31 (1985) 102-108. [131 L. Fernandez, J.J. Dolado, Measurement and prediction of the verifi. J^ * • • t i- _, , * ,
[n]
° " action with end-users. Problems of configuring neural nets tend to counteract their superior performance in terms of accuracy. W e are encouraged that this early investigation of GP shows it
cation cost of the design in a formalized methodology, Inf. Soft. Tech. (1999) 421-434. [14] D.E. Goldberg, Genetic algorithms in search, optimisation and machine learning, Addison Wesley, Reading, MA, 1989. I151 A.R. Gray, S.G. MacDonell, A comparison of techniques for devel41
can provide accurate estimates. The results highlight that measuring accuracy is not simple and researchers should consider a range of measures. Further, we highlight that where there is a Stochastic element, learners must also show acceptable, consistent accuracy for a population of
oS) E 3 7
m ddS
°
^ S°ftWare ^ ^
^
^
^
^
[16] T H e g a z V | o Moselhi> A n a l o g y b a s e d solution t o m a r k u p e s t i m a t i o n
problem, J. Comput Civ. Engng 8 (1994) 72-87. [17] S.Huang, Y.Huang, Bounds on the number of hidden neurons, IEEE Trans N e u r a l N e t w o r k 2 d " 1 ) 47~55.
[18]
TeIhH3U88U996E)X67e-75UdSeraent "' ^ eStimati " 8 m e t h ° d ' ^ ^ eC [19] M Jorgensen, Experience with the accuracy of software maintenance task effort prediction models, IEEE Trans. Soft. Engng 21 (1995) 674-681. t2°] N- Karunanithi, N. Whitley, Y.K. Malaiya, Using neural networks in reliability prediction, IEEE Soft. 9 (1992) 53-59. , , „ „ „ ,. y v . .. . . . . . ' . .
solutions, which complicates simple comparisons. We believe these results show that this approach warrants further investigation, particularly to explore the effects of various parameters on the models in terms of improving robustness and accuracy. It also offers the potential to ., , • , ,. , also rprovide more transparent solutions but this aspect r
[21] C.F. Kemerer, An empirical validation of cost estimation models, CACM 30 (1987) 416-429 [22] H.C. Kennedy, C. Chinniah, P. Bradbeer, L. Morss, Construction and evaluation of decision trees: a comparison of evolutionary and conce t P ^ m i n g methods, in: D. Come, J.L. Shapiro (Eds.), Evolu-
requires further research. Perhaps a useful conclusion is that the more elaborate estimation techniques such as ANNs and GPs can provide better accuracy but require more effort in setting up and training. A trade off between
accuracy from complexity and ease of interpretation also
^S^^XSf^^
seems inevitable. The authors are grateful to Jean-Marc Desharnais for
[ 2 3 ] p K o k | B A Kitchenham,
"**"*
^
J. Kirakowski, The MERMAID approach to software cost estimation, Proceedings of Esprit Technical Week, 1990.
104
873
C.J. Burgess, M. Lefley / Information and Software Technology 43 (2001) 863-873 [24] J.R. Koza, Genetic Programming: On the Programming of Computers by Natural Selection, MIT Press, Cambridge, 1993. [25] J.R. Koza, Genetic Programming II: Automatic Discovery of Reusable Programs, MIT Press, Cambridge, 1994. [26] J.R. Koza, F.H, Bennett, D. Andre, M.A. Keane, Genetic Programming III: Darwinian Invention and Problem Solving, Morgan Kaufmann, Los Atlos, CA, 1999. [27] P.S.H.A. Langley, G.L. Bradshaw, J.M. Zytkow, Scientific Discovery: Computational Explorations of the Creative Process, MIT Press, Cambridge, 1987. [28] D. Leake, Case-Based Reasoning: Experiences Lessons and Future Directions, AAAI Press, Menlo Park, 1996. [29] M. Lefley, T. Kinsella, Investigating neural network efficiency and structure by weight investigation, Proceedings of the European Symposium on Intelligent Technologies, Germany 2000. [30] C. Mair, G. Kadoda, M. Lefley, K. Phalp, C. Schofield, S. Shepperd, S. Webster, An investigation of machine learning based prediction systems, J. Soft. Syst. 53 (1) (2000). [31] R. San Martin, J.P. Knight, Genetic algorithms for optimization of integrated circuit synthesis, Proceedings of Fifth International Conference on Genetic Algorithms and their Applications, Morgan Kaufman, San Mateo, CA, 1993, pp. 432-438. [32] M. Mitchell, Introduction to Genetic Algorithms, MIT Press, Cambridge, 1996. [33] Y. Miyazaki, A. Takanou, H. Nozaki, N. Nakagawa, K. Okada, Method to esumate parameter values in software prediction models, Inf. Soft. Tech. 33 (1991) 239-243. [34] K.S. Narendra, K. Parthasarathy, Identification and control of dynao S i T 2 ™ USi " 8 nCUral n e t W ° r k S > I E E E T r a " S ' N e U r a l N e t W O T k '
[35] E. Rich, K. Knight, Artificial Intelligence, McGraw-Hill, New York, 1995. [36] R. Schank, Dynamic Memory: A Theory of Reminding and Learning in Computers and People, C.U.P., 1982. [37] T.J. Sejnowski, C.R. Rosenberg, Parallel networks that learn to pronounce English text, Complex Syst. 1 (1987) 145-168. [38] M.J. Shepperd, C. Schofield, Estimating software project effort using analogies, IEEE Trans. Soft. Engng 23 (1997) 736-743. [39] K.K. Shukla, Neuro-genetic prediction of software development effort, Inf. Soft. Tech. 42 (2000) 701-713. [40] K. Srinivasan, D. Fisher, Machine learning approaches to estimating software development effort, IEEE Trans. Soft. Engng 21 (1995) 126-136. [ 4 1 ] E. Stensrud, I. Myrtveit, Human performance estimating with analogy a n d regression models: an empirical validation, Proceedings of the F i f t h International Metrics Symposium, IEEE Computers and Society, Bethesda, MD, 1998. [ 4 2 ] A R Venkatachalam, Software cost estimation using artificial neural networks, International Joint Conference on Neural Networks, IEEE, Nagoya 1993 [ 4 3 ] s yincinanza, M.J. Prietula, Case based reasoning, software effort estimation, Proceedings of 11 th International Conference on Informat j o n Systems 1990 [ 4 4 ] s W a l c z a k ] N CeTp^ Heuristic principles for the design of artificial neura, networks> ,nf Soft T e c h 4J ( 1 9 9 9 ) 1 0 7 n 7 [45] L w R M a r f C a s e . b a s e d reasoning. a r e v i e W i T h e Knowledge Engng Rev. 9 (1994) 327-354. [4g] G
Optimal software release scheduling based on artificial neural networks Tadashi Dohi, Yasuhiko Nishio and Shunji Osaki Department of Industrial and Systems Engineering, Faculty of Engineering, Hiroshima University, 4-1 Kagamiyama 1 Chome, Higashi-Hiroshima 739-8527, Japan E-mail: [email protected]
The determination of the optimal software release schedule plays an important role in supplying sufficiently reliable software products to actual market or users. In the existing methods, the optimal software release schedule was determined by assuming the stochastic and/or statistical model called software reliability growth model. In this paper, we propose a new method to estimate the optimal software release timing which minimizes the relevant cost criterion via artificial neural networks. Recently, artificial neural networks are actively studied with many practical applications and are applied to assess the software product reliability. First, we interpret the underlying cost minimization problem as a graphical one and show that it can be reduced to a simple time series forecasting problem. Secondly, artificial neural networks are used to estimate the fault-detection time in future. In numerical examples with actual field data, we compare the new method based on the neural networks with existing parametric methods using some software reliability growth models and illustrate its benefit in terms of predictive performance. A comprehensive bibliography on the software release problem is presented.
T. Dohi et al. / Optimal software release scheduling
These SRGMs have several advantages to support a decision making in the development management of software products. For instance, it is useful for the software project manager to determine the optimal time to stop software testing and to deliver the system to users. This problem, called optimal software release problem, has been also discussed in many papers. Since the seminal contribution by Okumoto and Goel [1980], many authors formulated the optimal software release problems based on several SRGMs. A comprehensive bibliography on the software release problem is presented in the following section. On the other hand, it is pointed out that the most SRGMs explored have not been applied to practical software development projects. The reason for this is that none of these SRGMs give sufficiently accurate estimates of reliability. Thus, it should be naturally recognized that the optimal software release schedules based on unrealistic SRGMs will lose their own validity. Since, in general, the almost SRGMs in the literature have treated software as a black-box and have sometimes assumed specific behaviors of fault-detection processes in the sense of expectation, they are expected not to function well for assessment of software in the different situation where such a scenario can never be assumed. However, the black-box approach does not always function, since we have to capture the very complex reliability behavior of real software systems based upon a little and incomplete information observed in the testing phase. When the software fault-occurrence events are observed, the fault-detection time data may be treated as time series one. Then it will be useful to predict the future fault-detection time by applying an appropriate time series forecasting technique instead of existing SRGMs. Karunanithi et al. [1991, 1992] and Karunanithi and Malaiya [1992, 1996] applied some kinds of neural networks to evaluate the software reliability. Independently, Khoshgoftaar and Szabo [1994] and Shinohara et al. [1996] compared the neural network approaches with existing parametric SRGMs in terms of predictive abilities. Considering the fact that artificial neural networks are actively applied in many practical applications, they should be utilized for other software engineering applications as well as assessment of the software product reliability. For example, Srinivasan and Fisher [1995] and Venkatachalam [1993] used the neural networks to estimate software development effort or cost instead of theoretical models proposed by Boehm [1981, 1984]. In this direction, artificial neural networks with high information processing abilities have been often applied to evaluate the complex structure in software. The purpose of the present paper is to develop a new method to estimate the optimal software release time which minimizes the relevant cost criterion, by applying artificial neural networks. More precisely, it is shown that the underlying cost minimization problem can be reduced to a graphical one and is essentially equal to a time series forecasting problem. This fact claims that the earlier analytical approach for the software release problem can be interpreted from a different perspective, and enables us to determine the optimal schedule via different statistical inference devices. This paper is organized as follows. In the following section, we review the optimal software release problems referring earlier works. In section 3, we formulate a
107
T. Dohi et al. / Optimal software release scheduling
169
cost-based optimal software release problem and provide a graphical method. In section 4, statistical methods to estimate the optimal software release time from empirical fault-detection time data are presented. After introducing two parametric methods, the statistical estimator of the optimal software release time is defined. Section 5 introduces two typical neural network models with three-layers; multi-layer perceptron neural (MLPN) network and recurrent neural (RN) network. The MLPN network is a supervised network, by which is meant that the data used for training and testing the network has a required response of the network, known as the target. While the RN network is found in the majority of successful applications, from both research and industrial viewpoints, and is highly appropriate for forecasting time series. As the learning algorithm for both networks, the back-propagation (BP) algorithm is used [Rumelhart and McClelland 1986]. Numerical examples in section 6 are devoted to compare the neural network approaches with those by some SRGMs in terms of predictive performance and to evaluate the method proposed quantitatively. Finally, the paper is concluded with some remarks in section 7. 2.
Background and literature survey
The optimal software release problem was formulated first by Okumoto and Goel [1980]. They assumed that the temporal variation of the number of cumulative software faults detected in the testing phase followed an exponential SRGM [Goel and Okumoto 1979] based on the NHPP, and attempted to derive analytically the optimal software release time which minimizes the expected cost incurred in both testing and operational phases. Though it might not be easy to estimate cost parameters in an actual software development process, their approach can provide the cost-effective testing schedule as well as a unified criterion to represent several requirements for software delivery schedule. Koch and Kubat [1983] analyzed the similar but somewhat different problem under the assumption that the the number of cumulative software faults can be described as a Markov process, the so-called Jelinski and Moranda model [Jelinski and Moranda 1972]. Shanthikumar and Tafekci [1983] also considered the similar problem under the binomial type of SRGM. Yamada et al. [1984], Yamada and Osaki [1985, 1986], Bai and Yun [1988], Yun and Bai [1990], Ohtera and Yamada [1990], Kapur and Garg [1989, 1990, 1991], Yamada et al. [1993], Shinohara et al. [1997] and Dohi et al. [1997] took account of reliability requirement and/or formulated modified optimal software release problems for different SRGMs. Recently, Hou et al. [1996, 1997] considered the optimal software release policies for the hypergeometric distribution SRGM proposed by Tohma et al. [1989, 1991]. The literature mentioned above assumed some SRGMs to describe the fault-detection process with respect to time. Such a simple approach will be tractable if an adequate SRGM can be selected in the testing phase. We can classify these modeling approaches into the parametric problem. However, the choice of the SRGMs usually fluctuates as data observations progress. In other words, especially, in the initial testing phase, it may be difficult
108
170
T. Dohi et al. / Optimal software release scheduling
to choose the best SRGM from a small sample of fault-detection time data. Also, the statistical hypothesis that the SRGM is correct or not may be rejected after obtaining additional data. In such an unreliable situation, the Bayesian adoptive method will be useful to estimate the optimal software release time. Forman and Singpurwalla [1977, 1979], Singpurwalla [1991] and Ross [1985] applied the Bayesian statistical inference techniques to estimate the optimal software release time with some reliability criteria. Musa and Ackerman [1989] evaluated the empirical problem to stop testing. Masuda et al. [1989] considered the similar problem for determining release time of software system with modular structure. The most (mathematically) sophisticated techniques to estimate the optimal software release time under the expected cost criterion were proposed by Dalai and Mallows [1988, 1990, 1992]. They treated the underlying problem as the optimal stopping problems and derived the closed-form stopping rules. Recently, Yang and Chao [1995] compared two stopping rules by simulation. It is noticed that the earlier works above also depend on the parametric model structure, that is, they assume that the software inter-failure time interval obeys an exponential with a random rate. In the framework of Bayesian adoptive inference, such a treatment seems to be valid, but the problem on the model identification has remained. 3.
Cost-based software release problem
It will be convenient to unify the optimization criterion for designing the software release plan from the viewpoint of economic justification. In this paper, we concentrate the problem on minimizing the expected cost anticipated to occur during both testing and operational phases. Following Okumoto and Goel [1980] and Yamada et al. [1984], suppose that the software test is started at time 0 and terminated at time T (0 ^ T < l i e ) , where TLc (> 0) denotes the software life cycle or the warranty period. If XLc is the software life cycle, it may be a non-negative random variable with sufficiently large finite mean. Otherwise, since the user will be guaranteed the free service to repair the software failure during the warranty period [T, TLCL the upper limit of to, 2~Lc, m a v ^ e a constant which should be pre-determined (ordinarily, it may be half or one year for the commercial software). As seen latter, although we need not specify TLC, at the moment, assume it to be constant since it is irrelevant to derive the optimal software release time T*. Consider the following cost components: • c\ (> 0): the cost to remove a software fault in testing phase, • C2 (> 0): the cost to remove a software fault in operational phase, • c-j (> 0): the testing cost per unit time incurred in testing phase, where C2 > c\. As assumed in many works [Dalai and Mallows 1992; Koch and Kubat 1983; Okumoto and Goel 1980], it is a natural assumption that fixed costs are incurred for debugging software faults. On the other hand, since the running cost for the testing phase is extremely larger than the holding cost for the software service team during the warranty period, only the testing cost is considered here.
109
T. Dohi et al. / Optimal software release scheduling
171
Suppose that the cumulative number of faults detected up to time t £ [0, TLC] is a non-negative stochastic counting process {N(t), t ^ 0}. Then it is appropriate to adopt the expected total software cost incurred until T L c, V(T). Define
where M(T) = E[N(T)] is the expected cumulative number of faults detected up to time T. If the function M(T) is known completely, the problem is formulated as min
(2)
V(T)
which is a simply algebraic one. For instance, if the function M(T) is continuous and well-defined, and further is a non-decreasing, strictly concave and bounded function of T (see, e.g., [Goel and Okumoto 1979; Jelinski and Moranda 1972]), the minimization problem in equation (2) has a unique solution T* from d2M(T)/dT2 > 0 under the assumption c2> c\. For instance, if {N(t), t^O} satisfies the following properties: (i) JV(O) = 0, (ii) [N(t),
t^O}
has an independent increment,
(iii) ?r{M(t + h) - M(t) = 1} = X(t; 6)h + o(h), (iv) Pr{M(t + h)- M(t) ^ 2} = o(/i), then the stochastic process is the NHPP with ft{^)
= n | ^ = 0} = i
^
^
.
(3)
where M(t) = M(t; 6) is the mean value function of the NHPP representing the mean number of faults detected and 0 e TZn is an arbitrary parameter. Goel and Okumoto [1979] proposed the following exponential SRGM: M(t; 9) = N{\-
exp(-bt)}
(4)
where the parameters 0 = (N, b) e rTZ2 (0 < JV < oo, b > 0) denote the initial fault contents before testing and the fault-detection rate per remaining fault, respectively. In the typical software release problems, one should identify the model parameters under the assumption that the mean value function can be known from any expert knowledge. That is, suppose that n (> 1) data of the fault-detection time interval, 0 = xo,x\,...,xn, and of the cumulative time, Tn = X^Lo 3 ^' a r e available at the point of time in the testing phase. Then, the maximum likelihood estimation (MLE) method will be useful to estimate the model parameters and to derive the parametric
110
172
T. Dohi et al. I Optimal software release scheduling
estimator MQ(T) = M(T; 9). More precisely, since the logarithmic likelihood function becomes n
logM(t;0) = ^ l o § M { T i ; 6 ) - Mffi;d),
(5)
we obtain the simultaneous likelihood equation to be solved as follows:
«^=o.
(6)
Substituting the estimator MQ{T) into equation (2), an estimator of the optimal software release time can be calculated analytically. Yamada et al. [1984] proved the following proposition for the exponential SRGM. Proposition 3.1. For the exponential SRGM in equation (4), if the MLEs of model parameters satisfy Nb > 03/(02 — ci), then there exists a finite and unique solution
IQ-W^-^)
(7)
to the first-order condition of optimality dV(T)/dT = 0 and the optimal software release time is T* = min{T°,T^c}. Otherwise, the optimal software release time becomes T* = 0 and it is optimal to release the product without software test. Next, we attempt to reconsider the underlying problem graphically. After some algebraic manipulations, we have: Proposition 3.2. The minimization problem in equation (2) is equivalent to
max \M(T) o
^ZL).
(8)
c2-cij
The configuration of the geometrical solution method is depicted in figure 1. The result above implies that the underlying software release problem can be reduced to a graphical one to seek the point T* so as to maximize the vertical distance from the straight line c 3 T/(c2-ci) to the curve M(T) in the (T, M{T)) e [0, ThC] x [0,00) plane, if the function M(T) is known. Hence the software release problem is independent of Tic and means essentially the estimation problem of the curve M(T). In existing optimal software release formulations as mentioned above, it is noticed that the function M(T) was assumed in advance. In other words, if the future behavior on M(T) can be estimated at any time point s € [0, T), the optimal software release time will be determined based on the estimator M(T). The following result based on proposition 3.2 gives a dual relationship for proposition 3.1.
Ill
T. Dohi et al. / Optimal software release scheduling
173
mean number of faults detected
Figure 1. Configuration of geometrical solution method.
Proposition 3.3. Suppose that M(T; 6) is bounded, nondecreasing with respect to T and strictly concave. (i) If the tangent slope of the curve y = M(T; 6) at the origin is strictly larger than ci/{c2 — c\) and if the tangent slope of y = M(T\ 9) at x = TLC is strictly less than Ci/(c2 - c\) in the plane (x,y) = (T,M(T)), then there exists a finite and unique optimal software release time T*(0 < T* < TLc) which minimizes the expected total software cost. (ii) If the tangent slope of y = M{T;6) at x = XLc is larger than or equal to ^3/(02 - c\), then the optimal software release time becomes T* = T LC and it is optimal to continue the software test. (iii) If the tangent slope of y = M(T; 6) at the origin is less than or equal to 03/(02 — ci), then the optimal software release time becomes T* = 0 and it is optimal not to test the software product. Consequently, if the complete information on the function M(T; 6) can be obtained in advance, we can derive the optimal software release time on the graph. This method is applicable to the case where the cumulative number of faults detected is unknown. In the following section, we describe the statistical estimation procedure of the optimal software release time on the graph. 4.
Statistical estimation procedure
Consider statistical methods to estimate the optimal software release time for the problem in equation (8). Suppose that the time to detect the ith fault is T{ (i = 1,2,..., n) and that n fault-detection time data are available, that is, 0 = To < T\ < •••
112
174
T. Dohi et al. / Optimal software release scheduling
involved in M(T) have to be estimated from the n data and the optimal software release time (constant) T* must be calculated algebraically using suitable parametric estimator M(T). Ordinarily, the method of MLE will be applied to estimate unknown parameters for SRGMs and the optimal software release time. In the remainder of this section, somewhat different methods are introduced (see [Hishitani et al. 1991; Yamada 1989])^ The first method is to estimate the j'th (j = n + 1,... ,N) fault detection time Tj according to
fj = M(710) = M- 1 (j;?)
(9)
l
if the inverse function M~ (-) exists. The second method focuses on the inter-failure time distribution. It is well known that the NHPP with bounded mean value function, i.e., M(t) < oo for all t has a proper inter-failure time distribution Fk(t) = Pr{Tfc+1 - Tk ^ t | To = 0}, fc = 0,1.2,..., where Ffe(oo) < 1. For instance, the exponential SRGM and the other models based on the NHPP possess the above property (see [Musa et al. 1987]). In order to approximate the fault-detection time interval, Hishitani et al. [1991] proposed the following normalized distribution:
Gk(t)= jy**
(10)
Jo aFk(t) and regarded the following MTBF (mean time between faults) as an estimate: Xj=E[rj]-E[Tj_i],
(11)
j = n+l,...,N,
where the expectation in equation (11) is taken with respect to Gk(t). The two methods can be applied to estimate M(T) without using the method of MLE. If n fault-detection time interval datajri = T\ - To, xg = Ti - T\, ..., xn = Tn — Tn-\ are available and if the estimates Tj and Xj = Tj—Tj-i (j = n+l,... ,mn) are given, where
{
n
m: integer; ^ n+ ^ i=\
"I
m
j=n+l
Xj ,
then an estimator of the mean value function is given by the step function (0, OsCT<Ti, _ i, Ti^T< Ti+u Mn(T)=K, Tn^T
j, Un,
r j ^r
(12)
)
(13)
for i = 1,2,..., n — 1 and j — n + 1, n + 2,..., mn — 1. The configuration of the estimator is shown in figure 2. Note that the estimator in equation (13) is not consistent, but is very reasonable. In addition, it is pointed out that the MLE of the
113
175
T. Dohi et al. / Optimal software release scheduling
Figure 2. Empirical total number of software faults.
NHPP is not always the best estimator. This motivates that the two estimators based on equations (9) and (11) might fit to our graphical method. From equation (13), define an estimator of the expected total software cost as follows: V(T) = ciMn(T) + c2{Mn(TLC) - Mn(T)} + c3T.
(14)
From proposition 3.2, the optimal software release problem can be formulated as
max f.Mn(T)
0C7X7Ic\
— T\.
C 2 -Ci
J
(15)
Then we have the following useful result to restrict the search space of the optimal solution. Proposition 4.1. For an estimator Mn(T), the candidate of the optimal software release time for the problem in equation (15) necessarily exists among Uni Tn+\,. . . , im n )-
The proof is obvious from the fact that the function Mn(T) is right-continuous. From proposition 4.1, we can search the optimal solution over a finite region consisting of mn - n + 1 elements. Hence, if the estimator in equation (9) or (11) is used, the underlying optimal software release problem can be reduced to a time series forecasting one. This claims that a more beneficial forecasting method should be applied to estimate the future fault detection time. In the following section, we introduce the statistical methods to estimate Xj {j = n + 1 , . . . , m n ), using artificial neural networks.
114
176
5.
T. Dohi et al. / Optimal software release scheduling
Estimation via artificial neural networks
In this section, we explain two neural network models; the MLPN and RN networks. First, let us consider the MLPN network (see figure 3). The MLPN used in this paper is constructed of three layers of cells, with interconnections between all combinations of cell layers. The input layer receives input data. The middle layer, called the hidden layer, contains processing nodes terms as artificial neurons, and the output layer gives an output as an estimated value of the next software fault-detection time interval. Each processing element in the hidden layer or the output layer has an activation level, which must be computed as a function of the activation levels on the cell connected to it and the associated interconnection weights. This function is called the activation function, and, commonly, the following sigmoidal function is used:
(16) where /? is a nodal threshold (constant). As a training method, we use the wellknown BP algorithm [Rumelhart and McClelland 1986]. This algorithm modifies nodal thresholds and interconnection weights to reduce the error between the output and the target, where two kinds of learning rates are needed. The parameter a is used to determine the training cycle and the parameter TJ is used to adjust the convergence speed for the algorithm. The target is given by the jth fault-detection time interval when the input data are previous to jth fault. Thus, if the accumulated error over all input data decreases less than a tolerance level, then the neural network outputs the next fault-detection time. For the MLPN used in this paper, the numbers of cells in the input, the hidden and the output layers are fixed as 10, 10 and 1, respectively, where these numbers are determined through numerical experiments in advance. Next, we explain the RN network (see figure 4). This network in figure 4 is called Elman network and has a multi-layered structure similar to that of MLPNs. In the RN network, in addition to an ordinary hidden layer, there is another special hidden layer called context or state layer. This layer receives feedback signals from the ordinary hidden layer, and the outputs of neurons in the context layer are fed forward to the hidden layer. If only the forward connections are to be adapted and the feedback connections are presented to constant values, the network can be considered an ordinary feedforward network and the BP algorithm used to train it. It is well known that the RN network is better than the MLPN for forecasting the time series
Figure 3. Illustration of MLPN network.
115
T. Dohi et al. / Optimal software release scheduling
177
Figure 4. Illustration of RN network.
data, and will be useful to estimate the fault-detection time interval of the software product in the testing phase. In using the MLPN, after one training of the network with a fixed number of training data, all estimates of the future fault-detection time interval are given under an identical network architecture. On the other hand, the RN network can estimate the time interval sequentially, changing its internal parameters. See Karunanithi and Malaiya [1992, 1996] for more details. The estimation algorithm for the optimal software release time is the following; Step 1. Given n fault-detection time interval data x\,X2,...,xn, train the neural networks and estimate the future fault-detection time interval XJ (j = n + 1 , . . . , mn), where TLC and m n are determined in advance. = Step 2. Seek the estimate Mn(T) by plotting the points {A\,A2,...,Amn} {(l,xi),...,(n,xn), (n + l , x n + i ) , . . . ,(mn,Xmn)} and constructing a step function in the two-dimensional plane. Step 3. Calculate M « ( T L C ) for a given TLC, and search the point Xj» maximizing the vertical distance from the straight line to the point At (i = 1 , . . . , m n ) . Step 4. Calculate the minimum total software cost. In Step 1, by replacing the estimate with the neural networks into that with the NHPP in equations (9) and (11), it can be seen that the determination based on the SRGMs is possible. Therefore, the graphical method proposed in section 5 is a unified approach and can be applicable to every statistical inference technique. In the following section, we estimate the optimal software release time and the corresponding total software cost from actual field data, and examine the predictive performance of the proposed method by artificial neural networks. 6.
Numerical examples
We analyze 6 data sets cited in Lyu [1996] (see table 1). In data set #1 which consists of 136 fault-detection time data observed in the testing phase, suppose that 95 fault-detection time interval data Xj (i = 1,2,... ,95) are available and we wish to
116
178
T. Dohi et al. / Optimal software release scheduling Table 1 Test data. No. set #1 #2 #3 #4 #5 #6
estimate the optimal software release time sequentially from the point of time T95 as the software testing time goes on. The models used are G&O, Y&O, G&O-l, G&O-2, Y&O, MLPN and RN. The G&O and Y&O assume the associated SRGMs by Goel and Okumoto [1979] and Yamada and Osaki [1985], respectively, and calculate the optimal software release times and the minimum expected costs using the MLE of the respective mean value functions, M(t;9). When the next data (e.g., 96th data) is observed, the most recent 95 data are used to estimate the mean value function and this cycle is repeated until the termination of test, Tic- In G&O-l and G&O-2, the fault-detection time is estimated by the inverse function in equation (9) and the MTBF based on the normalized inter-failure time distribution in equation (11), respectively. MLPN and RN denote the neural network approaches using the multi-layer perceptron neural network and the recurrent neural network, respectively. The most recent 10 data is input to the neural networks and the other data is used for training. That is, in the RN, 9 fault-detection time data and one feedback signal are used as the input. Since the software life cycle TLC is independent of the minimization and can be sufficiently large, we can regard it as the upper limit of the computation time. The respective values of TLC used in the experiments are given in table 1. It is important to design the neural network architecture for assessment of the software reliability. As mentioned in section 5, the network size is fixed as (10,10, l)-units. Also, in the preliminary experiments, the maximum number of training is 10,000 and the parameters are (a,rj) = (0.2,0.3) for the MLPN and (a,rj) = (0.9,0.8) for the RN. The initial values of nodal thresholds and interconnection weights are determined by the uniform random number. The model parameters included in the total software cost are assumed as follows: c\ = 1 [dollar/unit fault], C2 = 2 [dollar/unit fault] and c3 = 0.0004 [dollar/testing time (CPU time)]. Figure 5 shows the behavior of estimates of the optimal software release time in data set #1, where the central straight line in the figure denotes the real optimum and is calculated so as to maximize the vertical distance between the step function representing the number of faults and the straight line cjT/(c2 —c\) after obtaining all 136 data. It will be no problem to regard it as the real optimum. From this figure, it is found that both MLPN and RN fluctuate in the neighborhood of the real optimum and that the other prediction models tend to increase mildly as the number of data
117
T. Dohi et al. / Optimal software release scheduling
179
Figure 5. Behavior of estimates of the optimal software release time (data set #1).
increases. This result tells us that the SRGMs: G&O, Y&O, G&O-l and G&O-2, underestimate the optimal software release time in the initial testing phase and give rather pessimistic predicts. On the other hand, note that the MLPN and the RN are not changed as time elapses after they output the real optimum as the release time. This powerful properties for the neural network approaches depend on the geometrical structure proposed in section 3. To this end, if the neural networks output the close value to the real optimum at the point of time, the estimates after the time never move extremely. The behavior of the estimates of the corresponding minimum software cost (data set #1) is presented in figure 6. It can be seen that the estimates show the similar tendencies to figure 5. In tables 2 and 3, we calculate the mean squared error of the estimates of the optimal software release time and the minimum software cost for all 6 data sets,
118
180
T. Dohi et al. / Optimal software release scheduling Table 2 Comparison of the optimal software release time (mean squared error). Data set #1 #2 #3 #4 #5 #6
where "mean" is taken on all observation points (£96, £97,. • •) and the squared error is normalized as
[
(estimate of the optimal release time) — (real value)! (real value) J
(17)
From these results, we recognize that the neural network approaches give better predictive performance than the existing SRGMs. Especially, it may be clear that the RN can provide the lowest squared errors less than 1% for almost all cases except for data set #6. This is because the RN with feedback structure is superior to the MLPN in terms of their time series forecasting abilities. On the other hand, we find that estimation results by the parametric method based on the SRGMs are very poor and that no remarkable differences between G&O, G&O-l and G&O-2 can be recognized. In other words, if we assume the parametric models specifying the mean value function, we cannot conclude that the MLE method is always better than the graphical estimation method proposed in this paper. Finally, we summarize the experimental results in this paper as follows. (i) The neural network, especially the recurrent neural network, can provide the better estimate for the optimal software release time minimizing the total software cost as well as the fault-detection time in future than existing SRGMs. This tendency is remarkable as the number of fault-detection time data observed increases.
119
T. Dohi et al. / Optimal software release scheduling
181
Figure 6. Behavior of estimates of the total software cost (data set #1).
(ii) In the graphical approach proposed in this paper (LNN, RNN, G&O-l, G&O-2), one can select the optimal software release time from several estimates of the fault-detection time. This will be more realistic than the existing SRGMs which determine the optimal solution as a constant point of time. The constant time derived by the MLE method may not be always feasible (i.e., we might be testing at that time and not digest some test cases). (iii) The demerit of the neural network approach is to need adjusting itself in advance. However, once the network architecture is determined, the scenario on the debugging process is not needed as the existing SRGMs. Moreover, if the optimization of network architecture is possible, the estimation with higher accuracy can be expected.
120
182
7^ Dohi et al. / Optimal software release scheduling
Thus, it is concluded that the method based on the artificial neural networks is very useful in actual software development process planning. The graphical idea proposed in this paper is also convenient to estimate the optimal software release schedule by applying the existing time series forecasting techniques. 7.
Concluding remarks
In this paper, we have developed a method to estimate the optimal software release time applying artificial neural networks. It has been shown that the underlying cost-minimization problem can be reduced to a graphical one to minimize the vertical distance from a straight line to a curve representing the number of faults. Since the essential factor behind the software release problem is to estimate the fault-detection time interval in future, we have used two typical artificial neural networks for the sake of time series forecasting. In numerical examples with real software fault-detection time data, it has been found that the predictive performance of the optimal software release time by neural networks is better than by the existing parametric SRGMs. Though this paper has assumed two classical neural networks, an effort to improve the forecasting ability will be needed in future. For instance, if the other environmental data for software testing, e.g., structural factors such as the numbers of codes, functions and modules, testing effort or cost such as the number of programmers and execution time, can be observed, the neural networks may carry out more realistic information processing taking account of those factors. On the other hand, the graphical method proposed in this paper will motivate to utilize other statistical time series forecasting techniques instead of neural networks. Dohi et al. [1998] applied five statistical autoregressive models to estimate the number of software faults and the optimal software release time. However, in order to go beyond the predictive results by the neural networks in this paper, further improvement on the statistical autoregressive processes has to be made. Acknowledgement This work was partially supported by a Grant-in-Aid for Scientific Research from the Ministry of Education, Sports, Science and Culture of Japan under Grant No. 09780411 and No. 09680426. References Bai, D.S. and W.Y. Yun (1988), "Optimum Number of Errors Corrected Before Releasing a Software System," IEEE Transactions on Reliability R-37, 41-44. Boehm, B.W. (1981), Software Engineering Economics, Prentice-Hall, Englewood Cliffs, NJ. Boehm, B.W. (1984), "Software Engineering Economics," IEEE Transactions on Software Engineering SE-10, 4-21.
121
T. Dohi et al. / Optimal software release scheduling
183
Dalai, S.R. and C.L. Mallows (1988), "When Should One Stop Testing Software?" Journal of American Statistical Association 83, 872-879. Dalai, S.R. and C.L. Mallows (1990), "Some Graphical Aids for Deciding When to Stop Testing Software," IEEE Journal of Selected Areas in Communications 8, 169-175. Dalai, S.R. and C.L. Mallows (1992), "When to Stop Testing Software - Some Exact and Asymptotic Results," In Bayesian Analysis in Statistics and Econometrics, Lecture Notes in Statistics, Vol. 75, Springer, New York, pp. 267-276. Dalai, S.R. and C.L. Mallows (1992), "Buying With Exact Confidence," Annals of Applied Probability 2, 752-765. Dohi, T., N. Kaio and S. Osaki (1997), "Optimal Software Release Policies With Debugging Time Lag," International Journal of Reliability, Quality and Safety Engineering 4, 241—255. Dohi, T., H. Morishita and S. Osaki (1998), "A Statistical Estimation Method of Optimal Software Release Timing Applying Autoregressive Models," In Proceedings of First Euro-Japanese Workshop on Stochastic Risk Modelling for Finance, Insurance, Production and Reliability, Vol. 2. Forman, E.H. and N.D. Singpurwalla (1977), "An Empirical Rule for Debugging and Testing Computer Software," Journal of American Statistical Association 72, 750-757. Forman, E.H. and N.D. Singpurwalla (1979), "Optimal Time Intervals for Testing Hypotheses on Computer Software Errors," IEEE Transactions on Reliability R-28, 250-253. Friedman, M.A. and J.M. Voas (1995), Software Assessment, Reliability, Safety, Testability, Wiley, New York. Goel, A.L. and K. Okumoto (1979), "Time-Dependent Error-Detection Rate Model for Software Reliability and Other Performance Measures," IEEE Transactions on Reliability R-28, 206-211. Hishitani, J., S. Yamada and S. Osaki (1991), "Reliability Assessment Measures Based on Software Reliability Growth Model With Normalized Method," J. Information Processing Society of Japan 14, 178-183. Hou, R.H., S.Y. Kuo and Y.P. Chang (1996), "Optimal Release Policy for Hyper-Geometric Distribution Software-Reliability Growth Model," IEEE Transactions on Reliability R-45, 646-651. Hou, R.H., S.Y. Kuo and Y.P. Chang (1997), "Optimal Release Times for Software Systems With Scheduled Delivery Time Based on the HGDM," IEEE Transactions on Computers C-46, 216-221. Jelinski, Z. and P.B. Moranda (1972), "Software Reliability Research," In Statistical Computer Performance Evaluation, W. Freiberger, Ed., Academic Press, New York, pp. 465-484. Kapur, P.K. and R.B. Garg (1989), "Cost-Reliability Optimum Release Policies for a Software System Under Penalty Cost," International Journal of Systems Science 20, 2547-2562. Kapur, P.K. and R.B. Garg (1990), "Optimal Software Release Policies for Software Reliability Growth Models Under Imperfect Debugging," Revue Francaise d'Automatique, Informatique et Recherche Operationnelle (Recherche Operationnelle/Operations Research) 24, 295-305. Kapur, P.K. and R.B. Garg (1990), "Cost-Reliability Optimum Release Policies for a Software System With Testing Effort," Opsearch 27, 109-116. Kapur, P.K. and R.B. Garg (1991), "Optimal Release Policies for Software Systems With Testing Effort," International Journal of Systems Science 22, 1563-1571. Kapur, P.K. and R.B. Garg (1991), "Optimum Release Policy for Inflection S-Shaped Software Reliability Growth Model," Microelectronics and Reliability 31, 39-42. Karunanithi, N., Y.K. Malaiya and D. Whitley (1991), "Prediction of Software Reliability Using Neural Networks," In Proceedings of 2nd International Symposium on Software Reliability Engineering, IEEE Computer Society Press, Los Alamitos, CA, pp. 124-130. Karunanithi, N. and Y.K. Malaiya (1992), "The Scaling Problem in Neural Networks for Software Reliability Prediction," In Proceedings of Third International Symposium of Software Reliability Engineering, IEEE Computer Society Press, Los Alamitos, CA, pp. 76-82. Karunanithi, N. and Y.K. Malaiya (1996), "Neural Networks for Software Reliability Engineering," In Handbook of Software Reliability Engineering, M.R. Lyu, Ed., McGraw-Hill, New York, pp. 699-728.
122
184
T. Dohi et al. / Optimal software release scheduling
Karunanithi, N., D. Whitley and Y.K. Malaiya (1992), "Using Neural Networks in Reliability Prediction", IEEE Software 9, 53-59. Karunanithi, N., D. Whitley and Y.K. Malaiya (1992), "Prediction of Software Reliability Using Connectionist Models," IEEE Transactions on Software Engineering SE-18, 563-574. Khoshgoftaar, T.M. and R.M. Szabo (1994), "Predicting Software Quality, During Testing, Using Neural Network Models: A Comparative Study," International Journal of Reliability, Quality and Safety Engineering 1, 303-319. Koch, H.S. and P. Kubat (1983), "Optimal Release Time for Computer Software," IEEE Transactions on Software Engineering SE-9, 323-327. Lyu, M.R., Ed. (1996), Handbook of Software Reliability Engineering, McGraw-Hill, New York. Masuda, Y., N. Miyawaki, U. Sumita and S. Yokoyama (1989), "A Statistical Approach for Determining Release Time of Software System With Modular Structure," IEEE Transactions on Reliability R-38, 365-372. Musa, J.D., A. Iannino and K. Okumoto (1987), Software Reliability, Measurement, Prediction, Application, McGraw-Hill, New York. Musa, J.D. and A.F. Ackerman (1989), "Quantifying Software Validation: When to Stop Testing," IEEE Software 6, 19-27. Ohtera, H. and S. Yamada (1990), "Optimum Software-Release Time Considering an Error-Detection Phenomenon During Operation," IEEE Transactions on Reliability R-39, 596-599. Okumoto, K. and L. Goel (1980), "Optimum Release Time for Software Systems Based on Reliability and Cost Criteria," Journal of Systems and Software 1, 315-318. Rumelhart, D.E. and J.L. McClelland (1986), Parallel Distributed Processing, Vol. 1, MIT Press, Cambridge, MA. Ross, S.M. (1985), "Software Reliability: The Stopping Rule Problem," IEEE Transactions on Software Engineering SE-11, 1472-1476. Schick, G.J. and R. Wolverton (1978), "An Analysis of Competing Software Reliability Models," IEEE Transactions on Software Engineering SE-4, 104-120. Shanthikumar, J.G. and S. Tiifekci (1983), "Application of a Software Reliability Model to Decide Software Release Time," Microelectronics and Reliability 23, 41-59. Shinohara, Y., M. Imanishi, T. Dohi and S. Osaki (1996), "Software Reliability Prediction Using Neural Network Technique," In Proceedings of Second Australia-Japan Workshop on Stochastic Models in Engineering, Technology and Management, R.J. Wilson, D.N.P. Murthy and S. Osaki, Eds., Technology Centre, The University of Queensland, Brisbane, pp. 564-571. Shinohara, Y., T. Dohi and S. Osaki (1997), "Comparisons of Optimal Release Policies for Software Systems," Computers and Industrial Engineering 33, 813-816. Singpurwalla, N.D. (1991), "Determining an Optimal Time Interval for Testing and Debugging Software," IEEE Transactions on Software Engineering SE-17, 313-319. Srinivasan, K. and D. Fisher (1995), "Machine Learning Approaches to Estimating Software Development Effort," IEEE Transactions on Software Engineering SE-21, 126-137. Tohma, Y, K. Tokunaga, S. Nagase and Y. Murata (1989), "Structural Approach to the Estimation of the Number of Residual Software Faults Based on the Hyper-Geometric Distribution," IEEE Transactions on Software Engineering SE-I5, 345—355. Tohma, Y., H. Yamano, M. Ohba and R. Jacoby (1991), 'The Estimation of Parameters of the Hypergeometric Distribution and Its Application to the Software Reliability Growth Model," IEEE Transactions on Software Engineering SE-17, 483-489. Venkatachalam, A.R. (1993), "Software Cost Estimation Using Artificial Neural Networks," In Proceedings of 1993 International Joint Conference on Neural Systems, pp. 987—989. Xie, M. (1991), Software Reliability Modelling, World Scientific, Singapore. Yamada, S., H. Narihisa and S. Osaki (1984), "Optimal Software Release Policies With a Scheduled Software Delivery Time," International Journal of Systems Science 15, 904-915.
123
T. Dohi et al. / Optimal software release scheduling
185
Yamada, S. and S. Osaki (1985), "Cost-Reliability Optimal Release Policies for Software Systems," IEEE Transactions on Reliability R-34, 422-424. Yamada, S. and S. Osaki (1985), "Optimal Software Release Policies With Simultaneous Cost and Reliability Requirement," European Journal of Operational Research 31, 46-51. Yamada, S. and S. Osaki (1985), "Software Reliability Growth Modeling: Models and Applications," IEEE Transactions on Software Engineering SE-11, 1431-1437. Yamada, S. and S. Osaki (1986), "Optimal Software Release Policies for a Nonhomogeneous Software Error Detection Rate Model," Microelectronics and Reliability 26, 691-702 (1986). Yamada, S. (1989), Software Reliability Assessment Technology, HBJ Japan, Tokyo (in Japanese). Yamada, S., J. Hishitani and S. Osaki (1993), "Software-Reliability Growth With a Weibull Test-Effort: A Model & Application," IEEE Transactions on Reliability R-42, 100-106. Yang, M.C.K. and A. Chao (1995), "Reliability-Estimation & Stopping-Rules for Software Testing, Based on Repeated Appearances of Bugs," IEEE Transactions on Reliability R-44, 315-321. Yun, W.Y. and D.S. Bai (1990), "Optimum Software Release Policy With Random Life Cycle," IEEE Transactions on Reliability R-39, 167-170.
124
Chapter 3 ML Applications in Property and Model Discovery Given a software system, ML methods can be used to identify or discover certain properties of the system. Such discoveries of properties can be indispensable in many SE tasks: to facilitate various development and maintenance activities, to understand the relationships among software components for program understanding, to identify reusable components for reuse repository construction, and to re-engineer the existing system into one that has desirable properties, to name a few. Table 23 summarizes the status of the ML methods being utilized in this application category. Table 23. ML methods used in discovery. NN Program Invariants Object Identifying
IBL CBR
DT
GA
GP ILP
EBL
CL BL
IAL
RL
EL
SVM
A/ A/
Operation Boundary
yj
Mutants Process Models
AL
yj \j
\j
In this chapter, we include two papers, one dealing with using NN to identify objects in procedural programs, and the other tackling the issue of detecting equivalent mutants in mutation testing using BL. The paper by Abd-El-Hafiz [1] describes a general approach to identifying objects in procedural programs using clustering NN. Currently, there are three main approaches to the identification of objects in software systems: concept analysis (based on lattice theory), knowledge-based approach, and cluster analysis. Two limitations exist behind these approaches: the inability to identify a coherent set of objects as a result of the presence of undesired connections among functions of a system under analysis, and the need for human intervention so as to obtain satisfactory identification results. In the cluster analysis, there are three types of algorithms, namely: hierarchical, graph-theoretic, and optimization. The proposed approach, which is based on cluster analysis and is classified as belonging to optimization algorithms, intends to address these limitations through two clustering NN algorithms, ART (Adaptive Resonance Theory) and SOM (Self-Organizing Maps). ART and SOM algorithms are used to carry out the clustering process where routines of a software system are partitioned in order to minimize intra-cluster distances and maximize inter-cluster distances. The weights associated with each output node of the neural network correspond to a cluster centroid (cluster prototype). The distance measure is either the Manhattan distance or the Euclidean distance. The process of weights modification and routine partition is iterated until a stability is reached. Through a prototype tool implementing ART and SOM algorithms, a number of experiments has been performed by the author that involves programs written in C and Pascal and having a size up to 19000 lines of code. The tool takes as its input a routine-attribute matrix obtained through
125
simple static analysis of the code, chooses one of the NN algorithms and its corresponding parameter based on the user's decision, and produces the clustering results. The analysis indicates that the clustering results by SOM algorithm are either comparable to or better than those obtained through ART algorithm, though at a cost of longer execution time. The paper by Vincenzi et al [141] deals with the issue of identifying equivalent mutants in mutation testing using BL. Mutation testing holds great promise for improving software quality. It not only leads to the detection of faults in the tested program, but also offers a criterion for the test adequacy issue. However, a major hindrance to the application of this testing technique is its computational cost. For instance, there are a large number of mutants that need to be analyzed and identified for possible equivalence with regard to the original program under test, which is very expensive computationally. The proposed approach in [141] is based on the brute-force maximum a posteriori (MAP) learning algorithm to identify the most promising group of mutants that should be analyzed. The most probable hypothesis is learned from the following settings: 1. Five C programs (each about 100 lines of code) are used to generate the training data. 2. A tool is used that supports mutation testing at the unit level and implements 71 mutant operators to C programs. These mutant operators can be divided into four categories based on the syntactic units over which the mutation is applicable: constants, operators, statements and variables. 3. A test set of 500 cases is constructed for each program. To observe the variation in the number of equivalent and non-equivalent mutants as the number of test cases increases, 19 scenarios of test cases are defined, ranging from 0, 10, 20, 30, , to 300, 350, 400, 450 and 500 test cases. The probabilistic information is then collected, for each scenario, of the equivalent and non-equivalent mutants generated by a given mutant operator. 4. From the 19 scenarios, the probability can be calculated of each mutant operator being able to generate equivalent and non-equivalent mutants. A case study is performed of a sort program having 624 lines of code. The results indicate that for certain mutant operators, the error rate is within a reasonable range of 10%. The following papers will be included here: 5. Abd-El-Hafiz, "Identifying objects in procedural programs using clustering neural networks", Automated Software Engineering, Vol.7, No.3, 2000, pp.239-261. A. M. R. Vincenzi, et al, Bayesian-learning based guidelines to determine equivalent mutants, International Journal of Software Engineering and Knowledge Engineering, Vol.12, No.6, 2002, pp.675-689.
Identifying Objects in Procedural Programs Using Clustering Neural Networks SALWA K. ABD-EL-HAFIZ [email protected] Engineering Mathematics Department, Faculty of Engineering, Cairo University, Giza, Egypt
Abstract. This paper presents a general approach for the identification of objects in procedural programs. The approach is based on neural architectures that perform an unsupervised learning of clusters. We describe two such neural architectures, explain how to use them in identifying objects in software systems and briefly describe a prototype tool, which implements the clustering algorithms. With the aid of several examples, we explain how our approach can identify abstract data types as well as groups of routines which reference a common set of data. The clustering results are compared to the results of many other object identification techniques. Finally, several case studies were performed on existing programs to evaluate the object identification approach. Results concerning two representative programs and their generated clusters are discussed. Keywords:
1.
clustering, objects, abstract data types, neural networks
Introduction
The identification of objects within existing procedural programs can facilitate many maintenance activities. By extracting reusable components from existing software systems, the population of a reuse repository can be cost effective (Abd-El-Hafiz et al., 1991; Canfora et al., 1993b). That is why the extraction of reusable abstract data types has been the focus of many research activities (Canfora et al., 1996, 1993a, b; Dunn and Knight, 1993). In addition, the automatic identification of objects assists in understanding the relationships among the components of a software system. It, thus, facilitates the recognition and comprehension of the abstractions existing in a given system without getting distracted by implementation details (Abd-El-Hafiz, 1997; Abd-El-Hafiz and Basili, 1996; Canfora et al., 1996). Furthermore, object identification enables the re-engineering of procedural programs into functionally equivalent object-oriented ones that are easier to maintain (Newcomb and Kotik, 1995; McFall and Sleith, 1993; Siff and Reps, 1997). Some object identification approaches apply mathematical concept analysis to the object identification problem (Lindig and Snelting, 1997; Sahraoui et al., 1997; Siff and Reps, 1997; Snelting, 1996). Concept analysis is a branch of lattice theory that can be used to identify similarities among a set of objects based on their attributes (Siff and Reps, 1997). Others adopt a knowledge-based approach which uses a library of known abstract data types and rules to recognize their implementations (Dekker and Ververs, 1994). However, considerable research on object identification and software modularization, in general, is based on cluster analysis where clustering is understood to be the grouping of similar objects and the separation of dissimilar ones. Refer for example to (Achee and Carver, 1994; Canfora et al., 1996, Hutchens and Basili, 1985; Ibba et al., 1993; Kunz, 1996; Mancoridis
127
240
ABD-EL-HAFIZ
et al., 1998; Schwanke, 1991; Yeh et al., 1995). Similarity is measured, either explicitly or implicitly, by using relationships among the objects or by using scores of the objects on a number of attributes. An overview of these similarity measures and the different clustering algorithms can be found in (Wiggerts, 1997). This paper reviews some of the approaches proposed in the literature for the identification of objects in existing software systems. The limitations of these approaches are highlighted with the emphasis being on two limitations. Thefirstone is the inability to identify a coherent set of objects due to the existence of undesired connections among routines of the analyzed system. The second one is relying on human intervention in order to produce good results. Techniques which rely on intervention from an expert who understands the system are of little help to someone who is not familiar with the software. To overcome these limitations, the paper presents an object identification approach that is based on cluster analysis. More specifically, we use clustering neural networks to identify objects in procedural programs. The identification technique proposes a flexible set of attributes upon which the clustering can be based. Using two neural clustering algorithms, a hierarchy of clustering options is presented. These clustering results, in general, enable the precise identification of objects with no human intervention. Section 2, of this paper, reviews related work. Section 3 describes two neural architectures which perform an unsupervised learning of clusters. Section 4 demonstrates how these clustering neural networks can be used to identify objects in software systems. This section also gives the clustering results of many examples and compares the presented approach with related approaches. Section 5 briefly describes an implemented prototype tool and evaluates our approach by applying it on several existing procedural programs. Results concerning two representative programs and their generated clusters are discussed. Finally, Section 6 summarizes the strengths and limitations of the presented approach and gives future research directions. 2.
Related work
Some of the existing object identification approaches use mathematical concept analysis which provides a method to identify sensible grouping of objects that have common attributes. Lindig and Snelting (1997) and Sahraoui et al. (1997) use concept analysis by relating each routine of a program to the global variables it accesses. Their results are not encouraging because they could not identify any useful way to decompose a program into objects. Siff and Reps (1997) obtain better results by relating the routines of the program to different attributes which include the fields of user defined structure types that are accessed by these routines. In order to identify objects in the presence of undesired connections between the routines, attributes that reflect 'negative' information must be invented by the user of the approach (e.g., attributes of the form 'routine R does not use fields of structure 7"). Thus, human expertise and effort are required to formulate the additional complementary attributes which correctly recognize objects in the code. The knowledge-based identification approach of Dekker and Ververs (1994) stores a library of common abstract data types in order to search a program for matches against them. This approach has the disadvantage of being only able to recognize abstract data types whose description is stored in the library.
128
CLUSTERING NEURAL NETWORKS
241
Many other object identification approaches are based on cluster analysis. The clustering algorithms reviewed in this section are classified into three different categories: hierarchical, graph-theoretic, and optimization algorithms (Wiggerts, 1997). Hierarchical clustering algorithms build a hierarchy of clusters such that each level contains the same clusters as the first lower level except for two clusters which are joined to form one cluster. Graph-theoretic clustering algorithms are based on defining a model of the subject system as a graph on which notable sub-graphs and/or patterns are identified. Optimization clustering takes an initial partitioning of the system and tries to improve it by iterative adaptations according to some heuristic. Hutchens and Basili (1985) use data bindings and a hierarchical clustering algorithm to determine how two procedures are related. When the systems under consideration exhibit strong encapsulation (i.e., hide their data), the number of publicly accessible variables becomes limited. Hence, the approach fails to identify objects in such cases. Examples of the graph-theoretic clustering algorithms can be found in (Canfora et al., 1993b, 1996; Cimitile and Visaggio, 1995; Dunn and Knight, 1993; Liu and Wilde, 1990; Livadas and Johnson, 1994; Muller et al., 1993; Yeh et al., 1995). The method used by Yeh et al. (1995) is based on constructing two kinds of graphs for C programs. The nodes of the first kind are the procedures and structure types and the edges are the references by the procedures to the internal fields of the structures. The nodes of the second kind of graphs are the procedures and the external variables and the edges are the references by the procedures to the variables. The candidate abstract data types and object instances are the set of connected components in these graphs. Although his basic identification method is automatic, it can introduce some recognition pitfalls. To handle these problems, they introduce some semi-automatic techniques. Dunn and Knight (1993) have also presented an algorithm that exploits the representation of the subject program as a bipartite graph where nodes are either routines or global variables. Edges are directed from routines to global variables to specify the 'uses' relation. The algorithm performs a depth-first traversing of the graph looking for strongly connected components. Each of the resulting components is regarded as a candidate object. Equivalently, Liu and Wilde (1990) define for each global variable x, the set P (x) of routines which directly reference it. A graph is then constructed by considering each P(x) as node. An edge between two nodes P(*i) and P f e ) denotes that the intersection of the two sets P(JCI) and P{xi) is not empty. A candidate object is identified by finding the strongly connected subgraphs. The use of first-order logic to express both the graph representing the system and the types of sub-graphs and patterns to be identified within it is also proposed in the literature (Canfora et al., 1993a). This method has the advantage of being easy to prototype using a logic programming language. However, it produces results which are essentially equivalent to those offered by other graph theoretic algorithms. In general, the objects identified by the above graph theoretic approaches tend to contain spurious routines that are slightly related to the objects and require human effort and understanding to unravel. Thus, Canfora et al. (1996) proposed an improved identification method which uses a variable-reference graph that is essentially the same as the graph adopted by Dunn and Knight (1993). By exploiting simple statistical techniques, they enable a more precise identification of objects with less human intervention. Although the approach performs automatic identification of
129
242
ABD-EL-HAFIZ
the routines which introduce undesired connections, it does not identify the types of these connections. Furthermore, the clustering of such routines into objects is not supported. The prototype tool offers assistance only when it is desired to slice some of these routines. Graph theoretic cluster analysis is also used by Cimitile and Visaggio (1995) in a different context. In order to identify functional abstractions in procedural code, they transform the system's call graph into a dominance graph and, then, interpret the dominance relationships of this graph as functional dependency relationships. They also propose a set of candidature criteria for the aggregation of the procedures in candidate modules. It should also be mentioned that cluster analysis is used to solve the general problem of decomposing a software system into subsystems, in order to re-modularize it. A unified framework for expressing these re-modularization approaches is presented in Lakothia (1997). Muller et al. (1993) present a semi-automatic graph theoretic approach which uses partite graphs to construct subsystems based on their structural aspects. Schwanke (1991) uses hierarchical clustering and measures similarity using an information sharing heuristic to identify groups of related procedures. To identify individual procedures which appear to be in the wrong module, 'maverick analysis' is used. Although this analysis helps in improving the modularization results, it requires tedious and demanding tuning of the similarity measure by an architect. That is why neural networks are used to semi-automate this tuning process. In order to produce good results, the aforementioned two approaches rely on the intervention from an architect who understands the system. Hence, Mancoridis et al. (1998) present another modularization approach which overcomes this drawback. The approach makes use of traditional optimization algorithms such as hill climbing and genetic algorithms. It also shows how graph visualization tools (North and Koutsofios, 1994) can be used to visualize the clustering results. Although the clustering algorithms make use of the different relationships that exist among the modules of a software systems, they do not take into account the different strengths of these relations. Other informal modularization work is also based on using common sub-strings in file names (Anquetil and Lethbridge, 1998) and in using the concepts referred to in the comments and function names (Merlo et al., 1993). While Anquetil and Lethbridge (1998) use a combination of iterative and statistical algorithms, Merlo et al. (1993) use neural networks. The object identification approach presented in this paper uses neural networks to perform clustering. The weights associated with each output node of the neural network correspond to a 'cluster centroid' or 'cluster prototype'. The objective of the clustering algorithms is to partition the routines of the software system so as to minimize intra-cluster distances while maximizing inter-cluster distances. This is performed by repeatedly modifying the weights of the network as well as the partitioning of the routines until stability is reached. Thus, the clustering algorithms can be classified as optimization algorithms. The distance measure being used in the presented algorithms are either the Manhattan distance or the Euclidean distance (Mehrotraet al., 1997; Wiggerts, 1997). 3.
Clustering neural networks
Clustering neural networks learn clusters in the input data without the need to be taught. That is, they perform unsupervised learning of clusters based on data correlations (i.e., similarity
130
CLUSTERING NEURAL NETWORKS Table!.
243
Animals data for unsupervised learning of clusters. has hair
has scales
has feathers
flies
lays eggs
l.Dog
1
0
0
0
0
2. Cat 3. Bat 4. Canary 5. Robin 6. Pigeon 7. Snake 8. Lizard 9. Alligator
1 1 0 0 0 0 0 0
0 0 0 0 0 1 1 1
0 0
0 1 1 1 1
0 0 0
1 1 1 0 0 0
0 0 1 1 1 1 1 1
measures). The same input pattern is presented to the network several times, and a pattern may move from one cluster to another until the network stabilizes. In this paper, we consider Adaptive Resonance Theory (ART) networks and Kohonen's Self-Organizing Maps (SOM) (Jain et al., 1996; Mehrotra et al., 1997; Zurada, 1992). In order to demonstrate how the two neural architectures perform an unsupervised learning of clusters, we use the data shown in Table 1. This data is for a group of nine animals, each described by its own set of attributes (Knight, 1990). The group breaks down naturally into three clusters: mammals, reptiles, and birds. 3.1.
Adaptive resonance theory
Adaptive Resonance Theory (ART) models are neural networks that perform clustering. They allow the number of clusters to vary with problem size. The main feature of ART models is that they permit the user to control the degree of similarity between members of the same cluster by means of a user-defined constant called the vigilance parameter, P. The ART1 network only accepts binary (0/1) input vectors. As shown in figure 1, it uses two layers of neurons (or nodes) with feedforward connections (from input to output nodes) as well as feedback connections (from output to input nodes). The input layer contains as many nodes, n, as the size of the input vector. The output layer has a variable number of nodes, m, representing the number of clusters. The connections from the input layer to the output layer carry 'bottom-up' weights, Bmxn, and the connections from the output layer to the input layer carry 'top-down' weights, Tnxm. A high-level description of the ART1 clustering algorithm, which is adapted from (Mehrotra et al., 1997), is as follows: ART1 clustering algorithm Initialize the number of output nodes: m = 1; Initialize the weights: bj^ = -rjj and t^j•- = 1, for k = 1 while the network has not stabilized, do 1. Select an input pattern x = (x\,X2,..., xn); 2. Let the active set A contain all output nodes;
131
n; j =
l,...,m;
244
ABD-EL-HAFIZ
Figure 1.
An ART1 network with 4 input and 3 output neurons.
3. Calculate yj = J2l=i bj.k ** for each node j € A; 4. repeat • Let j * be a node in A such that yy = max(yi,..., ym)\ • Compute s* = (s* s*) where s% = t£jX^, • if YflT > P ^ e n associate x with node j * and update weights else remove j * from set A; until A is empty or x has been associated with some node j ; 5. If A is empty, create a new output node whose weight vector coincides with the current input pattern x; end-while. In the above algorithm, the ART1 network is first initialized. Then, the input vectors are repeatedly presented to the network until it stabilizes. In a stable network, weights no longer change and each input vector belongs to the same cluster in successive presentations. When a new input vector x is presented to the network, it is communicated to the output layer via the upward connections. At the output layer, y ; is calculated as in step 3. The output yj represents the similarity between the weight vector bj = (bj,\, • • •, bj,n) and the input vector*, which is measured using the Manhattan (or Hamming) distance. In step 4, a competitive activation process occurs among nodes that are in the current "active" list A. The node with highest yj, j * , wins and the corresponding cluster is declared to match x best. Since the best match may not even be close enough to satisfy an externally chosen threshold, the final decision depends on a vigilance test. Using the top-down weights, the y'*th node in the second layer produces the n-dimensional vector s*. The similarity between x and s* is compared with the given vigilance parameter P. If the proportion of 'on' bits in x that are also in s* exceeds the threshold P, with 0 < P < 1, then the match with
132
CLUSTERING NEURAL NETWORKS
Figure 2.
245
Clustering results for the animals example.
the y'*th node is judged acceptable, and the weights (t\ j . tnj"), and (bj>,\,..., bj.n) are modified to make s* resembles x to a greater extent, and computation proceeds with the next input pattern. It should be noted that large P values indicate a more strict similarity requirement than small P values. If the vigilance test fails, the _/*th cluster is removed from the active set of output nodes, A, and does not participate in the remainder of the process of assigning x to an existing cluster. The process of determining the best match from the active set A is repeated until A is empty or until a best match has been found that satisfies the vigilance criterion. Step 5 states that if A is empty and no satisfactory match has been found, a new output node (cluster prototype) is created with weight vectors that match the current input pattern x (Mehrotra et al., 1997). In ART models, 'resonance' refers to the process used to match a new input vector to one of the cluster prototypes stored in the network. The system is 'adaptive' in allowing for the addition of new cluster prototypes to a network. For more details on the clustering algorithm, refer to (Mehrotra et al., 1997; Zurada, 1992). Figure 2 shows the results of applying the ART1 clustering algorithm to the animals data of Table 1. The input vectors correspond to the rows of Table 1. Starting with small P values (0.1 < P < 0.4) resulted in the successful identification of the mammals cluster. However, the network could not differentiate between reptiles and birds. Increasing the P values (0.4 < P < 0.9) resulted in identifying three clusters; mammals, reptiles, and birds. Figure 2 highlights these three clusters by using bold rectangles. High P values (0.6 < P < 0.9) further decomposed the mammals cluster based on whether they fly or not. Theoretically, there is an infinite number of possibilities for the P values and, correspondingly, for the clustering results. In practice, however, the possible clustering alternatives are limited because a range of P values can give the same clustering results. 3.2.
Self-organizing maps
Kohonen's Self-Organizing Maps (SOM) have the property of topology preservation. In a topology-preserving mapping, nearby input patterns should activate nearby output units
133
246
ABD-EL-HAFIZ
Figure S.
An SOM network with a rectangular array of neurons.
on the map. Figure 3 shows the basic network architecture of Kohonen's SOM. It consists of a two-dimensional array of connected neurons. Each neuron is also connected to all n input nodes. The n-dimensional vector associated with the neuron at location ; of the two dimensional array is denoted wj = (WJ,\ Wj,n)- SOM also defines a spatial neighborhood for each output node. The shape of this neighborhood can be square, rectangular or circular. During competitive learning, all weights associated with the winner and its neighboring nodes are updated (Jain et al., 1996; Mehrotra et al, 1997; Zurada, 1992). The SOM clustering algorithm, which is adapted from (Mehrotra et al., 1997), is as follows: SOM clustering algorithm Select the network topology to determine which nodes are adjacent to which others; Initialize weights to small random values; Initialize current neighborhood distance £>(0) to a positive integer; while the network has not stabilized, do 1. 2. 3. 4. 5.
Select an input pattern x — (x\, xi,..., xn)\ Calculate y ; = YL"k=\ (** ~~ wj,k)2< f° r e a c n output node; Select the output node, j * , with the minimum y, value; Update weights to all nodes within a topological distance of D(t) from j * ; Increment t;
end-while. Initially, the network weights are assigned random values. When an input vector x is presented to the network, the square of the Euclidean distance of* from the weight vector Wj associated with each output node is computed in step 2. In step 3, the output node, j * , with the minimum Euclidean distance is chosen as the winner of the competition. The weights to all nodes within a topological distance of D(t) from j * , where D(t) decreases
134
CLUSTERING NEURAL NETWORKS
247
with time, are updated in step 4. D(t) refers to the length of the path connecting two nodes for the prespecified topology chosen for the network. During the learning process, values of the weights change such that each weight vector moves towards the centroid of some subset of input patterns (Jain et al., 1996; Mehrotra et al., 1997). The design parameters include the dimensionality of the neuron array, the number of neurons in each dimension, the shape of the neighborhood, the learning rate, and the criterion used to determine whether the network has stabilized or not. With respect to the first parameter, we experimented with the two commonly used values: one- and two-dimensional arrays. In all the programs we analyzed, the two values gave comparable results. Thus, we focus on using one-dimensional arrays because they are more intuitive than two-dimensional ones. The second parameter, K, which denotes the number of nodes in the linear array, is varied to control the granularity of the resulting clusters. Small K values generate a small number of coarse-grained clusters and vice-versa. The values of K should be larger than the maximum number of possible clusters for the problem but smaller than the number of input vectors. In order to simplify the analysis, we choose commonly reported values for the third and fourth parameters (Zurada, 1992). The shape of the neighborhood we use is circular and the learning rate is exponentially decaying. Finally, the total adjustments made to all neuron weights, during a complete presentation of the input vectors, are used to determine whether stability is reached or not. If these total adjustments are below a certain limit, the network is deemed stable. Although the reduction of this limit can improve the accuracy of the results, it slows the convergence. The clustering results reported in this paper are all obtained with unity adjustments limit. The results of applying the SOM clustering algorithm to the animals data of Table 1 is also shown in figure 2. In this small example, varying K yields the same clustering results as those of the ART1 network. It should be noted that K represents an upper limit on the resulting number of clusters. Making K greater than or equal to four in our example yields the same results because at most four clusters could be identified in the data of Table 1. 4.
Object identification via clustering neural networks
In this section, it is demonstrated how clustering neural networks can be used to identify objects in procedural programs. In general, a routine-attribute matrix similar to the one given in the animals example of Table 1 is formed. In this matrix, the rows correspond to the routines included in the system under consideration, while the columns correspond to the attributes of these routines. The entries of the matrix are either 1 or 0 depending on whether the routines has a given attribute or not. Each row of the matrix represents a single input to the neural network. By varying the parameters K and P, the neural networks output gives multiple clustering possibilities. Related literature adopted several strategies for picking up the attributes. Attributes used before include: usage of common global variables (Canfora et al., 1993b, 1996; Dunn and Knight, 1993; Lindig and Snelting, 1997; Liu and Wilde, 1990; Sahraoui et al., 1997; Snelting, 1996; Yeh et al., 1995), dataflow information (Hutchens and Basili, 1985), usage of user-defined data types (Canfora et al., 1993a), in general, and usage of structure (record)
135
248
ABD-EL-HAFIZ
datatypes, in specific (Siff and Reps, 1997; Yeh et al., 1995). Similar to Siff and Reps (1997) approach, our approach is very flexible and general when it comes to the choice of attributes. Any set of attributes, that may be useful in some instances, can be used in our approach. In our examples and case studies, using the following attributes, either separately or jointly, yielded good clustering results: • Usage of global variables. An attribute might be of the form 'uses global variable x'. • Usage of structure (record) and enumeration data types. An attribute might be of the form 'uses fields of struct stack', 'has argument of type struct stack*', or 'return type is struct stack*'. • Disjunction of attributes related to similar user-defined types or similar global variables. For instance, if T\ and T2 are two similar data types, the disjunction 'uses fields of T\ or uses fields of T2' can improve the object identification results (Siff and Reps, 1997). • Usage of data files and/or usage of read/write statements. In some cases, such attributes identify the objects which interact with the user. Since the neural networks can generate different clustering results at different parameter values, we form a clustering tree, similar to those shown in figures 2 and 5, to facilitate the visualization and analysis of clustering results. In this tree, the root node represents all routines in the program. Whenever the neural network generates partitions of an existing tree node, we create the corresponding sub-nodes which represent the resulting partitions. To further explain our clustering techniques and to facilitate the comparison with related object identification techniques, we use several examples adapted from Canfora et al. (1996) and Siff and Reps (1997). Despite the fact that our techniques apply to any procedural programming language, the examples in this paper are in C. Figure 4 shows a specific C implementation of stacks and queues (Siff and Reps, 1997). Queues are represented by two stacks; one for the front and one for the back. Information is shifted from the front stack to the back stack when the back stack is empty. The queue functions make indirect use of the stack fields by calling the stack functions. We would struct stack {int *base, *sp, size;}; struct queue {struct stack *front, *back;}; /* I */ 1*2*1 /* 3 *l /* 4 •/ /* 5 */ 1*6*1 /* 7 */ /* 8 */ Figure 4.
struct stack *initStack(int sz) struct queue *initQ( ) int isEmptyStack(struct stack* s) int isEmptyQ(struct queue *q) void push(struct stack* s, int i) void enq(struct queue *q, int i) int pop(struct stack* s) int deq(struct queue *q)
{/* {/* {/• {/* {/• {/* {/• {/*
uses fields of struct stack */} uses fields of struct queue */} uses fields of struct stack */} uses fields of struct queue */} uses fields of struct stack */} uses fields of struct queue */} uses fields of struct stack */} uses fields of struct queue */}
A sample C-like code for a stack and a queue (adapted from Siff and Reps, 1997).
136
CLUSTERING NEURAL NETWORKS Table 2.
Attributes for the stack-queue example. #
A, 1 0
1 2
249
4 5
0
A2
1 1 0
A^
Attribute
Ai
argument or return type is struct stack*
A2
argument or return type is struct queue*
A3
uses fields of struct stack
A4
uses fields of struct queue
0
A4 1
1
0 1 0 1 1 0 1 0
6
0
1
0
1
7
1
0
1
0
8 I 0
Figure 5.
1
0
yS '
1 3 5 7 '
11
'
'
•
O \^K>=2 ^ K > 2
2 4 6 8 '
'
'
The routine-attribute matrix and its clustering results for the stack-queue example.
like to identify the two objects representing the two given abstract data types. Using the functions of figure 4 and the attributes of Table 2, we formed the routine-attribute matrix of figure 5. This matrix represents the input to the two clustering neural networks under consideration. We varied P between 0.1 and 0.9 with a step of 0.1 and gave K values that are greater than or equal to 2. ART1 and SOM gave the same clustering tree, which is depicted in figure 5. As shown by the two bold rectangles, the two abstract data types are correctly identified. 4.1.
Object identification in the presence of undesired links
The results generated by clustering neural networks, in the examples considered thus far, are similar to the results produced by many other techniques in the literature (see for example Siff and Reps, 1997; Yeh et al., 1995). To demonstrate the full power of clustering neural networks, we now consider their application to real-life systems. In such systems, there can be some routines which cause undesirable clustering. Canfora et al. (1996) describe two different types of undesired links: coincidental links and spurious links. A coincidental link results from a routine that actually includes implementations of several routines, each logically belonging to a different object. Spurious links are created by routines that access the supporting data structures of more than one object in order to implement system specific operations. Many object identification approaches do not yield good results when applied to examples that exhibit undesired links (see for example Dunn and Knight, 1993; Liu and Wilde, 1990; Livadas and Johnson, 1994; Yeh et al., 1995).
137
250
ABD-EL-HAFIZ
I A, A, A,
1 1 0
2 0 1 0 3
5 6 7
1
1
1
0
i
0
1 0 1 0 1 1 1 0 1
8 10
Figure 6.
0
0
I 1~~^ 8
A^ I
1
y^
/ \ O 1
<= P < 0.7
x^2,i
I 13 5 7 I124 6 8 I / ^V0-7 <= p <= °-9 X X K >= 4 I 20 o 8 I I 4"T7 6
0 1 0
1 I
|
> I|
'
The routine-attribute matrix and its clustering results after modifying the stack-queue example.
As an example of spurious links, Siff and Reps (1997) consider the following modification of the stack-queue example given in figure 4. /* 4 */ /* 6 */
int isEmptyQ(struct queue *q) void enq(struct queue *q, int i)
{/* uses fields of struct stack and struct queue */} {/* uses fields of struct stack and struct queue */}
Although such a modification may be more efficient, it causes some queue routines to access the supporting data structure of the stack routines. Figure 6 shows the routine-attribute matrix and the clustering tree after performing this modification. It is clear that the two abstract data types are still correctly identified. Because functions 4 and 6 are different from functions 2 and 8, with respect to the set of data they access, the queue abstract data type (functions 2, 4, 6, and 8) is divided into two corresponding partitions. That is, the clustering technique provides additional information about similarities among the functions of a selected cluster. Compared to the concept analysis technique presented by Siff and Reps (1997), we do not have to add a complementary attribute of the form 'does not use fields of struct queue' to correctly identify the two abstract data types. In order to discuss the effect of both spurious and coincidental links, Canfora et al. (1996) use the example of figure 7. This example gives a sample C-like code which uses a stack, a queue, and a list. The function global-init (function # 20) is an example of a coincidental connection, while functions 14-19 exemplify spurious connections. In this example, we use six attributes corresponding to the six global variables defined in the code. Each attribute has the form: 'uses global variable x'. For more details on this example and on its routineattribute matrix, refer to (Canfora et al., 1996). The results of applying ART1 and SOM are shown in figures 8 and 9, respectively. We varied P between 0.1 and 0.9 with a step of 0.1 and gave K values that are greater than or equal to 2. Only P and K values which trigger a partitioning of an existing cluster are shown on these figures. ART1 succeeds in identifying the list (functions 10-13) and stack (functions 1-5) abstract data types. However, it is unsuccessful in separating the queue abstract data type (functions 6-9). On the other hand, SOM successfully identify all the three abstract data types. Additionally, SOM provide the information that functions 14-20 can be grouped into three clusters: (15, 16), (14,18), and (17, 19, 20). The approach of Canfora et al. (1996) fails to automatically identify such a
138
CLUSTERING NEURAL NETWORKS
251
ELEM_T stack_struct[MAXDIM]; int stack_point; ELEM_T queue_struct[MAXDIM]; int queuejiead, queue_tail, queue_num_elem; struct list_struct {ELEMT node_content; struct liststruct * nextjiode; } list; main ( ) {/* this program exploits a stack, a queue, and a list of items of type ELEM_T •/} /• 1 */ void stack_push ( el) {/* uses stack_point and stack_struct •/} /* 2 */ ELEM_T stackjop ( ) {/* uses stack_point and stack_struct •/} /• 3 •/ ELEMJT stackjop ( ) {/* uses stack_point and stack_struct •/} /* 4 */ BOOL stack_empty () {/• uses stackj o i n t •/} /• 5 */ BOOL stack_full ( ) {/• uses stack_point •/} /* 6 •/ /* 7 */ /• 8 */ /• 9 */
{/• uses stack_point, stackstruct and list */} {/* uses stack_poiht, stack_struct, queue_struct, queue_head and queue_num_elem •/} {/• uses stack_point, stack_struct, queue_struct, queue_tail and queue_num_elem •/} {/• uses queue_struct, queue_tail, queue_num_elem and list •/} {/• uses stack_point, stack_struct and list */} {/* uses queuestruct, queuejiead, queue_num_elem and list */} {/* uses stack_point, queuejiead, queuejail, queue_num_elem and list •/}
Figure 7. A sample C-like code for a stack, queue, and a list (adapted from Canfora et al., 1996).
Figure 8. ART1 clustering results for the stack-queue-list example.
139
252
Figure 9.
ABD-EL-HAFIZ
SOM clustering results for the stack-queue-list example.
decomposition. Their tool only uses program slicing (Weiser, 1984) to assists in overcoming the coincidental connection introduced by routine number 20. In summary, Table 3 presents a comparison with the closely related object identification approaches. The criteria we use in the comparison are the choice of attributes on which the identification is based, the identification algorithm, the ability to identify objects in the presence of undesired connections, the need for human intervention to perform such an identification, and the ability to produce a hierarchy of objects.
Table 3.
A comparison with related object identification approaches. Comparison criteria If undesired connections exist
Approach
Attributes
The presented neural approach
Flexible
Canforaetal. (1996)
Fixed
Algorithm
identification
human effort
Hierarchical
Optimization clustering
Yes
No
Yes (tree)
Graph theoretic clustering
Yes
Yes
No
with statistical adaptation Dunn and Knight (1993) Fixed
Graph theoretic clustering
No
-
No
Liu and Wilde (1990)
Fixed
Graph theoretic clustering
No
-
No
Siffand Reps (1997) Yehetal. (1995)
Flexible Fixed
Concept analysis Graph theoretic clustering with informal adaptations
Yes Yes
Yes Yes
140
Yes (lattice) No
CLUSTERING NEURAL NETWORKS Table 4.
253
Clustering tool performance. ART1 Program
Name ccount
Type
Language KLOC Routines Attributes
Exec, time (sec.)
K
Exec, time (sec.)
0.8
17
10
0.9
0.00
6
0.00
schedule Schedule univ. courses
Pascal
1.5
39
23
0.9
0.00
21
1.98
vh
Hypertext browser
C
4.5
103
29
0.9
0.00
45
1.05
gdbm
Data base manager
C
6.8
69
22
0.1
0.00
9
0.93
0.9
0.00
45
4.67
elvis
Editor (a clone of vi/ex)
C
18.7
220
37
0.9
0.11
60
16.04
5.
Misc. counts for C files C
P
SOM
Implementation and evaluation
A prototype tool, which implements the clustering algorithms presented in this paper, has been implemented. The tool accepts as its input a routine-attribute matrix, which is constructed using simple static analysis of the code. The user decides which neural network model to use and its corresponding parameter value. After applying the required clustering algorithm, the clustering results are provided using a simple text-based interface. A graphical display of the clustering results, as shown in this paper, would certainly be very advantageous. In order to evaluate our approach, several case studies, which involve C and Pascal programs, have been carried out. The size of these programs ranges up to 19,000 lines of code. Table 4 presents the performance measurements of the clustering tool on five of these programs. The counting and scheduling programs are obtained from (Frakes et al., 1991) and (Jalote, 1991), respectively. The remaining three programs are public domain C programs obtained from the directory: ftp://ftp.ms.uky.edu/pub3/gnu. Execution times shown in Table 4 were collected by running the tool on a Pentium 11-400 computer with 128 MB of RAM and running the Windows98 operating system. Execution times that are less than 0.01 second are given zero values. It should be noted that the tool execution time does not only depend on the number of routines and attributes in the software system but also depends on the specific relations existing between them. Thus, the binary content of the routine-attribute matrix affects the speed with which the algorithm reaches stability. That is why the execution time of the scheduling program, for instance, is larger than that of the hypertext browser despite the fact that the routine-attribute matrix of the browser is larger. In all of the considered programs and case studies, the clustering results of the SOM architecture are either comparable to or slightly better than the corresponding ones of the ART1 architecture. However, the execution times of the SOM architecture are larger. Because the clustering algorithms are computationally simple, the execution times are, in general, very short. The largest execution time in Table 4 is only 16.04 seconds for a 18.7 KLOC program with a 220 x 37 routine-attribute matrix. From the computational point of
141
254
ABD-EL-HAFE
view, this implies that the clustering tool can handle large real-life systems. Nevertheless, further experimentation is needed in such cases to evaluate the quality of the clustering results. Evaluation of the clustering results in the considered case studies is performed by manual inspection of the code. Because it is difficult to manually inspect the code of large software systems, input from a software architect who is familiar with the system would be needed to evaluate the clustering results of such systems. In the following two subsections we focus on the description of two representative case studies. The first case study applies the approach to a small program, the counting program. This serves to asses the accuracy of the approach and to compare its proposed clustering results with the intended clustering of the program designer. To evaluate the effectiveness of the approach when dealing with medium-size programs, the second case study applies the approach to the database management program. Despite the fact that the examples presented so far focus on the identification of abstract data types, these case studies demonstrate that our approach is also appropriate for the identification of other groups of routines which reference a common set of data, e.g., object instances (Yeh et al., 1995).
5.1.
First case study: A counting program
This program performs different counts for C source files (Frakes et al., 1991). It provides the number of commentary source lines, the number of non-commentary source lines and comment-to-code ratio for C source files. These counts are reported for each function, for lines external to functions, and for the source file as a whole. The program consists of 800 lines of C code. It has 17 functions, which are shown in Table 5. The designer of the program divided the program into 7 clusters. In Table 5, each of these clusters is enclosed between two dashed lines. The attributes we use for this program are given in Table 6. Because there is a small number of global and data type definitions in this program, we use a combination of all possible attribute categories. The attributes include the only structure defined in the program (count_struct). The three defined enumeration types are also included. We consider a disjunction of the two similar enumeration types, char-class and token-type. In addition, usage of two global definitions, data files, and read/write statements is taken into account. Because ART1 and SOM gave similar clustering results for this program, we only show the results of ART1 in figure 10. The designers view of the program clusters is correctly identified for 5 clusters. These 5 clusters are drawn in bold rectangles. The first two designerdefined clusters were joined together in one cluster (functions 1-4). The reason for joining these four functions together is that they possess none of the considered set of attributes. That is, they represent a collection of a driver and miscellaneous utility routines. Functions 6-9 are the ones that parse lines of code and classify them. Functions 1 l-16implement an abstract data type for lists of line counts. Functions 13, 14 are more similar to each other than the rest of the abstract data type functions because they do not use fields of the count_struct structure. Functions 5 and 10 are considered similar to this abstract data type because they have argument/return types of struct count_struct*. Function 17 generates the error messages.
142
CLUSTERING NEURAL NETWORKS Table 5.
Functions for the counting program. #
Function
Function
1
main
10
check-options
11
3
clean-commandJine
12
destroy .node
4
get-parameters
13
is.empty.list
report
.metrics
create_node
5
countJines
14
createjist
6
start-tokenizer
15
append-element
16
delete.element
17
error
8 9
getjoken find
Junction Jiame
classifyJine
Attributes for the counting program. #
5.2.
#
2
7
Table 6.
255
Attribute
Ai
argument /return type is struct count-struct *
A2
uses fields of struct count struct
A3
argument/return type is tokenjype or char-class
A4
uses elements of token-type or charxlass
A5
argument /return type is error-type
As
uses elements of error-type
A7
uses max Jine
A8
uses max Jdent
A9
uses a file data type
A10
uses read/write statements
Second case study: A data base management system
This case study uses the GNU data base manager, GDBM, which is a free software written by Nelson (1993). The software system consists of 6,760 lines of C code and it has 69 functions. The system is divided by the designer into 48 files; 9 ".h" files and 39 ".c" files. Most of the 39 " x " files include single C functions. Only 13 files, out of 39, contain groups of C functions. In Table 7, the contents of each of these 13 files are enclosed between two dashed lines. To analyze the GDBM software, all global variables and user-defined structure and enumeration data types were used to form a list of 22 attributes. We only excluded one structure from the attribute list because it was defined to conveniently group pointers to dynamic variables as well as frequently used variables. Since read/write statements are used throughout the whole program, they were not considered in the attribute list.
143
256
Figure 10.
ABD-EL-HAFIZ
ART1 clustering results for the counting program.
Because the SOM architecture gave slightly better overall results than the ART1 architecture, we only show the results of SOM in figure 11. Due to the large number of routines, we only form the clustering tree at two K values (9 and 45). The results, which are depicted in this figure, demonstrate that a graphical visualization and manipulation of the clustering results is required when dealing with large software systems. In the remainder of this section, we discuss the SOM results and, when necessary, point out how they differ from the ART1 results. Table 7.
Functions included in 13 files of the GDBM software.
#
Function 1 2
find-stack-direction alloca
#
Function
#
Function
27
push-avail-block
51
my.bcopy
28
get_elem
52
exchange
3
iOOafunca
29
-gdbm.put.av_elem
53
-getoptJnternal
4
iOOafuncb
30
get-block
54
getopt
5
-gdbmjiew .bucket
31
adjust.bucket_avail
55
main
6
-gdbm-get.bucket
33
-gdbm-read-entry
57
first-key
7
.gdbm-split-bucket
34
-gdbm-findkey
58
nextjcey
8
-gdbm-write-bucket
40
gdbm.open
61
print-bucket
10
main
41
gdbm_init_cache
62
-gdbm-print-availjist
11
usage
42
rename
63
-gdbm-print-bucket-cache
20
dbm-firstkey
43
gdbm_reorganize
64
usage
21
dbmjiextkey
44
getjiextjcey
65
main
24
-gdbm-alloc
45
gdbm.firstkey
67
writeJieader
25
.gdbmjree
46
gdbm_next_key
68
-gdbm.end.update
26
pop-avail-block
50
myjndex
69
-gdbm-fatal
144
CLUSTERING NEURAL NETWORKS
Figure 11.
257
SOM clustering results for the GDBM software.
To analyze the clustering results, we consider the following two questions: 1. Assuming that the 13 files in Table 7 represent the designer-defined clusters, how many of these clusters are correctly identified by the neural algorithms? 2. Do the designer-defined clusters represent the best way to decompose the system? If not, what kind of improvements are offered by the neural algorithms. 5.2.1. Identification of designer-defined clusters. The SOM architecture only identifies the two clusters: (33, 34) and (67, 68). Function 69 is not included in the second cluster because it has no attributes. The ART1 architecture, on the other hand, identifies four designer-defined clusters. 5.2.2. Improvements offered by the clustering algorithms. Figure 11 shows, in double rectangles, one view of how to decompose the program into clusters. This view divides the 69 functions of the GDBM software into 18 clusters instead of the 39 ".c" files defined by the designer. Based on the required level of granularity, there can be several other decompositions. The following three points highlight the kind of improvement offered by the clustering algorithms. 1. Grouping utility and driver routines in few clusters: Cluster Ci groups all the utility and driver routines of the software system in one cluster. The functions included in this cluster possess no attributes. Similarly, cluster Cio includes all the driver and utility functions which only use fields of the "datum" structure. 2. Identifying new clusters and improving on already existing ones: For example, consider the new clusters C2, C9, and C\\. The names of the functions included in these clusters are given in Table 8. By inspecting these names, it becomes clear why they were clustered together in the shown manner. Consider also functions 61-65 which are grouped by the software designer in a program that tests the database routines and helps in debugging them. In our results, each of these functions belongs to a cluster that better matches it from the data manipulation point of view.
145
258 Table 8.
ABD-EL-HAFIZ Content of some SOM clusters for the GDBM software. Cluster
Function numbers
Function names
C2
23,59
delete, store
C6
16,17
dbmjnit, dbm-open
45,46
gdbm-first-key, gdbm-nextkey
47,49
gdbm-setopt, gdbm_sync
C9
36, 39,48
gdbm-delete, gdbm jetch, gdbm-store
Cn
15,32
dbmjetch, fetch
C12
9
dbmclose
20, 21
dbm.firstkey, dbmjiextJcey
57, 58
firstjcey,
next-key
3. Improving the identification of abstract data types in two different ways: • Separating the routines which define an abstract data type from the routines that uses it. Consider, for example, the file space management routines, 24—31, which are grouped by the software designer in one file. Our clustering results divide them into three separate clusters: C3, C14, C15. These three suggested clusters separate the file space management routines (Q4) from the two abstract data types (C3 and C15) they utilize. Cluster C\n includes the three function (24, 25, and 31) which allocate space, free space and make sure that the current space is close to half full, respectively. Cluster C15 includes four functions (25, 27, 30, and 62) which define an abstract data type for an available block of data. The four functions pop, push, search for and print an available block of data, respectively. Cluster C3 defines an abstract data type for an available element within the available block of data. The two functions of the cluster (28 and 29) search for and insert an element in the block, respectively. • Including all the routines that define an abstract data type. For instance, the software designer defines an abstract data type for a bucket, which is a small hash table, by grouping functions 5-8 in one file. In the SOM clustering results, the four functions are included in two clusters: Ci7 and C 18 . The two clusters also include three other functions: 35, 61, and 36. Function 35 frees all memory associated with a bucket cache, function 61 prints the bucket, and function 63 prints the bucket cache. To correctly define the abstract data type for buckets, we think that clusters Ci 7 and C18 should be combined together. That it, this improvement is partially offered by the SOM results. In the ART1 clustering results, the four functions are scattered in different clusters. Finally, it should be mentioned that the ART1 architecture could not identify clusters C5 and C9. As depicted in Table 8, we also think that each of the clusters C(, and Cn would better be divided into three sub-clusters. This division is clear from the function names and it is suggested by the ART1 results and the software designer.
146
CLUSTERING NEURAL NETWORKS
6.
259
Conclusions
An approach for identifying objects in procedural programs has been presented. This approach is based on clustering neural networks. It is very flexible and general when it comes to the choice of the attributes on which the identification is based. It is also capable of identifying objects in the presence of undesired links with no human intervention. By controlling the design parameters of the two considered neural architectures, we automatically obtain a hierarchy of clustering possibilities. The design parameter of ART1, P, controls the degree of similarity between elements of the same cluster. The number of clusters necessary to achieve this similarity requirement is automatically determined by the network. On the other hand, the design parameter of SOM, K, represents an upper limit on the required number of clusters. That is, the user identifies the appropriate number of clusters for a specific problem by gradually increasing K. With respect to the clustering results, the two neural architectures were successful in identifying abstract data types as well as groups of routines which reference a common set of data. However, the examples and case studies showed that although the execution times of the SOM architecture are larger than those of the ART1 architecture, the SOM clustering results are slightly better. While SOM succeeded in identifying objects in the presence of undesired links, ART1 was only partially successful. In the second case study, ART1 and SOM gave complementary results. However, the overall identification results of the SOM architecture were better. Because the presented clustering approach assists in the identification of abstract data types and groups of routines which reference a common set of data, it is convenient for re-engineering procedural programs into object-oriented ones. Future work includes experimenting with the object identification approach on software systems that are larger than the ones considered so far. In such cases, a user interface that allows graphical visualization of the analysis results becomes essential. Since some of the visualized graphs may become excessively big on medium/large size systems, it might also be necessary to navigate inside large graphs (see for example Antoniol et al., 1997; North and Koutsofios, 1994). References Abd-El-Hafiz, S.K. 1997. Effects of decomposition techniques on knowledge-based program understanding. In Proceedings of the International Conference on Software Maintenance, Bari, Italy, pp. 21-30. Abd-El-Hafiz, S.K. and Basili, V.R. 1996. A knowledge-based approach to the analysis of loops. IEEE Trans, on Software Engineering, 22(5):339-36O. Abd-El-Hafiz, S.K., Basili, V.R., and Caldiera, G. 1991. Towards automated support for extraction of reusable components. In Proceedings of the Conference on Software Maintenance, Sorrento, Italy, pp. 212-219. Achee, B.L. and Carver, D.L. 1994. A greedy approach to object identification in imperative code. In Proceedings of the Third Workshop on Program Comprehension, pp. 4—11. Anquetil, N. and Lethbridge, T. 1998. Extracting concepts from file names; a new file clustering criterion. In Proceedings of the International Conference on Software Engineering, Kyoto, Japan. Antoniol, G., Fiutem, R., Lutteri, G., and Merlo, E. 1997. Program understanding and maintenance with the CANTO environment. In Proceedings of the International Conference on Software Maintenance, Bari, Italy,
pp. 72-81.
147
260
ABD-EL-HAFIZ
Canfora, G., Cimitile, A., and Munro, M. 1993a. A reverse engineering method for identifying reusable abstract data types. In Proceeding of the First Working Conference on Reverse Engineering, Baltimore, Maryland, pp. 73-82. Canfora, G., Cimitile, A., and Munro, M. 1996. An improved algorithm for identifying objects in code. Software Practice and Experience, 26(l):25-48. Canfora, G., Cimitile, A., Munro, M., and Taylor, C.J. 1993b. Extracting abstract data types from C programs: A case study. In Proceedings of the International Conference on Software Maintenance, Montreal, Quebec, Canada, pp. 200-209. Cimitile, A. and Visaggio, G. 1995. Software salvaging and the call dominance tree. The Journal of Systems and
Software, 2S(.2):in-m. Dekker, R. and Ververs, F. 1994. Abstract data structure recognition. In Proceedings of the Ninth Knowledge-Based Software Engineering Conference, pp. 133-140. Dunn, M.F. and Knight, J.C. 1993. Automating the detection of reusable parts in existing software. In Proceedings of the 15th International Conference on Software Engineering, Baltimore, Maryland, pp. 381-390. Frakes, W.B., Fox, C.J., and Nejmeh, B.A. 1991. Software Engineering in the UNWC Environment. Prentice Hall. Hutchens, D.H. and Basili, V.R. 1985. System structure analysis: Clustering with data binding. IEEE Transaction on Software Engineering, SE-11(8):749-757. Ibba, R., Natale, D., Benedusi, P., and Naddei, R. 1993. Structure-based clustering of components for software reuse. In Proceedings of the International Conference on Software Maintenance, Montreal, Quebec, Canada, pp. 210-215. Jain, A.K., Mao, J., and Mohiuddin, K.M. 1996. Artificial neural networks: A tutorial. IEEE Computer, 29(3):3144. Jalote, P. 1991. An Integrated Approach to Software Engineering. Springer-Verlag. Knight, K. 1990. Connectionist ideas and algorithms. Communications of the ACM, 33(11):59-74. Kunz, T. 1996. Evaluating process clusters to support automatic program understanding. In Proceedings of the Fourth Workshop on Program Comprehension, pp. 198-207. Lakothia, A. 1997. A unified framework for expressing software subsystem classification techniques. Journal of Systems and Software, 36:211-231. Lindig, C. and Snelting, G. 1997. Assessing modular structure of legacy code based on mathematical concept analysis. In Proceedings of the I9th International Conference on Software Engineering, pp. 349-359. Liu, S. and Wilde, N. 1990. Identifying objects in a conventional procedural language: An example of data design recovery. In Proceedings of the Conference on Software Maintenance, San Diego, California, pp. 266-271. Livadas, P.E. and Johnson, T. 1994. A new approach to finding objects in programs. Software Maintenance: Research and Practice, 6:249-260. Mancoridis, S., Mitchell, B.S., Rorres, C , Chen, Y., and Gansner, E.R. 1998. Using automatic clustering to produce high-level system organizations of source code. In Proceedings of the Sixth International Workshop on Program Comprehension, Ischia, Italy. McFall, D. and Sleith, G. 1993. Reverse engineering structured code to an object oriented representation. In Proceedings of the Fifth International Conference on Software Engineering and Knowledge Engineering, pp. 86-93. Mehrotra, K., Mohan, C.K., and Ranka, S. 1997. Elements of Artificial Neural Networks. The MIT Press. Merlo, E., McAdam, I., De Mori, R. 1993. Source code informal information analysis using connectionist models. International Joint Conference of Artificial Intelligence, vol. 2. Los Altos, CA, pp. 1339-1344. Muller, H.A., Orgun, M.A., Tilley, S.R., and Uhl, J.S. 1993. A reverse engineering approach to subsystem structure identification. Software Maintenance: Research and Practice, 5(4): 181-204. Nelson, P.A. 1993. GDBM, the GNU Data Base Manager. Cambridge, MA: Free Software Foundation. Newcomb, P. and Kotik, G. 1995. Reengineering procedural into object-oriented systems. In Proceeding of the Second Working Conference on Reverse Engineering, Toronto, Ontario, Canada, pp. 237-249. North, S. and Koutsofios, E. 1994. Applications of graph visualization. In Proceedings of Graphics Interface, Banff, Alberta, pp. 235-245. Sahraoui, H.A., Melo, W., Lounis, H., Dumont, F. 1997. Applying concept formation methods to object identification in procedural code. Technical Report CRIM-97/05-77, CRIM. Schwanke, R.W. 1991. An intelligent tool for re-engineering software modularity. In Proceedings of the Thirteenth IEEE International Conference on Software Engineering, Austin, Texas, pp. 83-92.
148
CLUSTERING NEURAL NETWORKS
261
Siff, M. and Reps, T. 1997. Identifying modules via concept analysis. In Proceedings of the International Conference on Software Maintenance, Ban, Italy, pp. 170—179. Snelting, G. 1996. Reengineering of configurations based on mathematical concept analysis. ACM Transactions on Software Engineering and Methodology, 5(2):146-189. Weiser, M. 1984. Program slicing. IEEE Trans, on Software Engineering, SE-10(4):352-357. Wiggerts, T.A. 1997. Using clustering algorithms in legacy systems remodularization. In Proceedings of the Working Conference on Reverse Engineering, Amsterdam, Holland, pp. 33—43. Yeh, A., Harris, D.R., and Reubenstein, H.B. 1995. Recovering abstract data types and object instances from a conventional procedural language. In Proceeding of the Second Working Conference on Reverse Engineering, Toronto, Ontario, Canada, pp. 227-236. Zurada, J. 1992. Introduction to Artificial Neural Systems. West Publishing Company.
BAYESIAN-LEARNING BASED GUIDELINES TO DETERMINE EQUIVALENT MUTANTS AURI MARCELO RIZZO VINCENZI*'5, ELISA YUMI NAKAGAWA*^, JOSE CARLOS MALDONADO*-*, MARCIO EDUARDO DELAMARO+-** and ROSELI APARECIDA FRANCELIN ROMERO*-" * Institute de Ciencias Matemdticas e de Computacdo, Universida.de de Sao Paulo, Av. do Trabalhador Sancarlense, 400, Cx. Postal 668, CEP 13560-970, Sao Carlos, SP, Brazil t Faculdade de Jnformdtica, Fundagao Euripedes Soares da Rocha, Av. Higyno Muzzy Filho, 529 ND, CEP 17525-901, Marilia, SP, Brazil § [email protected] ' [email protected][email protected] H [email protected] ** [email protected] Mutation testing (Mutation Analysis), although powerful in revealing faults, is considered a computationally expensive criterion, due to the high number of mutants created and the effort to determine the equivalent mutants. Using mutation-based alternative testing criteria it is possible to reduce the number of mutants but it is still necessary to determine the equivalent ones. In this paper the Bayesian Learning(one of the Artificial Intelligence techniques used in machine learning) is investigated to define the Bayesian Learning-Based Equivalent Detection Technique (BaLBEDeT), which provides guidelines to help the tester to analyze the live mutants in order to determine the equivalent ones. Keywords: Mutation testing; program equivalence analysis; Bayesian learning.
1. Introduction Software testing is one of the most relevant activities used to guarantee the quality and the reliability of the software under development. The success of the testing activity depends on the quality of the test set. Testing criteria have been defined and investigated to help the tester to generate and evaluate test sets. In the last two decades, data-flow and mutation based testing criteria have been intensively investigated [11, 5, 20, 7, 2]. The focus of this paper is mutation testing. Mutation testing requires the development of a test set T that reveals the presence of a well-specified set of faults [5]. The faults are modeled by a set of mutant operators which, when applied to a program P under test, generate syntactically correct programs called mutants. The * Corresponding author. 675
150
676
A. M. R- Vincenzi et al.
quality of T is measured by its ability to distinguish the behavior of the mutants from the behavior of the original program. One problem related to mutation testing is the large number of mutants that need to be compiled and executed. In addition, a tester needs to examine many mutants and analyze them for possible equivalence with respect to (w.r.t.) the original program. For these reasons, mutation testing is considered too expensive. Despite the high cost of mutation testing some empirical studies provided evidence that it is effective at detecting faults [24, 16]. To reduce the number of mutants generated, some approaches have been investigated [19, 16, 13, 2], but even using these alternative approaches the equivalent mutants should be determined to obtain an adequate test set. The automation of equivalent mutant determination has been pursued by many researchers [14, 15, 17, 18, 10, 9]. This work aims at providing guidelines to ease the analysis of the live mutants. Each mutant operator has specific characteristics, i.e., one mutant operator may generate more equivalent mutants than another one. Based on these characteristics and historical information previously collected [2, 22], artificial intelligence techniques can be used to guide the analysis of the live mutants, aiming at reducing the effort to determine the equivalent ones. This paper presents a case study using Bayesian Learning algorithms which provide probabilistic information used to estimate the number of equivalent mutants. In Sec. 2 is presented the background on mutation testing and Bayesian Learning. In Sec. 3 the related work is presented. In Sec. 4 the experimental methodology, data collection and the results obtained are described. In Sec. 5 a case study is described illustrating the application of the technique. In Sec. 6 the conclusions and future work are presented. 2. Background In this section, the main concepts related to mutation testing [5, 6] and Bayesian Learning [21, 12], necessary for the understanding of this paper, are explained. 2.1. An Overview of Mutation
Testing
Mutation testing provides the tester with a systematic way to generate test cases as well as to evaluate how "good" a test set is. The idea is to produce from the program under test a set of other possible implementations — the mutants — containing simple syntactic changes, that are modelled by the mutant operators. a In fact, mutant operators can be seen as the implementation of a fault model that represents the common errors committed during software development. a It may happen that on applying a given operator op to a program P no mutant is generated. This occurs when P does not contain any structure in the domain of the syntactic changes imposed by op. For example, consider the SWDD operator for C language that replaces the while with the do-while command. If P does not have any while commands, the application of SWDD would not generate any mutant.
151
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants
677
The goal of mutation testing is to encourage the tester to find test cases that make the mutants behave differently from the original program, thereby distinguishing the mutants. Such distinguished mutants are said "dead". The "live" mutants are those that behave as the original program for all the test cases in the test set. This may occur either because the mutant is equivalent to the original program or because the test set is not good enough to distinguish the mutant. In the former case, the mutants can be dismissed. In the latter, the test set should be improved. The mutation score — the ratio of the number of dead mutants to the number of non-equivalent mutants — provides to the tester a mechanism to assess the quality of the testing activity. When a mutation score reaches 1.00, it is said that the test set T is adequate w.r.t. mutation testing (MT-adequate) to test the program P, increasing the confidence in the program under test. Some alternative approaches have been investigated to deal with the cost aspects: Randomly Selected Mutation, Constrained Mutation and Selective Mutation [19, 16, 13, 2]. The goal is to determine a subset of mutations in such a way that if a test set T is obtained, which is able to distinguish those mutations, then T would also distinguish the complete set of mutations. 2.2. Bay esian learning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that quantities of interest are governed by probability distributions and that optimal decisions are taken by reasoning about these probabilities together with observed data. Bayesian reasoning also provides the basis for learning algorithms that directly manipulate probabilities, as well as a framework for analyzing the operations of other algorithms that do not explicitly manipulate probabilities. Bayesian learning algorithms calculate explicit probabilities for hypotheses, and are among the most practical approaches to certain types of learning problems [12]. 2.2.1. Bayes theorem The Bayes theorem is the base for all Bayesian learning algorithms. It is given by Eq. (1):
mm - ^ ™
(i)
where: • P(h) denotes the initial probability that a hypothesis h holds before we observe the training data. P{h) is often called the prior probability of h and may reflect any background knowledge we have about the chance that h is a correct hypothesis; • P(D) denotes the prior probability that training data D will be observed;
152
678 A. M. R. Vincenzi et al.
• P(D\h) denotes the probability of observing data D given some world in which hypothesis h holds. More generally, we write P(x\y) to denote the probability of x given y; and • P{h\D) is called the posterior probability of h, because it reflects our confidence that h holds after we have seen the training data D. Note that the term P(D) is an independent constant of h and can be dropped, as shown in Eq. (2). P(h\D) = P(D\h)P(h)
(2)
In many learning scenarios, the learner considers a set of candidate hypotheses H and is interested in finding the most probable hypothesis h £ H, given the observed data D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. The Brute-Force MAP Learning Algorithm, which is used in this paper, is based on the MAP. The two steps of this algorithm are [12]: 1. For each hypothesis h in H, calculate the posterior probability using Eq. (2); and 2. Output the hypothesis KMAP with the highest posterior probability: huAP = max(P{h\D)),h £ H
(3)
In Sec. 5 the Brute-Force MAP Learning Algorithm is used for predicting the percentage of equivalent mutants, for each mutant operator. More information about Bayes theorem and the Brute-Force MAP Learning Algorithm can be found in [12]. 3. Related Work Empirical studies have provided evidence that mutation testing is among the most promising criteria in terms of fault detection [24, 16]. However, mutation testing often imposes unacceptable demands on computing and human resources because of the large number of mutants that need to be executed and that need to be analyzed for possible equivalence with respect to the original program. Randomly Selected Mutation, Constrained Mutation, Selective Mutation [19, 16, 13, 2] are alternatives to reduce the number of mutants generated, but if we want to obtain an adequate test set equivalent mutants must still be determined. Offutt et al. [14, 6, 15, 17, 18] have considered the problems of test data generation and equivalent mutant detection, using both constraint-based techniques and compiler optimizations. The idea explored by Offutt and Craft [14] was to implement a set of compiler-optimization heuristics and evaluate them. This approach consists of looking at the mutants which, compared to the original program, implement traditional "peep-hole" compiler optimizations [1]. Compiler optimizations
153
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants
679
are designed to create faster, but equivalent, programs, so that a mutant which implements a compiler optimization is, by definition, an equivalent mutant. The set of implemented heuristics they used was able to detect about 10% of the equivalent mutants. DeMillo and Offutt [6], using the concept of constraint, developed an automatic way to generate test cases. Their idea is that by solving a set of constraints it is possible to generate a test case that kills any given mutant. Even if they are not completely satisfied, the set of constraints is also useful to determine equivalent mutants. Empirical studies showed that the approach could achieve a detection rate of equivalent mutants of about 50% [17, 18]. Hierons et al. [10] in their work show how amorphous slicing [8] can be used to assist the human analysis of live mutants, rather than as a way of automatically determining the equivalent ones. Another study developed by Harman et al. [9] shows the relationship between program dependence and mutation testing. The idea of the authors is to combine dependence analysis tools with existent mutation testing tools, supporting the test data generation and the determination of equivalent mutants. The authors also proposed a new mutation testing process which starts and ends with dependence analysis phases. The pre-analysis phase removes a class of equivalent mutants from further analysis, while the post-analysis phase is used to reduce the human effort to study the few mutants that evade the automated phases of the process [9]. This paper aims at reducing the effort needed to analyze the live mutants instead of providing a way to automatic detect the equivalents. The idea presented here is to provide guidelines to ease the determination of equivalent mutants and also the identification of non-equivalents, which is useful to improve the test set. Based on historical data collected in previous experiments [2, 22], our approach, named Bayesian Learning-Based Equivalent Detection Technique (BaLBEDeT), uses the Brute-Force algorithm to estimate which is the most promising group of mutants that should be analyzed. In the next sections we present the experiment description and the case study that was carried out. 4. Experiment Description The methodology used in the experiment comprises four steps: • • • •
Program Selection; Tool Selection; Test Set Generation; and Results and Data Analysis. Details of these steps can be found in Sees. 4.1 to 4.4.
154
680
A. M. R. Vincenzi et al.
4.1. Program selection Five UNIX C programs, called the 5-UNIX program suite, are used: Cal, Checkeq, Comm, Look and Uniq. Although they are simple programs (about 100 LOC each), our intention is to evaluate the applicability of Bayesian Learning in this context. We will then investigate the applicability of Bayesian Learning to larger programs and in other domains. 4.2. Tool selection To support the application of Mutation Analysis, Proteum [3] was used. This tool was developed at the Institute de Ciencias Matemdticas e de Computagao da Universidade de Sao Paulo and at the Departamento de Informdtica da Universidade Estadual de Maringd - Brazil. Some facilities that ease the carrying out of empirical studies are provided, such as: • Test case handling: execution, inclusion/exclusion and enabling/disabling of test cases; • Mutant handling: creation, selection, execution, and analysis of mutants; and • Adequacy analysis: mutation score and statistical reports. With these characteristics, different combinations of test sets can be evaluated against different groups of mutants in the same test session. In this way, alternative approaches to apply mutation testing such as Randomly Selected Mutation, Constrained Mutation and Selective Mutation are easily applied. Proteum supports the application of mutation testing at the unit level and implements 71 mutant operators to test C programs [3]. These operators are divided into 4 classes according to where the mutation is applied: Constants, Operators, Statements and Variables. A description of the unit operators is presented elsewhere [3]. To illustrate the cost aspect related to the number of mutants, the total of mutants generated for the Mutation Analysis criterion is provided in Table 1, considering each mutation class. Note that even for small programs, Mutation Analysis can be very expensive, so that the investigation of mechanisms to reduce its application cost and to ease the analysis of live mutants is worthwhile to be pursued. For example, Cal and Comm programs have both 119 LOC but generate significantly different number of mutants: 4,332 and 1,728, respectively. Considering the Constants' class, for Cal program 1,780 mutants were generated, over 5 times the number of mutants generated for Comm program. For the Operators and Variables's classes, considering Cal program, 1,409 and 791 mutants were generated, respectively, over 2 times the number of mutants generated for Comm program: 642 and 367, respectively. The equivalent mutants for the 5-UNIX programs have been already determined in previous experiments [2, 22]. Table 2 illustrates the total of equivalent mutants manually determined for each program considering the four mutation classes. The percentage w.r.t. the total of equivalent ones per program is also provided.
155
681
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants Table 1. Total and percentage of mutants generated by each mutation class. Program
Constants # Mut. % Total
Operators # Mut. % Total
Statements # Mut. % Total
Variables # Mut. % Total
Total
Cal Checkeq Comm Look Uniq
1,780 1,111 314 371 244
41.1 35.9 18.2 18.1 15.1
1,409 937 642 720 621
32.5 30.2 37.2 35.0 38.3
352 268 405 319 348
8.1 8.6 23.4 15.5 21.5
791 783 367 646 406
18.3 25.3 21.2 31.4 25.1
4,332 3,099 1,728 2,056 1,619
Total
3,820
29.8
4,329
33.7
1,692
13.2
2,993
23.3
12,834
Table 2.
Total and percentage of equivalent mutants generated by each mutation class. Constants
Program ~C~al Checkeq Comm Look Uniq Total
# Mut. I % Total
Operators
Statements
# Mut. I % Total ~# Mut. I % Total
Variables # Mut.
% Total
Total
72 2 28 46 14
2L8 0.9 14.4 17.9 8.4
113 105 123 111 119
34.2 47.1 63.1 43.2 71.2
12 5 9 26 4
3.7 2.2 4.6 10.1 2.4
133 111 35 74 30
40.3 49.8 17.9 28.8 18.0
330 223 195 257 167
162
13.8
571
48.7
56
4.8
383
32.7
1,172
4.3. Test set generation For each one of the 5-UNIX programs a pool of 500 test cases was constructed. Each pool is composed of: 1. ad hoc test cases based on the program specification; and 2. random test cases taken from Wong's experiments (used to compare Mutation Analysis and Data-Flow based criteria [23]). 4.4. Results and data analysis For each program, a test session was created using all the mutant operators, and the 500 test cases were executed with all non-equivalent mutants. Note that, although the 71 mutant operators were applied, considering the 5-UNIX programs, 15 mutant operators generate no mutants and, in this way, there is no statistical information about them.b Next, using the features available in Proteum to enable/disable test cases, 19 subsets of test cases were evaluated. The idea is to observe the variation in the number of equivalent and non-equivalent mutants as more and more test cases b
Operators that produces no mutants, considering the 5-UNIX programs: OBAA, OBBA, OBEA, OBSA, OSAA, OSAN, OSBA, OSBN, OSEA, OSLN, OSRN, OSSN, OSSA, SGLR and Vtrr.
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants Table 3. Average data for the 5-UNIX programs: (a) Empty test set, (b) test set with 20 elements, and (c) test set with 500 elements. ____^ Operator Cccr Ccsr
(7) (8) Applying Eqs. (7) and (8) for all mutant operators it is possible to estimate the probability of each one being equivalent and non-equivalent for different subsets of test cases. For instance, considering Table 3 and the operator OLBN we have: • Table 3 (a) - Empty Test Set - Pnor(®\OLBN) = 0.06 - PnOr(e\OLBN)
= 0.94
• Table 3(b) - Test Set with 20 Elements - PnoA®\OLBN) = 0.72 - Pnor(e\OLBN)
= 0.28
• Table 3(c) - Test Set with 500 Elements - Pnor{®\OLBN) = 0.97 - Pnor(e\OLBN) = 0.03 Note that after 20 test cases have been executed (Table 3(b)) the probability of a live mutant being equivalent increases. So, the more the test cases that have been executed the higher the confidence that a live mutant is equivalent. For the 19 subsets of test cases we have calculated the probability of each mutant operator being able to produce equivalent and non-equivalent mutants. The tester, while testing another program, may select the probabilities corresponding with the number of test cases that he/she has already used and then estimates the current number of equivalent and non-equivalent mutants. The next section presents an example of this technique.
159
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants Table 4.
685
Total and percentage of generated and equivalent mutants for the Sort program.
Constants Operators Statements Variables Total # Mut. % Total # Mut. % Total # Mut. % Total # Mut. % Total Generated 3,769 16.81 5,104 22.77 2,745 12.24 10,801 48.18 22,419 Equivalent 857 31.16 878 31.93 124 4.51 891 32.40 2,750
5. A Case Study: Sort Program In this section, we consider that we have a program to be tested and would like to estimate the number of equivalent/non-equivalent mutants generated. The probabilistic information previously obtained from the 5-UNIX programs is used to estimate these quantities and the estimated values are compared with the real ones. The Sort program is used in our example. This program has approximately 624 LOC's and it is used to classify records in one or more files. Applying the 71 mutant operators to the Sort program, 22,419 mutants were generated (2,750 of them are equivalent). Nine mutant operators (OSAA, OS AN, OSBA, OSBN, OSEA, OSLN, OSRN, OSSN and OSSA) generate no mutants and were not considered. Note that these operators are a subset of the 15 ones that generate no mutants in the 5-UNIX programs. Table 4 shows a more detailed information about the number and percentage of mutants generated (first line) and equivalent ones (second line) for each mutation class. Next, supposing that all of the 22,419 mutants have been executed with 20 and 100 test cases, Table 5 and Table 6 show the real number of live and equivalent mutants that remain and the probability of each mutant operator producing nonequivalent and equivalent mutants. According to Table 5, 905 out of 1,208 mutants of Cccr are alive after executing 20 test cases: 374 out of 905 are non-equivalent and 531 out of 905 are equivalent mutants. With respect to the 5-UNIX programs, considering a test set with 20 elements, the probability of Cccr's live mutants being equivalent is 0.60 against 0.40 for it being non-equivalent, i.e., from 905 Cccr's live mutants 545.45 are estimated to be non-equivalent and 359.55 to be equivalent, which means an error rate around 18,9%. The error rate means the discrepancy obtained from comparing the original data about equivalent and non-equivalent mutants against BaLBEDeT technique. Observe that the estimated number of non-equivalent and equivalent mutants are sometimes very different from those found in the real data. On average, considering 20 test cases, the error in the estimation is around 19.3% and the standard deviation 17.4%. Considering 100 test cases, the error is around 23.2%, and the standard deviation 16.9%. Table 7 shows the error rate obtained in the estimation of non-equivalent and equivalent mutants, considering 20 and 100 test cases. For 20 test cases (Table 7(a)), 17 mutant operators with an error rate between 0%-10% were identified, and 9 of these 17 operators are classified with an error rate lower that 5%. Most of the mutant operators were classified with an error rate higher than 10%.
160
686
A. M. R. Vincenzi et al. Table 5. Sort program: Real data x estimated data for 20 test cases. Real Data Estimated Data Live Will Die Equiv Pnor{Q\op) Pnor(®\op) Will Die Equiv Error(%) 905 374 531 0.60 0.40 545.45 359.55 18.9 1,025 817 208 0.98 0.02 1,007.75 17.25 18.6 655 537 118 0.94 0.06 616.10 38.90 12.1
Oper Cccr Ccsr CRCR
Total 1,208 1,542 1,019
OARN OASA OCOR
282 8 16
141 3 16
117 3 0
24 0 16
0.91 1.00 0.00
0.09 0.00 1.00
128.98 3.00 0.00
12.02 0.00 16.00
8.5 0.0 0.0
SCRB SSDL STRP
21 552 556
17 256 128
8 201 120
9 55 8
0.00 0.70 0.91
1.00 0.30 0.09
0.00 178.75 115.97
17.00 77.25 12.03
47.1 8.7 3.1
199 347 2,584 420 310 74 9,214 2,750
0.14 0.85 0.73
0.86 0.15 0.27
76.34 469.66 2563.51 440.49 278.88 105.12 Average Standard Deviation
Table 6. Sort program: Real data X estimated data for 100 test cases. Real Data Estimated Data Pnar(®\op) Will Die Equiv Error(%) Live Will Die Equiv Pnor(e\op) 666 135 531 0.25 0.75 168.59 497.41 5.0 459 251 208 0.93 0.07 426.28 32.72 38.2 313 195 118 0.76 0.24 237.73 75.27 13.7
Oper Cccr Ccsr CRCR
Total 1,208 1,542 1,019
OARN OASA OCOR
282 8 16
70 2 16
46 2 0
24 0 16
0.70 1.00 0.00
0.30 0.00 1.00
49.22 2.00 0.00
20.78 0.00 16.00
4.6 0.0 0.0
SCRB SSDL STRP
21 552 556
14 135 54
5 80 46
9 55 8
0.00 0.34 0.61
1.00 0.66 0.39
0.00 46.55 33.06
14.00 88.45 20.94
35.7 24.8 24.0
91 347 745 420 130 74 2,657 2,750
0.04 0.56 0.41
0.96 0.44 0.59
17.61 420.39 647.24 517.76 83.59 120.41 Average Standard Deviation
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants
687
For 100 test cases (Table 7(b)), 12 operators were classified with an error rate lower that 10% and more operators were classified with an error rate above 20% than Table 7(a). In Artificial Intelligence, an error rate around 10% in estimation is considered reasonable [12]. Table 7. Error rate: (a) 20 Test cases; and (b) 100 test cases. (a) Operator (Error %) OASA (0.0) OCOR (0.0) ORLN (0.5) Vsrr (0.7) OALN (0.7) OABN (2.6) STRP (3.1) STRI (3.8) Vprr (4.4) SMTC (5.2) Oido (5.3) OLNG (6.5) OASN (6.7) VTWD (8.1) OBAN (8.2) OARN (8.5) SSDL (8.7) [10%-20%) OLLN (10.0) OLRN (10.5) OCNG (10.9) OESA (11.2) OBSN (11.5) CRCR (12.1) ORRN (12.3) ORAN (12.7) ORSN (14.2) SWDD (14.4) Varr (16.4) OBRN (16.9) OIPM (18.2) Ccsr (18.6) Cccr (18.9) ORBN (19.3) [20%-30%) O A A N (20.5) V D T R (22.5) S M V B (24.1) OLAN (24.9) OLSN (25.2) OLBN (26.4) S S W M (29.0) Error Rate [0%-10%)
OLBN (20.7) O R B N (20.7) V T W D (22.7) O B R N (23.4) SSDL (24.8) Oido (24.8) O A A N (27.8) S M V B (29.0) OEAA (37.4) OLSN (31.4) OABN (32.0) OCNG (34.8) SCRB (35.7) OBLN (44.4) OBLN (44.4) SMTC (47.4) SCRB (47.1) OAAA (50.0) SSWM (50.0) Varr (55.9) SRSR (60.4)
Vprr (13.1) OALN (14.3) ORRN (16.6) OEBA (19.1)
O L A N (21.2) STRP (24.0) O E A A (25.0) OLLN (33.3) Ccsr (38.2) OLRN (49.2) OBBN (50.2) OBNG (62.8)
It should be pointed out that more historical information should be collected in order to obtain a more representative estimative considering programs with the characteristic of Sort. The idea is to define probabilistic information according to the characteristics of a set of programs. So, for each set, there would be one instance of the application of the technique that is a better estimator of the the number of equivalent mutants. The results obtained here should be considered as a first attempt on providing guidelines to the tester to analyze live mutants. For example, considering Table 5, the OASA operator generates 8 mutants. After the run with 20 test cases, 3 are kept alive. The probabilistic information about this operator tells us that all OASA's live mutants should die, since the probability that an OASA's live mutants being equivalent is zero. Therefore, if the tester wishes to evaluate the live mutants by trying to improve the test set, he/she should first analyze the kind of operators that are supposed to produce non-equivalent mutants. On the other hand, considering the OCOR operator that generates 16 mutants, according to the probabilistic information, all OCOR's live mutants should be equiv-
162
688 A. M. R. Vincenzi et at.
alent. So, if the tester wishes to determine the equivalent ones, he/she should first analyze this kind of operator. Thus, there are operators for which live mutants are more or less probable to be equivalent. For example, considering the live mutants of the STRP operator, the probability of the STRPs live mutants being equivalent is 0,09. 6. Conclusion and Future Work In this paper the use of Bayesian Learning, an Artificial Intelligence technique, was investigated to provide guidelines to help the tasks of determining equivalent and non-equivalent mutants. The main contribution of this paper is the proposition of the technique BaLBEDeT (Bayesian Learning-Based Equivalent Detection Technique). Further studies are been planned to investigate the scalability of these results to larger programs. We are also interested in expanding the selection of programs to different application domains to replicate this study, in order to increase the validity of these results. A database with information about equivalent mutants in a different set of programs in different domains would lead to more precise predictions (closer to real predictions) and smaller effort would be needed to analyze the live mutants. A further refinement of this work would consider the frequency of execution. Proteum/IM 2.0 [4] provides this kind of information. The frequency of execution of a given mutant indicates how many test cases of the test set reach the mutated code. Acknowledgments The authors would like to thank the Brazilian Funding Agencies — CNPq, FAPESP and CAPES — and Telcordia Technologies (USA) for their partial support of this research and the anonymous referees for their valuable comments. We would also like to thank Eric Wong who provided part of the test cases used in this experiment and Rodrigo Funabashi Jorge who provided information on equivalent mutants. References 1. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley, 1996. 2. E. F. Barbosa, J. C. Maldonado, and A. M. R. Vincenzi, "Towards the determination of sufficient mutant operators for C", Software Testing, Verification and Reliability 11(2) (2001) 113-136. 3. M. E. Delamaro and J. C. Maldonado, "Proteum — a tool for the assessment of test adequacy for C programs", in Conference on Performability in Computing Systems (PCS'96), Brunswick, NJ, July 1996, pp. 79-95. 4. M. E. Delamaro, J. C. Maldonado, and A. M. R. Vincenzi, "Proteum/IM 2.0: An integrated mutation testing environment", in Mutation 2000 Symposium, San Jose, CA, Oct. 2000, pp. 124-134.
163
Bayesian-Learning Based Guidelines to Determine Equivalent Mutants
689
5. R. A. DeMillo, R. J. Lipton, and F. G. Sayward, "Hints on test data selection: Help for the practicing programmer", IEEE Computer 11(4) (1978) 34-43. 6. R. A. DeMillo and A. J. Offutt, "Constraint based automatic test data generation", IEEE Trans, on Software Engineering 17(9) (1991) 900-910. 7. R. G. Hamlet, "Testing programs with the aid of a compiler", IEEE Trans, on Software Engineering 3(4) (1977) 279-290. 8. M. Harman and S. Danicic, "Amorphous program slicing", in 5th IEEE Int. Workshop on Program Comprehesion (IWPC'97), IEEE Computer Society Press, Dearborn, Michigan, May 1997, pp. 70-79. 9. M. Harman, R. Hierons, and S. Danicic, "The relationship between program dependence and mutation testing", in Mutation 2000 Symposium, San Jose, CA, Oct. 2000, pp. 15-23. 10. R. M. Hierons, M. Harman, and S. Danicic, "Using program slicing to assist in the detection of equivalent mutants", Software Testing, Verification and Reliability 9(4) (1999) 233-262. 11. W. E. Howden, "Reliability of the path analysis testing strategy", IEEE Trans, on Software Engineering 2(3) (1976) 208-214. 12. T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997. 13. E. Mresa and L. Bottaci, "Efficiency of mutation operators and selective mutation strategies: An empirical study", The Journal of Software Testing, Verification and Reliability 9(4) (1999) 205-232. 14. A. J. Offutt and W. M. Craft, "Using compiler optimization techniques to detect equivalent mutants", Software Testing, Verification and Reliability 4 (1994) 131-154. 15. A. J. Offutt, Z. Jin, and J. Pan, "The dynamic domain reduction approach to test data generation", Software Practice and Experience 29(2) (1999) 167-193. 16. A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf, "An experimental determination of sufficient mutant operators", ACM Transactions on Software Engineering Methodology 5(2) (1996) 99-118. 17. A. J. Offutt and J. Pan, "Detecting equivalent mutants and the feasible path problem", in COMPASS'96 — in Annual Conference on Computer Assurance, IEEE Computer Society Press, Gaithersburg, MD, June 1996, pp. 224-236. 18. A. J. Offutt and J. Pan, "Automatically detecting equivalent mutants and infeasible paths", Software Testing, Verification and Reliability 7(3) (1997) 165-192. 19. A. J. Offutt, G. Rothermel, and C. Zapf, "An experimental evaluation of selective mutation", in 15th Int. Conf. on Software Engineering, Baltimore, MD, May 1993, pp. 100-107. 20. S. Rapps and E. J. Weyuker, "Selecting software test data using data flow information", IEEE Transactions on Software Engineering 11(4) (1985) 367-375. 21. S. J. Russell and P. Norving, Artificial Intelligence: A Modern Approach, PrenticeHall, New Jersey, 1995. 22. A. M. R. Vincenzi, J. C. Maldonado, E. F. Barbosa, and M. E. Delamaro, "Unit and integration testing strategies for C programs using mutation-based criteria", Software Testing, Verification and Reliability 11(4) (2001) 249-268. 23. W. E. Wong, On Mutation and Data Flow, Ph.D. thesis, Department of Computer Science, Purdue University, W. Lafayette, IN, Dec. 1993. 24. W. E. Wong, A. P. Mathur, and J. C. Maldonado, "Mutation versus all-uses: An empirical evaluation of cost, strength, and effectiveness", in Int. Conf. on Software Quality and Productivity, Hong Kong, Dec. 1994, pp. 258-265.
164
Chapter 4 ML Applications in Transformation One of the essential challenges in SE, as eloquently explicated by Brooks, is the changeability: "The software product is embedded in a cultural matrix of applications, users, laws, and machine vehicles. These all change continually, and their changes inexorably force change upon the software product." Changes can be made to a software system through transformations. A transformation to a software product is a mapping from one model to another that aims at improving certain aspect of the transformed software product (e.g., improved modularity, desirable parallelism, improved run-time performance) while preserving all of its other properties (e.g., its functionality) [23]. A transformation is usually localized, affects a small number of classes, attributes, and operations, and is carried out in a series of small steps. In this chapter, we focus on ML applications in software product transformation. Table 24 offers a state-of-thepractice in this area. Table 24. ML methods used in transformation. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Parallel Programs
Modularity Object-oriented Applications
y]
V
V
V V
In this chapter, we include one paper by Schwanke and Hanson [128]. The paper deals with the issue of transforming software systems for better modularity using nearest-neighbor clustering and a special-purpose NN. The proposed approach treats modularization as a categorization (clustering and classification) activity that relies on some similarity measurement. Similarity between software units is computed in terms of a function defined based on their common and distinctive features. To learn the similarity function, there are several steps: 1. A subset of software units, that the software architect believes are correctly classified, is identified and used as a set of training data. 2. A set of more-similar-than judgments (e.g., for three software units S, G and B, <S, G, B> indicates that S is more similar to G than S is to B) is constructed such that if correctly modeled by the similarity function would result in perfect classification on the training set. 3. A special-purpose NN is defined and used to learn, through back-propagation, the similarity function that optimally fits the training data (more-similar-than judgments). Once the similarity function is learned, categorization is accomplished through a nearestneighbor classifier. This approach compares favorably to more traditional methods. It has been integrated into a module architecture advisor environment, and has been used successfully on several real software reorganization tasks.
165
The following paper will be included here: R. Schwanke and S. Hanson, "Using neural networks to modularize software", Machine Learning, Vol.15, No.2, 1994, pp.137-168.
Using Neural Networks to Modularize Software ROBERT W. SCHWANKE AND STEPHEN JOSE HANSON Siemens Corporate Research, Princeton, NJ 08540 Editor: Alex Waibel Abstract. This article describes our experience with designing and using a module architecture assistant, an intelligent tool to help human software architects improve the modularity of large programs. The tool models modularization as nearest-neighbor clustering and classification, and uses the model to make recommendations for improving modularity by rearranging module membership. The tool learns similarity judgments that match those of the human architect by performing back propagation on a specialized neural network. The tool's classifier outperformed other classifiers, both in learning and generalization, on a modest but realistic data set. The architecture assistant significantly improved its performance during a field trial on a larger data set, through a combination of learning and knowledge acquisition. Keywords, neural networks, software modularization, similarity classification.
1. Introduction U. The cognitive task of programming Software engineers today face a formidable cognitive challenge: understanding the interactions among thousands of procedures, variables, data types, macros, and files. Most software engineers work on large, long-lived programs. Consequently, they spend more of their time modifying existing code than they do creating new code. The engineer frequently must read and understand parts of the program that he did not write, or that he wrote months or years ago and no longer recognizes. Any documentation he might have available is almost certainly obsolete. The original designers of the system have probably moved on to new projects, or even new employers. Thus, he is left with only the code itself to give him the information he needs. Most significant commercial software systems comprise more than 100,000 lines of code—1600 pages, thicker than a James Michener novel. Fortunately, the code is typically organized into modules,1 so that the programmer can deal with it in larger chunks. Even so, a large system is likely to comprise more than 10,000 procedures, variables, types, macros, etc. (hereafter called software units, or units), in more than 100 modules, and is likely to involve five or more programmers. Furthermore, the system is likely to be changing rapidly, with new major releases coming out every year, each with one quarter or more of the code different from the previous release. With rapid change comes architectural drift, as each change moves the structure of the system away from its original design. To compound his woes, the programmer may be responsible for working on several different system versions simultaneously, so he must remember how the interactions among components differ from one version to another.
167
138
R.W. SCHWANKE AND S.J. HANSON
The goal of the current research is to help rescue engineers from the nightmare of incomprehensible code by providing them with intelligent tools for analyzing the system, reorganizing it, documenting the new structure, and monitoring compliance with it, so that significant structural changes can be detected and evaluated early, before they become irreversible. 1.2. Why software is in modules Recent developments in programming environments have raised questions about whether modules are as important as they once were. Cross-reference aids, "smart recompilation," and hypertext facilities, for example, treat procedures, macros, and other software units individually, practically ignoring traditional file and module boundaries. However, when programming is considered as a human cognitive activity, the importance of modules becomes clear. Reviewing this activity will also motivate the heuristic analysis and reorganization methods we are proposing. • Modules are the building blocks of a software system's technical design. One of the goals of design is to select a set of conceptual entities that have relatively few interactions between them, so that the designers can reason about the system as a whole without much reference to the details inside individual modules. • Modules are often used to assign technical responsibility. Each programmer on a large project becomes a specialist in certain parts of the system. Limiting the interactions between modules reduces the amount of communication needed between programmers. • Good modularity can also limit the impact of program changes. A single conceptual change generally requires changes to several software units. For example, if every module in a system contains code that directly accesses a sorted list, changing it to a hash table will be extremely difficult. However, if other modules can access the list only by calling the "insert," "retrieve," and "remove" routines of the "SortedList" module, then only these three routines will need to be rewritten. • Modules are the basic units of system integration and testing. Good planning depends heavily on having well-defined modules with limited dependencies on other modules, and on making sure that the dependencies do not change much between writing the plan and starting integration. A recent study reveals that at least half of the cost of a software system occurs after the software is first delivered to the customer, (cf. Chapin, 1988). The largest component of this cost is modifications to the software, including fixing bugs, adding new functions and services, and porting the softare to new computers, new operating systems, and new user interlace systems. Furthermore, the programmers who make these modifications spend most of their time, not in making the changes, but in understanding the code that is related to the changes. In summary, the choice of modules for organizing a large software system affects understandability, division of labor, modifiability, integratability, and testability. Of these, understandability has the largest impact on the success of a software project. Therefore,
168
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
139
modules are important for their role as conceptually coherent chunks of software, and improving coherence through machine-assisted reorganization is an appropriate goal. 1.3. Overview of the article The current research is intended to form the basis for a heuristic module architecture advisor, which recommends organizational changes that would improve the information-hiding quality of the modules in a software system. This article models modularization as a categorization activity requiring similarity judgment. Similarity between software units is computed by a function of their common and distinctive features, which is fitted to training data by a neural network. Categorization is accomplished by a nearest-neighbor classifier. The model is examined by embedding it in a software classification tool and several interactive clustering tools, which make reclassification and clustering recommendations, respectively. The tools incorporate a learning component, which responds to rejected recommendations by using a neural network to adjust feature weights as necessary to make the classifier agree with the category assignments given explicitly by the user. The learner transforms the user's category assignments into more-similar-than judgments, "S is more similar to G than S is to B," selecting triples < S, G, B > such that a similarity function whose values minimize errors on those judgments also maximizes the classifier's accuracy on the given category assignments. The tool then learns an ordinal similarity function that optimally fits the more-similar-than judgments. Learning is carried out through a special-purpose back-propagation neural network. The network directly compares the value of the similarity function computed on two pairs of inputs ( < S , G > and <S, B > ) , and back-propagates error to increase similarity on the first pair while decreasing it on the second. The features of < S , G > and <S, B> are preprocessed and presented to the network as common and distinctivefeatures.The similarity function computed by the network is constrained to compute a ratio of common to distinctive features, in keeping with accepted models of human similarity judgment. The classifier-with-learner thus constructed compares favorably to more traditional category learning methods. It has also been installed in a module architecture advisor, and used successfully on several real software reorganization tasks. We conclude from these experiences that modeling software modularization as nearestneighbor classification, with a similarity function based on accepted models of human similarity judgment, is a viable basis for the design of a module architecture advisor. The learning method used would be useful for a wide range of applications involving nearestneighbor classifiers. The module architecture advisor illustrates a promising approach for designing "intelligent assistants" for expert tasks.
2. The information-hiding principle One of the earliest and most influential writers on the subject of modularity is David L. Parnas. In 1971,2 he wrote of the information distribution aspects of software design (emphasis his),
169
140
R.W. SCHWANKE AND S.J. HANSON
The connections between modules are the assumptions which the modules make about each other. In most systems we find that these connections are much more extensive than the calling sequences and control block formats usually shown in system structure descriptions (Parnas, 1972). The same year he formulated the information-hiding criterion, advocating that a module should be .. .characterized by a design decision which it hides from all others. Its interface or definition [is] chosen to reveal as little as possible about its inner workings (Parnas, 1971). According to Parnas, the design decisions to hide are those that are most likely to change later on. Good examples are • • • •
data formats, user interface details, hardware (processor, peripheral devices) operating system.
In practice, the information-hiding principle works in the following way. First, the designers identify the role or service that the module will provide to the rest of the system. At the same time, they identify the design decisions that will be hidden inside the module. For example, the module might provide an associative memory for use by higher-level modules and conceal whether the memory is unsorted or sorted, whether it is all in fast memory or partly on disk, and whether it uses assembly code to achieve extra-fast key hashing. The module description is then refined into a set of procedure and data types that other modules may use when interacting with the memory. For example, the memory module might provide operations to insert, retrieve, modify, and remove records. These four operations would need parameters specifying records and keys, and some way to determine when the memory is full. The module would declare and make public the data types "Key" and "Record," and the procedures "Insert," "Retrieve," "Modify," and "Remove." Next, the associative memory module is implemented as a set of procedures, types, variables, and macros that together make, for example, a large in-core hash table. The implementation can involve additional procedures and types beyond the ones specified in the interface; only the units belonging to that module are permitted to use these "private" units. Thus, the information that the memory is implemented as a hash table is concealed from other modules. They cannot, for example, determine which order the records are stored in, because they cannot use the name of the table of records in their procedures. Later, if the implementor should decide to replace the hashing algorithm, or even to use a sorted tree, all the code that he would need to change would be in the associative memory module. This example shows that many design decisions are represented by software unit declarations, such as HashRecord array HashTabIe[TabIeSize]
170
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
141
which embodies the decision to store hash records in a fixed-size table rather than, say, a linked list or tree. In most cases, procedures that depend on the design decision will use the name of the corresponding software unit, such as procedure Retrieve(KeyWanted: Key) Index = Hash(KeyWanted) if HashTable[Index].Key equals KeyWanted return HashTabIe.Record else return FAILURE
This correspondence implies that If two units use several of the same unit-names, they are likely to be sharing significant design information, and are good candidates for placing in the same module. A unique aspect of our research is that we measure design coupling, rather than data or control coupling. A simple example will illustrate the difference. The diagram below illustrates four procedures (A, B, C, and D) and table T. Procedure A calls procedure B to write information into table T. Procedure D reads information from the table. Procedure C also writes information into table T. Procedures A and B have a control link between them, because A calls B. Procedures B and D have a data link between them, because data pass from BtoD through the table. Likewise, A and B are data-linked through parameters, and C and D are data-linked through T. However, B and C are not data-linked, because both of them put data into T, but neither one takes data out. Finally, B, C, and D have a design link among them, because all three share assumptions about the format and interpretation of table T. If one of the procedures ever needs to be rewritten in a way that affects the table T, the other two should be examined to see if they require analogous changes.
A
cal1
•
B -^write ^ ^ - ^
^ ^ T -« C ~""~"^^
read
D
write
Before Parnas's work, it was commonplace to group units into modules based on control links, leaving large numbers of design dependencies between modules. Nowadays, programmers generally agree that it is more important to group together procedures that share data and type information than to group procedures that call one another. It would be nice if the clear, simple concepts contained in a system's original design could be directly mapped into an appropriate set of implementation modules, and the mapping preserved throughout the system's lifetime. However, the implementation process always uncovers technical problems that were not apparent during the early design process, leading to changes in the design. Furthermore, design decisions are almost never so clearly separable that they can be neatly divided into subsystems and sub-subsystems. Each decision interlocks
171
142
R.W. SCHWANKE AND S.J. HANSON
with other decisions, so that inevitably there are some design decisions that cannot be concealed within modules, even though they are likely to change. Conversely, a module may span several loosely related decisions. In addition, there are often managerial and other non-technical influences on how a system is modularized. In the final analysis, good modularity is highly subjective. 3. A model for human software classification We observe that programmers modularize software in much the same way that humans generally classify objects. Specifically, modules are used analogously to categories. The software units contained in a module are instances, or exemplars, of the category. The unit names appearing in an instance are its boolean-valued features. Two units can be compared by looking at their shared and distinctive features. Programmers often decide whether two units belong in the same module by such comparisons. For example, when writing a new procedure, a programmer will normally place it in the same module as other procedures that use some of the same data types and data structures in the same way. (Unlike some domains, such as thyroid assay interpretation (cf. Horn et al., 1985), we make a sharp distinction between shared "true" features and shared "false" features.) Modules must be described by exemplars because there are many cases in which a welldesigned, useful module contains two units (instances) that do not share any features. Therefore, there can be no necessary-and-sufficient feature list to describe the category. Feature diversity is intrinsic in the problem, because the information-hiding principle implies that there should be very few widely used unit names, and therefore very few features that are common to large numbers of instances. Nonetiieless, a module is often designed to surround several related data structures and other private software units. Some procedures access only one or two of these data structures, while others may access all of them. Consequently, many modules contain no single "typical" member, although in some cases two or three procedures together represent the principal types of procedures contained in the category. These observations are consistent with the literature of human classification. Humans classify things according to a few simple heuristics. First, people tend not to behave as if categories are defined by necessary and sufficient conditions; rather, they treat them as probabilistic (cf. Smith & Medin, 1981), or, more generally, as if diey possess a feature "polymorphy" (cf. Hanson & Bauer, 1989). Much of the natural world promotes this view: cups, chairs, birds and so forth are labeled as such because they possess smaller feature variance within each category than between categories. Cups are cups because they possess more "cupness" than, say, "bowlness." Consequently, categorization of like objects arises partly as a contrast between clusters of objects. Another important heuristic about human classification is that not all exemplars within a category have equal status—categories are not equivalence classes. Some members of a category are better representatives of the whole category; some are more typical or more central to the "definition" of the category (Posner & Keele, 1968; Homa, 1978). And finally, humans tend to use multiple strategies when classifying, depending on the frequency of the candidate and its closeness to the most typical case. Categories can be extended either by comparisons to an aggregate pattern, prototype, or "average" or by nearest match to an exemplar (Medin & Schaffer, 1978; Homa, 1978).
172
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
143
To turn this qualitative model of modularization into an operational one, we require • a way to describe software units as sets of features, • a way to measure similarity in terms of those features, and • a classification rule based on that similarity measure. 3.1. Software implementation features The information-hiding principle led us to the observation that the names used in a program unit are good clues about the design assumptions on which its implementation depends, and therefore are good indicators of which module it belongs in. For similar reasons not elaborated here, the names of other units in which a unit is used are also good clues about where it belongs, although the correlation is not as strong. Therefore, the names a unit uses, and the names of places where a unit is used, are appropriate features representing the design characteristics of that unit.3 This information can be extracted from the code itself. First, a conventional crossreference extractor analyzes the code to produce a relation, NamesUsed C UnitxUnit, where (x, y) 6 NamesUsed if and only if the name of unit y is used in unit x. Then, UserNames = {(y, i ) } such that (x, y) € NamesUsed, and HasFeatures = NamesUsed U UserNames. Notice that UserNames introduces the notation x, denoting a synthetic name derived from x. This represents the difference between a name used in a unit and the name of a place where the unit is used. The distinction is made so that, when HasFeatures is computed, {y, x) and (y, x) are distinct tuples. Experience has shown the importance of obtaining cross-reference information that is as fine-grained as possible, e.g., the individual field names of structures and the individual literals of enumeration types. Such details are what distinguish code that implements an abstract data type from code that merely uses it. Although cross-reference analysis generates a rich set of implementation features, some important design decisions do not correspond to any particular identifier. Therefore, the relation HasFeatures may be expanded with tuples supplied by the human architect. 3.2. Measuring similarity In order to design an appropriate similarity function, we first describe several important properties that the function should satisfy, and then introduce a function that satisfies them. 3.2.1. Matching and monotonicity When a programmer judges the similarity of two procedures, she looks both at the features they have in common and the features mat are distinctive to one or the other procedure.
173
144
R.W. SCHWANKE AND S.J. HANSON
Adding a common feature increases similarity; adding a distinctive feature decreases it. She also judges the relative importance of different features. A feature representing a localized, volatile design decision deserves a greater weight than a feature representing a widely used, stable design decision. One type of feature she does not look at is one that is absent from both procedures. Identifiers that do not occur in either procedure have no impact on their similarity; they are simply irrelevant. These properties of software similarity judgment correspond to general models of human similarity judgment, such as proposed by Tversky (1977). Tversky's model treats object descriptions as sets of features, and similarity functions as functions of common and distinctive features, defined as follows. Let/4, B, C, . . . be objects described by sets of features a,b,c, . . . , respectively. When comparing two objects, the following computed feature sets are significant: a H b
The set of features that are common to A and B.
a - b, b — a The sets of features that are distinctive to A or B, respectively. The matching property restricts similarity functions to those that are functions of the common and distinctive features, and that are independent of the features that neither object has. (The property is apparently so named because it matches up features in the two sets.) A similarity function, SIM, has the matching property if there exists a function F such that SIM(X, Y) = Fix D v, x - y, y - x) The monotonicity property embodies the idea that similarity should increase in proportion to common features and decrease in proportion to distinctive features. A similarity function, SIM, has the monotonicity property if SIM(A, B) > SIM{A, C) whenever a 0 b 2 a f~l c a —c 2 a —b c —a 2 c —b and, furthermore, the inequality is strict whenever at least one of the set inclusions is proper. 3.2.2. Other desirable properties The software domain suggests some additional characteristics that a similarity function should have: • No maximum value. Two identical procedures with many features should be more similar than two identical procedures with few features.
174
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
145
• A minimum value, obtained when there are no common features. Two unrelated procedures should be just as dissimilar as two other unrelated procedures. • The function should be defined when there are no common and no distinctive features. This surprising requirement arises because real-world software sometimes contains "stub" procedures that have no bodies and no callers, and hence no features. 3.2.3. A ratio model of similarity We have designed a matching, monotonic similarity function with the above properties, called Ratio, similar to one proposed by Tversky, of the same name. We define
RatiofS N) = Weigh** C\ n) v ' ' 1 + c • Weights D n) + d • (Weights - n) + Weight{n - s)) This function satisfies the matching and monotonicity properties as long as c and d are positive, and Weight increases monotonically with the set membership of its argument. This requirement is satisfied by defining
Weight{X) = 2 wx, where wx > 0. flight computes the combined significance of a set of features. Although Tversky defined the function to be linear, we admit the possibility that it might be non-linear, representing correlations among features. Ratio satisfies the requirements of software similarity mesurement described above. It is also symmetric, which is not necessary to model human similarity judgment, but makes it possible to use the function in standard clustering algorithms. There still remains the problem of how to assign weights to the features, and values to c and d. Giving all features the same weight causes high-frequency features to dominate clustering performance at the expense of rare features. Intuitively, feature weight should vary inversely with its frequency, since rarely occurring features have a better chance of being encapsulated within a module. Therefore, we have been estimating the significance of a feature by its Shannon information content, vty = -log P(f), where P(f) is the probability that a unit in the system being studied has feature/ In a later section we will describe how to learn better estimates of these weights. 3.3. A nearest-neighbor classification rule Because programmers assign procedures to modules by finding a small group of highly similar procedures, modularization can be modeled by a nearest-neighbor classification rule. To describe the rule, we use the following definitions: Subject. A unit being considered for inclusion in a category. Good Neighbor. A unit belonging to the category for which the subject is being considered.
175
146
R.W. SCHWANKE AND S.J. HANSON
Bad Neighbor. A unit belonging to any category other than the one for which the subject is being considered. k. A parameter of the classification rule, denoting the minimum group size to which a subject might be added. Since more than one module may be a good candidate to receive a given subject, the classification rule below incorporates a confidence measure for each of the possible categories: Confidence. Subject S fits in category X with confidence C if and only if assigning S to X would result in S having exactly C bad neighbors more similar to it than its jfc'th good neighbor. Note that C is zero when S'sk nearest neighbors are all members of category X. Greater values of C imply that the immediate neighborhood of S is "polluted" with units from other categories.4 With the confidence measure so defined, the classification rule is straightforward: Classification Rule. A software unit belongs in the category for which its confidence rating is best (closest to zero). Confidence ratings also provide a sensitive way to measure how well a classifier implementing this rule conforms to a given set of classification data. If performance were measured only in terms of classification errors, the classifier might compute the correct categories, but with marginal confidence. Therefore, it is useful to measure a classifier's performance in relation to its confidence that the subjects are correctly classified as labeled in the data. However, such a measure would still be sensitive to the cluster size parameter, k. Experience has shown that setting k equal to 1 causes problems when some of die data are mislabeled. If just one unit is mislabeled, any unit for which it is the nearest neighbor will appear to be misclassified. Similarly, any particular value for k may be inappropriate for a specific unit, because highly cohesive software clusters occur in many sizes. Therefore, performance is actually measured over a range of values of k, as follows: Classifier Performance Measure. A nearest-neighbor classifier conforms to a data set D with rating R, K
R= 2
(S,C)tD
2 / Confidence^, C, k)
t=l
where (5, C) denotes unit S assigned to cluster C. Typical values of K range from 2 to 5. 4. Learning architectural judgment The classification rule described in the previous section has been installed in a module architecture advisor and used profitably to help reorganize real software systems (including
176
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
147
the tool itself). These experiences will be discussed in section 5. However, the classifier's accuracy has been limited by the arbitrary way that values are assigned to the constants and feature weights in the similarity function. "Hand tuning" those values has been tedious and unenlightening, although doing so does improve performance. The goal of the present research is to replace hand tuning with an automatic tuning process, in which the tool learns from its mistakes. To achieve this goal, the learning task is divided into the following steps: 1. Advise the architect until the architect disagrees with the advice (indicating that learning is needed). 2. Identify a subset of the units that the architect believes are correctly classified, for use as a training set. 3. Construct a set of more-similar-than judgments that if correctly modeled by the similarity function would produce perfect classifier performance and confidence on the training data. 4. Train the similarity function, by back propagation, to maximally fit the more-similarthan judgments. 5. Ask the architect for additional features to explain any category assignments that the classifier has not learned. The rest of this section assumes that the advisor can extract a training set from the current tool state. It describes how more-similar-than judgments are constructed, and describes the back propagation network used to train the similarity function. Training data selection and feature acquisition from the architect are deferred to section 5. 4.1. Constructing more-similar-than judgments From the definition of the classifier performance measure, one can see that only similarity judgments between a subject and its near neighbors are relevant to classifier performance. In particular, Confidence(S, C, k) is proportional to the number of cases in which S is more similar to one of its bad neighbors than to itsfc-thnearest good neighbor. Aggregating over all values of S and k, classifier performance is equal to the number of cases, (S,G,B), for which a subject, S, is more similar to a bad neighbor, B, than to one of its K nearest good neighbors, G. Therefore, only a subject's K nearest good neighbors are relevant to it. The relevant bad neighbors are those more similar to the subject than its K\h nearest good neighbor. Therefore, the more-similar-than judgments that should be learned are all possible combinations of a subject and its relevant good and bad neighbors. Optimizing these judgments optimizes classifier performance according to the specified measure. The optimization process brings in one additional problem: initialization. Since the goal is to learn a similarity function, and training data are selected using that same function, one must assume that, initially, the estimated similarity function is a poor predictor of actual similarity. Therefore, limiting consideration to the "ATth good neighbor" -hood might arbitrarily screen out units that are actually highly similar to the subject.
177
148
R.W. SCHWANKE AND S.J. HANSON
This problem is solved in the present research by setting K very large at the beginning of training, while the similarity function is poorly estimated, and gradually decreasing it as learning progresses. Initially, all weights and constants are drawn from random distributions, and K is set greater than the size of the largest module. After each training epoch, Kis reduced, by exponential decay, toward a predetermined asymptotic value and the triples (S, G, B) are reconstructed. 4.2. Backpropagating similarity judgment errors Similarity judgment is learned by computing more-similar-than judgments in a feedforward neural network, and backpropagating errors. However, rather than being a general hidden-layer network, the network is designed to mirror the model of similarity judgment discussed above. It is described here mathematically first, and then as a network. 4.2.1. Inputs For each triple (5, G, B), the inputs to the network are the corresponding feature sets s, g, and b. 4.2.2. Error junction The network computes the sigmoid function of the difference of two similarities: Errors, g, b) = (a(Sim(s, g) - Sim(s, b)) - threshold)2 where the threshold is typically 0.95, and
•» " TV?-The value of Error will be near zero whenever 5 is much more similar to G than to B, and near threshold2 when the opposite is true. 4.2.3. Similarity function The similarity function is defined as described in section 2.3: im(s, n)
l + c
.
Weignt(s
Weightjs PI n) n n) + d - (Weights(s - n) + Weight(n - s))
where Weight(X) = ]>] w^, and wx > 0.
178
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
149
4.2.4. Implementation as a network Figure 1 shows the topology of the network. The Focuser uses the current estimate of the similarity function to select a set of input triples for one training epoch. Each triple is presented to the network as a triple of feature vectors. A preprocessing stage computes the needed sets of common and distinctive features. The next stage computes Weight for all six sets, using the same feature weights each time. These six Weights are fed into two identical sub-networks, which compute Ratio on (s, g) and (s, b), using the same values for c and d each time. Finally, the comparison stage of the network computes the sigmoid function of the difference of the two similarities. Back propagation is carried out with a simple delta rule. 4.2.5. Novel aspects of the network The most significant aspect of the network design is that it learns a similarity function by comparing the value computed by the function on two related pairs of inputs. This works
figure 1. Comparison network for learning relative similarity judgments.
179
150
R.W. SCHWANKE AND S.J. HANSON
because the absolute value computed by the function is unimportant; only the relative order of the values it computes matters. Therefore, instead of training the network to compute a specified value, it must compute two values with a specified relative order. The implementation method was suggested by Tesauro, who had invented it to teach his Neurogammon system (Tesauro & Sejnowski, 1989) an ordinal move evaluation function by training it on pairs of alternative moves from the same board position and dice roll, one of which was known to be an expert's choice. He calls the approach the comparison paradigm. To implement the comparison paradigm, the network uses the same set of link weights for both computations of Sim. When error is propagated back, each weight receives error assignments for its roles in both computations. The order of the inputs does not need to be randomized, and no training is signal needed, because the symmetric network design prevents the weights from learning a bias toward the "left" or "right" neighbor. The "training signal" is always the same: S should be more similar to G than to B. By implementing the ratio formula directly, the network restricts the class of similarity functions to a subset of the monotonic, matching functions. This restriction was motivated by the problem domain and has proven to be acceptable in practice. However, further research could examine other monotonic, matching functions. The network also specifies a linear Weight function. Nonlinear, but monotonic, functions could be substituted, if necessary to model feature correlations, but experience has shown the linear function to be satisfactory. Two other implementation details are worth mentioning. The weights are bounded greater than zero, so that the similarity function remains monotonic. When the weights are updated during backpropagation, any weights that drop below a small positive threshold are reset to that threshold. The other detail is that the common and distinctive feature sets are represented as feature vectors, with l's for set members and O's for non-members. Zero has the special property that back propagation will assign no portion of the error to a link from a zero input, meaning that error can only be assigned to weights for features that were present in at least one of the input units. This reinforces our working hypothesis that similarity is unrelated to absent features.
4.3. Learning and generalization performance To obtain a preliminary estimate of the tool's capabilities, we applied it to a small but realistic data set. The sample data come from a real software system, which is actually an early version of our batch-clustering tool. It comprises 64 procedures, grouped into seven modules. Membership in the modules is distributed as described in table 1. The software is written in the C programming language. To create the sample data set, we applied a cross-reference analysis tool to it, collecting every occurrence of a non-local identifier, including procedures, variables, macros, typedefe, and the individual field names of structured data types. Each such name was given a unique identification number, so that there would be no confusion when, for example, two different record types have a field with the same name.
180
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
151
Table 1. Module sizes Members
Module
10 12 5 7 4 12 14
attr hac massage node objects outputmgt simwgts
Each distinct non-local name occurring within a procedure was then recorded as a feature of that procedure. In those cases where one procedure called another, two features were recorded: the callee's name became a feature of the caller, and the caller's name became a feature of the callee, but marked to distinguish "called by X" from "calls X." This process produced 152 distinct feature names. However, many of these features occurred in only one procedure each, and were therefore greatly increasing the size of the problem without contributing to the similarity of two procedures. Therefore, we eliminated all such singly occurring features, leaving 95. We expected the given data to contain classification errors, because the software had not been carefully modularized. However, we wanted to measure generalization performance on "clean" data, so that generalization errors would not be confused with training data errors. Therefore, we established two criteria for eliminating units from the data set, both of which had to be met before the unit was eliminated: 1. The classifier must fail to classify the unit correctly, even after learning. 2. The feature data must show evidence that the learning failure was due to a modularization error. Twelve procedures met both criteria and were removed, leaving 52. The network was able to learn to classify all 52 correctly. To test the network's generalization ability, we contracted a jackknife test, in which the 52 >. rocedures were divided into a training set and a test set, to determine how often the similarity measure, used in a nearest-neighbor classifier, could correctly classify procedures that were not in the training data. The test consisted of 13 experiments, each using 48 procedures for training and 4 for testing, such that each procedure was used for testing exactly once. Each experiment consisted of training on triples constructed from the training set, and then using the learned similarity function to identify the nearest neighbor of each procedure in the test set. K had a final value of 3. The test procedures were classified with k equal to 1. The results of the jackknife test are shown in table 2. Each row gives the number of procedures that were in that module and how many of them were classified into each module during the jackknife test. Only one unit out of 52 was misclassified.
181
152
RW. SCHWANKE AND S.J. HANSON
Table 2. Classifier performance on unseen units. Classifed as Module Actual Module outputmgt simwgts attr hac node massage objects
No.
outputmgt
11 11 9 S 7 4 2
11
simwgts
attr
10
1 9
hac
node
massage
objects
8 7 4 2
From these results we conclude that the given data, the classification and similarity models, and the learning method were very well matched to one another. Naturally, it could be that the data were highly redundant, so that a jackknife test that removed only four procedures was not really withholding any knowledge. Or it could be that the process of removing errors from the data biased the results. The experiment was simply too small to tell. 4.4. Comparisons to simpler classifiers To study the properties of software categories, we first tried clustering them in feature space. Similarity in feature space is measured by distance metrics with small distance corresponding to great similarity. Consequently, units possessing common features will be more similar than those that do not. Furthermore, those features that appear in one category but not in the other will augment similarity of units within a group. Typical kinds of similarity measures include Euclidean, Hamming, and inner product (cosine). Such measures used with agglomerative hierarchical clustering algorithms can only produce cluster groups that were originally at least linearly separable, i.e., clusters that could be separated by a line in 2-space or a hyperplane in four or higher dimensions. We studied the category properties in feature space by attempting to cluster the example data set in feature space, using a hierarchical, agglomerative clustering algorithm under several different similarity measures. Shown in figure 2 is a Euclidean, centroid clustering of the software units by their crossreferences (see above). At the end of each leaf is a label indicating which module of a possible seven the unit is defined in. Note that separation of the category modules does not result; less than half of the group members are assigned to their proper modules. Other measures fare no better. Figure 3 and figure 4 show cluster diagrams for other similarity measures. 4.5. Comparison to neural network classifier Neural networks represent a class of function approximating methods that can create similarity as a function of the data to which they are exposed. Within any network is a
182
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
Figure 2. Euclidean, eentroid clustering.
Figure 3. Hamming, eentroid clustering.
183
153
154
R.W. SCHWANKE AND S.J. HANSON
Figure 4. Hierarchical clustering with cosine correlation.
basis set of functions (see Hanson & Burr, 1990) that allow arbitrary similarity functions to result as the network learns to correctly label data points. Such supervision is also critical for category problems in which the initial feature space is required by the category labels to transform nonlinearly. Consequently, although the similarity measure may cause units with shared and distinctive features to be closer in similarity space, category labels as determined by membership in modules may require transformation of similarity by moving units that are initially far apart closer together in similarity space. Nonetheless, having an effective similarity measure and supervision from labels still involves a complex induction problem. Methods like neural nets share with other learning approaches the problem of generalizing correctly from a limited sample of example cases. In theoretical learning work, this is currently an intense area of research with many unanswered questions (cf. Somplinsky & Tishby, 1990; Baum & Haussler, 1988; Rivest, Haussler, & Warmuth, 1989). Consequently, it is possible to learn perfectly all examples from the domain and still incorrectly classify new examples as they appear. We found that this phenomenon was indeed present in our sample data. We used a simple feed-forward, back-propagation network with one hidden layer of sigmoidal activation units, and trained it to classify the given objects. We found that, with just four hidden units, this network was able to classify the training data perfectly. However, this perfect learning prevented it from classifying any novel units correctly. These results are shown in figure 5. We believe that one of the reasons neural network generalization performance is so poor is that the categories are sparsely populated. There are generally more relevant features than units to be classified. Each unit has typically only seven features, which it shares
184
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
155
Figure 5. Training and transfer performance by number of hidden units.
with only a handfiil of other units. The majority of instances have no exact duplicates in the training set. Therefore, the network is not able to make generalizations about salient features, but can only memorize the category memberships. 5. Advising architects Since preliminary experiments on small data sets showed the classification and learning methods to be promising, we proceeded to incorporate them in a heuristic architecture advisor (Schwanke & Platoff, 1989) and to try them out on real reorganization tasks. This section describes • • • •
the working styles supported by the tool, how the tool uses the classification model to give advice, a case study showing that the advice is useful, and a case study showing how the quality of the advice improves with learning.
5.1. Working styles The tool supports three different (although overlapping) styles of work:
185
156
R.W. SCHWANKE AND S.J. HANSON
• Incremental change: the software is already organized into high-quality modules. The engineer wishes to identify individual weak points in the architecture, and repair them by making small changes. • Moderate reorganization: although the software is already organized into modules, their quality is suspect. The engineer wishes to reorganize the code into new modules, but with an eye to preserving whatever is still good from the old modularity. • Radical (reorganization: Either the software has never been modularized, or the existing modules are useless. The engineer wishes to organize the software without reference to any previous organization. All these styles can be applied to a whole system or to an individual subsystem or module. The tool supports these activities with two kinds of service: clustering and maverick analysis. 5.2. Clustering This service organizes software units into a tree of categories. It is actually a group of clustering services, each of which interacts with the user in a different way: Batch clustering supports radical reorganization. It uses a hierarchical, agglomerative clustering (HAC) algorithm to form a category tree. The given similarity measure is used to derive a group similarity measure. The algorithm starts by placing each unit in a group by itself. It then repeatedly combines the two most similar groups. Some variations of it heuristically eliminate useless interior nodes in the category tree, so that the tree has varying arity appropriate to the data. Incremental clustering supports incremental to moderate reorganization. It is based on the same HAC algorithm, but allows the user to apply it to any node of his cateogry tree at any time, and to review each clustering action before it is carried out. The user selects a node in the category tree, and asks for either the nearest sibling or the two most similar children of that node. Based on the answer, he may decide to combine the indicated groups, or to make some other change in the organization. Thus clustering is carried out manually, but with advice whenever requested. Interactive reclustering supports moderate reorganization. It starts with a given set of original categories, but tries to build a fresh classification tree out of the units in them. It uses the original category labels to decide which clustering steps should be reviewed by the user, and which can be carried out automatically, without review. It uses the hierarchical, agglomerative clustering algorithm, but before combining two groups, it checks to see whether all members of both groups were in the same original category. If so, it combines the two groups automatically. If not, it pauses and asks the user whether to combine them. To help the user decide, it presents several other relevant groups. For each of the two groups that it is recommending combining, it also presents the second-nearest neighbor as well as the nearest neighbor whose members all belonged to the same group originally. The user can then choose to combine the recommended pair, to combine some other subset of the presented groups, or to make any other organizational change he wishes. The interactive clustering algorithm then resumes its work using whatever clusters the user has formed.
186
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
157
Neighborhood clustering can be used as a prelude to incremental clustering. Given a set of units, this service forms the smallest clusters for which each unit is in the same cluster as its k nearest neighbors. The batch-clustering algorithm appears not to be useful, because a mistake early in the clustering process often makes all subsequent clustering decisions wrong. Also, an architect is not likely to accept a new set of categories that is radically different from the set of modules with which he started. The interactive clustering methods are much more useful, because in each of them the architect has opportunities to override the tool's recommendations, and the tool then continues its work based on the architect's decisions. Also, the interactive tools present the architect with several good alternatives, rather than just one best choice, thus giving the architect powerful guidance without preempting his options. The neighborhood clustering algorithm is quite powerful: even specifying a neighborhood size of 1 often creates clusters with an average of four members. This means that three quarters of the clustering decisions (1 decisions combines two clusters) have already been made, leaving only one quarter to be made by other methods. However, all the reclustering methods suffer the same weakness: they cannot yet learn from their mistakes.
5.3. Maverick analysis This service heuristically identifies software units that appear to violate the informationhiding principle. A unit is deemed to be in violation if it appears to share more implementation characteristics with units in other modules than with units in its own module. This violation is detected by using the classifier to compute the "correct" module assignment. If the computed assignment is different from the module in which it resides, the unit is listed as a maverick. (Like a stray calf on the western plains, the unit must be returned to the proper herd.) Such a misplaced unit usually indicates a conceptual weakness in the architecture. The correct repair is usually not to simply move the unit. More often, it indicates that one or more units are mixing design decisions from different modules, and that the units should be rewritten. The maverick analyzer evaluates the category assignment of each unit. It uses the classifier both to identify the best category for the unit and to give confidence ratings for both the present and recommended category assignments. A unit is considered a maverick if it fits into another category with a better confidence rating than its rating for the category to which it is currently assigned. Because the heuristic nature of the analysis leads to a substantial number of false positives, the mavericks are presented to the architect "worst first." They are sorted in order of confidence in the recommended category (strongest first), and, among mavericks with equal reclassification confidence, in order of confidence in the current classification (weakest first). The maverick analyzer has been used to review the organization of a modest-sized industrial-strength software system (Lange and Schwanke, 1991). The system studied contained 300 procedures, organized into 27 modules, and 900 distinct cross-reference feature names. At the time of the analysis, the programmer maintaining the code was already planning to clean up its structure, but had not yet done so.
187
158
R.W. SCHWANKE AND S.J. HANSON
Maverick analysis yielded 51 procedures that were apparently misclassified. An analyst unfamiliar with the code used the maverick list to uncover 24 specific ways in which the code modularity could be improved. Recommended improvements included the following: • • • • • • •
Move a procedure Repartition a set of modules Add methods to an abstract data type, and use them instead of accessing the representation Introduce an interface layer to separate low-level from high-level functionality Split a procedure that is straddling two modules Replace an erroneous variable reference with the correct one Remove dead code
The original maintainer reviewed each of these 24 recommendations and responded in one of four ways: • Correct (5 cases) • Helpful (6 cases): the problem was correctly identified, but a more appropriate repair was found • Redundant (7 cases): the identified problem was due to one of the previous 11 cases • Incorrect (6 cases) To assess the potential benefit of learning from mistakes, the analyst then hand tuned the maverick analyzer to improve its performance, by adding five user-defined features, adding four syntactic features that were not derived from cross-references, changing eight feature weights, and marking two procedures as unquestionably belonging to the modules in which they resided. This reduced the maverick list to 23 procedures, including the 11 correct and helpful cases. These results are promising, but not completely satisfactory. Although maverick analysis provided real benefit to a real project, 80% of the mavericks were useless or redundant. Although hand tuning reduced this number to 50%, the tuning process was difficult and unenlightening. Therefore, it seemed worthwhile to try an automatic tuning process, which, in combination with focused knowledge acquisition from the user, might produce a more useful tool.
5.4. Maverick analysis with learning Although presenting the mavericks worst-first mitigates the problem of false positives, the architect must eventually review all the mavericks on the list or worry that he has overlooked a problem. What is worse, if he makes changes to the code because of the real mavericks he finds, he may have to re-review the whole maverick list to see if any of the old false positives have become real mavericks. Adding a learning capability to the tool could overcome these problems in two ways:
188
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
159
1. Translating the human's classifiction decisions into similarity judgments will make them applicable to other potential mavericks after the code has been changed or reorganized. Instead of re-reviewing all mavericks, the human would only have to review those that were not screened out by his previous judgments. 2. By using the architect's judgments as "relevance feedback" during an analysis session, the tool could reorder the maverick list to bring the real mavericks closer to the top of the list. These two hypotheses have been investigated in a case study using the learning method to improve the quality of maverick analysis for a real architect reorganizing a real system. This section first describes the case study itself, then reports subsequent analsyis of data taken from the study. 5.4.1. A case study The Arch tool was applied to the code of a real software system called TSL (Balcer, Hasling, & Ostrand, 1989). This system is in production use in Siemens operating divisions, and is undergoing active maintenance. It comprises 470 procedures in 33 modules, ranging in size from 1 to 29 procedures. Three modules with fewer than four procedures were excluded from the experiment. The architect was asked to perform four kinds of actions: 1. Move those mavericks which were actually in the wrong module. 2. Identify false-positive mavericks. 3. Identify those units for which nearest-neighbor classification was not the right model to explain his modularization decisions. 4. Supply user-defined features, as necessary, to identify shared design properties that were not represented by cross-reference features. Time did not permit us to have the programmer actually rewrite any code. The experimental procedure went as follows: 1. Initialize the tool's classifier with feature weights based on the information content of features, as described in section 3.3. 2. Have the architect review each of the 10 worst mavericks, either specifying the best module assignment explicitly or removing it from further analysis. 3. Move the mavericks to their new modules as specified. 4. Review the common and distinctive feature data for explicitly classified mavericks, checking for learnability. 5. Have architect supply features to explain "unlearnable" cases. 6. Extract training data from the user session log and transmit it to the learning component. 7. Train the neural network, as described in sections 4.1 through 4.2, until classification performance on the training data stops improving. 8. Transmit the learned weights back to the maverick analyzer. 9. Generate a revised maverick list. 10. If the list is non-empty, go back to step 2.
189
160
R.W. SCHWANKE AND S.J. HANSON
Notes: • Step 4. Experience has shown that, for a module assignment to be learnable, there usually must be at least one feature that the unit shares with some good neighbors that it does not share with its nearest bad neighbors. • Step 5. In principle, we could have waited until thetoolfoiledtolearn the correct classification for the unit before asking for a user-defined feature, but time constraints prevented us from waiting for this verification. From the user's point of view, supplying the feature provides useful documentation anyway, so it is not an unreasonable procedure. This step is actually a carefully focused knowledge acquisition procedure. The architect is asked to focus her attention on just the difficult situations, and to supply just enough information to explain them. This strategy is far more desirable for the architect than supplying large amounts of knowledge a priori, because she can see the immediate benefit of supplying the knowledge: it corrects the tool's mistake. • Step 6. To extract training data without burdening the architect unduly, most of the training set should be identified automatically. Therefore, the training set was constructed by collecting all the units that appeared to be correctly classified already, with strong confidence, before the first user session began, and adding to them all the mavericks that the architect had reviewed and explicitly classified. By this procedure we hoped to avoid putting "false negatives" in the training data, but this assumption was never directly verified. The architect required 10 cycles through the experimental procedure before all the mavericks had been examined. His actions on each cycle are summarized in table 3. Table 3. Architect's actions during maverick analysis. Session
Mavericks
False
Move
1 2 3 4 5 6 7 8 9 10 Totals
125 92 82 71 65 56 45 27 12 3 125
5 6 5+6 6+3 6+5 7+9 6+5 5+1 6 1 53+29
5 4 3 1+2 3
16+2
Remove
1 1+1 1 1 5+3 4 15+4
Repeat
1
User Features
2
1
1 2 5
6 4 6 3 2 1 24
Checked 10 10 16 15 16 18 12 14 10 1 114
Session: During each session, the architect set out to check the ten worst mavericks. The actual count was sometimes more and sometimes less. Mavericks: The number of mavericks on the maverick list for each session. False: The number of mavericks that the architect indicated were false positives. + 6 : Sometimes the architect explicitly labeled procedures other than the top ten mavericks, such as when he was providing user-defined features for the maverick and the other members of its cluster. The number of nontop-ten mavericks so labeled is shown as "+digit." Move: The number of mavericks that the architect reassigned to a new module. Remove: The number of mavericks that the architect indicated were not appropriate for maverick analysis. Repeat: The number of mavericks that showed up in the top ten after having been checked previously. User features: The number of user-defined features added to explain "unleamable" cases. Checked: Total number of False, Moved, Removed, and Repeated mavericks.
190
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
161
Several observations about the study are worth mentioning: • All of the Moved mavericks were found in the first five sessions. The last five sessions served only to teach the tool not to report the false positives. A real user, after finding no more real mavericks in the top ten, might well choose to stop analysis, satisfied that he has found nearly all of the problem procedures. • In the first five sessions, all of the top-ten mavericks had a reclassification confidence of 2 or less. All of the Moved mavericks had a reclassification confidence of 0 or 1. • User-defined features were not needed until session 3. • Nineteen of the mavericks were judged to be procedures for which the nearest-neighbor classification heuristic seemed inappropriate. For example, some of the cases involved one-of-a-kind procedures. There were another 19 procedures that were excluded by very simple heuristics, such as those belonging to modules with three or fewer members, belonging to an imported module, or with fewer than three good or bad neighbors having at least one common feature. • Eight of the 24 mavericks requiring user-defined features were type-destructor procedures that shared more implementation information with one another than they did with other procedures defined on their own type. Possibly these eight should have been excluded rather than kept and given an extra feature. • One hundred and fourteen of the 125 mavericks had to be checked explicitly before the tool learned to completely agree with the architect. This seems to indicate that very little generalization was taking place. • Despite these concerns, the tool did eventually learn to incorporate all the architect's classification judgments.
5.4.2. Analyzing generalization performance The feet that 114 mavericks had to be checked explicitly before perfect learning was achieved seemed to be due to a very liberal definition of maverick. In order to minimize the risk of false negatives, a large number of false-positive mavericks were occurring. Therefore, to measure learning performance, we decided to treat the maverick list like the result of an information retrieval (IR) query, measuring its performance on a precision/recall graph. When an IR system produces a set of documents that approximately match a query, some of the retrieved documents are likely to be irrelevant, as judged by the end user. Since retrieval is based on the similarity between the query and each of the documents in the collection, the retrieved list is typically sorted in order of decreasing similarity, since the most-similar documents are presumed to be most likely to be relevant. Depending on the stamina of the end user, she may look at only the first few entries, or half the list, or the whole list. Information scientists have defined two standard concepts, precision and recall, to measure the quality of a retrieved set of documents. Precision is the fraction of documents in the retrieved set that are relevant. Recall is the fraction of relevant documents that are in the retrieved set. When the retrieved set of documents is ordered by their estimated likelihood
191
162
R.W. SCHWANKE AND S.J. HANSON
of relevance, that ordering can be evaluated by measuring precision and recall for successively longer prefixes of the list. Specifically, data are collected for each prefix of the list that ends with a relevant document. Precision and recall are plotted for each such prefix in a precision/recall graph. Perfect performance, where all the relevant documents were recalled and bunched at the top of the list, would produce a horizontal line at Y = 1.0. Random performance, where all relevant documents were recalled but uniformly distributed throughout the list, would produce a horizontal line at Y = relevant/(relevant + nonrelevant). Two methods of estimating relevance can be compared by comparing their precision/ recall graphs. However, precision and recall depend strongly on both the document collection and the specific query, so comparisons must use the same collections and queries to evaluate both methods. Normally, precision and recall are averaged over a large number of queries. However, it is also valid to compare the performance of two methods on a single query, as long as one does not generalize too much from a single example. Precision/recall measurement can be used to measure the quality of maverick analysis by treating the maverick list as a sorted list of retrieved documents, sorted by estimated relevance. Relevance is estimated by reclassification confidence and lack of confidence in the current classification. We will compare two sets of parameters for the similarity function by comparing the maverick lists they generate for the same set of data. Various small procedural changes during the course of the case study prevented us from performing meaningful analysis directly on the experimental protocol. However, the protocol did produce a complete list of relevant documents, allowing us to analyze precision and recall with and without learning. First, figure 6 shows why learning was needed. Out of 125 mavericks, only 16 were actually relevant. Although precision was relatively high near the top of the list, beyond the first five real mavericks, precision dropped off rapidly. In the first iteration of the experiment, the architect's analyses of the 10 worst mavericks were used to learn new feature weights. To compare performance with and without learning, we performed the specified module reassignments, removed the original top 10 from further consideration as mavericks, and compared the maverick lists constructed with and without the learned feature weights. The results are plotted in figure 7. Here we see that, without learning, precision is low at all recall levels, not much better than "random." Precision with learning is better in all cases, and 3 to 5 times better for recall levels up to 0.63. To compare the contribution of feature weights versus gross coefficient values in the similarity function, we tried varying the gross coefficients, both with and without learned feature weights. We found the variation in performance due to the gross coefficients to be negligible compared to the variation due to learning feature weights. 6. Discussion 6.1. Modeling modularization The case studies reported here support the following hypotheses: • Nearest-neighbor classification, with similarity measured based on common and distinctive features, is an effective model for a large fraction of the modularization decisions in software systems.
192
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
Figure 6. Retrieval performance of the initial, untrained maverick list.
Figure 7. Retrieval performance of revised maverick list.
193
163
164
R.W. SCHWANKE AND S.J. HANSON
• The accuracy of the model is highly sensitive to individual feature weights. • The accuracy of the model is relatively insensitive to the gross coefficients of the similarity function. Model performance can be substantially improved by learning from its biggest mistakes, without need for additional features. • With a modest number of user-defined features, the learning component can adapt perfectly to the architect's judgment. Future research combining the nearest-neighbor classification heuristic with other kinds of heuristic modularization advice may produce a practical module architecture advisor. 6.2. Related work on software similarity and modularity Other software engineering research related to the present work fells into the following categories: • Software similarity: work by Maarek and Kaiser (1987), for example, uses similarity and clustering to group software units in a reuse library. They rejected implementation features as irrelevant to reuse. • Module and subsystem synthesis: Belady and Evangelisti (1982), Hutchens and Basili (1985), and Chen et al. (1990) have investigated clustering units into modules based on data bindings and data flow connection strength. None of these papers reports validation of the clusters by real maintainers. Maarek and Kaiser (1988) looked at clustering for the purpose of identifying software units that are likely to be modified at the same time. They proposed measuring affinity between software units by a combination of connection strength and how often in the past the units have been changed as part of the same task. Choi and Scacchi (1990) propose synthesizing subsystems based on articulation points in the cross-reference graph. • Module quality analysis: Selby and Basili (1988) have measured the maintainability of a module by measuring its internal cohesion and external coupling to other modules. Porter and Selby (1990) successfully use this information to help predict the module^ error proneness and cost to repair. They automatically construct decision trees from large volumes of real project data. This work is closest in spirit to the present work, in that it applied machine learning techniques to real-world data and measures success by realworld standards. However, the methods of these authors do not identify specific flaws in module quality, nor do they make reorganization recommendations.
6.3. Learning similarity vs. learning classification The classifier and similarity learning method worked significantly better in the given problem domain than simpler methods that learn categories directly. We attribute the failure of simpler methods to the feature sparsity and category diversity inherent in the problem domain, and to the small number of examples in the training data. The information-hiding
194
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
165
principle predicts that very few features will occur widely; most will be rare, indicating that exemplar-style category models would be more appropriate than probabilistic models. The small number of examples of each category also suggested that the actual category members be used as exemplars, rather than a small set of synthesized prototypes representing a larger set of category members. Explicitly representing and learning the similarity function also permitted us to force it to be monotonic and matching, which probably also contributed to its success. 6.4. Heuristic design advisors Several characteristics of the architecture advisor may be useful in other heuristic design advisors: • The advisor embodies a model of the judgment-based human reasoning process it is advising, rather than merely checking a mechanistic design rule. While a practical tool would incorporate as many such design rules as are useful, adding judgment-based advice extends the usefulness of the tool considerably. • By providing advice in the form of an ordered list of good alternatives, the tool was actually more useful than if it had given a single recommendation. By examining the alternatives, the architect could better understand why one was recommended over another, even if she chose one of the "lesser" alternatives. • By providing interactive analysis on the architect's work in progress, the tool played the role of a subservient assistant rather than a demanding master. Whether or not the architect acted in accordance with or in opposition to the tool's advice, the tool could analyze the new situation, sometimes revise its judgment based on the architect's actions, and identify good alternatives for the next step. • By acquiring most of its feature data from the design artifacts themselves, the tool was able to provide a useful level of service even before it acquired additional user-defined features and learned from its mistakes. • Further knowledge acquisition was mistake-driven. The user could know much better what information to supply when she knew what mistake needed correcting. The alternative would have been to expect the user to supply much more information, not knowing which parts of it were important. • By using its mistakes as "relevance feedback" to reorder the priorities of its other recommendations, the tool significantly improved the quality of its advice. 7. Future work Clearly, more case studies are needed to confirm or contradict what was found in the TSL study. Such studies are somewhat expensive because they require an engineer knowledgeable about the code to comment on all mavericks, both real and false-positive. We find that an effective combination is to team up the knowledgeable engineer with an experimenter well versed in the information-hiding principle. The experimenter pre-screens the mavericks to point out the ones that are obviously true or false, thereby reducing the engineer's effort considerably.
195
166
R.W. SCHWANKE AND S.J. HANSON
Next, a method is needed for incorporating learning into the interactive clustering tools. The main difficulty is to automatically extract training data from the user session. We believe that learning similarity rather than learning classification makes the knowledge learned more transferable to new problems. The similarity judgment learned during maverick analysis could be used to • cluster the members of a module into smaller sub-modules. • cluster modules into subsystems, and • reanalyze the structure of the system after adding major enhancements. Use in reanalysis would require a few extensions to the present work. Suppose that an architect did maverick analysis on a system, including several rounds of learning from relevance feedback, and saved the learned weights. After a major revision to the system, the architect's specific relevance judgments could not be reused without review, because the restructuring might have invalidated some of them. However, the learned weights represent his judgment of the relative significance of different implementation features, and that judgment would not be likely to change radically. Therefore, the learned weights could be used to compute the initial maverick list after reorganization, and then could be readjusted according to further relevance feedback. Some procedure would be needed for calibrating the a priori weight estimates for new features introduced by the enhancements, so that as a group they would be neither more nor less significant than the learned weights carried over from before the enhancements. Substantial additional research is needed to create a practical architecture re-engineering tool. Advisory heuristics are needed that apply to variables, types, and macros. Representations are needed for the role of each module in a system, and for the conceptual relationships among modules. Heuistics are needed that relate individual units to the roles of modules, and explanation techniques are needed to present the analyses to the architect.
8. Conclusions We conclude from the work described here that it is effective to model software modularization as a nearest-neighbor clustering and classification activity. The model supported effective heuristic assistance with that activity, and effective performance improvement by learning and knowledge acquisition. Naturally, similarity-based clustering is not the only principle used in software modularization. However, research on other heuristics can now be based on identifying those cases where the nearest-neighbor heuristic does not apply. We also conclude that nearest-neighbor classification can be learned effectively by converting classification data to more-similar-than judgments and training the similarity function by back-propagation using the comparison paradigm. We did not do any efficiency studies, whether in time, space, or features needed. However, we believe that feature sparsity is an important characteristic of the problem domain and that nearest-neighbor classification captures more information about the sparse features than is collected in statistical classifiers. Finally, we hope that the ideas about modeling judgment, giving advice, and acquiring knowledge will prove useful in the creation of other intelligent design assistants.
196
USING NEURAL NETWORKS TO MODULARIZE SOFTWARE
167
Notes 1. When the programming language does not have an explicit module construct, the programmer typically uses files to represent his modules. However, our use of the term module specifically does not cover systems where every module or file contains only one procedure. 2. The conference proceedings were not actually published until the following year. 3. The astute programmer will be worrying about the problem of duplicate variable names in different scopes, such as i, j , and k, which are declared many times in large systems, but with unrelated meanings. We restrict our features to non-local names, so that private variables are not considered, and give each distinct software unit a unique name system wide, so that there is no problem with duplicate names. 4. Since greater values of C imply poorer confidence, one might more intuitively call this a measure of doubt.
References Balcer, M.J., Hasling, W.M., & Ostrand, T.J. (1989). Automatic generation of test scripts from formal test specifications. Proceedings oftheACMSIGSOFT1989 Third Symposium on Software Testing, Analysis, and Verification. Key West, FL: ACM Press. Baum, E., & Haussler, D. (1988). What size net gives valid generalization? In D. Touretzky (Ed.), Advances in neural information processing systems (Vol. 1). Morgan-Kaufmann. Belady, L.A., & Evangelisti, C.J. (1982). System partitioning and its measure. Journal of Systems and Software, 2(2). Chapin, N. (1988). Software maintenance life cycle. Proceedings of the Conference on Software Maintenance—1988. IEEE Computer Society Press. Chen, Y.-F., Nishimoto, M., & Ramamoorthy, C.V. (1990). The C information abstraction system. IEEE Transactions on Software Engineering, 16(3). Choi, S.C., & Scacchi, W. (1990). Extracting and restructuring the design of large software systems. IEEE Software, 7(1), 66-73. Hanson, S.J., & Bauer, M. (1989). Conceptual clustering, categorization, and polymorphy. Machine Learning, 3, 343-372. Hanson, S.J., & Burr, D.J. (1990). What connectionist models learn: Learning and representation in connectionist networks. Behavioral and Brain Sciences, 13, 471-518. Homa, D. (1978). Abstraction of ill-defined form. Journal of Experimental Psychology: Human Learning and Memory, 4, 407-416. Horn, K.A., Compton, P., Lazarus, L., & Quinlan, J.R. (1985). An expert system for the interpretation of thyroid assays in a clinical laboratory. Australian Computer Journal, 17(1), 7-11. Hutchens, D.H., & Basili, V.R. (1985). System structure analysis: Clustering with data bindings. IEEE Transactions on Software Engineering, 11(8). Lange, R., & Schwanke, R.W. (1991). Software architecture analysis: A case study. Third International Workshop on Software Configuration Management. ACM Press. Maarek, Y.S., & Kaiser, G.E. (1987). Using conceptual clustering for classifying reusable Ada code. Using Ada: ACM SIGAda International Conference. Special issue of Ada Letters. ACM Press. Maarek, Y.S., & Kaiser, G.E. (1988). Change management for very large software systems. Seventh Annual International Phoenix Conference on Computers and Communications. Scottsdale, AZ. Medin, D., & Schaffer, M.M. (1978). Context theory of classification learning. Psychological Review, 85, 207-238. Parnas, D.L. (1972). Information distribution aspects of design methodology. Information Processing 71. Amsterdam: North-Holland. Parnas, D.L. (1971). On the criteria to be used in decomposing systems into modules (Technical Report No. CMU-CS-71-101). Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA. Porter, A.A., & Selby, R.W. (1990). Empirically guided software development using metric-based classification trees, IEEE Software, 7(3).
197
168
R.W. SCHWANKE AND S.J. HANSON
Pbsner, M.I., & Keele, S.W. (1968). On the genesis of abstract ideas. Journal of Experimental Psychology, 77, 353-363. Rivest, R., Haussler, D., & Warmuth, M.K. (1989). Proceedings of Second Annual Workshop on Computational Learning Theory. Morgan-Kaufmann. Schwanke, R.W., & Platoff, M.A. (1989). Cross references are features. Second International Hbrkshop on Software Configuration Management. Software Engineering Notes, 17(7), ACM Press. Selby, R.W., & Basili, V.R. (1988). Error localization during software maintenance: Generating hierarchical system descriptions from the source code alone. Conference on Software Maintenance—1988. IEEE Computer Society Press. Smith, E.E., & Medin, D.L. (1981). Categories and concepts. Cambridge, MA: Harvard University Press. Sompolinsky, H., & Tishby, N. (1990). Learning from examples in large neural networks. Siemens Computational Learning Theory and Natural Learning Systems Workshop, September 5 & 6, Princeton, NJ. Tesauro, G., <& Sejnowski, T. J. (1989). A parallel network that learns to play Backgammon. Artificial Intelligence, 39, 357-390. Received September 24, 1990 Accepted July 24, 1992 Final Manuscript December 15, 1992
198
Chapter 5 ML Applications in Generation and Synthesis Depending on the domain from which the training data are collected and the nature of the target function to be learned, ML methods can be used to generate or synthesize various types of software products or artifacts. This chapter pertains to the applications of ML methods for software artifacts generation or synthesis. Examples include: test data generation, synthesis of test-resource allocation, generation of project management rules or schedules, and synthesis of agent programs, data structures, scripts, or design schemas. Table 25 provides a glimpse of the ML methods used in this application area. Table 25. ML methods used in generation and synthesis. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Test cases/data
V
Test Resource
V
Project Management Rules
V
V
Design Repair Knowledge
V
Design Schemas
V
V
Data Structures
Project Management Schedule
V
V
Software Agents
Programs/Scripts
V
V V
V
V
V
In this chapter, we include one paper by Michael, McGraw and Schatz [102]. The paper describes a GA based approach to test data generation. The proposed approach is based on dynamic test data generation and is geared toward generating condition-decision adequate test sets. Two GA algorithms: the standard algorithm and the differential algorithm are utilized in their study. Experimental results are obtained for three approaches: GA-based generator, the gradient descent generator, and the random test generator, from simple programs to a real-world autopilot control program. The comparison among those approaches indicates some salient features of GA based approach (such as the global optimization, evolutionary pressure induced serendipitous coverage) and its limitations (expensive search process, more executions of the target program). The authors also point out some other possible test data generation applications GA approach is considered appropriate, and open research issues in automatic test data generation systems.
199
The following paper will be included here: C. Michael, G. McGraw and M. Schatz, "Generating software test data by evolution", IEEE Trans. SE, Vol. 27, No.12, December 2001, pp.1085-1110.
200
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001
1085
Generating Software Test Data by Evolution Christoph C. Michael, Member, IEEE, Gary McGraw, Member, IEEE, and Michael A. Schatz Abstract—This paper discusses the use of genetic algorithms (GAs) for automatic software test data generation. This research extends previous work on dynamic test data generation where the problem of test data generation is reduced to one of minimizing a function [1], [2]. In our work, the function is minimized by using one of two genetic algorithms in place of the local minimization techniques used in earlier research. We describe the implementation of our GA-based system and examine the effectiveness of this approach on a number of programs, one of which is significantly larger than those for which results have previously been reported in the literature. We also examine the effect of program complexity on the test data generation problem by executing our system on a number of synthetic programs that have varying complexities. Index Terms—Software testing, automatic test case generation, code coverage, genetic algorithms, combinatorial optimization.
+ 1
INTRODUCTION
A
N important aspect of software testing involves judging
a program is instrumented to collect information about the
how well a series of test inputs tests a piece of code, program as it executes. The resulting information, collected Usually, the goal is to uncover as many faults as possible during each test execution of the program, is used to with a potent set of tests since a test series that has the heuristically determine how close the test came to satisfying potential to uncover many faults is obviously better than a specified test requirement. This allows the test generator one that can only uncover a few. Unfortunately, it is almost to modify the program's input parameters gradually, impossible to predict how many faults will be uncovered by nudging them ever closer to values that actually do satisfy a given test set. This is not only because of the diversity of the requirement. In essence, the problem of generating test the faults themselves, but because the very concept of a fault data reduces to the well-understood problem of function is only vaguely defined (c.f., [3]). Still, it is useful to have minimization. some standard of test adequacy, to help in deciding when a T n e approach usually proposed for performing this program has been tested thoroughly enough. This leads to minimization is gradient descent, but gradient descent the establishment of test adequacy criteria. suffers from some well-known weaknesses. Thus, it is Once a test adequacy criterion has been selected, the a p p e aling to use more sophisticated techniques for function question that arises next is how to go about creating a test minimization, such as genetic search [6], simulated annealset that is good with respect to that criterion. Since this can i n g [7]/ o r t a b u s e a r c h [ 8 ] ^ t h i s p a p e r / w e i n v e s t i g a t e the be difficult to do by hand, there is a need for automatic test u s e o f function hc search to ate test cases b data generation. minimization. Unfortunately, test data generation leads to an undecid^ autO matic test data generation schemes have In the able problem for many types of adequacy criteria. Insofar as u s u a l l b e e n , H e d fo gj ams ( mathema. the adequacy criteria require the program to perform a ^ f u n c t i o n s ) u s i s i k t e s t a d a c y c r i t e r i a ( specific action, such as reaching a certain statement, the feranch c o v e } R a n d o m t e s t g e n e r a n o n performs adehalting problem can be reduced to a problem of test data Nevertheless, it seems unlikely on ^ Mems generation. To circumvent this dilemma, test data genera- i . J ui r n \- J u °. , ., . . . • i i j that a random approach could also perform well on realistic • ii i•i f hon algorithms use heuristics, meaning that they do not , ° , . ,. ,. , ° . -u test-generation problems, which often require an intensive i « ^i j j ,L 7,.i_ ,. j always succeed in finding an adequate test input. Comparl t r A-CC i i i J f !_• i_ u manual effort. Indeed, our results suggest that random test . . ,? . ™ isons of different test data generation schemes are usually , . , . . . , . " ., , -JO. \ generation performs poorly on realistic programs. The r aimed at determining which method can provide the most ° , • , , , , b r o a d e r i m l l c a h o n ls benefit with limited resources. P «*»*- d u e t o *eir simplicity, toy In this paper, we introduce GADGET (the Genetic P ^ a m s fal1 t o e x P ° s e ** limitations of some test-data Algorithm Data GEneration Tool), which uses a test data generation techniques. Therefore, such programs provide generation paradigm commonly known as dynamic test data l i m i t e d u t i l i t y w h e n s p a r i n g different test generation generation. Dynamic test data generation was originally methods. Because GADGET was designed to work on large proposed by [1] and then investigated further by [2], [4], programs written in C and C++,, it is possible for us to and [5]. During dynamic test generation, the source code of examine the effects of program complexity on the difficulty of test data generation. We examine a feature of the dynamic test generation • The authors are with Cigital Corporation, Suite 400, 21351 Ridgetop problem that does not have an analog in most other Circle, Dulles, VA 20166. E-mail: (ccmich, gem, [email protected]. {wction m i n i m i z a n o n problems. If we are trying to satisfy Manuscript received 17 Dec. 1997; revised 5 Feb. 1999; accepted 24 Oct. 2000. m t e s t requirements for the same software, we have to Recommended for acceptance by D. Rosenblum. / . . . . . . . u r . For information on obtaining reprints of this article, please send e-mail to: perform many function minimizations, but the functions [email protected], and reference IEEECS Log Number 106067. being minimized are sometimes quite similar. That makes it 0098-5589/01/S10.00 C 2001 IEEE
201
1086
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL 27, NO. 12, DECEMBER 2001
possible to solve one problem by coincidence while trying to solve another. In other words, the test generator can find inputs that satisfy one requirement even though it is searching for inputs to satisfy a different one. On the larger programs we tested, coincidental discovery of test inputs satisfying new requirements was much more common than their deliberate detection (the GAs often satisfied one test requirement while they were trying to satisfy a different one). In fact, the ability of test data generators to satisfy coverage requirements coincidentally seems to play an important role in determining their effectiveness. Random test generation did not perform well in our experiments. Moreover, our empirical results show an increasing performance gap between random test generation and the more sophisticated test generation methods when tests are generated for increasingly complex programs. This suggests that the additional effort of implementing more sophisticated test generation techniques is ultimately justified. This paper begins with an overview of automatic test data generation methods (Section 2), followed by an introduction to genetic algorithms in the context of testdata generation (Section 3). In Section 4, we describe our own test-data generation system, GADGET. Finally, in Section 5, we empirically examine the performance of our system on a number of test programs. 2
specifically stating the conditions that the tests must fulfill, For example, each statement in a program might be associated with a requirement asking that the statement in question be executed during testing. The example given above describes statement coverage. A slightly more refined approach is branch coverage. This criterion requires every conditional branch in the program to be taken at least once. For example, supposing we want to obtain branch coverage of the following code fragment: lf
< a > = b > { do one t h i n g } { do something e l s e } we must satisfy two test requirements: There must be one program input that causes the value of the variable a to be greater than or equal to the value of b, and there must be one that causes the value of a to be less than that of b. One effect of these requirements is to ensure that both the "do one thing" and "do something else" sections of the program are executed, There is a hierarchy of increasingly complex coverage criteria having to do with the conditional statements in a program. We shall refer to this hierarchy as defining levels of coverage. At the top of the hierarchy is multiple condition coverage, which requires the tester to ensure that every permutation of values for the Boolean variables in every condition occurs at least once. At the bottom of the hierarchy is function coverage, which requires only that every function be called once during testing (saying nothing about the code inside each faction). Somewhere between else
TEST ADEQUACY CRITERIA AND TEST DATA
GENERATION
these extremes is condition-decision coverage, which is the
criterion we use in our test-data generation experiments. Some test paradigms call for inputs to be selected on the A condition is an expression that evaluates to TRUE or basis of test adequacy criteria, which are used to ensure that FALSE/ b u t d o e s n o t c o n t a i n a n y o t h e r TRUE/FALSE-valued certain features of the source code are exercised (in testing expressions, while a decision is an expression that influences terminology, these features are to be covered by the test f h e p r o g r a m ' s flow o f c o n t r o l T o o b t a i n condition-decision inputs). Some studies, such as [9], [10], [11], have concluded c o v e a test set m u s t m a k e e a c h condition evaluate to that test adequacy does, m fact,mprove the ability of a test J R U E for a t ^ ^ ^ ^ ^ . ^ o n e Q£ ^ suite to reveal faults, though [12], [13], [14], [15], among ., , ., .. ' ? .. , J .. evaluate to FALSE for at least one of the tests. Furthermore, lU . . , , , , - , , others, describe situations where this is not true. Whether or , not test adequacy criteria really measure the quality of a test t h e T R U E a n d F A L S E b r a n c h e s o f e a c h d e c l s l o n m u s t b e suite, they are an objective way to measure the thorough- exercised. Put another way, condition-decision coverage ness of testing requires that each branch in the code be taken and that every These benefits cannot be realized unless adequate test condition in the code be TRUE at least once, and FALSE at data (i.e., test data that satisfy the adequacy criteria) can be least once. found. Manual generation of such tests can be quite timeWith any of these coverage criteria, we must ask what to consuming, so it would be appealing to have algorithms d » when an existing test set fails to meet the chosen that can examine a program's structure and generate criterion. In many cases, the next step is to try to find a test set tnat adequate tests automatically. d°es satisfy the criterion. Since it can be quite It is desirable to have test data generation algorithms that difficult to manually search for test inputs satisfying certain are more powerful in the sense of being more capable of requirements, test data generation algorithms are used to finding adequate tests. Our research addresses this need. automate this process. 2.1
Code Coverage and Test Adequacy Criteria
2.2
Previous Work in Test Data Generation
Many test adequacy criteria require certain features of a The term "test generation" is commonly applied to a program's source code to be exercised. A simple example is number of diverse techniques. For example, tests may be a criterion that says, "Each statement in the program should generated from a specification (in order to exercise features be executed at least once when the program is tested." Test of the specification), or they may be generated from state methodologies that use such requirements are usually model of software operation (in order to exercise various called coverage analyses because certain features of the states or combinations of states). source code are to be covered by the tests. A test adequacy For a program with a complex graphical user interface, criterion generally leads to a set of test requirements test generation may simply consist of finding tests that
202
MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1087
exercise all aspects of the interface. On the other hand, what happens if the loop is never entered, if it iterates once, many software systems package a diverse collection of if it iterates twice, and so on ad infinitum. In other words, the services and, for such packages, it is often considered symbolic execution of the program may require an infinite sufficient to ensure that most or all services are used during amount of time. Test data generation algorithms solve this problem in a testing. In such cases, it is often straightforward to find inputs that exercise a given feature and, so, a test generator straightforward way: The program is only executed only has to list the features or combinations of features that symbolically for one control path at a time. Paths may be are to be tested. In other words, the term "test data selected by the user, by an algorithm, or they may be generation" sometimes only refers to the process of coming generated by a search procedure. If one path fails to result in m expression that yields an adequate test input, another up with concrete test criteria. Unfortunately, some test criteria are harder to satisfy. If P a t n *s tried. L o o s a r e n o t (he o n l we want to satisfy code coverage criteria or exercise some P y programming constructs that c a n n o t easil Y b e evaluated symbolically; there are other other precisely defined aspect of program semantics, it may be far from obvious what program inputs satisfy a given obstacles to a practical test data generation algorithm based criterion. This paper is concerned with this case: We are o n symbolic execution. Problems can arise when data is given a set of test adequacy criteria and the goal is to find referenced indirectly, as in the statement: test inputs that cause the criteria to be satisfied. a _ B [ c+( ji / i o There are many existing paradigms for this type of automatic test data generation. Perhaps the most commonly H e r e ' ; t i s ""known which element of the array B is being encountered are random test data generation, symbolic (or referred to by B [c+d] because the variables c and d are not path-oriented) test data generation, and dynamic test data b o u n d to specific values. Pointe generation. In the next three sections, we will describe each r references also present a problem because of the of these techniques in turn. The GADGET system we potential for aliasing. Consider that the C code fragmentdescribe in this paper is a dynamic test generator. In our * a _ i 2 . experiments, we use random data generation as a baseline * b _ 1 3 . for comparison. c _*a.' 2.2.1 Random Test Data Generation Random test data generation simply consists of generating inputs at random until a useful input is found. The problem with this approach is clear: With complex programs or complex adequacy criteria, an adequate test input may have to satisfy very specific requirements. In such a case, the number of adequate inputs may be very small compared to the total number of inputs, so the probability of selecting an adequate input by chance can be low. This intuition is confirmed by empirical results (including those reported in Section 5). For example, [16] found that random test generation was outperformed by other methods, even on small programs where the goal was to obtain statement coverage. More complex programs or more complex coverages are likely to present even greater problems for random test data generators. Nonetheless, random test data generation makes a good baseline for comparison because it is easy to implement and commonly reported in the literature. 2.2.2 Symbolic Test Data Generation Many test data generation methods use symbolic execution to find inputs that satisfy a test requirement (e.g., [17], [18], [19]). Symbolic execution of a program consists of assigning symbolic values to variables in order to come up with an abstract, mathematical characterization of what the program does. Thus, ideally, test data generation can be reduced to a problem of solving an algebraic expression. A number of problems are encountered in practice when symbolic execution is used. One such problem arises in indefinite loops, where the number of iterations depends on a nonconstant expression. To obtain a complete picture of what the program does, it may be necessary to characterize
results in c taking the value 12 unless the pointers a and b Nation, in which case, c is assigned the v a l u e 13 Since a a n d b a r e n o t b o u n d t o numeric values during symbolic execution, the final value in c cannot be determined. Technically, any computable function can be computed without the use of pointers or arrays, but it is not normal practice to avoid these constructs when writing a program, Thus ' although array and pointer references are not a theoretical impediment to the use of symbolic execution, they complicate the problem of symbolically executing real programs,
refer to t h e s a m e
2 2 3
Dynamic Test Data Generation third class of test data generation paradigms is dynamic test data generation, introduced in [1] and exemplified by the TESTGEN system of [2], [16], as well as the ADTEST system of [5]. This paradigm is based on the idea that if some desired test requirement is not satisfied, data collected during execution can be still used to determine which tests come closest to satisfying the requirement. With the help of this feedback, test inputs are incrementally modified until one of them satisfies the requirement. For example, suppose that a hypothetical program contains the condition
A
1
p o s > -
' •• • on line 324 and that the goal is to ensure that the TRUE branch of this condition is taken. We must find an input that will cause the variable pos to have a value greater than or equal to 21 when line 324 is reached. A simple way to determine the value of pos on line 324 is to execute the program up to line 324 and then record the value of pos.
203
1088
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001
Let pos32i(x) denote the value of pos recorded on line 324 but suppose that b has a default value of 3. It may be that when the program is executed in the input x. Then, the one execution path gives b a new value, while a different function path simply leaves the value of b alone. As long as we only take the path that leaves b with its default value, we will §rx\ _ / 2 1 - P°s32i(x), if Pos32i(x) < 21; neV er be able to make the condition TRUE; no choice of I °. otherwise inputs can make the default value of b be anything other is minimal when the TRUE branch is taken on line 324. Thus, m a n 3" Therefore, the test generation algorithm must know the problem of test data generation is reduced to one of h o w t o s e l e c t ^ e x e c u t i o n P a t h * * assiS^s a n e w v a l u e t 0 function minimization: To find the desired input, we must b" ^ * e T E S T G E N sYf^' heuristics are used to select the path that seems most likely to have an impact on the target ,. , , , ., . . . . cw > r J r b , find a value of x that minimizes Sr(z). In a sense, the function 9 (which we will also call an '.___„_ , rn , .. . . ... , T . ,. ,. , .. . . .. ., . . . , , .. . ^ In the ADTEST system of [5], an entire path is specified objective function) tells the test generator how close it is to . , , ., , , , . . . . . ,. , ., . , , in advance, and the goal of test data generation is to find an reaching its goal. If a; is a program input, then the test . . ,, , . ., , . , °, „. ., . , ° ° , • input that executes the desired path. Since it is known generator can evaluate 9(x) to determine how close * is to w h j c h b r £ j n c h m u g t b e t a k e n for e a c h c o n d i t i o n o n ^ th> satisfying the test requirement currently being targeted. The a l l o f t h f i s e c o n d i t i o n s c a n b e c o m b m e d i n a s i n g i e f^^^ idea is that the test generator can now modify x and w h o s e m m i m i z a t i o n l e a d s t o m a d e q uate test input. For evaluate the objective function again in order to determine e x a m p l e / if m e d e s i red path requires taking the TRUE what modifications bring the input closer to meeting the b r a n c h o f ^ condition requirement. The test generator makes a series of successive modifications to the test input using the objective function if (b >= 10) . . . for guidance and, hopefully, this leads to test that satisfies on line 11 and taking the FALSE branch of the condition the requirement (in fact, 9 can only be said to provide heuristic information, as will become apparent when we ~ discuss the construction of 9 in Section 4.3). on line 13, then one can find an adequate test input by In the TESTGEN system of [2], the minimization of §{x) minimizing the function Si (a:) + 9 2 (i), where begins by establishing an overall goal, which is simply the ,in , ... , , l n r in 6 6 r 3 6 , y f 10 — feu, if bn < 10 on line 11; /N satisfaction of a certain test requirement. The program is 3ti(x) = < executed on a seed input, and its behavior on this input is ^ ' ° erwlse Cl3 8 lf c > 8 o n l m e 13; used as the basis of a search for a satisfactory input (that is, Q^)= [ ~ ' " if the seed input is not satisfactory itself). 10, otherwise. The subsequent action depends on whether the the (Here, c13 and 6 n are actually functions of the input valuer.) execution reaches the section(s) of code where the test Unfortunately, this function cannot be evaluated until requirement is supposed to hold (for example, whether it i i n e 1 0 a n d line 13 are both reached. Therefore, the ADTEST reaches line 324 in the example above). If it does, then system begins by trying to satisfy the first condition on the function minimization methods can be used to find a useful path, adding the second condition only after the first input value. condition has been satisfied. As more conditions are If the code is not reached, a subgoal is created to bring reached, they are incorporated in the function that the about the conditions necessary for function minimization to algorithm seeks to minimize. work. The subgoal consists of redirecting the flow of control Another test generation system relevant to our work is so that the desired section of code will be reached. The the QUEST/Ada system of [20], [21]. This is a hybrid algorithm finds a branch that is responsible (wholly or in system combining random testing and dynamic testing for part) for directing the flow of control away from the desired Ada code. Once the code is instrumented and ranges and location and attempts to modify the seed input in a way that types of input variables have been provided, the system will force the control of execution in the desired direction. creates test data using rule-based heuristics. For example, The new subgoal can be treated in the same way as other values of parameters are adjusted according to one such test requirements. Thus, the search for an input satisfying a rule to increase or decrease by a fixed constant percentage, subgoal proceeds in the same way as the search for an input The test adequacy criterion chosen by Chang et al. is branch satisfying the overall goal. Likewise, more subgoals may be coverage. The system creates a coverage table for all created to satisfy the first subgoal. This recursive creation of branches and marks those that have been successfully subgoals is called chaining in [2] and [4]. covered. Table 1 provides an example of such a branch Korel's approach is advantageous when there is more t a b l e - The t a b l e i s consulted during analysis to determine than one path that reaches the desired location in the code. w h i c h branches to target for testing. Partially covered The test data generation algorithm is free to choose branches are always chosen over completely noncovered whichever path it wants (as long as it can force that path branches. to be executed), and some paths may be better than others. Although QUEST/Ada does not use the dynamic test For example, suppose we want to take the TRUE branch of generation paradigm we have been describing, the coverage ., j.f. table of L[21] is relevant to our research because itr provides a . .. _ ,, . .. , , . , the condition t r ' strategy for dealing with the situation where a desired if (b > 10) . . . condition is not reached. Instead of picking a particular 204
MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1089
Both TESTGEN and the QUEST/Ada system gradient descent to minimize the objective function. technique is limited by its inability to recognize minima in the objective function (see Section 3.1). Our
TABLE 1
A Sample Coverage Table after Chang Branch Decision | TRUE | FALSE"]
1
X XX X
2
3 T ..
...
'. . .
'
goa] i s t o
X
overcome this limitation.
In this paper, we report on GADGET, a test generation system designed for p r o g r a m s written in C or C++. GADGET automatically generates test data for arbitrary C/C++ programs, with n o limitations on the permissible language constructs a n d n o requirement for hand-instrumentation. (However, GADGET does not generate mean-
J=
use This local main
_—I
K
77MS table is generated from a program flow chart. The table provides
,
6
information regarding the covered branches and directs future test case mgiul character strings unless those strings represent generation. We adapt this approach for our generator as well. numbers, and it does not generate meaningful values for
condition, as TESTGEN does, or picking a particular path, like ADTEST, this strategy is opportunistic and seeks to cover whatever conditions it can reach. Although this is inefficient when one only wants to exercise a certain feature of the code under test, it can save quite a bit of unnecessary work if one wants to obtain complete coverage according to some criterion. We developed the coverage-table strategy independently and use it in our test-generation system. In [22], [23], [24], simulated annealing is used in conjunction with dynamic test generation in much the same way that we use genetic algorithms. These papers only report results for small programs, but they show how dynamic test generation can be applied numerous test generation problems other than the satisfaction of structural coverage criteria. The GADGET test generation system, which we discuss in this paper, is a dynamic test generation system like TESTGEN and ADTEST, but it uses genetic search to perform optimization, instead of the gradient descent techniques used by TESTGEN and ADTEST; the advantages of this will be discussed in Section 3.1. In [25], we presented some preliminary results on the performance of an early prototype and, in [26], we examined the performance of the GADGET system using a number of different optimization techniques, including genetic search and simulated annealing. 2.3 Contributions of This Paper The research described in this paper addresses two limitations commonly found in dynamic test-data generation systems. First, many systems make it difficult to generate tests for large programs because they only work on simplified programming languages. Second, many systems use gradient descent techniques to perform function minimization and, therefore, they can stall when they encounter local minima (this problem is described below in greater detail). Limited program complexity is a drawback of the TESTGEN. It can only be used with programs written in a subset of the Pascal language. Aside from problems of practicality, the problem with such limitations is that they prevent one from studying how the complexity of a program affects the difficulty of generating test data. The unchallenging demands of simple programs can make simple schemes like random test generation appear to work better in comparison to other methods than they actually do.
compound data-types, even though such values can be regarded as inputs if they are read from an external file.) W e re ort test results for a P program containing over 2 000 l i n e s of s o u r c e c o d e ' ' excluding comments. To our ^owledge, this is the largest program for which results have been reported. (Although [5] reported that their system had been run on programs as large as 60,000 lines or source code, no results were presented.) The ability to generate tests for programs using all C/C++ constructs has m e added benefit of allowing us to study the effects ° f program complexity on the difficulty of test data generation. Some experimental results were also presented in [26], but the current paper also presents a selfcontained explanation of GADGET'S underlying testgeneration paradigm. The GADGET system uses genetic algorithms to perform the function minimization needed during dynamic test data generation. In this respect, it differs from the TESTGEN and ADTEST systems, which use gradient descent. The advantage of using genetic algorithms is that they are less susceptible to local minima, which can cause a test-generation algorithm to halt without finding an adequate input. Genetic algorithms were used in a different way by [27]. That s y s t e m judges test inputs according to how "interesti n g " m e y a r 6 / a c c o r ding to user-defined criteria for what is interes ting. Thus, although that system is used for generati n g s o f t w a r e t e s t S ; i t i s a t e s t generator in a different sense than our system. It does not strive to satisfy specific test requirements and, thus, it does not need to use semantic formation about the target program (in contrast, semantic information is crucial for the other test generation techni9 u e s w e have discussed). Th e most frequently cited advantage of genetic algorithms, when they are compared to gradient descent methods, is that genetic algorithms are less likely to stall in a local minimum—a portion of the input space where $s(x) appears to be minimal but is not. There is also a second advantage when several paths to the desired location are available. Unlike gradient descent methods, which must concentrate on a single path, the implicit parallelism of genetic algorithms allows them to examine many paths at once. This presents a partial solution to the path-selection problem described in Section 2.2.3. Certain limitations are common to all dynamic test generation systems, including our own. Existing systems are limited to programs whose inputs are scalar types
205
1090
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001
3.1 Function Minimization Recall that, in dynamic test generation , the target progra m is repeatedly executed on a series of test inputs. This is done to evaluate an objective function SJ, which heuristicall y tells the test generator how close each input has come to satisfying the test requirement that is currently targeted , This lets the test generator make successive modifications to the test input, bringing it ever closer to satisfying the requirement. This processis equivalent to conducting a minimization of the function 3 . Thus, the ability to numerically minimize a function value is the key to dynamic test data generation . The standar d minimization technique, used by [1], [2], [5],is gradien t descent. It essentially works by making successive changes to a test input in such a way that each change decreases the value of the objective function. Insofar as the objective function tells us how close the input is to satisfying the selected test requirement, each modification b r i n g s m e i n p u t c l o s e r t o m eeting that requirement, T h e minimization method suggested in [16] is a form of g r a d i e n t d e s c e n t . Small changes in one input value ar e made ^ . ^ tQ d e t e r m i n e a d d i r e c t i o n f or m a k i , J• m o v es w h e n ^ d j r e c t i o n {g fouJ . . , , . ., ..,.,. .. .., increasingly larg e steps ar e taken in that direction until no <• , . . . . . . . ... , , ^ r t h e r l m P r o v e m e n t is obtained (in which case, the search be lns S ***" w l t h s m a 1 1 l n P u t modifications), or until the input no longer reaches the desired location (in which case, t h e m o ve is t r i e d a a i n w i t h a s m a l l e r ste s i z e S P )' W h en n o riher ^ progress can be made, a different input value is modified, and the process terminates when no more progress can be made for any input value, Reference [5] also uses gradien t descent; specifically,the system described there uses a quasi-Newton technique (see [28]). Gradien t descent can fail if a local minimum is encountered. A local minimum occurs when none of the changes of input values that ar e being considered lead to a decrease in the function value and, yet, the value is not globally minimized. The problem arises because it is only possible to consider a limited number of input values (i.e., a small section of the search space) due to resourc e limitations. The input values that ar e considered may suggest that any change of values will cause the function's value to increase, even when the curren t value is not truly minimal. This situation is illustrate d in Fig. 1. In other area s where optimization is used, the problem of GENERATION 3 GENETIC ALGORITHMS FOR TEST DATA i oca i minima has led to the development of function GENERATION minimization methods that do not blindly pursue the s mulated In dynamic test data generation, the problem of finding s t e e P e , s t S r a d i e n t " N o t a b l f am ,°? 8 ^ e s e a r e } a, M. •. • J A i. n • iui \A \ annealin g [7], tabu search [81, [291,and genetic algorithms 6 6 software tests is reduced to an optimization problem. Most r _ . „ & ' J ' . T , . l " l " . . . . . . . , ., .. . . i, .., L[6] , l [30], [31]). In this r paper , we apply two a genetic rY } existing techniques solve the optimization rproblem with ." .., ' \ " ,, /. ' , . .. . . . ~ i , . . . . algorithms to the problem of test data generation . gradien t descent, but this is not a general-purpos e o r o approac h because of certain assumptions it makes about 3.2 Genetic Algorithms the shape of the objective function, which we will discuss A gen etic algorithm (GA) is a randomized paralle l search momentarily. This is the motivation for building the system m e t hod based on evolution. GAs have been applied to a we describe in this paper , which replaces gradien t descent v a r i e t y o f pro blems and ar e animportant tool in machine with a more general optimization method, namely, genetic learnin g and function optimization. References [30] and [31] search. give thorough introductions to GAs and provide lists of possible application areas . The motivation behind genetic (though the natur e of dynamic test generation systems allows the use of arbitrar y data-types within the progra m itself, which is a further advantag e of dynamic test-data generatio n over the symbolic methods described in Section 2.2.2). The main challenge posed by nonscalar progra m inputs is that, often, the data members must satisfy certai n constraint s for the input to make sense. For example, the character s in an arra y may be required to spell out the name of a file, an integer may represen t the number of elements in a list stored elsewhere in the same object instantiation , etc. What these constraints actually are canvary a great deal from one application to the next and, so, it is difficult to automate universal support for them. (Of course, violating such constraint s can make for interesting test cases, but, if many input values are nonsensical most or all of the time, it can lead to poor coverage.) In such cases, users must build auxiliary functions that accept scalar inputs from the test generato r and that construct meaningful values for variables having complex data types. Existing dynamic test generation techniques ar e also somewhat unintelligent in their handling of TRUE/FALSEvalued variable s or enumerated types. Program s using such variables within conditional statements do not seem to have , ,. , .„ ,, . ,, , been used in past research . (The problem presented by such variables will be examined in Section 5.4.) Dynamic test generation has the advantag e of treatin g a progra m somewhat as a black box, which makes most programming constructs transparen t to the algorithm. It is only necessary that we be able to extract the information needed for calculating the objective function and for determining whether the path of execution leads to the location where the objective function is evaluated. This gives dynamic test generation the flavor of a generalpurpose test generation technique that can be applied automatically to a wide range of programs . The GADGET system advances this conception of test generation in two ways. First, it uses genetic search, which is a generalpurpose optimization strategy and, unlike gradien t descent, does not assume an absence of local minima in the objective function. Second, GADGET uses automatic code instrumentation, which eliminates a practica l obstacle when generatin g tests for large programs . This allows us to easily examine the effects of progra m complexity on the performance of test data generators , as we doin this paper . 3 GENETIC ALGORITHMS FOR TEST DATA
1. The exceptions are GADGET itself, for which preliminary results were published in [25], and the system described in [22], which was published after the original preparatio nof this paper .
206
. ... , , ,. , . , „ ., .... , algorithm s is to m o d el t h e robustnes s a n d flexibility of natura l selection.
MICHAEL ET A L : GENERATING SOFTWARE TEST DATA BY EVOLUTION
1091
Fig. 1. Illustration of a local minimum. An algorithm tries to find a value of x that will decrease f(x), but there are no such values in the immediate neighborhood of the current x. The algorithm may falsely conclude that it has found a global minimum of / .
In a classical GA, each of a problem's parameters is represented as a binary string. Borrowing from biology, an encoded parameter can be thought of as a gene, where the parameter's values are the gene's alleles. The string produced by the concatenation of all the encoded parameters forms a genotype. Each genotype specifies an individual which is in turn a member of a population. The GA starts by creating an initial population of individuals, each represented by a randomly generated genotype. The fitness of individuals is evaluated in some problem-dependent way, and the GA tries to evolve highly fit individuals from the initial population. In our case, individuals are more fit if they seem closer to satisfying a test requirement; for example, if the goal is to make the value of the variable pos greater than or equal to 21 on line 324, then an input that results in pos having the value 20 on line 324 is considered more fit than an input that gives it the value -67. The genetic search process is iterative: evaluating, selecting, and recombining strings in the population during each iteration (generation) until reaching some termination condition occurs. (In our case, a success leads to termination of the search, as does a protracted failure to make any forward progress. This is a relatively common arrangement.) The basic algorithm, where P(t) is the population of strings at generation t, is: initialize P(t) evaluate P(t) while = (termination condition not satisfied) do select P{t + 1) from Pit) recombine Pit + 1) evaluate P{t + 1) t =t+l
In the first step, evaluation, the fitness of each individual is determined. Evaluation of each string (individual) is based on a fitness function that is problem-dependent. Determining fitness corresponds to the environmental determination of survivability in natural selection, and, in our case, it is determined by the fitness function described in Section 2.2.3. The next step, selection, is used to find two individuals that will be mated to contribute to the next generation. Selection of a string depends on its fitness relative to that of other strings in the population. Most often, the two individuals are selected at random, but each individual's probability of being chosen is proportional to its fitness. This is known as roulette-wheel selection. Thus, selection is done on the basis of relative fitness. It probabilistically culls from the population individuals having relatively low fitness. The third step is crossover (or recombination), which fills the role played by sexual reproduction in nature. One type of simple crossover is implemented by choosing a random point in a selected pair of strings (encoding a pair of solutions) and exchanging the substrings defined by that point, as shown in Fig. 2.
Fig. 2. Single-point crossover of the two parents A and B produces the two children C and D. Each child consists of parts from both parents leading to information exchange.
207
1092
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001
selection has come out in such a way that the first, third, and fourth elements of A, namely, Ai, As, and A4, are copied directly to the result; perhaps p was about 2/3. But, the second element of the result has the value A2 + 0A(B2 - C2) and the fifth element has the value A5 + 0.4(B 5 -C 5 ). 3.3
Fig. 3. An illustration of how individuals are combined by a differential GA. Three individuals are chosen as parents and each element of the resulting individual is either copied from the first parent or combined from elements of all three parents.
In addition to evaluation, selection, and recombination, genetic algorithms use mutation to guard against the permanent loss of gene combinations. Mutation simply results in the flipping of bits within a genome, but this flipping of bits only occurs infrequently. The Individuals in the population act as a primitive memory for the GA. Genetic operators manipulate the population, usually leading the GA away from unpromising areas of the search space and toward promising ones, without the GA having to explicitly remember its trail through the search space [32]. It is easiest to understand GAs in terms of function optimization. In such cases, the mapping from genotype (string) to phenotype (point in search space) is usually trivial. For example, in order to optimize the function f(x) = x, individuals can be represented as binary numbers encoded in normal fashion. In this case, fitness values would be assigned by decoding the binary numbers. As crossover and mutation manipulate the strings in the population, thereby exploring the space, selection probabilistically filters out strings with low fitness, exploiting the areas defined by strings with high fitness. Since the search for individuals with higher fitness is not restricted to a localized region of the objective function, this search technique is not subject to the problems associated with local minima, which were described above.
An Example of Genetic Search in Dynamic Test Data Generation
Our example of test generation by genetic search is based on the following simple program: main(int argc, char **argv) { int a = atoi(argv[l]); if (a> 100)
puts ('hello world\n") ; }
The program input is read in by the statement i n t a = a t o i (argv[l] ), which takes the first argument from the command line and assigns it to the variable a. (Although the test generator only generates scalar input values, those values are represented as text when the program needs them in that format). Suppose the test requirement is simply to exercise the p u t s statement on line 6, causing " h e l l o world" to be printed. The objective function will be based on the value of variable a at line 3 since this variable determines whether or not line 4 is reached. The source code is instrumented in a way that causes the value a - 100 to be sent to the test generator when line 3 is executed. This is the value of the objective function 3 . When the genetic algorithm is invoked, it begins by generating an initial population of test inputs; each input will be treated as one individual by the genetic algorithm. For this example, we assume four individuals are generated with the values 94, 91, 49, and -112, respectively. The value returned during the four test executions is used to determine the fitness of each of the four inputs. A smaller number indicates that the test input is closer to satisfying the criterion, but, in genetic search, larger 3.2.1 Differential Genetic Search numbers have traditionally been associated with greater A second genetic algorithm that we have also used for fitness and we will adopt that convention here. Therefore, generating software tests is the differential GA described in the fitness of each input is obtained by taking the inverse of [33]. Here, an initial population is constructed as above. Recombination is accomplished by iterating over the the heuristic value returned when the target program is inputs in the population. For each such input /, three executed. For the inputs —112, 49, 91, and 94, the mates, A, B, and C, are selected at random. A new input instrumented program returns 222, 51, 9, and 6, respec/' is created according to the following method, where we tively, to the execution manager. Therefore, the respective let Ai denote the value of the ith parameter in the input A fitnesses of the four inputs are 1/222, 1/51, 1/9, and 1/6. Once the fitness of each test input has been evaluated, and, likewise, for the other inputs: For each parameter the reproduction phase of the genetic algorithm begins. value h in the input I, we let /; = /, with probability p, where p is a parameter to the genetic algorithm. With Two inputs are selected for reproduction according to their probability 1 — p, we let 1\ = At + a(Bi — Ci), where a is afitness. This is accomplished by normalizing the fitness second parameter of the GA. If /' results in a better values and using them as the respective probabilities of objective function value than /, then /' replaces /; choosing each input as a parent during reproduction. The otherwise, / is kept. This procedure can be thought of probabilities of selecting -112, 49, 91, and 94 are thus 0.015, as an operation on /-dimensional vectors. First, we 0.065, 0.368, and 0.552, respectively. With only four generate a new vector by adding a weighted difference individuals, there is a significant probability of selecting of B and C to A. Then, we perform £-point crossover the same individual twice, but many genetic algorithms between A and the newly generated vector, obtaining our have explicit mechanisms that prevent this. We will assume result. This is illustrated in Fig. 3. Here, the random the algorithm in our example does so as well.
208
MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1093
Suppose the two parents selected are 91 and 94—the two inputs with the largest probabilities. Reproduction is accomplished by representing each number in binary form, 01011011 and 01100000, respectively. A crossover point is selected at random (suppose it is 3), and crossover is performed by exchanging the first three bits of the two parents. The resulting offspring are 01000000 and 01111011, or 64 and 123. Since reproduction continues until the number of offspring is the same as the original size of the population, two more inputs are selected as parents. Suppose they are —112 and 91 and the crossover point is 5. If the inputs are converted to bit-strings using 8-bit, two's complement representation, then they are 10010000 and 01011011, t nm rvm 1 u i r J LU rr • respectively. Crossover produces the offspring 10010011 and 01011000 or -109 and 88. In summary, the second generation consists of the individuals 64, 123, -109, and 88. When the program is executed on these inputs, it is found that one of them satisfies the test requirement, so the test generation process is complete (for that requirement). If none of the tests had met the requirement, the reproduction phase would have begun again using the four newly generated inputs as parents. The cycle would have been repeated until the test requirement was satisfied or until the GA halted due to insufficient progress. (Note that we would have been in trouble if all four original test inputs had had zeros in the first two binary digits. Then, no combination of crossovers could have created a satisfactory test input and we would have had to wait for an unlikely mutation. This is part of the importance of diversity, mentioned in Section 5.1 when we discuss the adjustment of the GA's parameters.) 3.4 Other Optimization Algorithms Once the underlying framework for dynamic test data generation has been implemented, it is straightforward to add other optimization techniques. One such technique is simple gradient descent. We implemented two gradient descent algorithms: Polak-Ribiere conjugate gradient descent [28] and a reference algorithm which is slow but makes no assumptions about the shape of the objective function. Conjugate gradient descent is best suited for oprimization problems with continuous parameters, and integer parameters lead to a myriad of small plateaus where the algorithm can stall. That is, the algorithm may make such small adjustments to an integer-valued parameter that there is no affect on the program's behavior. (Continuous-valued parameters from the optimization algorithm were converted to integers where necessary by rounding). To solve this problem, we interpolate the objective function, but our technique seeks to execute the program as infrequently as possible and is somewhat crude in other respects. For the input values xltx2,- • • ,*„ found by the gradient descent algorithm, we calculate the two values Q — <j(y1 y2 ... y\ and
Q _ cw '
'
-. '
where „ '
=
) x" I r n d ( x 0>
if x; is a continuous parameter; i i s a n i n t e e e r parameter
if x
and
( xi} if xt is a continuous parameter; rnd(xi) + 1, if Xi is an integer parameter and Zi = < Xi > rnd(xj); rnd(xj) — 1, if X, is an integer parameter and ( x, < rnd(xi), , , , ,_^ whe re r n d d e n o t e s r o u n d m , S t o ^ nearest mteger. (Our ™pkmentanon rounds 0.5 down.) The interpolated value is then r
209
»i
1094
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001
have a higher objective-function value than the seed. In this way, the algorithm implements a technique, also used by conjugate gradient descent, of bracketing a minimum in the objective function and then conducting an increasingly refined search for the minimum within the bracket.
test generator are evaluated by a commercial coverage analysis tool (DeepCover), together with the original seed input. This lets us plot the percentage of requirements satisfied as a function of the number of times the target program has been executed. The two details that must be addressed are the same as thos * described in Section 2.2.3: we must find a way to reach the code location where we want our test requirement to be satisfied, and we must convert that requirement into a function that can be minimized,
4
THE GENETIC ALGORITHM DATA GENERATION _ . . . _ p p-r\ * In this section, we describe our approach to test data generation by genetic search. Recall from Section 2.2.3 that our approach is based on the conversion of the test-data generation problem to a function minimization problem, which allows a genetic algorithm to be applied. In the previous section, we described the genetic algorithm itself, and we now describe its application to the problem of automatic test data generation. We begin with an overview of the algorithm, and then we discuss a number of issues in greater detail, including the implementation of the fitness function and the technique for ensuring that the GA can reach the location where the fitness function is evaluated. _ . ... _ _ . _ . . „ ...
4.1
Overview of the Test Data Generation Algonthm
The goal for the GAs is to find a set of tests that satisfy condition-decision coverage (see Section 2.1). This leads to two test requirements for every condition in the code, namely, that the condition be true for at least one input and false for at least one input. Condition-decision coverage also requires that each branch of each decision be taken at least once, but this requirement is satisfied by any test set that meets the requirements above. Before starting the GA, we execute the program on a seed input. Such a seed input typically satisfies many of the test requirements (see Section 5.2 for a discussion of this). The initial program execution is used to initialize the coverage table. After this, the coverage table is used to select a series of test requirements in turn. For each test requirement, the GA is initialized and attempts to satisfy the given requirement. Due to the limitation on the number of iterations (the GA must make some progress every n iterations for some n), the GA is guaranteed to halt, either because a solution has been found or because the GA has given up. Whenever the GA generates an input that satisifes a new test requirement, whether or not that test requirement is the one the GA is currently working on, the new test input is recorded for future use (see below), and the coverage table is updated. We also record how many times the target program was executed before the new input was found. The test generation system continues to iterate over the test requirements until no further progress can be made. This happens when one attempt has been made to satisfy every reachable requirement. (A requirement is considered unreachable if no test input executes the code location where the requirement is evaluated. In other words, the test requirement refers to a condition that the GA cannot reach. The coverage table can be used to determine when there are no more reachable requirements, see Section 4.2.) The performance of each test generator is measured as the percentage of test requirements it has satisfied. To determine this percentage, the inputs that are found by the
4
- 2 Reaching the Target Condition Recall that, in dynamic test generation, function minimization cannot be performed unless the flow of control reaches a certain point in the code. For example, if we are seeking an input that exercises the TRUE branch of a condition in line 954 of a program, we need inputs that reach line 954 before we can begin to do function minimization, Our approach is slightly different from that of [2] and [5], which concentrate on finding a specific path to the desired location. Our goal is (among other things) to cover all branches in a program. This means we can simply delay our attempts to
J^*
a c e r t a i n c o n d i t i o n mm
w
^ a v e f*und
condition. T h i s l e a d s t o a t e s t generation approach similar to the o n e employed by [21]. A table is generated to keep track of (he conditional branches already covered by existing test c a s e s If n e i t h e r b r a n c h o f a c o n d i t i o n h a s b e e n t a k e r i , then t h a t d e d s i o n h a s n o t b e e n rea ched, so we are not ready to a p p l y f u n c t i o n minimization to that condition. If both branches have been taken, then coverage is satisfied for that c o n d i t i o n and w e n e e d n o t e x a m i n e it further. However, if o n l y o n e b r a n c h o f a c o n d i t i o n h a s b e e n exercised, then the c o n d i t i o n h a s b e e n r e a c h e d / and it is appropriate to apply ^ ^ m m i m i z a t i o n i n s e a r c h of a n j t that will exercise t h e o t h e r branch. tests that reach that
4.3
Calculation of Fitness Functions. from Section 2.2.3 that dynamic test generation i n v o i v e s reducing the test generation problem to one of minimizing a fitness function 9. The first step is to define 9. LJke other dynamic test generation techniques, ours begins b v instrumenting the code under test. The purpose of this instrumentation is to allow us to calculate the fitness faction by executing the instrumented code, A t e a c h c o n d i t i o n i n t h e code, our system adds instrumentation to report $(x) when execution reaches that c o n d i t i o n . T a b l e 2 s h o w s h o w Q{X) i s c a i c u i a t e d for s o m e t k a l r e l a t i o n a l o p e r a t o r s w h e n we are seeking to take the J R U E b r a n c h o f a condition ( t h e functions for the FALSE b r a n c h are anaiogous) If ^ m ' s execution fails to reach the desired l o c a t i o n ( i t t e r m i n a t e s o r times o u t w i t h o u t h a v i executed ^ statement)/ m e n m e fimess ^ ^ t a k e s its w o r s t i bj ^ Qur g s e e k g tQ a t e cond iti on -decision ad te test ^ c o n j u n c tions and disjunctions can be handled . loitin c / c + + short circuit evaiuation. For ^ ^ ^ ^ e of ^ condition Recan
i f ( ( c > d ) && ( c < f ) ) . . .
210
MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1095
TABLE 2 Computation of the Fitness Function decision type
example
fitness function
,. inequality 4
... ,, if (c >= d ) . . .
7:, \ I d-c, if d > c; Avm = < „ , ~ . ' I 0, otherwise
if (c — d) . . .
$(x) =
equality ...
,
true/false value '
. , .
~, .
if (c) . . .
\d-c\ f 1000,
5(^) = { n ' \ 0,
is not reached unless the first clause evaluates to TRUE, so the requirement that states the first and second clause must both be TRUE is replaced by a requirement stating that the second clause must be reached and must evaluate to TRUE, If both clauses are reached and both clauses take on both the value TRUE and the value FALSE (as required by conditiondecision coverage), then both branches of the conditional branch will necessarily have been taken. 4.4 Execution Control In the GADGET system, an execution controller is in charge of running the instrumented code, coordinating GA searches, and collecting coverage results and new test cases. It begins by executing all preliminary test cases, (These preliminary cases can be supplied by the user or generated randomly by the execution controller.) After running all initial test cases, the execution controller uses the coverage table to find a condition that can be reached, but has not been covered yet (that is, no input has made the condition TRUE or else no input has made it FALSE). The genetic algorithm is invoked to make this condition take on the value that was not already observed. The GA is seeded with test cases that can successfully reach the condition (though they did not give the condition the desired value, or else the condition would already have been covered). If ,,... ,. . , ,. /, . , ' additional inputs are needed to get the required population size, they are generated randomly. When the GA terminates, either because it found a successful test case or because it stopped making progress, the execution controller uses the coverage table to find a new condition that has not been covered completely. The GA is called again with the task of finding an input that covers this condition. This process continues until all conditions that have had only one value (either TRUE or FALSE) have been subjected to GA search. The execution controller keeps track of all GA-generated test inputs that cover new program code, regardless of whether or not they satisfy the test requirement that the GA is currently working on. (In other words, GADGET takes advantage of all serendipitous coverages.) These test cases are stored for later use. 4.5 Computational Costs Since dynamic test generation is a hueristic process, we cannot give a universal characterization of its computational cost. To be specific, the optimization algorithm makes iterative improvements to a test input while trying to make that input satisfy the chosen requirements, and we cannot predict exactly how many iterations there will be.
if c = FALSE;
,, otherwise
However, there are a number of factors that influence the computational cost of each iteration. The most significant cost is that of executing the target program; recall that this program has to be executed in order to evaluate the objective function. Each new generation created by the genetic algorithm contains a number of new test inputs, and the objective function has to be evaluated for each one. Thus, the cost of dynamic test generation depends intimately on the cost of executing the target program. In addition, larger programs typically have more conditional statements, so if we try to obtain complete coverage of a program, as we did in the experiments described in Section 5, the number of different optimizations the GA has to perform depends on the program size as well. The second factor that determines the efficiency of test generation is the optimization algorithm itself. For example, the genetic algorithm's ability to escape local minima comes at the cost of additional resource expenditures, compared to gradient descent. In other words, the genetic algorithm makes more iterations than gradient descent would make, and it executes the target program more often. The experiments we report on in Sections 5.3 and 5.4 required anywhere from 30 minutes to several hours on a fun Sparc-10 workstation, though the simple programs in f ^ ™ 5"2 w e , r e less expensive. Based on results reported in [4] and elsewhere, this genetic search may be considerably * a d i e n t d e s c e n t J o u l d b e ( b u t J, more sive t h a n b d o w ) u l t i r n a t e l V / s u c h expense would be justified by the difficulty of generating test data by hand, which makes any autO mated technique desirable, especially for programs w jth a complex structure. it is interesting to note that computational overhead seems to play a considerable role in the time needed to perform dynamic test generation. When running the target program on a given test input, the test generator has to invoke the program (using the Unix vf ork and exec system calls in the case of GADGET) and wait for the results (GADGET uses wait.) For all of our programs, even the largest, computation time was significantly affected by the expense of the Unix vf o r k / e x e c / w a i t calls needed to execute the target program. We empirically estimated that our operating system is capable of about 3,200 vfork/ exec/wait operations per minute. In contrast, [4] reports b e i n g a b l e t o p e r f o r m 200,000 to 700,000 tests in five minutes (using random test data generation). Further computational overhead comes from the fact that GADGET is written in object-oriented C++ and was not optimized for these experiments. This (together with the timing results of [4]) suggests that it is possible to obtain much more efficient operation.
211
1096
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001
Apparently, platform-dependent computational overhead greatly affects the performance of test-generation algorithms. Therefore, it seems that the number of executions of the target program is a better measure of a testgenerator's resource requirements than wall-clock or system time. Reference [4] does not report on the number of executions used by their nonrandom techniques and, ... ... ./,.,. ,. , . i • although it appears that their gradient descent technique was much cheaper than our genetic search, their reduced overhead costs call this conclusion into question, at least if computational expense is measured in terms of the number of executions of the target program. In the next section, we do, indeed, measure the expense of test generation by the number of target-program executions. For reasons discussed above, this seems to be the most logical measure of efficiency for a dynamic test generator: It abstracts away the cost of actually running the target program itself, as well as the overhead of calling the target program and communicating with it. 4.6 Tuning the GA When a test is performed using one of the two genetic algorithms described in Section 3.2, the GA is first tuned by adjusting the number of individuals in the population and the number of iterations that must elapse with no progress before the GA gives up. In the standard GA, mutation is controlled by adjusting the probability that any single bit will be flipped during reproduction. The goal of this finetuning was to maximize the percentage of conditions covered while keeping the execution time low. A second goal was to ensure that the differential and standard GAs executed the target program about the same number of times in order to get a reasonable comparison between the 6
r
two techniques. , , , , , , , Such tuning adjustments control the breadth and the thoroughness of the GA's search. If there are more individuals, then there are more disinct inputs that can be created by the crossover operation; in a sense, more genetic diversity is available. On the other hand, if we make the GA continue for more iterations before giving up, then it is less likely to give up simply because progress is slow. In some cases, the GA is visibly on the path to a successful input, but it gives up because an unlucky series of crossovers fails to improve the fitness of the population. The likelihood of this is reduced if more iterations are permitted. Unfortunately, both of these improvements have a cost. Allowing more iterations means the GA will waste more time trying to cover conditions that it cannot cover. Adding more individuals means that the target program is executed more often during each iteration because a target-program execution is required every time we evaluate the fitness of an individual.
5
programs. Finally, we present results obtained by analyzing a real-world autopilot control program called b737. _ 5 1 " D e s i 9 n o f t h e Experiments *" s e t t i n S UP o u r experiments, we began by selecting a P r °g r a m on which to try out the test generation system; we refer to s ch a " P r o S r a m a s a ^get program. By means of source-code instrumentation, the target program is conne^ ^ ^ ^ J ^ m £ u r e d tQ r u n ^ t m d e m ^ w h k h m o n i t o r s t h e p r o g r a m ' s behavior. The target p r o g r a r n i s instrumented by augmenting each condition w i t h additional code. This code reports the objectivefunction value for each condition to the execution manager. (The objective functions are calculated as described in Section 4.3). The program that performs this instrumentation also collects the information needed to construct the coverage table described in Section 4.2. Several parameters of the GA were the same throughout all of our experiments. For the differential GA, the parameter p (see the end of Section 3.2) was always 0.6. The parameter a was always 0.5. The probability of mutation, number of individuals, and number of iterations permitted without progress were different for different test programs. We report the results for each of the programs we tested and give the actual values we used below. (See Section 4.6 for a discussion of how these parameters affect the performance of the GA.) An additional parameter of the test generator is the random number seed used to create pseudorandom numbers by means of a linear congruential generator (c.f., [34]). The following describes the utilization of pseudorandom numbers and, hence, the impact of the choice of a random number seed in each of the three test generators. „ , . . .. AH • i I. J • Random test generation. All inputs are generated pseudorandomly. In all of our experiments, the parameters were numeric, and inputs were generated
.
•
b
whe er
* ,
EXPERIMENTAL RESULTS
In this section, we report on four sets of test data generation experiments. The first set of experiments involves programs that calculate simple numeric functions. The second experiment investigates how the GAs and random test generation perform on increasingly complex synthetic 212
•
selecti
a
value
for
each
parameter
uniformly at random from the range specified in a program-specific configuration file, G r a d i e n t de scent. The initial inputs are randomly generated using the same technique as the random t e s t d a t a generator. Since our implementation supp O r t s i t readily, we randomly select two starting p o i n t s ^ d perform the two descents in tandem with One another. The step size has a pseudorandom element as well, as outlined earlier at the end of Section 3.4. Standard genetic search. The seed inputs are created pseudorandomly, as are new population members when these are needed. Each input is represented as a string of bits, and inputs are generated by setting each bit to 1 with probability 0.5. Crossover points are selected uniformly at random from the potential crossover points in the bit-string, and the decision of or not
*° mut f te
a bit is based o n a ,r
randomly generated number as well. Differential genetic search. Pseudorandom numbers are used to create the seed input and to create new population members as above and, in addition, the random number generator drives the random choices used in reproduction (see Section 3.2).
MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1097
We executed the test generation system described in values. Therefore, the differential GA sees each input as a Section 4 using each of the genetic algorithms described in s e r i e s ° f variables corresponding to the input parameters of Section 3.2. A single run of the test generation system t h e t a r 8 e t program. The description of the input parameters involves the following steps: (ranges, types, etc.) is provided to the test generation system by the user and that description is used both by the 1. Selection of a target program for which tests will be standard GA and the differential GA when test inputs are generated. formatted for use by the target program. 2. Specification of a test adequacy criterion. In these experiments, the test adequacy criterion is always 5 2 S l m P l e Programs condition-decision coverage. We began our experiments on a set of simple functions 3. Enumeration of the test requirements that the m u c h l i k e m o s e reported in the literature [35], [21], [4]. The adequacy criterion induces on the target program. programs analyzed are as follows: 4. An attempt by the test generator to satisfy all of the , . test requirements for the program. , ,,, ' , . , , • bubble sort, . n u m b e r of days between two dates, We performed several complete test-generation runs . euclidean greatest common denominator, with each test generation system and for each of the . .. . several test programs. (That is, each test generation system made several attempts to satisfy the test requirements .. . , ' ,. . . . . . . , r , rJ u r • computing the median, HHf 1 associated with each program.) The exact number of runs varies between test programs, and it is given below when * ^ u a ,ra ^ o r m u a ' we describe our results for each of the programs. During warshall s algorithm, and * Wangle classification. each run of the test generation system, the system saves the original seed input, and it also saves any test input These programs are roughly of the same complexity, that satisfied a requirement not already satisfied by an averaging 30 lines of code and all having relatively simple earlier input. decisions. After each run of the test generator, information collected When we tested GADGET on these programs, we used 30 during the run is used to assess the test generator's individuals for the differential GA and allowed performance. Performance is measured in terms of the io generations to elapse with no progress before allowing percentage of test requirements that were satisfied. To do m e GA to give up. For the standard GA, we used this, the seed input is fed to a commercial coverage analysis 1 0 0 individuals, allowed 15 generations to elapse before tool (DeepCover), along with the inputs that were saved by t h e G A a n d m a d e m e p r o b a b i l i t y o f m u t a t i o n 0.001. the test generator because they satisfied new coverage F o r ( h e s e p r o g r a m S / r a n d o m t e s t d a t a generation never requirements^ This tells us how many requirements were o u h , e r f o r m s netic search/ t h h s o m e r i m e s both apr satisfied by these inputs. , , ., -n. £ Cc ? „ , . . . . rproaches have the same effectiveness. These results it , . ., .. j • n i l J MI • ,.u Each time the test generator saves a test input, r . , °, . ' resemble those reported in [21] and [41; in those rpapers, r , . . .. , , , . „ records how many times the target program was executed , , ...• / c A TU- i ? i i. i.u i. .. random test generation also performed nearly as well as r J before that input was found. This lets us plot the test , . .& , , . . . . , r_ i.r l i <• more sophisticated techniques on simple programs, c generator s performance as a function of the number of f . .. u i . i . 1 J target-program executions, which is what we do in most
, . . of our experiments. T JJ\;i • IUV. ,-AI- j i i *. In addihon to running the two GA-based test generators, we also apply the random test generator to each target
_ / . . , , _ , . , Random test case generation has the upper hand in these . , . . . ... r r . .
experiments because it involves significantly less computaF 6 F . / hon. However, in every case, one of the GAs performs the , , , , , „ , , , J T best overall. Table 3 shows the results. The numbers
program. We run the random test generator as many times r e P o r t e d i n * e t a b l e ""f^f111 * * m g h e S t P e r c f t a S e o f as the two other algorithms. For each run of the random test t e s t requirements satisfied by any single run of the test generator, we execute the program on randomly generated g e n e r a t ° r a ™ n g a s f i e s of f i v e fuch r u n s lt i s u s e f u l t o « « l y ^ a sample program to understand input values and record the inputs that satisfy new w h a t the G A s are doin requirements as above. The number of such executions is S - T h e c o d e f o r T n a n g l e classification equal to the largest number of times the program was i s s h o w n b e l o w - ^The c o m m e n ^ ° n the right margin will be executed by any of the GAs. We plot the performance of the u s e d l a t e r t o s h o w w h i c h i n P u t s ™tefed which conditions.) random test generator in the same way as the performance N o t e ^ m a n y d e c i s i o n s onl Y contain a single condition, #include <stdio.h> of the GAs. 5.1.1 Input Representation in the GAs
. . , ,. , int triang (int I , int 3, mt k) { Input representation in the GAs. In the standard GA, test f/ r e t u r n s o n e o f t h e f o l l o w i n g . inputs are represented as a contiguous string of bits, and / / x i f t r i a n g l e i s scalene crossover is accomplished by selecting a random position " 2 l f t " a n 3 l e 1 S isosceles within the bit-string. (In our experiments, the binary // 3if triangle i s equilateral representation was obtained by using the machine encoding / / 4 i f not a t r i a n g l e for floating-point parameters and Gray-coding others.) Representing the input as a series of bits does not work for the differential GA (see Section 3.2) because its method i f ( ( i < = 0 ) II ( j < = 0 ) II ( k < = 0 ) ) r e t u r n 4; of reproduction is not well-suited to variables with only two / / acd
213
1098
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 12, DECEMBER 2001
TABLE 3 A Comparison of Four Test Data Generation Approaches on Simple Programs | random | | Program Binary search 80 Bubble sort 1 100 Bubble sort 2 100 Number of days between two dates 87.0 Euclidean greatest common denominator 100 Insertion sort 100 Computing the median 100 Quadratic formula 75 Warshall's algorithm 91.7 Triangle classification | 48.6 |
The standard GA exhibits the best performance overall, but not by a significant amount. These data represent the highest percentage obtained by each method during a series of five attempts to obtain complete coverage of the program.
i n t t r i = 0; i f (i == j) t r i + = 1; II g if(i==k)tri+=2; //h i f (j == k) t r i +=3; // i if ( t r i == 0) { / / bef if ( (i+j <= k) II (j+k <= i) II (i+k <= j ) ) t r i = 4; / / be else tri = 1; lit return t r i ; i if (tri > 3) tri = 3; else if ( (tri == 1) && (i+j > k) ) tri = 2; else if ( (tri == 2) && (i+k > j) ) tri = 2; // h else if ( (tri == 3) (j+k > i) ) tri = 2; else tri = 4; return tri; } void main () { printf ("enter 3 integers for sides of triangles\n"); int a,b,c; scanf("%d %d %d",&a, &b, &c) ; int t = triang(a,b,c) ; if (t == 1) printf ("triangle is scalene\n"); // f else if (t == 2) printf ("triangle is isosceles\n"); // h else if (t == 3) printf ("triangle is equilateral\n"); else if ( t== 4) printf ("this is not a triangle\n"); //abcdegi Fig. 4 shows how the four different systems—standard GA, differential GA, gradient descent, and random test generation—perform on the triangle program shown above. 5.2.1 Coverage Plots for Test Generation Problems In Fig. 4, the number of test requirements satisfied by each test generation system is plotted against the number of executions of the target program. Here, as well as in our later results, the standard GA and differential GA did not
214
coverage
execute the program the same number of times. (In general, the standard GA tended to satisfy individual test requirements more quickly.) The random test generator was run last, and the number of program executions alloted to it was the maximum number of program executions needed by either of the other two systems. This creates favorable conditions for the random test generator and helps us determine when the use of a more complex test generation system is actually justified. The plot in Fig. 4 shows features that appear throughout o u r test generation experiments. First, there is a large, immediate jump in the percentage of test requirements
Fig. 4. A coverage plot comparing the performance of the two GAs, gradient descent, and random test generation on the triangle program. The graph shows how performance (in terms of the percentage of the 20 test requirements that have been satisfied) improves as a function of the number times the program is executed. Random test generation hits its peak early, but fails to improve after that. The differential GA has a better performance and executes for a longer amount of time, but the standard GA has the best performance overall, covering about 93 percent of the code on average in about 8,000 target-program executions. Gradient descent performs nearly as well as the standard GA. The curve for each system represents the pointwise mean performance of that system taken over five attempts by that system to achieve complete condition-decision coverage of the program.
MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1099
TABLE 4 A Table of Sample Input Cases Generated by the Standard GA for triangle | Execution of target program | Key | Integerl | ~2 I a I 1680498885 I 3 b 1293470477 c -120192928 4 6 d 841354299 20 e 1056804119 117 f 719320455 5311 g 743820356 10751 h 999699718 I i | 799340978 | 16800
These data can be mapped to conditional expressions in the code shown above using the Key field.
satisfied, so that the interesting part of the vertical axis does not start at zero but somewhere closer to 50 percent. Second, the percentage of requirements satisfied by the GAs seems to make discontinuous jumps. The initial jump in coverage is there because the first test input satisfies many conditions. When the program executes the first time, it has to take either the TRUE branch or the FALSE branch on any condition it encounters and, each time a new condition is found to be true or false, the percentage of test requirements satisfied goes up. For example, if a program has no nested decisions and if each decision has exactly one condition, then the first program execution is forced to take either the TRUE branch or the FALSE branch of every condition in the program. For such a program, we would always achieve 50 percent coverage using only the first test. Many of the discontinuities that appear in Fig. 4 are there for similar reasons. When an input causes the program to take a branch that had not been taken before, it may lead to the execution of other conditional statements that were not executed previously. Once again, each condition must either be TRUE or FALSE, and this leads to an instantaneous increase in the number of conditions satisfied. (Apparent discontinuities can also occur when there are few conditions in the program because then the granularity of the vertical axis is low. However, this is not the cause of the salient jumps in the coverage plots we present here. Readers may visually judge the granularity of the plots by looking at small features, such as the shallow increments that occur in the plot for the standard GA at 43 percent and 78 percent on the vertical axis.) A further cause of discontinuities is serendipitous coverage. The GAs often satisfied one test requirement by coincidence when they were trying to satisfy a different requirement. In fact, the shorter execution time of the standard GA results largely from this phenomenon; less exertion was required of the GA because so many requirements were met serendipitously. We find this to be a recurring phenomenon in our experiments and have more to say about it in Section 5.5.4. 5.2.2 The GAs' Choices of Test Inputs A sample of results obtained by the GA test data generation algorithm is shown in Table 4. These data can be mapped to the source code shown above by using the letters shown in the comments.
The tests in Table 4 are probably not like those that a human tester would choose, much less those that would occur in a hypothetical world where this program was used by the general public. Of course, the ability to create bizarre tests can sometimes be an advantage. Coverage criteria are often used to exercise obscure and infrequently utilized features of a program, which testers might otherwise overlook. However, it might be desirable to concentrate on inputs that are more realistic. This leads to an interesting digression on dynamic test generation techniques: Unlike the static methods described in Section 2.2.2, dynamic test generation allows certain inputs to be preferred over others by means of biases in the objective function. For example, bizarre or uncommon input combinations could be penalized by raising the value of the objective function for those inputs. One could even construct an operational profile (c.f., [36]), allowing each input to be weighted according to its probability of occurring on the field. Similar biases are used in [22] to create a preference for test inputs close to boundary values, 5 2 3 - - Performance of Gradient Descent In view of the above discussion, it is perhaps unsurprising that the performance of gradient descent was generally somewhere between that of the genetic algorithms and that of random test generation. In many cases, gradient descent failed because of flat spots in the objective function where there is no information to guide the algorithm's search. This was the case with the binary search program, for example, However, gradient descent appears to encounter a local minimum in the t r i a n g l e program. This can be observed in the behavior of the reference algorithm for gradient descent, which empirically estimates the gradient at each step of the optimization process. In the t r i a n g l e program, the reference algorithm reaches a point where all adjacent points (as defined by the neighborhood function) lead to a worsening of the objective-function value. This leads to a reduction of the mean step size (as described in Section 3.4), but the same phenomenon is encountered again, and this process is repeated until the algorithm gives up. The optimization process is empirical rather than analytic, so this does not prove the existence of a local minimum, but it provides strong evidence. (Of course, this might be prevented by a clever neighborhood function, but finding a neighborhood function that is guaranteed to eliminate local minima is a nontrivial matter.)
215
1100
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001
The test criterion whose objective function contained the local minimum was satisfied by the standard GA, albeit serendipitously. 5.2.4 Performance of the Random Test Generator Our results for random test case generation resemble those reported elsewhere for cases where the code being analyzed was relatively simple. For the simple programs in Table 3, the different test generation techniques have roughly the same performance most of the time. In [21], random test data generation was reported to cover 93.4 percent of the conditions on average. Although its worst performance—on a program containing 11 decision points—was 45.5 percent, it outperformed most of the other test generation schemes that were tried. Only symbolic execution had better performance for these programs. Reference [4] reports on 11 programs averaging just over 100 lines of code. Overall, random test data generation was fairly successful, achieving 100 percent statement coverage for five programs and averaging 76 percent coverage on the other six. It is also interesting to compare our results with those obtained by [16] for three slightly larger programs. Again, simple branch coverage was the goal. Random test generation achieved 67 percent, 68 percent, and 79 percent coverage, respectively, on the three programs analyzed. Symbolic test generation achieved 63 percent, 60 percent, and 90 percent coverage, while dynamic test generation achieved 100 percent, 99 percent, and 100 percent coverage. These results suggest a common trend: Random test generation has at least an adequate performance on such 6 , 'l K programs, but, for larger programs or more demanding r o . . e r e o coverage criteria, its performance deteriorates. The pro1 •
we call the condition complexity). For example, a program with no nested conditional clauses (nesting complexity = 0) would look like the beginning part of the triangle program, in that i f statements are not nested inside other i f statements. The nature of the conditional expressions in each conditional clause is controlled by the second parameter (the condition complexity). The decision ( ( i<= 0) | | (j <= 0) I I (k <= 0) ) from the triangle program ranks a s a 3 o n this scale because it contains three conditions. In this section, we will use these two parameters to characterj z e the complexity of our synthetic programs. We will use t he expression compKx, y) as shorthand for "nesting complexity x and condition complexity y." In our experim e n t S / p rO grams were generated with all complexities compl{nest,cond), nest € {0,3,5}, cond e {1,2,3}. In this sec tion, we present the results for compl{0,1), compel, 3), a n d c o m p ; ( 3 ) 5 ) / w h i c h iii ustr ate the widening gap between t h e p e r f o r m a n c e levels of the three test generators as p r o g r a m complexity grows. The results for the other six programs are given in the appendix. Note ^ i n s o m e c a s e S / c h a n g i n g m e n e s t m g complexi t y h a s ^ s a m e e f f e c t a s changing the condition complexity changing i f (cond2}
( c o n d l && c o n d 2 )
doeg nQt c h a n g e
Q r c + + ^ feut u
^
toi f
(condl)
i f
s e m a n t i c s o fa p r o g r a m i n c
does c h t h e nesta n d condition However, in our synthetic programs, there is al a d d i n O n a l code between two nested conditions, T h u ^ &h h e f n e c o r n p l e xity implies a more complicated relationshi b e t w e e n the input parameters and the . ,, . . . . , , . , .... .-_ ., variables that appear in deeply nested conditions. On the , , , ,.f .... , . . . ,. . . . other hand, a high condition complexity implies that many .... , , . ., . . . conditions are evaluated in the same program location, . , , , .;r , ° . . c meaning that the test generator has to find a single set of . ,.° . .,.?•?• • 1 _.-,_• ^ ^i. variable values that satisfies many simple conditions at the J r . same time. I J « - 1 ^ A i_ J 1U • c T ) In these tests, the differential GA had a population size of „ „ , , , , ,., 1 • .,„ 30 and abandoned a search if no progress was made in 10 . , \ b, generations. The standard GA also gave up after & K , , , , 10 g y r a t i o n s made no progress, but the population s,ze w a s ad usted J ^ ^ s t a n d a r d f A executed the target P™&am a b o u t t h e s a m e n u m b e r o f h m e s a s * e differential GA. This resulted in a population size of 270 for comp^O, 1), 3 2 0 for ^m P /(3,2), and 340 for comPl(5,3). The large P°P"lation sizes make up for the standard GA's tendency t o ive U e a r l S P y ; t h i s i s discussed in Section 5.3.1. The mutation probability for the standard GA was 0.001. Fi s 5 6 a n d 7 s h o w m e m e a n S- ' ' performance over six m n s o f each test generation technique. (Note that the percentage of test requirements satisfied, the shown vertical axis h a s ' different ranges in different plots. We have focused on the interesting portion of each plot.) Forthe simp'est program, random test generation and the GAs quickly achieve high coverage. All three methods 5.3 The Role Of Complexity make most or all progress during the very early stages of In our second set of experiments, we created synthetic test generation. In these early stages, the GAs have programs with conditions and decisions whose character- essentially random behavior because evolution has not istics we controlled. The two characteristics we were begun yet. The standard GA sometimes satisfied additional interested in controlling were: 1) how deeply conditional requirements closer to the end of the process. The results clauses were nested (we call this the nesting complexity) and are shown in Fig. 5, whose horizontal scale is logarithmic to 2) the number of Boolean conditions in each decision (which show both of the features just mentioned.
216
lexities
MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1101
Fig. 5. The random test generator and GAs perform relatively equally on the least complex program. compl(0,1); 100 percent coverage represents coverage of 60 conditions.
Fig. 6 shows results for a program of intermediate complexity. Random test generation does not do as well as before. When we examined the details of each execution, we found that the standard GA performed well because it often managed to satisfy new criteria serendipitously—that is, it often found an input satisfying one criterion while it was searching for an input that satisfied another. It is likely that the differential GA performs a more focused search than the standard GA, and our examination of the detailed results showed that it failed to find as many inputs by coincidence. Finally, Fig. 7 shows the results for a program with comp/(5,3). Here, all test generation methods are less effective than before, and there is more to distinguish among the different techniques. In most cases, the standard GA performed better than the differential GA. The results for conjugate gradient descent and the reference algorithm are shown in Table 5. We do not show the results for gradient descent in the coverage plots because the disparity in the horizontal scale makes them
Fig. 7. The complexity program was hard for both generators, but the GA outperforms random by even more. The standard GA ultimately outperforms the other two methods. campl(5,3); 45 conditions were to be covered.
hard to see; conjugate conjugate gradient descent is considerably faster than the other techniques, while the reference algorithm is somewhat slow (in any case, we expect faster performance by conjugate gradient descent a priori, so a comparison of performance over time is not as informative as it is with the two GAs). However, the table shows the total condition-decision coverage obtained by gradient descent for each of the three programs, compl(0,1), compl(3,2), and compl(5,3). For compl(0,1), the gradient descent algorithms perform comparably to the other techniques. For compl(3,2), the performance of gradient descent is somewhere between that of the standard GA and that of the differential GA. For compl(5,3), both GAs outperformed gradient descent. Like the other optimization algorithms, gradient descent encountered problems because of flat regions in the objective function. This was the most frequent cause of failures to satisfy a test criterion. However, the reference algorithm apparently encountered one local minimum in the compl(0,1) program, one in the compl(S, 2) program, and two in the compl(5,3) program. (We concluded this on the basis of the same type of empirical evidence as in Section 5.2.) For the the programs with compl(5,3) or compl(3,2), the GAs also failed to satisfy the test criteria for which there TABLE 5 Condition-Decision Coverage Achieved by Conjugate Gradient Descent and the Reference Gradient Descent Algorithm for
compl(0,1), compl(3,2), and comp/(5,3) | compl (0,1) (3.2) (5.3)
Fig. 6. As complexity increases, the GAs begins to do better and the standard GA outperforms the differential GA. campl(3,2); there are a total of 45 conditions to cover.
| CGD | ref. | I 95.27 I 94.03 70.65 75.0 | 29.58 I 39.3
The curves are not shown since their scale is different from that of the other curves, but conjugate gradient descent was faster than the other algorithms (as expected). The two algorithms have comparable performance. Gradient descent generally performed somewhat below genetic search and comparably to differential genetic search.
217
1102
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001
were local minima in the objective function. The test criterion with the local minimum in the compl(0,1) program was satisfied by the GAs some of the time, but only serendipitously. This suggests that the global optimization capability of the GAs is not what makes them perform better than gradient descent. The notion that other factors are more important than global optimization ability in determining the success of genetic search for these problems is also reinforced by our observations on serendipitous coverage. It is interesting that conjugate gradient descent did not perform as well as the reference algorithm for most of these programs. In most cases, this is apparently because the reference algorithm obtained better serendipitous coverage, In the experiments we will describe next (in Section 5.4), the opposite thing happened: Conjugate gradient descent outperformed the reference algorithm by—according to the logged status information—obtaining better serendipitous coverage. Additional results are shown in the Appendix. The performance gap between the three techniques seems to grow when the target program becomes more complex, but it remains small when either the condition complexity or the nesting complexity is small. This suggests that dynamic test data generation is worthwhile for complex programs, though it may not be worthwhile for programs whose conditions are simple or are not nested. However, it seems that random test generation levels off after several hundred executions, during which time the GAs show the same level of performance; since evolution has not begun yet, their behavior is essentially random. This is illustrated especially well in Figs. 5 and 14.
On the other hand, there are many programs where each parameter controls specific aspects of a program's behavior, This may make it more appropriate to think of the parameters as features, each one describing a different dimension of the program's behavior. If we regard the parameters as features, we see that the differential GA has a tendency to modify each feature, while the standard GA (with single-point crossover) tends create different combinations of existing features, This gives the differential GA a disadvantage when it comes to the serendipitous discovery of new inputs. This phenomenon, which we will discuss in the next section, requires that existing control paths (that is, existing features) be preserved, and the differential GA may not be g°°d a t doing this. In our experiments, the disadvantage of the differential G A was sometimes amplified since we insisted on having different optimization methods use roughly the same number of program executions. To make the standard GA execute the program as many times as the differential GA did during its fine-grained and fruitless searches, we increased the standard GA's population size. The standard G A tended to outperform the differential GA even when their populations were the same, and increasing the population size allowed it to widen its search while the differential GA remained tightly focused, T o m a k e matters w o r s e for <*« differential GA, small differences in performance are magnified over time. The faiIure to find a n " ^ **** c a u s e s o n e b r a n c h o f a decision t0 b e r o u t e d means that no conditions within that branch can be covered in the future. Thus, the failure to satisfy a sin le S condition can be a greater handicap than it seems.
r- „ w -TL. r, _r r ^ ,-,•« .- , ^. ^ , 5 . 4 b737: Real-World Control Software 5.3.1 The Performance of the Differential GA Compared , , . „ . . , „ In this study, we used the GADGET system on b737, a . ' to the Standard GA '' , . , .. } ^ C program which is part of an autopilot system. This code The poor performance of the differential GA (compared to h a s 6 9 d e c i s i o n p o i n t S / 75 c o n d i t i o n s , and 2,046 source the standard GA) can probably be attributed to its l i n e s o f c o d e ( e x c i u d i n g comments). It was generated by a numerical focus, which seems poorly suited to test genera- CASE tool tion problems. Recall from Section 3.2 that the differential W e generated tests using the standard GA, the differGA makes numerical changes in many variables, while the e n t i a l G A / c o n j u g a t e gradient descent, the reference standard GA, using single-point crossover, leaves most g r a d i e n t descent algorithm, and random test data generavariable values intact during reproduction. The scheme t i o n . F o r e a c h m e t hod, 10 attempts were made to achieve used by the differential GA is desirable in numerical complete coverage of the program. For the two genetic problems, while the standard GA seems more suited to algorithms, we made some attempt to tune performance by situations where variables represent specific features of a adjusting the number of individuals in the population and potential solution. The question, therefore, is whether the the number of generations that had to elapse without any variables in a test generation problem behave more like the improvement before the GAs would give up. parameters of a numerical function,or if they are more like For the standard genetic algorithm, we used populations "features" describing the behavior of the program. of 100 individuals each and allowed 15 generations to This somewhat cloudy question becomes more concrete elapse when no improvement in fitness was seen. The if we focus on the part of the program's behavior that is of probability of mutation was 0.000001. For the differential interest during test generation. The program's input is a set GA, we used populations of 24 individuals each and of parameter values, but, in a test generation problem, we allowed 20 generations to elapse when there was no generally ignore the output and are only interested in the improvement in fitness (the smaller number of individuals control path taken during execution. In this sense, test allows more generations to be evaluated with a given generation is not what would normally be thought of as a number of program executions, and we found that the numerical problem. Indeed, a small change in the input differential GA typically needed more generations because parameters often leads to no change at all in the control it converges more slowly than the standard GA for these path, so the fine parameter adjustments made by the problems). As before, we attempted to generate test cases differential GA are often fruitless. that satisfy condition-decision coverage.
218
MICHAEL ET A L : GENERATING SOFTWARE TEST DATA BY EVOLUTION
1103
Fig. 8. Coverage plots comparing performance of four systems on the b737 code. The four curves represent the pointwise mean over 10 attempts by each system to achieve complete coverage. The upper and lower edges of the dark gray region represent the best and worst performances of the standard GA, while the edges of the light gray region are the best and worst performances for the differential GA. The GAs have comparable performance and both show much better performance than random test data generation. The performance of gradient descent was slightly lower; conjugate gradient descent achieved just over 90 percent condition-decision coverage. The vertical axis shows the percentage of the 75 conditions that have been covered. (Note that the two plots do not have the same horizontal or vertical scale.)
First, we tried to achieve condition-decision coverage The two gradient descent algorithms were somewhere in with the two GAs. Next, we applied random test data the middle, with conjugate gradient descent achieving generation to the same program. Here, we permitted the a b o u t 9 0 percent condition-decision coverage, while the same number of program executions as was used by the r e f e r e n c e a i g o r i t h m only reached about 85 percent, genetic searches. This amounts to thousands of random tests, one for each time the fitness function was evaluated 5.5 Detailed Analysis of Experiments during genetic search. Note, however, that random test For any given execution of the test generator, we can divide generation stops making progress quickly. t h e 7 5 c o n d i t i o n s o f b 7 3 7 i n t o four classes. Some were ^ Fig. 8 shows the coverage plot comparing generic and n e v e f c o v e r e d s Q m e w e f e c o v e r e d s e r e n d i >. random test data generation. The graphs show the best, . , .. ., „ . . . . j-« , . . • c ,„ ' tously while the GA was trying to cover a different worst, and pomtwise mean performance over 10 separate .. ," , _, ., ,. . u u i. «. ui .. j-icondition, some were covered while the GA was actually attempts by each system to achieve complete condition' j .. working on them, and some were covered by the randomly eC The best performance is that of the differential GA; s e l e c t e d i n P u t s w e u s e d t o s e e d t h e G A i n i t i a l l yT h e l a s t c l a s s of i n u t s ls eventually, all runs converged to just over 93 percent P reflected in Fig. 8 by the fact t h a t man condition-decision coverage. The best runs of the standard Y conditions were already covered almost GA also reached this level of performance, but the immediately. For example, the standard GA covered about differential GA was faster. Random test generation only 7 6 percent of the conditions right away, Here as in o u r other achieved 55 percent coverage. ' experiments, most inputs were Though we did not tune the GAs extensively for discovered serendipitously. (Note that, when a test requireperformance (as we have said), we did try to adjust the ment is satisfied serendipitously, it often happens before the population size of both programs in order to obtain genetic algorithm makes a concerted attempt to satisfy that comparable resource requirements for both programs. The requirement. Therefore, the fact that many requirements reason for doing this was to ensure that better or worse w e r e satisfied by chance does not imply that the GA would performance by one algorithm was not merely the result of h a v e f a i k d t 0 s a t i s f y t h e m o m e r w ise.) a greater or smaller number of program executions. But, ^ w e r e discovered b luck m e a n s T h e fact ^ most • during this process we found that changing the population ^ m Q s t ^ i r e m e n t s not s a t i s f i e d b c h a n c e w e r e n o t size only had a small affect on performance. The reason was .. ,. , „ , ,. . ., ,-,.,-, • , . . , . ., V^A •. A u c «.i satisfied at all. In this respect, the b737 experiments shed that the GAs wasted many program executions on fruitless , , , searches and changes in the population size simply changed s o m e U S h t o n t h e t r u e behavior of the two different the resources wasted in this way. Therefore, it is quite G A implementations. A quick look at parts of the source possible that the standard GA could have been faster if the c o d e elucidates this behavior. We will first explain a typical population size had been smaller, without a significant individual run of the standard GA and then discuss the decrease in the condition-decision coverage it achieved. differential GA. 219
1104
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001
The execution of the standard GA we examine as a model was selected because its coverage results are close to the mean value of all coverage results produced by the standard GA on b737. In its 11,409 executions of b737, this run sought to satisfy test requirements on 12 different conditions in the code. Of these 12 attempts, only one was successful. The remaining 11 attempts showed little forward progress during 10 generations of evolution. While making these attempts, however, the GA coincidentally discovered 14 tests that satisfied requirements other than the ones it was working on at the time. The high degree of coverage that was finally attained was mostly due to those 14 inputs. Indeed, the most successful executions have the shortest runtimes precisely because so many inputs were found serendipitously—many conditions had already been covered by chance before the GA was ready to begin working on them. Next, we consider a typical execution of the differential GA. During this execution, the 15,982 executions of b7 3 7 involve 25 different attempts to satisfy specific test requirements. The number of attempts depends on which requirements are satisfied and in what order. Therefore, it is not the same for all executions of the test data generator. Again, only one objective is obtained through evolution, though 10 additional input cases, not necessarily prime objectives of the GA, are found. 5.5.1 Where the GAs Failed .. , , . „ . It is interesting to consider the conditions that the GAs never successfully covered. The standard GA failed to cover , , ,, . ., .... the following eight conditions: for (index = b e g i n ; index <= end && ! t e r m i n a t i o n ; index++) i f (( (T <= 0 0) II (0.0== Gain) )) . £ , cv . , ~ t, , ~ _ .. . i f ( ( (o_5 > 0. 0) && (o < 0.0) ) ) i f (FLARE) i f (FLARE) i f (DECRB) The differential GA failed to cover these conditions: for (index = b e g i n ; index <= end && ! t e r m i n a t i o n ; index++) i f (o) i f (o) i f ( ( ( ( O P - D 2 ) < 0 . 0 ) && ((OP-D1) > 0 . 0 ) ) ) i f (o_3) i f (( (o_5 > 0 . 0 ) & & ( o < 0 . 0 ) ) ) i f (FLARE) i f ( ( ( ! LOCE) I I ONCRS)) i f (RESET) Most of the decisions not covered only contain a single Boolean variable, signifying a condition that can be either TRUE or FALSE. The technique we use to define our fitness function seems inadequate when the condition contains Boolean variables or enumerated types. For example, if we are trying to exercise the TRUE branch of the condition if (windy) ... we simply make 9 (x) equal to the absolute value of windy, This makes Ss(x) zero when the condition is FALSE and
positive otherwise. But, if windy only takes on two values (say 0 and 1), then the fitness function can only have two values as well. Any two-valued fitness function does not allow the genetic algorithm to distinguish between different inputs that fail to satisfy the test requirement. Genetic search relies O n the ability to prefer some inputs over others, so twovalued variables cause problems when they appear within conditions. Our experimental results suggest that this problem is real. With an improved strategy for dealing w ith such conditionals, GA behavior should improve, The GAs also failed to cover several conditions not containing Boolean variables, in spite of the fact that such conditions provide the GAs with useful fitness functions, The conditions not covered by the GAs all occurred within decisions containing more than one condition, and this may account for the the GAs difficulties. However, it is also important to bear in mind that these conditions do not tell the whole story since the variables appearing in the condition may be complicated functions of the input parameters, 5.5.2 The Performance of the Random Test Generator question raised by our experiments is why the r a n d o m t e s t generator performed so poorly. In all of our experiments, the random test generator quickly satisfied a certain percentage of the test requirements and then leveled off, failing to satisfy any further requirements, even though, m SQme caseS/ ^ werg toousands of a d d i t i o n a l p r o g r a m . executions, ^ m a n y p r o g r a m S / u i s n o t counterintuitive that random test generation performs poorly. The b737 program presents an intuitively striking illustration of this. This program has 186 floating-point input parameters, meaning that there are 211904 possible inputs if a floating-point number is represented with 64 bits. Even a program like t r i a n g l e , with three floating-point parameters, has 2192 possible inputs. With a search space of this size, nothing even approaching an exhaustive search is possible. It is clear that, for any probability density governing the random selection of inputs, one can write conditions that only have a minute probability of being satisfied by those inputs. Thus, even seemingly straightforward test requirements c a n ^ e essentially impossible to satisfy using a random test generator. Some parts of the t r i a n g l e program are only executed when all three parameters are the same, but a random test generator only has one chance in 2128 of coming up with such an input if inputs are selected uniformly and independently (e.g., the second and third input parameters have to have the same value as the first, whatever that value happens to be). Since the random test generator creates tests independently at random, it is straightforward to determine ^ t h e Probability of generating an equilateral triangle one o r m o r e times d u r i n S 15> 0 0 ° t e s t s i s l e s s *** 4 x 1 0 • ( ^ I121 f o r a discussion of the probability of exercising a program feature during testing that was not exercised previously.) In some cases, the performance of the random test generator might be improved by nonuniform sampling, as discussed in Section 5.2. If the test generator were only allowed to choose integer inputs between 1 and 100—the tester would already have to know that restricting the A s e c O nd
220
1105
MICHAEL ET AL.: GENERATING SOFTWARE TEST DATA BY EVOLUTION
inputs to small integers does not preclude finding the desired test—then there would be one chance in 10,000 of finding a satisfactory input. This would give the random test generator a manageable chance of finding an input whose three parameters are all the same. Still, the same trick would also improve the performance of the GA-driven test generator. The t r i a n g l e example clearly illustrates the advantages of a directed search for a satisfactory input. 5.5.3 The Performance of Gradient Descent
The coverage plots for conjugate gradient descent and the reference gradient descent algorithm are shown on the right side in Fig. 8, with a different horizontal and vertical scale than the coverage plot for the GAs. On average, conjugate gradient descent achieved 90.53 percent condition-decision coverage, just slightly less than the genetic algorithms. The reference algorithm achieved 85.51 percent. In fact, conjugate gradient descent was more expensive than genetic search in these experiments. The reason for this, according to the status information logged during execution, was that conjugate gradient descent had trouble noticing when it was stuck on a plateau. It continued searching there when it should have stopped. This could easily be fixed, but tuning gradient descent for efficiency is our goal in this paper. Needless to say, the reference algorithm was more expensive than any of the others. The poor performance of the reference algorithm is somewhat surprising. We would expect it to perform at least as well as conjugate gradient descent if the objective function had no local minima or plateaus. The status information logged by the reference algorithm does not show that it encountered local minima, but it did encounter plateaus. According to the logged status information, this difference in performance was, once again, the result of serendipitous coverage. Below, we will discuss the issue of serendipitous coverage in more detail. 5.5.4 Serendipitous Coverage
Recall that, in the b737 code, 14 conditions were covered while the GA was trying to satisfy a different condition, while only one condition was covered while the GA was actually trying to cover it. For the differential GA, the ratio was 10 coincidental coverages to one deliberate coverage. This phenomenon was seen throughout our experiments and it is graphically illustrated in the coverage plots. There are sudden, seemingly discontinuous increases in the percentage of conditions covered. Instantaneous jumps come about because many new conditions are encountered and satisfied immediately when the program takes a new branch in the execution path (this was discussed in Section 5.2). But, sometimes, the sudden increases in coverage are rapid without being instantaneous. In our experiments, this happened when the execution of a new branch provided new opportunities for serendipitous coverage. The most interesting question raised by this experiment is the following: If the two GAs had so much success with inputs they happened on by chance, then why didn't random test generation perform equally well? We believe that the evolutionary pressures driving the GA to satisfy
Fig. 9. The flow of control for a hypothetical program. The nodes represent decisions and the goal is to find an input that takes the TRUE branch of decision c.
even one test requirement are strong enough to force the system as a whole to delve deeper into the structure of the code. This means though the GA is not necessarily following the optimal algorithm of grinding through each conditional one after the other to meet its objectives in lockstep manner, it is, in the end, finding good input cases. This argument is illustrated in the diagram in Fig. 9, which represents the flow of control in a hypothetical program. The nodes represent decisions. Suppose that we do not have an input that takes the TRUE branch of the condition labeled c. Because of the coverage-table strategy, GADGET does not attempt to find such an input until decision c can be reached (such an input must take the TRUE branches of conditions a and b). When the GA starts trying to find an input that takes the TRUE branch of c, inputs that reach c are used as seeds. During reproduction, some newly generated inputs will reach c and some will not, but those that do not will have poor fitness values and they will not usually reproduce. Thus, during reproduction, the GA tends to generate inputs that reach c. Until the GA's goal is satisfied, all newly generated inputs that reach c will, by definition, take the FALSE branch and, therefore, they will all reach condition d. Each time a new input is generated that reaches c, there is a possibility it will exercise a new branch of d. According to this explanation, gradient descent should also benefit from serendipitous coverage and, in fact, we found this to be the case (our experiments with gradient descent were performed after the preceding argument was formulated). In the case of the b737 experiments, this may also explain why conjugate gradient descent performed better than the reference gradient descent algorithm. Conjugate gradient descent tends to take small steps initially, which means that many of those inputs are likely to reach the same branches as the original seed. The reference algorithm, which takes larger steps, may make more exploratory executions that do not reach the same branches as the seed. (Of course, those inputs will lead to poor objective-function values and that direction of search will not be pursued further.) In contrast to the inputs generated by genetic search and gradient descent, those generated completely at random may be unlikely to reach d because many will take the FALSE branches of conditions a and b. Therefore, random inputs are less likely to exercise new branches of d.
221
1106
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27,
NO. 12, DECEMBER 2001
6.1 Global Optimization vs. Gradient Descent In general, gradient descent is faster than global optimization algorithms such as genetic search, so it is preferable when it provides comparable performance. However, the g OPEN RESEARCH ISSUES global optimization algorithms almost always provided better performance in our experiments. Our experimental results open a considerable number of Gradient descent can fail in two ways: First, it can research issues. These issues all have at their heart the e n c o u n t e r l o c a l m i n i m a andi s e c o n d / i t c a n e n c o u n t e r question: How can our test data generation system be lateaus in the objective f ^ ^ . w h e n w e analvzed the further improved? More specifically, how can we make our b e h a v i o r o f t h e re f ere nce algorithm, it appeared that local system find tests that satisfy an even larger proportion of m i n i m a d i d e x i s t J* Q u r fest a d for criteria_at the test requirements? The handful of issues addressed m ^ ^ ^ w g defined ^ { of ^ blem this section each have the rpotential to improve system i_ ,. i ,. j ^ u i f r j space—but plateaus appeared to be much more common. . . . r r behavior. „, . , , , . . . . . , ., Plateaus can also make a global optimization algorithm • Improved handling of binary-valued variables. The fail, and they did so in our experiments. Nonetheless, the fitness function should deal intelligently with con- global optimization algorithms made up for this failing with ditions that contain two-valued variables (see Sec- serendipitous discoveries of new inputs. In a sense, the tion 5.5.1). extra program executions, which make up the bulk of the • Improved handling of inputs that fail to reach the extra resources needed for global optimization, were put to target condition. When genetic search generates an good use. input that fails to reach the condition that we are On the basis of our experiments, we are therefore currently trying to satisfy, that input is simply given reluctant to conclude that global optimization algorithms a low fitness value. However, we already have at performed better because of their avoidance of local minima, least one input that reaches the condition because of Rather, it seems that their absence of assumptions about the the way the algorithm is defined. If we assign higher shape of the objective function, and the extra exploration fitnesses to inputs that are closer to reaching the they did to make up for this lack of assumptions paid off in condition, it might be possible to breed more inputs an unexpected way. that actually reach it. Of course, the setting determines whether or not the • Special purpose GAs. Much of the GA literature is performance gain is worth the computational effort. For concerned with investigating special purpose GAs many programs, it may be important to generate tests whose parameters and mechanisms are tailored to quickly, and achieving high levels of coverage may be less specific tasks [31]. Results garnered from the important. Indeed, it has been argued that coverage for its differential GA versus standard GA comparison o w n sa ke is not a worthy goal during software testing, that we made suggest that investigation into However, high test coverage levels are mandated in some designer GAs would be profitable. This research settings. We will also argue, in Section 6.2, that achieving would focus on GA failure and investigate ways to structural test coverage is not the only reason for using avoid running out of steam during test data autO matic test generation. In settings where automated test generation. More work should also be done generation replaces manual test generation, the automated determining exactly why the standard GA seems a p p r o a c h is clearly desirable, and we hope that this paper to outperform the differential GA. w i l l c o n s t itut e progress toward making it feasible in • Path selection. Path selection is the use of heuristics s e t t i n g s w h e r e programs are complex and manual test to choose an execution patii that simplifies test-data e n e r a t i o n i s especially laborious. r J generation (as used by TESTGEN; see Section 2.3). 6 Although path-selection is not vital in our test- 6.2 Further Applications of Test-Data Generators generation approach, it may still be the case that fo t h e b temi/ ^ a r e a n u m b e r o f i nter esting potential some execution paths are better than others for a l i c a t i o n s o f t e s t d a t a g e n e r a t i o n t h a t a r e n o t r e l a t e d t o satisfying a particular test requirement. If static or ^ s a t i s f a c t i o n o f t e s t J criteria o f t w e would dynamic analysis can provide clues about which ... . ... ^ J . , , r ,• paths are best/it will not be difficult to bias a genetic l l k e t o knaw w h e t h e r a P r o S r a m l s c a P a b l e o f P e r m i n g a search algorithm toward solutions using those paths. c e r t a i n a c t i o n ' whether or not it was meant to do so. For In fact, the work of [16], [4] suggests that such an example, in a safety-critical system, we want to know approach can lead to a noticeable improvement in whether the system can enter an unsafe state. When security performance. ' s a concern, we would like to know if the program can be • Higher levels of coverage. In this paper, we made to perform one or more undesirable actions that reported on the generation of condition-decision constitute security breaches. Even in standard software adequate test data. However, higher levels of cover- testing, one could conceivably perform a search for inputs age may further discriminate among different test- that cause a program to fail, instead of simply trying to generation techniques. It would be interesting to exercise all features of the program. References [22], [23] apply our technique to multiple-condition coverage and [24] discuss a number of such extensions of dynamic as well as dataflow and mutation-based coverage test generation and give more detail than we do here about measures. their implementation. In the final analysis, both GAs clearly outperform random test data generation for a real program of several thousand lines. This is an encouraging result.
222
MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1107
Fig. 10. comp/(0,2); 100 percent coverage represents coverage of
Fig. 11. compl(0,3); 100 percent coverage represents coverage of
60 conditions.
45 conditions.
The genetic search techniques we are developing can be applied in all of these areas, although we expect that each area will present its own challenges and pitfalls. To our knowledge, the only test-data generation systems that can be used on real programs are our own and that of [5]; since both are recent developments, it has not been possible to explore many of the less obvious applications of test data generators. Thus, the ability to automatically satisfy test criteria will open an enormous number of new avenues for investigation. 6 3 Threats to Validitv „ ., .. . . , For the most part, this paper simply reports what we , , . . _ , . , . observed in our experiments. By performing each expenment several times with different random number seeds, we tried to ensure that the observations we reported were not extraordinary phenomena, and we can say with a certain confidence that similar results would be obtained if the same experiments were run with different initial seeds. However, we do not know of any legitimate basis for generalizing about one program after observing the behavior of another nor for selecting programs at random in a way that permits statistical conclusions to be drawn } , about a broader class of programs. Therefore, we have not . , , , tried to obtain, say, a statistical sample of control programs . „ . { . , , . . , , to use m Section 5.4 nor have we tried to generalize about .,,.,, . , •• ,• • programs with different nesting complexities or condition , ... , . . ., complexities by generating many random programs with the same compl(-, •) parameters in Section 5.3. It follows, however, that our results in Section 5.4 do not predict the performance of genetic search in automatic test generation for all control software nor do the results in Section 5.3 predict its performance for all software with a given nesting complexity and condition complexity. Our measurements were made in two ways. First, we kept a log of certain information while test generation was in progress. Our logs show the value of the objective function when the program under test is executed, they record when new coverage criteria are satisfied, and, in
some cases, they record other information as well, such as the fact that the reference gradient descent algorithm is reducing its step size. Coverage results are obtained using the commercial coverage tool DeepCover and they are recorded each time a new criterion is satisfied; this allows us to construct convergence plots. Although most of what we reported is simply what we observed, some conclusions about the shape of the objective function was obtained indirectly via the log files. Thus, when we report that a test generation hits a plateau in the objective function, we mean that the algorithm was unable to find any inputs that caused the objective function value , , , , , to change and not that there were no such inputs. Likewise, r ° , when We that b ective f u n C t i o n w a s n O t ** ** ° >
7 m
^
p a p e r / w e h a v e r e p o r t e d o n r e s u l t s f r o m f o u r s e t s of
e x p e r i m e n t s u s i n g d y n a m i c t e s t d a t a g e n e r ation.
Test data
wefe
inciuding
generated
for p r o g r a m s
of v a r i o u s
sizeS/
... , , . ., „ ,. . , some that were large compared to those usually subjected . . . j . ,.• I I J T to test data generation. To our knowledge, we present ii.ri.i-i ± ^ I J • A. * * results for the largest program yet reported in the test .. ... L, , ., . .. . generation literature. The following are some salient , . , , ° observations of our study:
223
•
In our experiments, the performance of random test generation deteriorates for larger programs. In fact, it deteriorates faster than can be accounted for simply by the increased number of conditions that must be covered. This suggests that satisfying individual test requirement is harder in large programs than in small ones. Moreover, it implies that, as program complexity increases, nonrandom test generation techniques become increasingly desirable, in spite of the greater simplicity of implementing a random test generator.
1108
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL.27, NO. 12, DECEMBER 2001
Fig. 12. compl(3,1); the differential GA outperforms the other two techniques, which occured rarely in our experiments. 100 percent coverage represents coverage of 60 conditions.
•
•
Although the standard genetic algorithm performed best overall, there were programs for which the differential GA performed better. For most of the programs, a fairly high degree of coverage was achieved by at least one of the techniques. From the standpoint of combinatorial optimization, it is hardly surprising that no single technique excels for all problems, but, from the standpoint of testdata generation, it suggests that comparatively few test requirements are intrinsically hard to satisfy, at least when condition-decision coverage is the goal, Apparently, a requirement that is difficult to cover with one technique may often be easier with another, Serendipitous satisfaction of new test requirements can play an important role. In general, we found that the most successful attempts to generate test data did so by satisfying many requirements coincidentally. This coincidental discovery of solutions is facilitated by the fact that a test generator must solve a number of similar problems, and it may lead to considerable differences between dynamic test data ... .. . ' ,, generation and other optimization problems.
Fig. 14. comp/(5,l);100 percent coverage represents coverage of 60 conditions. All the requirements that are ever satisfied are satisfied in the early stages of the test generation process. The behavior of the GAs is essentially random because no evolution has taken place yet.
APPENDIX RESULTS FOR
SYNTHETIC PROGRAMS
This appendix shows the results of further experiments described in Section 5.3. Several interesting features, such as the remarkably poor performance of the standard GA for i) and the similar performance t h e program w i t h compi(3t of ^ ^ methods for ( Q) ^ d in several .. ., . . , - M ^ » repetitions of the experiment using different GA parameters and different seeds for random number generation. This suggests (not surprisingly) that the compl metric does not " P ^ everything needed to predict the performance of t h e test generators. As in Section 5.3, the differential GA used 30 individuals and gave up on satisfying a given test requirement if 10 generations elapsed with no progress. The standard GA w a s a l s o instructed to give up if no progress was made in 1 0 g e n e r a t i o r i S / a n d t h e population size was adjusted to give ., , , . . ' ., the same number of tareet-proeram executions as the o r e>
Fig. 15. compl(5,2); there are a total of 60 conditions to cover.
224
MICHAEL ET AL: GENERATING SOFTWARE TEST DATA BY EVOLUTION
1109
TABLE 6 Performance of Conjugate Gradient Descent and the Reference Gradient Descent Algorithm on the Off-Diagonal Synthetic Programs |
| compl(0,2) | compl(0,3) \ compl(3,l)
\ compl(3,S) | compl(5,l) | compl(5,2)
CGD I 85^83 I 8O0 I 78?fl I 4L25 I 721 I 53?79 ref. I 89.25 | 84.03 | 81.67 | 55.28 | 72.5 | 62.96 Overall, the techniques are comparable to one another and their performance is somewhat below that of genetic search A remarkable exception is compl(3,3), where the reference algorithm outperformed all others. The reference algorithm also outperformed conjugate gradient descent on compl(0,3), which may be because of the interpolation technique we used for conjugate gradient descent.
differential GA. This resulted in the following population , , or r sizes for the standard GA (shown in Figs. 10, 11,12,13,14,
['3] E.J. Weyuker and B. Jeng, "Analyzing Partition Testing Strate• „ IEE£ j Soctware £„„ v o i 1 7 n o 7 p p 703-7H, J u l y vv y f 99 i.
a n d 15, respectivly):
[14] E.J. Weyuker, "Axiomatizing Software Test Adequacy," JEEE Trans. Software Eng., vol. 12, no. 12, pp. 1128-1137, Dec. 1986.
v
' ^v ' ' ' r \ t / compl(3, 3 ) : 280, compl(5,1): 340, and compl(5,2)
The authors would like to thank Berent Eskikaya, Curtis Walton, Greg Kapfhammer, and Deborah Duong for many . , ' , . helpful contributions to this paper. Tom O Connor and Brian Sohr contributed the GADGET acronym. Bogdan
Detecting Ability of Testing Methods, iEEE Trans. Software Eng., vol. 19, no. 3, pp. 202-213, Mar. 1993. [16] B. Korel, "Automated Test Data Generation for Programs with Procedures," Proc. Int'l Symp. Software Testing and Analysis, pp. 209-215, 1996. [17] L.A. Clarke, "A System to Generate Test Data Symbolically and Execute Programs," IEEE Trans. Software Eng., vol. 2, no. 3, pp. 215-222, Sept. 1976. [18] C.V. Ramamoorty, S.F. Ho, and W.T. Chen, "On the Automated Generation of Program Test Data," 7EEE Trans. Software Eng., vol. 2, no. 4, pp. 293-300, Dec. 1976. » A n ^ g r a t e d Automatic Test Data Generation [19] , o f f u t t System,"/. Systems Integration, vol. 1, pp. 391-409,1991.
Korel provided the programs analyzed in Section 5.2. This
™
:240.
T h e performance for each size is reported in Table 6. ACKNOWLEDGMENTS
r
r
WH D ea on D Br ? T n ' ^ H ' Chan ,?', a c n r d r J i I - Cro f n ' J c / ' ®'
"ARule;
Based Software Test Data Generator, IEEE Trans. Knowledge and Data Eng., vol. 3, no. 1, pp. 108-117, Mar. 1991. [21] K.H. Chang, J.H. Cross II, W.H. Carlisle, and S.-S. Liao, "A Performance Evaluation of Heuristics-Based Test Case Generation Memods for Software Branch Coverage," I n n J. S o / W En*. «mrf Knowledge Eng., vol. 6, no. 4, pp. 585-608, 1966. [22] N. Tracey, J. Clark, and K. Mander, "The Way Forward for REFERENCES Unifying Dynamic Test-Case Generation: The Optimisation-Based REFERENCES Approach," Proc. Int'l Workshop Dependable Computing and Us [I] W. Miller and D.L. Spooner, "Automatic Generation of Floating „ „ Rations (DCIA), pp 169-180 Jan. 1998. Point Test Data," IEEE Trans. Software Eng., vol. 2, no. 3, pp. 223- (231 £• Tracey J. Clark, and K Mander, Automated Program Flaw JO rr Finding Usmg Simulated Annealing, Proc. Int I Symp. Software 226 Sept 1976 [2] B. Korel,' "Automated Software Test Data Generation," IEEE Testing and Analysis, Software Eng. Notes, pp. 73-S1, Max. 199S. 24 H Trans. Software Eng., vol. 16, no. 8, pp. 870-879, Aug. 1990. I ! - Tracey, J. Clark, K. Mander, and J. McDermid, "An Automated Framework for Structural Test-Data Generation," Proc. Automated [3] P. Frankl, D. Hamlet, B. Lirtlewood, and L. Strigini, "Choosing a Testing Method to Deliver Reliability," Proc. 19th Int'l Conf. Software Eng. '98, pp. 285-288, 1998. Software Eng. (ICSE '97), pp. 68-78, May 1997. [25] C.C. Michael, G.E. McGraw, and M.A. Schatz, "Genetic Algorithms for Dynamic Test Data Generation," Proc. Automated [4] R. Ferguson and B. Korel, "The Charting Approach for Software Test Data Generation," ACM Trans. Software Eng. Methodology, Software Eng. '97, pp. 307-308, 1997. vol. 5, no. 1, pp. 63-86, Jan. 1996. [26] C.C. Michael, G.E. McGraw, and M.A. Schatz, "Opportunism and [5] M.J. Gallagher and V.L. Narasimhan, "Adtest: A Test Data Diversity in Automated Software Test Data Generation," Proc. Generation Suite for Ada Software Systems," 7EEE Trans. Software Automated Software Eng. '98, pp. 136-146, 1998. Eng., vol. 23, no. 8, pp. 473-484, Aug. 1997. [27] A.C. Schultz, J.C. Grefenstette, and K.A. Dejong, "Test and [6] J.H. Holland, Adaption in Natural and Artificial Systems. Ann Arbor, Evaluation by Generic Algorithms," IEEE Expert, pp. 9-14, Oct. ° ' research has been made possible by the US National Science Foundation under award number DMI-9661393 and the US T~, c AJ JTO 1-TJ-I.A /r~»A r>r> A \ Defense A d v a n c e d Research Projects A g e n c y (DARPA) c o n t r a c t N66001-00-C-8056.
Mich.: Univ. of Michigan Press, 1975. S. Kirkpatrick, CD. Gellat Jr., and M.P. Vecchi, "Optimization by Simulated Annealing," Science, vol. 220, no. 4,598, pp. 671-680, roi M ^ i 1 9 8 3 ' « T u o UJ . m . n m i r .• i i •, [8] F. G over "Tabu Search Part I, II," ORSA ]. Computing, vol. 1, no. 3, pp. 190—206, 1989. [9] J. Horgan, S. London, and M. Lyu, "Achieving Software Quality ' ... JP .. ,-. ,, '„ U . i *,*, ,n with Testing Coverage Measures, Computer, vol. 27, no.n 9, pp. 6069 Se t 1994 ' ' rr [10] J. Chilenski and S. Miller, "Applicability of Modified Condition/ Decision Coverage to Software Testing," Software Eng. }., pp. 193[7]
200, Sept. 1994. [II] R. DeMillo and A. Mathur, "On the Uses of Software Artifacts to Evaluate the Effectiveness of Mutation Analysis for Detecting Errors in Production Software," Technical Report SERC-TR-92-P, Purdue Univ., 1992. [12] R. Hamlet and R. Taylor, "Partition Testing Does Not Inspire Confidence," IEEE Trans. Software Eng., vol. 16, no. 12, pp. 14021411, Dec. 1990.
1993. [28] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C. New York: Cambridge Univ. Press, 1991. I291 J- Skorin-Kapov, "Tabu Search Applied to the Quadratic Assignm e n t P r o b ] e m / - ORSA ; . Computing, vol. 2, no. 1, pp. 33-U, Winter 1 oon . _ _ . . . _ , . ., ... . , _ . . . . , rori [30] D. Goldberg, Genetic Algorithms in cSearch, Optimization, and .„ •• • • r, J- * » , .,,. ... , ,„„„ Machine Learning. Reading, Mass.: Addison-Wesley, 1989. I31) ™- Mitchell An Introduction to Genetic Algorithms. Cambridge, Mass " M r r Fress'19%-
[32] Foundations of Genetic Algorithms. G. Rawlins, ed. San Mateo, Calif.: Morgan Kaufmann, 1991. [33] R. Storn, "On the Usage of Differential Evolution for Function Optimization," Proc. North Am. Fuzzy Information Prossessing Soc., (NAFIPS '96), pp. 519-523, June 1996. [34] S.K. Park and K.W. Miller, "Random Number Generators: Good Ones are Hard to Find," Comm. ACM, vol. 31, no. 10, pp. 11921201, Oct. 1988.
225
1110
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, JGINEERING, VOL.27, VOL. 27, NO. 12, DECEMBER 20O1 IEEETRANSAC1
[35] R.A. DeMillo and A.J. Offutt, "Experimental Results from an Automatic Test Case Generator," ACM Trans. Software Eng. Methodology, vol. 2, no. 1, pp. 215-222, Jan. 1993. [36] J.D. Musa, "Operational Profiles in Software Engineering," IEEE Software, vol. 10, no. 2, pp. 14-332, 1993. Christoph C. Michael received the BA degree in physics from Carleton College, Minnesota, in 1984, and the MSc and PhD degrees in of William William computer science from the Collegei of is a and Mary in Virginia, 1993. He is a senior senior served as as research scientist at Cigital. He has» served prinicpal investigator on software assurance assurance ite of Stangrants from the US National Institute of Standards and Technology's Advanced Technology Technology Program and the US Army Research Labs, as :urity grants from the US Defense Advanced well as software security jency. His current research includes information Research Projects Agency. :tion, software test data generation, and dynamic system intrusion detection, deling. He is a member of the IEEE. software behavior modeling.
Gary McGraw is the vice president of corporate technology at Cigital (formerly, Reliable Software Technologies) where he pursues research in software security while leading the Software Security Group. He has served as principal investigator on grants from the US Air Force Research Labs, the Defense Advanced Research Projects Agency, the US National Science Foundation, and the US National Institute of Standards and Technology's Ad•rogram. He also chairs the National Information vanced Technology Program. Council's Malicious Code Information Security Security Research Council's ogy Study Group. He coauthored Java Security Science and Technology re Fault Fault Injection Injection (Wiley, (Wiley, 1997), 1997), and and Securing Securing Java Java (Wiley, 1996), Software (Wiley, 1999), and is currently writing a book entitled Building Secure esley, 2001). He is a member of the IEEE. Software (Addison-Wesley, Michael A. Schatz graduated summa cum laude with a BS degree in mathematics from Case Western Reserve University in 1996 and an MS degree in computer engineering from Case Western Reserve University in 1997. He is a senior research associate at Cigital (formerly known as Reliable Software Technologies). He has worked on numerous projects in both research and development roles. These projects include experimentation with using fault injection abilities, using genetic algorithms to generate test to find security vulnerabilities, data for programs, andd augmenting the capabilities of Reliable Software ige tool. He has coauthored articles for Dr. Dobbs Technologies's coverage arch papers. and a number of research
> For more information on on this or any computing topic, please visit our Digital Library att http://computer.org/publications/dlib.
226
Chapter 6 ML Applications in Reuse As the cost of software development becomes the bulk of any computer based solution, it makes a great economic sense to systematically reuse existing solutions, which can take place at many different levels: specifications, domain knowledge, designs, development processes, systems, subsystems, and components. There are a number of technical and managerial benefits of reuse: reduced development time and risk, increased reliability and productivity, and improved standardization. The ML applications in this chapter pertain to reuse. Issues that have been considered in this area of applications include: how to compute the similarity in a reuse library, tools for browsing software libraries, how to model the cost of rework for reusable components, how to locate and adopt software components to given specifications, how to generalize program abstractions so as to increase their chance for reuse, and how to organize reusable components such that efficient retrievals can be accommodated. The ML methods utilized in this area of applications consist of IBL/CBR, DT, GA and EBL, as shown in Table 26. Table 26. ML methods used in reuse. NN IBL DT GA GP ILP EBL CL BL AL IAL RL EL SVM CBR Similarity Computing
V
Active Browsing
V
Cost of Rework Knowledge Representation Locate and Adopt Software to Specifications
V V V
Generalize Program Abstractions Clustering of Components
V
V
This chapter contains one paper by Katalagarianos and Vassiliou [70]. The paper describes a CBR based approach to locating and adopting reusable components to particular specifications. In their organizational framework, a single representational model is used to accommodate a variety of different artifacts (ranging from designs, domain knowledge, development experience and programming knowledge, to code) in the repository. Descriptions of those artifacts are made in a component interconnection language called Telos. Components in the repository are decomposed into implementation components, design components, specification components, and dependency components. In the generalization/specialization hierarchy of specification
227
components, there are three different types of elements: the most general elements, the most specific elements and the intermediate elements. The dependency components interrelate components of different types and consist of the design dependency and the implementation dependency. A design dependency component defines a dependency relationship between a design component and the corresponding specification component. An implementation dependency component defines a dependency relationship between an implementation component and the corresponding design component. The proposed approach has two phases: retrieval and adaptation phase, and repository evolution phase. When retrieving components for reuse, the proposed system interacts with the application developer, and searches for the most appropriate case satisfying the user's need with possible adaptation. The repository evolution is based on learning and generalization. To demonstrate the viability of the proposed approach, a prototype system was implemented for the reuse of C++ code. The following paper will be included here: P. Katalagarianos and Y. Vassiliou, "On the reuse of software: a case-based approach employing a repository", Automated Software Engineering, Vol.2, No.l, 1995, pp.55-86.
On the Reuse of Software: A Case-Based Approach Employing a Repository PANAGIOTIS KATALAGAR1ANOS Department of Computer Science, University of Crete and Institute of Computer Science, Foundation for Research and Technology—Hellas. FORTH YANN1S VASSILIOU National Technical University of Athens, Depl. of Electrical and Computer Engineering and Institute of Computer Science, Foundation for Research and Technology—Hellas, FORTH
Abstract. Systematic reuse of software has been proposed as a promising means to address the legendary productivity increase in software development. While object-oriented programming languages are, by nature, well suited for reusability-based development of applications, additional mechanisms to effectively reuse software are necessary. We present a novel language-independent method, which assumes an appropriately organized software repository and employs a simple form of Case-Based Reasoning in conjunction with the specificitygenericity hierarchy to locate and possibly adopt software to particular specifications. The method focuses on code reuse and addresses the evolving nature of the repository. Complexity issues for the main algorithms are presented. Finally, a demonstrator prototype system for reusing object-oriented code (C++) is described.
1.
Introduction
Several advances towards systematic software reuse have recently been reported with library systems, classification techniques, the creation and distribution of reusable components, reuse support environments, and corporate reuse programs. Despite these efforts, it is argued that reuse has not delivered yet on its promise to significantly increase productivity and quality (Prieto-Diaz, 1993). Effective reuse of software requires a rich collection of designed-for-reuse software components and knowledge on how to locate them in a repository, adapt them if needed (for instance, by substituting parameters), and even create new ones based on information provided by other components exploiting similar characteristics. Genericity is the technique that allows a module to be defined with parameterized types. This is a definite aid to reusability because just one generic module is defined, instead of a group of modules that differ only in the types of objects they manipulate (Meyer, 1987). A language supporting type parameterization allows specification of general container types such as list, where the specific type of the elements is left as a parameter. Thus, a parameterized class specifies an unbounded set of related types; for example, list of int. Typical languages that provide genericity are Ada, LPG and CLU. A general form of parameterized types can also be integrated into object-oriented languages that do not provide a built-in form of genericity, Stroustrup (1988) proposed such a general form for the C++ language. Within this framework a generic class can be defined
229
56
PANAGIOTIS KATALAGARIANOS AND YANNIS VASSILIOU
having one parameter of type < T > . For example the generic class array may be defined as:
class array { protected: < T > * element; int size; public int search(); Example 1
Using this technique, it is possible to maintain the efficiency of object-oriented code (as we make the substitution before compiling), while retaining the benefits of genericity. However, this genericity mechanism by itself, is not flexible enough because it can not capture the fine grain of commonality between groups of implementations of the same general data abstraction. This is because there are only two levels of modules: a) generic modules, which are parameterized and thus open to variation but not directly usable, and b) fully instantiated modules (specific modules) which are directly usable but not open to refinement. Our conjecture is that in order to effectively reuse object-oriented code using genericity, software developers could depend on experiential knowledge gathered and stored in the repository while developing similar software components. Specialized methods can then be incorporated in order to achieve better support from the system for the provision of this knowledge. The application of techniques and methods from artificial intelligence to software engineering is one mechanism through which reusability of software might be achieved. By abstracting and encoding the expertise of experienced software engineers into knowledge bases together with software components, system developers can gain effective access to the artifacts in the software repository as it evolves over time. Case-Based Reasoning (CBR) (Barletta, 1991) is a method of solving problems based on the transfer of past experience to new problem situations. It has been attractive as a method for building intelligent reasoning systems because it appears relatively simple and natural. CBR as a learning paradigm has several advantages: Firstly, there are several performance enhancements as it provides: shortcuts in reasoning, the capability of avoiding past errors, and the capability of focusing in the most important parts of a problem first. Secondly, learning can be eased since CBR does not require a causal model or a deep understanding of the domain. Thirdly, individual or generalized cases can also serve as explanations. This paper presents a novel method, which employs a simple form of Case-Based Reasoning (CBR) in conjunction with the specificity-genericity hierarchy to semi-automatically locate the appropriate code in a software repository, possibly adapting it to particular requirements, while dealing with the evolution of the repository by adding (if needed) new components and making the proper repository re-organization.
230
ON THE REUSE OF SOFTWARE
5-7
The method presented in this paper has been evaluated through a prototype implementation, which addresses the reuse of C++ code. Since the objective is not to describe all the functionalities of the prototype system but to illustrate the way the method is applied, we present a sample session on the system usage which can be found in Appendix A. The organizational framework is presented in section 2, while the reuse method itself is described in section 3. The main topic of section 4 is the complexity analysis of the algorithms used for selection, adaptation and repository evolution. Finally, conclusions and extensions are described in section 5. 2.
Organizational Framework
Software environments typically use an environment database (repository) to provide support for all activities concerning the software development process. There are several factors in addressing the organization of the repository. These are: a) which artifacts may be reused? b) how these artifacts are transformed into reusable components? c) what is the proper size and form of reusable components? d) how should thej be classified? and e) what are proper techniques and languages for describing the components? Regarding which artifacts are candidates for reuse, it appears that the software community is reaching a consensus: not only code but also higher-level concepts like designs, domain knowledge, development experience and programming knowledge. The issues of representation and presentation of the reusable artifacts do not adhere to a simplistic solution and certainly need to be separated. A main objective in the work reported, has been the use of a single representation model for hosting all these drastically different artifacts in arepository. Effectively, the repository stores and manages descriptions of the artifacts (in the sequel they are called components)—all expressed with a suitable data description language. An important aspect in this work has been to identify the concepts and relations in the programming domain which can be used to capture programming knowledge. These identified concepts and relations have then been explicated in component descriptions and component interconnections by introducing abstractions and generalizations, under the following assumptions: •
The software components are described and are interconnected in an economic, domain independent way that enables the formation of matching cases.
•
The matching cases may be categorized effectively by indices that are easily extracted from the component descriptions.
•
The whole organization provides a strong basis for controlled evolution and expansion of the repository with quality components through the application of an evolution method.
A consequence of describing and interconnecting the software components under these assumptions is that the repository is constrained to be methodology-specific, rather than general-purpose. This is a trade-off that experience and experimentation to date shows unavoidable.
231
58
PANAGIOTIS KATALAGAR1ANOS AND YANNIS VASSILIOU
Several languages from various fields like software engineering, databases and knowledge representation, provide various means to describe individual features of components and component interconnections. Informally, two components are interconnected whenever there exists at least one resource belonging to one, and is being accessible or derived by the other. All of these languages belong to a special class called Component Interconnection Languages (CILs) (Motschnig-Pitrik and Mittermeir, 1989). CELs are not restricted at the implementation level, but they use the term component to refer to a segment of software specification independently of the level in which it belongs. Using a CIL to define components and their interconnections, makes it possible to cover all of the aspects which are considered to be important to be supported when following a specific life cycle approach and employing specific languages and techniques. The main feature of a CIL is that it is not just a language for programming, but fundamentally a language for packaging. When properly applied, this principle can provide much functionality to the application developer, without imposing inconvenient constraints or overhead. In the work reported, the language chosen for the description of software components is Telos (Mylopoulos et at., 1990). Telos is an E-R based language, satisfying all the requirements so as to be considered a CIL, designed specifically for information system development applications. It supports a number of structuring mechanisms which have been used by knowledge representation languages as well as semantic data models allowing the designer of a knowledge base to introduce gradually and in an orderly fashion the detail that needs to be represented. These mechanisms are: classification, aggregation, and generalization. Using Telos, the general metaclasses of components and interconnections are defined as: IndividualClass Component in Ml .Class end Component IndividualClass Interconnection in Ml.Class with necessary, single from.component: Component; to_component: Component end Interconnection The generic attributes associated with Interconnection are listed within the declarations using the syntax : . Thusfivm.component: Component is one such generic attribute which allows instances of Interconnection to have attributes with values which are instances of the metaclass Component. Attribute categories in Telos, which are all userdefined, group generic attributes and impose constraints on their instances. In the previous declaration the necessary attribute category is a constraint to be enforced at all times for Interconnection instances, while single constraints its instances not to have multi-valued attributes. Since the software environment at hand covers all the stages of the software life cycle, a first partition of the set of the components involved in the framework include: a) implementation components, b) design components, and c) specification components.
232
ON THE REUSE OF SOFTWARE
2.1.
Implementation
rg
Components
Object-Oriented languages have data abstraction and encapsulation constructs called packages, modules, or classes that enable one to define and enforce the boundaries separating the components of a software system. In this article these abstraction constructs are referred to as classes, or implementation components. Every class provides some resources to other classes in a system, and in turn may require some resources from other classes. These resources are a) the class parts which denote the data representation of a component, b) the class methods. Using the Telos syntax an implementation component description is defined as: IndividualClass IMPL.Component in Ml .Class isA Component with necessary, single language: ProgrammingJLanguage necessary filename: String attribute class-parts: ClassPart; methods: Method end IMPL_Component The attribute type ClassPart is defined as: IndividualClass ClassPart in Ml .Class isA IMPL-Resource with necessary, single parUype: Type end ClassPart The attribute type Method is defined in a similar way. These resources need to be further specialized in order to specify their interfaces. For instance, a protected class part is defined as: IndividualClass ProtectedPart in Ml .Class isA ClassPart end ProtectedPart Up to this point, the features of implementation components and their resources are modeled using Telos metaclasses. These metaclasses can be instantiated to Telos classes corresponding to the different implementation components that are stored in the repository. For instance, the instantiations corresponding to the implementation component array described in Example 1 are: IndividualClass array is SXlass, IMPL.Component with file : "array.cc" language :"C++" class-parts
233
60
PANAGIOTIS KATALAGARIANOS AND YANNIS VASSIUOU
: array_size; : array_element methods : array_search end array IndividualClass array-size in S-Class, ProtectedPart with part-type : int end array-size 2.2.
Design Components
Object-oriented design is viewed as a software (de)composition technique. It may be defined as a technique which, unlike the classical (functional) design, bases the modular decomposition of a software system on the classes of objects the system manipulates. The resources distributed among the design components are partitioned into abstract data types and operations. Using the Telos syntax a design component is defined as: IndividualClass DES.Component in Ml_Class isA Component with attribute adt: AbstractData Type; operations: Operation end DES-Component Concerning the running example, the abstract data type table (as the array is an implementation of a table) may be identified with a search operation. 2.3.
Specification
Components
The ability to structure a specification is vital in any software engineering environment. A specification model in the proposed framework, includes both functional and non-functional specifications. Functional specifications provide adescription of the functions carried out by the corresponding implementation component, while non-functional specifications impose global constraints on the operation, performance, and efficiency of any proposed solution to the functional specifications model (Chung, 1990). When modeling a specification component, both specification types are considered to be resources, described as: where ATTRIBUTES is defined as the set of attributes of the form: