Lecture notes on empirical software engineering

LECTURE NOTES ON EMPIRICAL SOFTWARE ENGINEERING nes Editor i NajtaHa J u r i s t c j Q K Chang l ^ g M M o r e n o J ...

Author: Natalia Juristo | Ana M. Moreno

20 downloads 1846 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

LECTURE NOTES ON

EMPIRICAL SOFTWARE ENGINEERING

nes Editor i NajtaHa J u r i s t c j Q K Chang l ^ g M M o r e n o J

J

p^ . y. ' World Scientific

,V • •

LECTURE NOTES ON

EMPIRICAL SOFTWARE ENGINEERING

SERIES ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Series Editor-in-Chief S K CHANG (University of Pittsburgh, USA)

Vol. 1

Knowledge-Based Software Development for Real-Time Distributed Systems Jeffrey J.-P. Tsai and Thomas J. Weigert (Univ. Illinois at Chicago)

Vol. 2

Advances in Software Engineering and Knowledge Engineering edited by Vincenzo Ambriola (Univ. Pisa) and Genoveffa Tortora (Univ. Salerno)

Vol. 3

The Impact of CASE Technology on Software Processes edited by Daniel E. Cooke (Univ. Texas)

Vol. 4

Software Engineering and Knowledge Engineering: Trends for the Next Decade edited by W. D. Hurley (Univ. Pittsburgh)

Vol. 5

Intelligent Image Database Systems edited by S. K. Chang (Univ. Pittsburgh), E. Jungert (Swedish Defence Res. Establishment) and G. Tortora (Univ. Salerno)

Vol. 6

Object-Oriented Software: Design and Maintenance edited by Luiz F. Capretz and Miriam A. M. Capretz (Univ. Aizu, Japan)

Vol. 7

Software Visualisation edited by P. Eades (Univ. Newcastle) and K. Zhang (Macquarie Univ.)

Vol. 8

Image Databases and Multi-Media Search edited by Arnold W. M. Smeulders (Univ. Amsterdam) and Ramesh Jain (Univ. California)

Vol. 9

Advances in Distributed Multimedia Systems edited by S. K. Chang, T. F. Znati (Univ. Pittsburgh) and S. T. Vuong (Univ. British Columbia)

Vol. 10 Hybrid Parallel Execution Model for Logic-Based Specification Languages Jeffrey J.-P. Tsai and Bing Li (Univ. Illinois at Chicago) Vol. 11 Graph Drawing and Applications for Software and Knowledge Engineers Kozo Sugiyama (Japan Adv. Inst. Science and Technology)

Forthcoming titles: Acquisition of Software Engineering Knowledge edited by Robert G. Reynolds (Wayne State Univ.) Monitoring, Debugging, and Analysis of Distributed Real-Time Systems Jeffrey J.-P. Tsai, Steve J. H. Yong, R. Smith and Y. D. Bi (Univ. Illinois at Chicago)

LECTURE NOTES ON

EMPIRICAL SOFTWARE ENGINEERING

Editors

Natalia Juristo Ana M Moreno Universidad Politecnica de Madrid, Spain

V | f e World Scientific w l

New Jersey •London • London ••Sin Singapore • Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

EMPIRICAL SOFTWARE ENGINEERING Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-02-4914-4

Printed by Fulsland Offset Printing (S) Pte Ltd, Singapore

Preface The use of reliable and validated knowledge is essential in any engineering discipline. Software engineering, as a body of knowledge that guides the construction of software systems, also has a need for tested and true knowledge whose application produces predictable results. There are different intermediate stages on the scale from knowledge considered as proven facts to beliefs or speculations: facts given as founded and accepted by all, undisputed statements, disputed statements and conjectures or speculations. The ranking of an enunciation depends on the change in its factuality status. The path from subjectivity to objectivity paved by testing or empirical comparison with reality determines these changes. For software development to really be an engineering discipline and to predictably build quality software, it has to make the transition from development based on speculations to that based on facts. Software engineering needs to lay aside perceptions, bias and market-speak to provide fair and impartial analysis and information. Several software developer community stakeholders can contribute to developing the engineering knowledge in the software discipline in different ways. Firstly, researchers are responsible for testing their proposals, providing data that demonstrate what benefits they have and identifying the best application conditions. These validations are usually performed, first, under controlled conditions, by means of what are called laboratory tests or in vitro experiments, which distinguish them from the real conditions in which these artefacts would be employed in an industrial environment. For example, software development techniques or tools would be applied on projects not subjected to market pressures and with developers of certain known characteristics. Additionally, researchers also have to replicate the studies performed by their peers, either to corroborate the results under the same application conditions or to provide additional information in new contexts. Far from being a straightforward process, replication is beset by a series of problems related to variations in the hypotheses, in the factors that affect the studies or the data collected during the experiments.

V

VI

Empirical Software Engineering

Through this experimental process, the research community would provide practitioners with tested knowledge of the benefits of applying given artefacts and their application conditions. However, studies need to be repeated in industrial settings to assure that these benefits also occur in an industrial environment and that practitioners can make use of the respective artefacts knowing beforehand what results their application will have. These are in-vivo experimental studies. This handbook deals with the above two levels of the process of empirical testing.1 The first three chapters focus on the process of experimental validation in laboratories, whereas the other three address this process of validation in industrial settings. Let us now turn to the individual topics discussed in each of these two groups. Verification techniques belong to one of the software development areas that have traditionally been subjected to empirical testing. Studies aiming to identify the strengths of the different error detection techniques have been undertaken since the 80s. The first chapter of this handbook aims to identify the knowledge available about testing techniques, after 20 years of experimentation. The authors find that, despite all the studies and replications in the area, it is not easy to extract coherent knowledge on all of these techniques due to the dispersion of the testing techniques and response variables analysed. From the study shown in "Limitations of Empirical Testing Techniques Knowledge", Juristo, Moreno and Vegas conclude that more coherence and coordination are needed in experimental efforts. It is not enough to run one-off experiments. Coherent and coordinated replication is just as important as running new experiments. Often, it would be more beneficial to establish a given item of knowledge through replication before trying to validate new items that are based on this knowledge. A single experiment is better than none, but it is not sufficient to convert an item of knowledge into a validated fact. The second chapter of this handbook provides guidelines on how to perform replications. The lessons from this chapter aim to contribute to 1

There is another empirical study level, not addressed in this book, which the software industry should perform routinely: having introduced a software artefact into an industrial setting, practitioners are responsible for measuring and monitoring the improvement and changes that have taken place in their development processes and software products. This activity is commonly referred to as process control.

Preface

vn

creating a body of factual knowledge, overcoming the problems detected in chapter one. Replications of empirical studies are a complex process that calls for an exhaustive analysis of the studies to be replicated and the precise definition of new working conditions. In "Replicated Studies: Building a Body of Knowledge about Software Reading Techniques", Shull, Carver, Maldnonado, Travassos, Conrado and Basili analyse how to run coherent replications of empirical studies. The authors further discuss which variables should change from one study to another to generate more knowledge of the different tests. The proposed approach is applied to specific validation techniques, namely, reading techniques. Having replicated studies, one of the most important tasks is to extract reliable knowledge from these studies. In the third chapter of the book, "Combining Data from Reading Experiments in Software Inspections. A Feasibility Study", Wohlin, Petersson and Aurum analyse the process of information extraction to be performed when we have a coherent and coordinated set of experiments. Chapter three of the handbook illustrates the types of generalised results that can be derived when combining different studies. By way of an example, the authors present some results from the combination of studies found in the software inspections area. The remainder of the book focuses on the second level of empirical testing, that is, on performing studies in industrial settings. It is important to note that, although considerable progress has been made recently in the field of empirical validation, there is still a lack of empirical work in the industrial setting. Therefore, the three remaining chapters of the book aim to propose alternatives that make this task easier. Chapters four and five propose two alternatives for making empirical testing more appealing to industry. In chapter four, "External Experiments - A workable paradigm for collaboration between industry and academia", Houdek proposes an empirical validation approach by means of which to share out the workload of empirical testing in an industrial setting between the university setting and industry and assure, on the one hand, that the result of the study represents the real conditions and, on the other, relieve industry of the work and effort required to run this sort of studies. In chapter five, "(Quasi)-Experimental Studies in Industrial Settings", Laitenberg and Rombach analyse the difficulties (for example, in terms of

Vlll

Empirical Software Engineering

cost or control of the different variables that can have an impact on the studies) faced by industry when running empirical studies and present a practical approach that makes it workable for practitioners to perform this sort of studies by relaxing the conditions of the experiment. Finally, chapter six, "Experimental validation of new software technology", presents an analysis of the empirical validation techniques in use by researchers and practitioners. Zelkowitz, Wallance and Binkley identify communalities and differences among these methods and propose some guides to assure that both kinds of models can work complementarily.

Natalia Juristo Ana M. Moreno Universidad Politecnica de Madrid SPAIN

Contents

Preface

Chapter 1 Limitations of Empirical Testing Technique Knowledge N. Juristo, A. M. Moreno and S. Vegas Chapter 2 Replicated Studies: Building a Body of Knowledge about Software Reading Techniques F. Shull, J. Carver, G. H. Travassos, J. C. Maldonado, R. Conradi and V. R. Basili Chapter 3 Combining Data from Reading Experiments in Software Inspections — A Feasibility Study C. Wohlin, H. Petersson and A. Aurum Chapter 4 External Experiments - A Workable Paradigm for Collaboration Between Industry and Academia F. Houdek

1

39

85

133

Chapter 5 (Quasi-)Experimental Studies in Industrial Settings O. Laitenberger and D. Rombach

167

Chapter 6 Experimental Validation of New Software Technology M V. Zelkowitz, D. R. Wallace and D. W. Binkley

229

IX

CHAPTER 1

Limitations of Empirical Testing Technique Knowledge N. Juristo, A. M. Moreno and S. Vegas Facultad de Informatica Universidad Politecnica de Madrid Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain

Engineering disciplines are characterised by the use of mature knowledge by means of which they can achieve predictable results. Unfortunately, the type of knowledge used in software engineering can be considered to be of a relatively low maturity, and developers are guided by intuition, fashion or market-speak rather than by facts or undisputed statements proper to an engineering discipline. Testing techniques determine different criteria for selecting the test cases that will be used as input to the system under examination, which means that an effective and efficient selection of test cases conditions the success of the tests. The knowledge for selecting testing techniques should come from studies that empirically justify the benefits and application conditions of the different techniques. This paper analyses the maturity level of the knowledge about testing techniques by examining existing empirical studies about these techniques. For this purpose, we classify testing technique knowledge according to four categories. Keywords: Testing techniques; empirical maturity of testing knowledge; testing techniques empirical studies.

1. Introduction Engineering disciplines are characterised by using mature knowledge that can be applied to output predictable results. (Latour and Woolgor, 1986) discuss a series of intermediate steps on a scale that ranges from the most mature knowledge, considered as proven facts, to the least mature knowledge, composed of beliefs or speculations: facts given as founded and 1

2

N. Juristo, A. M. Moreno & S. Vegas

accepted by all, undisputed statements, disputed statements and conjectures or speculations. The path from subjectivity to objectivity is paved by testing or empirical comparison with reality. It is knowledge composed of facts and undisputed statements that engineering disciplines apply to output products with predictable characteristics. Unfortunately, software development has been characterised from its origins by a serious want of empirical facts tested against reality that provide evidence of the advantages or disadvantages of using different methods, techniques or tools to build software systems. The knowledge used in our discipline can be considered to be relatively immature, and developers are guided by intuition, fashion or market-speak rather than by the facts or undisputed statements proper to an engineering discipline. This is equally applicable to software testing and is in open opposition to the importance of software quality control and assurance and, in particular, software testing. Testing is the last chance during development to detect and correct possible software defects at a reasonable price. It is a well-known fact that it is a lot more expensive to correct defects that are detected during later system operation (Davis, 1993). Therefore, it is of critical importance to apply knowledge that is mature enough to get predictable results during the testing process. The selection of the testing techniques to be used is one of the circumstances during testing where objective and factual knowledge is essential. Testing techniques determine different criteria for selecting the test cases that will be used as input to the system under examination, which means that an effective and efficient selection of test cases conditions the success of the tests. The knowledge for selecting testing techniques should come from studies that empirically justify the benefits and application conditions of the different techniques. However, as authors like (Hamlet, 1989) have noted, formal and practical studies of this kind do not abound, as: (1) it is difficult to compare testing techniques, because they do not have a solid theoretical foundation; (2) it is difficult to determine what testing techniques variables are of interest in these studies. In view of the importance of having mature testing knowledge, this chapter intends to analyse the maturity level of the knowledge in this area. For this purpose, we have surveyed the major empirical studies on testing in order to analyse their results and establish the factuality and objectivity level of the body of testing knowledge regarding the benefits of some techniques over others. The maturity levels that we have used are as follows:

Limitations of Empirical Testing Technique Knowledge

•

3

Program use and laboratory faults pending/confirmed: An empirical study should be/has been performed to check whether the perception of the differences between the different testing techniques is subjective or can be objectively confirmed by measurement. • Formal analysis pending/confirmed: Statistical analysis techniques should be/have been applied to the results output to find out whether the differences observed between the techniques are really significant and are not due to variations in the environment. • Laboratory replication pending/confirmed: Other investigators should replicate/have replicated the same experiment to confirm that they get the same results and that they are not the fruit of any uncontrolled variation. • Field study pending/confirmed: The study should be/has been replicated using real rather than toy programs or faults. For this purpose, the chapter has been structured as follows. Section 2 presents the chosen approach for grouping the different testing studies. Sections 3, 4, 5, 6 and 7 focus on each of the study categories described in section 2. Each of these sections will first describe the studies considered depending on the testing techniques addressed in each study and the aspects examined by each one. Each study and its results are analysed in detail and, finally, the findings are summarised. Finally, section 8 outlines the practical recommendations that can be derived from these studies, along with their maturity level, that is, how reliable these recommendations are. Section 8 also indicates what aspects should be addressed in future studies in order to increase the body of empirical knowledge on testing techniques. The organisation of this chapter means that it can be read differently by different audiences. Software practitioners interested in the practical results of the application of testing techniques will find section 8, which summarises the practical recommendations on the use of different testing techniques and their confidence level, more interesting. Researchers interested in raising the maturity of testing knowledge will find the central sections of this chapter, which contain a detailed description of the different studies and their advantages and limitations, more interesting. The replication of particular aspects of these studies to overcome the abovementioned limitations will contribute to providing useful knowledge on testing techniques. Researchers will also find a quick reference to aspects of testing techniques in need of further investigation in section 8.

4

N. Juristo, A. M. Moreno & S. Vegas

2. Classification of Testing Techniques Software testing is the name that identifies a set of corrective practices (as opposed to the preventive practices applied during the software construction process), whose goal is to determine software systems quality. In testing, quality is determined by analysing the results of running the software product (there is another type of corrective measures, known as static analysis, that examine the product under evaluation at rest and which are studied in other chapters of this book). Testing techniques determine different criteria for selecting the test cases that are to be run on the software system. These criteria can be used to group the testing techniques by families. Accordingly, techniques belonging to one and the same family are similar as regards the information they need to generate test cases (source code or specifications) or the aspect of code to be examined by the test cases (control flow, data flow, typical errors, etc.). This is not the place to describe the features of testing techniques or their families, as this information can be gathered from the classical literature on testing techniques, like, for example (Beizer, 1990), (Myers, 1979). For readers not versed in the ins and outs of each testing techniques family, however, we will briefly mention each family covered in this chapter, and the techniques of which they are composed, the information they require and the aspect of code they examine: • Random Testing Techniques. The random testing techniques family is composed of the oldest and intuitive techniques. This family of techniques proposes randomly generating test cases without following any pre-established guidelines. Nevertheless, pure randomness seldom occurs in reality, and the other two variants of the family, shown in Table 1, are the most commonly used. TECHNIQUE

TEST CASE GENERATION CRITERION

Pure random

Test cases are generated at random, and generation stops when there appear to be enough.

Guided by the number of cases

Test cases are generated at random, and generation stops when a given number of cases has been reached.

Error guessing

Test cases are generated guided by the subject's knowledge of what typical errors are usually made when programming. It stops when they all appear to have been covered.

Table 1. Random techniques family.

Limitations of Empirical Testing Technique Knowledge

•

Functional Testing Techniques. This family of techniques proposes an approach in which the program specification is used to generate test cases. The component to be tested is viewed as a black box, whose behaviour is determined by studying its inputs and associated outputs. Of the set of possible system inputs, this family considers a subset formed by the inputs that cause anomalous system behaviour. The key for generating the test cases is to find the system inputs that have a high probability of belonging to this subset. For this purpose, the technique divides the system inputs set into subsets termed equivalence classes, where each class element behaves similarly, so that all the elements of a class will be inputs that cause either anomalous or normal system behaviour. The techniques of which this family is composed (Table 2) differ from each other in terms of the rigorousness with which they cover the equivalence classes. TECHNIQUE

TEST CASE GENERATION CRITERION

Equivalence partitioning

A test case is generated for each equivalence class found. The test case is selected at random from within the class.

Boundary value analysis

Several test cases are generated for each equivalence class, one that belongs to the inside of the class and as many as necessary to cover the limits (or boundaries) of the class.

Table 2. Functional testing technique family.

•

•

1

Control Flow Testing Techniques. Control flow testing techniques require knowledge of source code. This family selects a series of paths1 throughout the program, thereby examining the program control model. The techniques in this family vary as to the rigour with which they cover the code. Table 3 shows the techniques of which this family is composed, giving a brief description of the coverage criterion followed, in ascending order of rigorousness. Data Flow Testing Techniques. Data flow testing techniques also require knowledge of source code. The objective of this family is to select program paths to explore sequences of events related to the data state. Again, the techniques in this family vary as to the rigour A path is a code sequence that goes from the start to the end of the program.

5

6

N. Juristo, A. M. Moreno & S. Vegas

with which they cover the code variable states. Table 4 reflects the techniques, along with their associated coverage criterion. TECHNIQUE Sentence coverage Decision coverage (branch testing) Condition coverage

Decision/condition coverage Path coverage

TEST CASES GENERATION CRITERION The test cases are generated so that all the program sentences are executed at least once. The test cases are generated so that all the program decisions take the value true or false. The test cases are generated so that all the conditions (predicates) that form the logical expression of the decision take the value true or false. Decision coverage is not always achieved with condition coverage. Here, the cases generated with condition coverage are supplemented to achieve decision coverage. Test cases are generated to execute all program paths. This criterion is not workable in practice.

Table 3. Control flow testing technique family. TECHNIQUE All-definitions All-c-uses/ some-p-uses All-p-uses/ some-c-uses All-c-uses All-p-uses All-uses All-du-paths All-dus

TEST CASES GENERATION CRITERION Test cases are generated to cover each definition of each variable for at least one use of the variable. Test cases are generated so that there is at least one path of each variable definition to each c-use2 of the variable. If there are variable definitions that are not covered, use p-uses. Test cases are generated so that there is at least one path of each variable definition to each p-use of the variable. If there are variable definitions that are not covered, use c-uses. Test cases are generated so that there is at least one path of each variable definition to each c-use of the variable. Test cases are generated so that there is at least one path of each variable definition to each p-use of the variable. Test cases are generated so that there is at least one path of each variable definition to each use of the definition. Test cases are generated for all the possible paths of each definition of each variable to each use of the definition. Test cases are generated for all the possible executable paths of each definition of each variable to each use of the definition.

Table 4. Data flow testing techniques. 2

There is said to be a c-use of a variable when the variable appears in a computation (righthand side of an assignation). There is said to be a p-use of a variable when the variable appears as a predicate of a logical expression.

Limitations of Empirical Testing Technique Knowledge

•

Mutation Testing Techniques. Mutation testing techniques are based on modelling typical programming faults by means of what are known as mutation operators (dependent on the programming language). Each mutation operator is applied to the program, giving rise to a series of mutants (programs that are exactly the same as the original program, apart from one modified sentence, originated precisely by the mutation operator). Having generated the set of mutants, test cases are generated to examine the mutated part of the program. After generating test cases to cover all the mutants, all the possible faults should, in theory, be accounted for (in practice, however, coverage is confined to the faults modelled by the mutation operators). The problem with the techniques that belong to this family is scalability. A mutation operator can generate several mutants per line of code. Therefore, there will be a sizeable number of a mutants for long programs. The different techniques within this family aim to improve the scalability of standard (or strong) mutation to achieve greater efficiency. Table 5 shows the techniques of which this family is composed and gives a brief description of the mutant selection criterion.

TECHNIQUE

TEST CASES GENERATION CRITERION

Strong (standard) mutation

Test cases are generated to cover all the mutants generated by applying all the mutation operators defined for the programming language in question.

Selective (or constrained) mutation

Test cases are generated to cover all the mutants generated by applying some of the mutation operators defined for the programming language. This gives rise to selective mutation variants depending on the selected operators, like, for example, 2, 4 or 6 selective mutation (depending on the number of mutation operators not taken into account) or abs/ror mutation, which only uses these two operators.

Weak mutation

Test cases are generated to cover a given percentage of mutants generated by applying all the mutation operators defined for the programming language in question. This gives rise to weak mutation variants, depending on the percentage covered, for example, randomly selected 10% mutation, ex-weak, st-weak, bb-weak/1, or bbweak/n.

Table 5. Mutation testing technique family.

7

8

N. Juristo, A. M. Moreno & S. Vegas

Our aim is to review the empirical studies designed to compare testing techniques in order to identify what knowledge has been empirically validated. We have grouped the empirical studies reviewed into several subsets taking into account which techniques they compare: • Intra-family studies, which compare techniques belonging to the same family to find out the best criterion, that is, which technique of all the family members should be used. We have identified: o Studies on the data flow testing techniques family. o Studies on the mutation testing techniques family. • Inter-family studies, which study techniques belonging to different families to find out which family is better, that is, which type of techniques should be used. We have identified: o Comparative studies between the control flow and data flow testing techniques families o Comparative studies between the mutation and data flow testing techniques families o Comparative studies between the functional and control flow testing techniques families. In the following sections, we examine all these sets of studies, together with the empirical results obtained.

3. Studies on the Data Flow Testing Techniques Family The objective of this series of studies is to analyse the differences between the techniques within the data flow testing techniques family. Table 6 shows

STUDY

ASPECT STUDIED TESTING TECHNIQUE

(Weyuker, 1990)

(Bieman & Schultz, 1992)

Criterion compliance Number of test cases generated

X X

X

All-c-uses

0

All-p-uses All-uses

0 O

All-du-paths

0

Table 6. Studies on data flow testing techniques.

0

Limitations of Empirical Testing Technique Knowledge

9

which aspects were studied for which testing techniques. For example, Weyuker analysed the criterion compliance and the number of test cases generated by four techniques (all-c-uses, all-p-uses, all-uses and all-dupaths), whereas Bieman and Schultz studied the number of test cases generated for the all-du-paths technique alone. (Weyuker, 1990) (see also (Weyuker, 1988)) conducts a quantitative study to check the theoretical relationship of inclusion among the test cases generation criteria followed for each technique. This theoretical relationship can be represented as follows: all-du-paths => all-uses all-uses => all-c-uses all-uses => all-p-uses all-p-uses and all-c-uses cannot be compared Which would read as follows. The test cases that comply with the all-dupaths criterion satisfy the all-uses criterion; the test cases that comply with the all-uses criterion satisfy the all-c-uses criterion, and so on. Weyuker's empirical results (obtained by studying twenty-nine programs taken from a book on Pascal with five or more decision sentences) reveal that the following, generally, holds: all-uses => all-du-paths all-p-uses => all-c-uses all-p-uses => all-uses So, the author establishes an inverse relationship with respect to the theory between: all-uses and all-du-paths and between all-p-uses and alluses. That is, she concludes that, in practice, the test cases generated to meet the all-uses criterion, also normally comply with all-du-paths, and the test cases generated by all-p-uses also comply with all-uses. According to these results, it would suffice with respect to criterion compliance to use all-uses instead of all-du-paths and all-p-uses instead of all-uses, as the test cases that meet one criterion will satisfy the other. However, the number of test cases generated by each criterion needs to be examined to account for the cost (and not only the benefits) of these relationships. Analysing this variable, Weyuker gets the following relationship:

10

N. Juristo, A. M. Moreno & S. Vegas

all-c-uses < all-p-uses < all-uses < all-du-paths which would read as: more test cases are generated to comply with all-puses than to meet all-c-uses and fewer than to satisfy all-uses and all-dupaths. Bearing in mind the results concerning the number of generated test cases and criteria compliance, we could deduce that it is better to use all-puses than all-uses and it is better to use all-uses than all-du-paths, as the former generate fewer test cases and generally meet the other criterion. With respect to all-c-uses, although it generates fewer test cases than allp-uses, the test cases generated by all-c-uses do not meet the criterion of allp-uses, which means that it does not yield equivalent results to all-p-uses. Note that the fact that the set of test cases generated for one criterion is bigger than for another does not necessarily mean that the technique detects more faults, as defined in other studies examined later. And the same applies to the relationship of inclusion. The fact that a criterion includes another, does not say anything about the number of faults it can detect. Another of Weyuker's results is that the number of test cases generated by all-du-paths, although exponential in theory, is in practice linear with respect to the number of program decisions. (Bieman and Schultz, 1992) partly corroborate these results using real industrial software system, deducing that the number of test cases required to meet this criterion is reasonable. Bieman and Schultz indicate that the number of cases in question appears to depend on the number of lines of code, but they do not conduct a statistical analysis to test this hypothesis, nor do they establish what relationship there is between the number of lines of code and the number of generated test cases. The results yielded by this group of studies have the following limitations: • Weyuker uses relatively simple toy programs, which means that the results cannot be directly generalised to real practice. • On the other hand, Bieman and Schultz do not conduct a statistical analysis of the extracted data, and their study is confined to a qualitative interpretation of the data. • The response variable used by Weyuker and Bieman and Schultz is the number of test cases generated. This characteristic merits analysis insofar as the fewer test cases are generated, the fewer are run and the fewer need to be maintained. However, it should be supplemented by a study of case effectiveness, which is a variable that better describes what is expected of the testing techniques.

Limitations of Empirical Testing Technique Knowledge

•

11

What the number of test cases generated by all-du-paths depends on needs to be examined in more detail, as one study says it is related to the number of decisions and the other to the number of lines of code, although neither further specifies this relationship.

However, despite these limitations, the following conclusions can be drawn: • All-p-uses should be used instead of all-uses, and all-uses instead of all-du-paths, as they generate fewer test cases and generally cover the test cases generated by the other criteria. • It is not clear that it is better to use all-c-uses instead of all-p-uses, as, even though all-c-uses generates fewer test cases, there is no guarantee that the generated test cases meet the criterion imposed by all-p-uses. • Both Weyuker, using toy programs, and Bieman and Schultz, using industrial software, appear to agree that, contrary to testing theory, the all-du-paths technique is usable in practice, since it does not generate too many test cases. These results are summarised in Table 7.

4. Studies on the Mutation Testing Techniques Family This family is examined in three papers, which look at types of mutation that are less costly than traditional mutation. Generally, these papers aim to ascertain what the costs and benefits of using different mutation testing techniques are. These studies, along with the characteristics they examine and the techniques they address, are shown in Table 8. As shown in Table 8, the efficiency of these techniques is measured differently. So, whereas (Offut and Lee, 1994) (see also (Offut and Lee, 1991)) and (Offut et al, 1996) (see also (Offut et al, 1993)) measure efficiency as the percentage of mutants killed by each technique, (Wong and Mathur, 1995) measure it as the percentage of generated test set cases that detect at least one fault. On the other hand, all the studies consider the cost of the techniques identified as the number of generated test cases and/or the number of generated mutants. The results of the three studies appear to corroborate each other as regards mutation being much more costly than any of its variants, while

N. Juristo, A. M. Moreno & S. Vegas

12

there does not appear to be too drastic a loss of effectiveness for the variants as compared with strong mutation. After analysing 11 subroutines of no more than 30 LOC, Offut and Lee indicate in this respect that, for non-critical applications, it is recommendable to use weak as opposed to strong mutation, because it generates fewer test cases and kills a fairly high percentage of mutants. In particular, they suggest that bb-weak-1 and st-weak kill a higher percentage of mutants, but they also generate more test cases.

(Weyuker, 1990) - All-p-uses includes all-uses - All-uses includes all-du-paths

Number of test cases generated

- All-c-uses generates fewer test cases than all-p-uses - All-p-uses generates fewer test cases than all-uses - All-uses generates fewer test cases than all-du-paths - The number of test cases generated by all-du-paths is linear as regards the number of decisions in the program, rather than exponential as stated in theory

- The number of test cases generated with all-du-paths is not exponential, as stated in theory, and is reasonable - The number of test cases generated by all-du-paths seems to depend on the number of lines ofcode

PRACTICAL RESULTS

Criteria compliance

(Bieman & Schultz, 1992)

- All-p-uses should be used instead of all-uses, and all-uses instead of all-du-paths, as they generate fewer test cases and generally cover the test cases generated by the other criteria. • It is not clear that it is better to use all-c-uses instead of all-p-uses, as, even though all-c-uses generates fewer test cases, coverage is not assured. - Contrary to testing theory, the all-du-paths technique is usable in practice, since it does not generate too many test cases.

LIMITATIONS

ASPECT STUDIED

STUDY

- It remains to ratify the laboratory results of Weyuker's study in industry. - The results of Bieman and Schultz's study have to be corroborated using formal statistical analysis techniques. - Technique effectiveness should be studied, as the fact that the test cases generated with one criterion cover the other criteria is not necessarily related to effectiveness. - What the number of test cases generated in all-du-paths depends on should be studied in more detail, as one study says it depends on the number of decisions and the other on the number of lines of code.

Table 7. Results of the studies on data flow testing techniques.

Limitations of Empirical Testing Technique Knowledge

13

TESTING TECHNIQUE

ASPECT STUDIED

STUDY (Offut & Lee, 1994)

(Offut et al, 1996)

% mutants killed3 by each technique

X

X

No. of generated cases

X

X

No. of generated mutants

X

% generated sets that detect at least 1 fault Mutation (strong/standard) MD EX-WEAK MD ST-WEAK MD BB-WEAK/1 MD BB-WEAK/n 2-selective mutation 4-selective mutation 6-selective mutation Random selected 10% mutation Constrained (abs/ror) mutation

(Wong & Mathur, 1995)

X X

O 0 0 0 0

O

0

O 0 O 0 0

Table 8. Studies on mutation testing techniques.

Furthermore, Offut et al. analyse 10 programs (9 of which were studied by Offut and Lee, of no more than 48 LOC) and find that the percentage of strong mutation mutants killed by each selective variant is over 99% and is, in some cases, 100%. Therefore, the authors conclude that selective mutation is an effective alternative to strong mutation. Additionally, selective mutation cuts test costs substantially, as it reduces the number of generated mutants. As regards Wong and Mathur, they compare strong mutation with two selective variants (randomly selected 10%> mutation and constrained mutation, also known as abs/ror mutation). They find, on 10 small programs, that strong or standard mutation is equally as or more effective than either of the other two techniques. However, these results are not supported statistically, which means that it is impossible to determine whether or not this difference in effectiveness is significant.

3

A mutant is killed when a test case causes it to fail.

14

N. Juristo, A. M. Moreno & S. Vegas

Finally, Wong and Mathur refer to other studies they have performed, which determined that abs/ror mutation and 10% mutation generate fewer test cases than strong mutation. This gain is offset by a loss of less than 5% in terms of coverage as compared with strong mutation. This could mean that many of the faults are made in expressions and conditions, which are the questions evaluated by abs/ror mutation. For this reason and for noncritical applications, they suggest the possibility of applying abs/ror mutation to get good cost/benefit performance (less time and effort with respect to a small loss of coverage).

• •

•

• •

•

In summary, the conclusions reached by this group of studies are: Standard mutation appears to be more effective, but is also more costly than any of the other techniques studied. The mutation variants provide similar, although slightly lower effectiveness, and are less costly (generate fewer mutants and, therefore, fewer test cases), which means that the different mutation variants could be used instead of strong mutation for non-critical systems. However, the following limitations have to be taken into account: The programs considered in these studies are not real, which means that the results cannot be generalised to industrial cases, and a replication in this context is required to get greater results reliability. Offut et al. and Wong and Mathur do not use formal techniques of statistical analysis, which means that their results are questionable. Additionally, it would be interesting to compare the standard mutation variants with each other and not only with standard mutation to find out which are more effective from the cost and performance viewpoint. Furthermore, the number of mutants that a technique kills is not necessarily a good measure of effectiveness, because it is not explicitly related to the number of faults the technique detects. Table 9 shows the results of this set of studies.

Limitations of Empirical Testing Technique Knowledge STUDY % mutants killed by each technique

(Offut & Lee, 1994)

Selective mutation kills more than 99% of the mutants generated by strong mutation.

-

Weak mutation generates fewer test cases than strong mutation.

- st-weak/1 and bb-weak/1 generate more test cases than exweak/1 and bbweak/n - Although not explicitly stated, strong mutation generates more test cases than selective mutation

-

ASPECT STUDIED

-

-

- 10% mutation generates fewer mutants than abs/ror mutation (approx. half) - Abs/ror generates from 50 to 100 times fewer mutants than standard mutation - Standard mutation is more effective than 10% mutation in 90% of cases and equal in 10% - Standard mutation is more effective than abs/ror in 40% of the cases and equal in 60% - Abs/ror is equally or more effective than 10% mutation in 90% of the cases

PRACTICAL RESULTS

-

Selective mutation generates fewer mutants than strong mutation

- Where time is a criticalfactor, it is better to use weak as opposed to standard mutation, as it generatesfewer test cases and effectiveness is approximately the same. - Where time is a criticalfactor, it is better to use selective (exweak/1 and bb-weak/n) as opposed to standard mutation, as it generates fewer mutants (and therefore fewer test cases) and its effectiveness is practically the same. - Where time is a criticalfactor, it is better to use 10% selective as opposed to standard mutation, although there is some loss in effectiveness, because it generates much fewer test cases. In intermediate cases, it is preferable to use abs/ror mutation, because, although it generates more cases (from 50 to 100 times more), it raises effectiveness by 7 points. If time is not a criticalfactor, it is preferable to use standard mutation.

LIMITATIONS

% sets generated that detect at least 1 fault

(Wong & Mathur, 1995)

The percentage of mutants killed by weak mutation is high.

No. cases generated

No. mutants generated

(Offut etaL, 1996)

15

- It remains to ratify the laboratory results of these studies in industry. - The results of the studies by Offut et al. and Wong and Mathur should be corroborated using formal techniques of statistical analysis. - It remains to compare the variants of strong mutation with each other. - The studies should be repeated with another measure of effectiveness, as the number of mutants killed by a technique is not necessarily a good measure of effectiveness.

Table 9. Results of the studies on mutation testing techniques.

N. Juristo, A. M. Moreno & S. Vegas

16

5. Comparative Studies Between the Data-Flow, ControlFlow and Random Testing Techniques Families The objective of this series of studies is to analyse the differences between three families, selecting, for this purpose, given techniques from each family. The selected techniques are the branch testing (decision coverage) control flow technique, all-uses and all-dus within the data flow family and random in the random testing technique family. Table 10 shows the studies considered, the aspects studied by each one and for which testing techniques.

(Hutchins etal, 1994)

ASPECT STUDIED

Number of test cases generated

X

X

No. of sets with at least 1 fault/no. of sets generated

X

X

TESTING TECHNIQUE

STUDY (Frankl & Weiss, 1993)

All-uses

0

Branch testing (all-edges)

0

X 0

0

0

0

All-dus (modified all-uses) Random (null)

(Frankl & Iakounenko, 1998)

0

0

0

Table 10. Comparative studies of the data flow, control flow and random testing technique families. (Frankl and Weiss, 1993) (see also (Frankl and Weiss, 1991a) and (Frankl and Weiss, 1991b)) and (Frankl and Iakounenko, 1998) study the effectiveness of the all-uses, branch testing and random testing techniques in terms of the probability of a set of test cases detecting at least one fault, measured as the number of sets of test cases that detect at least one fault/total number of sets generated. Frankl and Weiss use nine toy programs containing one or more faults to measure technique effectiveness. The results of the study indicate that the probability of a set of test cases detecting at least one fault is greater (from a statistically significant viewpoint) for all-uses than for all-edges in five of

Limitations of Empirical Testing Technique Knowledge

17

the nine cases. Additionally, all-uses behaves better than random in six of the nine cases and all-edges behaves better than random testing in five of the nine cases. Analysing the five cases where all-uses behaves better than all-edges, Frankl and Weiss find that all-uses provides a greater probability of a set of cases detecting at least one fault with sets of the same size in four of the five cases. Also, analysing the six cases where all-uses behaves better than random testing, the authors find that all-uses provides a greater probability of a set of cases detecting at least one fault in four of these cases. That is, all-uses has a greater probability of detecting a fault not because it works with sets containing more test cases than all-edges or random testing, but thanks to the very strategy of the technique. Note that the diffrence in the behaviour of the techniques (of nine programs, there are five for which a difference is observed and four for which none is observed for all-uses and six out of nine for random testing) is not statistically significant, which means that it cannot be claimed outright that all-uses is more effective than all-edges or random testing. Analysing the five cases where all-edges behaves better than random testing, Frankl and Weiss find that in no case does all-edges provide a greater probability of a set of cases detecting at least one fault with sets of the same size. That is, in this case, all-edges has a greater probability of detecting a fault than random testing because it works with larger sets. Frankl and Weiss also discern a relationship between technique effectiveness and coverage, but they do not study this connection in detail. Frankl and Iakounenko, however, do study this relationship and, as mentioned above, again define effectiveness as the probability of a set of test cases finding at least one fault, measured as the number of sets of test cases that detect at least one fault/total number of generated test cases. Frankl and Iakounenko deal with eight versions of an industrial program, each containing a real fault. Although the study data are not analysed statistically and its conclusions are based on graphical representations of the data, the qualitative analysis indicates that, as a general rule, effectiveness is greater when coverage is higher, irrespective of the technique. However, there are occasions where effectiveness is not 1, which means that some faults are not detected, even when coverage is 100%. This means that coverage increases the probability of finding a fault, but it does not guarantee that it is detected. Additionally, both all-uses and all-edges appear to behave similarly in terms of effectiveness, which is a similar result to what Frankl and Weiss found. For high coverage levels, both all-uses and

18

N. Juristo, A. M. Moreno & S. Vegas

all-edges behave much better than random testing. Indeed, Frankl and Weiss believe that the behaviour of random testing is unrelated to coverage. Hence, as random testing does not improve with coverage, it deteriorates with respect to the other two. Note that even when technique coverage is close to 100%, there are programs for which the technique's fault detection effectiveness is not close to 1. This leads us to suspect that there are techniques that work better or worse depending on the fault type. The better techniques for a given fault type would be the ones for which effectiveness is 1, whereas the poorer ones would be the techniques for which effectiveness is not 1, even though coverage is optimum. However, Frankl and Weiss do not further research this relationship. The study by (Hutchins et al, 1994) compares all-edges with all-dus and with random testing. As shown in Table 10, the authors study the number of test cases generated by each technique, the effectiveness of the techniques, again measured as the number of sets that detect at least one fault/the total sets, as well as the relationship to coverage. Hutchins et al. consider seven toy programs (with a number of lines of code from 138 to 515), of which they generate versions with just one fault. The results of the study show that the greater coverage is, the more effective the techniques are. While there is no evidence of a significant difference in effectiveness between all-edges and all-dus, there is for random testing. Furthermore, the authors study the sizes of the test cases generated by all-edges and all-dus, and how big a set of cases generated using random testing would have to be for a given coverage interval to be equally effective. They reach the conclusion that the sizes generated by all-edges and all-dus are similar and that the increase in the size of one set of cases generated by random testing can vary from 50 to 160% for high coverages (over 90%). The authors further examine the study, analysing the fault types detected by each technique. They find that each technique detects different faults, which means that although the effectiveness of all-edges and all-dus is similar, the application of one instead of the other is not an option, as they find different faults.

•

The limitations discovered in these studies are: Frankl and Weiss and Hutchins et al. use relatively simple, nonindustrial programs, which means that the results cannot be directly generalised to real practice.

Limitations of Empirical Testing Technique Knowledge

•

•

•

•

•

Of the three studies, Frankl and Iakounenko do not run a statistical analysis of the extracted data, which means that the significance of the results is questionable. The evaluation of the effectiveness of the techniques studied, measured as the probability of detecting at least one fault in the programs, is not useful in real practice. Measures of effectiveness like, for example, number of faults detected over number of total faults are more attractive in practice. Besides technique effectiveness, Frankl and Weiss and Frankl and Iakounenko should also study technique complementarity (as in Hutchins et al.), in order to be able to determine whether or not technique application could be considered exclusive, apart from extracting results regarding similar technique effectiveness levels. As regards the conclusions, it can be said that: There does not appear to be a difference between all-uses, all-edges and random testing as regards effectiveness from the statistical viewpoint, as the number of programs in which one comes out on top of the other is not statistically significant. However, from the practical viewpoint, random testing is easier to satisfy than all-edges and, in turn, all-edges is easier to satisfy than all-uses. On the other hand, all-uses is better than all-edges and than random testing as a technique, whereas all-edges is better than random because it generates more test cases. It follows from the results of the above studies that, in the event of time constraints, the use of the random testing technique can be relied upon to yield an effectiveness similar to all-uses in 50% of the cases. Where testing needs to be exhaustive, the application of all-uses provides assurance, as, in the other half of the cases, this criterion yielded more efficient results thanks to the actual technique and not because it generated more test cases. A logical relationship between coverage and effectiveness was also detected (the greater the coverage, the greater the effectiveness). However, effectiveness is not necessarily optimum in all cases even if maximum coverage is achieved. Therefore, it would be interesting to analyse in detail the faults entered in the programs in which the effectiveness of the techniques is below optimum, as a dependency could possibly be identified between the fault types and the techniques that detect these faults.

19

20

•

•

•

N. Juristo, A. M. Moreno & S. Vegas

Hutchins et al. discover a direct relationship between coverage and effectiveness for all-uses and all-edges, whereas no such relationship exists for random testing. For high coverage levels, the effectiveness of all-uses and all-edges is similar. Frankl and Iakounenko also discover a direct relationship between coverage and effectiveness for all-uses and all-edges. Again, the effectiveness of both techniques is similar, although all-edges and alldus are complementary because they detect different faults. Even when there is maximum coverage, however, there is no guarantee that a fault will be detected. This suggests that the techniques may be sensitive to certain fault types.

Taking into account these limitations, however, we do get some interesting results, which have been summarised in Table 11.

6. Comparisons between the Mutation and the Data-Flow Testing Techniques Families We have found two studies that compare mutation with data flow techniques. These studies, along with the characteristics studied and the techniques addressed, are shown in Table 12. (Frankl et al., 1997) (see also (Frakl et al, 1994)) compare the effectiveness of mutation testing and all-uses. They study the ratio between the sets of test cases that detect at least one fault vs. total sets for these techniques. The effectiveness of the techniques is determined at different coverage levels (measured as the percentage of mutants killed by each technique). The results for 9 programs at high coverage levels with a number of faults of no more than 2 are as follows: • Mutation is more effective for 5 of the 9 cases • All-uses is more effective than mutation for 2 of the 9 cases • There is no significant difference for the other two cases. With regard to (Wong and Mathur, 1995), they compare strong mutation, as well as two variants of strong mutation (randomly selected 10% and constrained mutation, also known as abs/ror mutation), with all-uses, studying again the ratio between the sets of test cases that detect at least 1 fault vs. total sets. For this purpose, the authors study 10 small programs, finding that the mutation techniques behave similarly to all-uses.

Limitations of Empirical Testing Technique Knowledge (Frankl & Weiss, 1993)

(Hutchins era/., 1994)

(Frankl & Iakounenko, 1998)

Number of test cases generated

- All uses is a better technique than all-edges and random by the technique itself - All-edges is better than random because it generates more test cases

- All-edges and all-dus generate approx. the same number of test cases - To achieve the same effectiveness as alledges and all-dus, random has to generate from 50% to 160% more test cases

-

No. of sets detecting at least 1 fault/no. of sets generated

- There is no convincing - The effectiveness of result regarding all-uses all-edges and all-dus being more effective than is similar, but they all-edges and random: find different faults • In approximately 50% of - Maximum coverage does not guarantee the cases, all-uses is that a fault will be more effective than alldetected edges and random, and all-edges is more effective than random • In approximately 50% of the cases, all-uses, alledges and random behave equally

STUDY

PRACTICAL RESULTS

- There is an effectiveness/coverage relationship in all-edges and all-uses (not so in random) - There is no difference as regards effective-ness between all-uses and alledges for high coverages

- In the event of time constraints, the use of the random testing technique can be relied upon to yield an effectiveness similar to all-uses and all-edges (the differences being smaller the higher coverage is) in 50% of the cases. Where testing needs to be exhaustive, the application of all-uses provides assurance, as, in the other half of the cases, this criterion yielded more efficient results thanks to the actual technique, unlike all-edges, which was more efficient because it generated more test cases. - All-edges should be applied together with all-dus, as they are equally effective and detect different faults. Additionally, they generate about the same number of test cases, and the random testing technique has to generate between 50% and 160% more test cases to achieve the same effectiveness as all-edges and all-dus. - High coverage levels are recommendedfor all-edges, all-uses and all-dus, as this increases their effectiveness. This is not the case for the random testing technique. Even when there is maximum coverage, however, there is no guarantee that a fault will be detected.

LIMITATIONS

ASPECT STUDIED

21

- It remains to ratify the laboratory results of the studies by Hutchins et al. and Frankl and Ianounenko in industry. - The results of the studies by Frankl and Weiss should be corroborated using formal techniques of statistical analysis. - The type of faults should be studied in the programs where maximum effectiveness is not achieved despite there being maximum coverage, as this would help to determine technique complementarity. - The studies should be repeated for a more practical measure of effectiveness, as the percentage of test case sets that find at least one fault is not real.

Table 11. Results of the studies comparing data flow, control flow and random testing techniques.

N. Juris to, A. M. Moreno & S. Vegas STUDY (Frankl et ah, 1997) ASPECT STUDIED

TESTING TECHNIQUE

% mutants technique

killed

by

each

(Wong & Mathur, 1995)

X

Ratio of generated sets that detect at least 1 fault

X

X

Mutation (strong or standard)

0

O

All-uses

0

0

Random selected 10% mutation

O

Constrained (abs/ror) mutation

0

Table 12. Comparative studies between the mutation and data flow testing techniques families. We cannot conclude from these results that there is a clear difference in terms of effectiveness between mutation testing and all-uses. Additionally, the authors highlight that it is harder to get high coverage with mutation as compared with all-uses. The limitations are: • The results of Frankl et al. can be considered as a first attempt at comparing mutation testing techniques with all-uses, as this study has some drawbacks. First, the faults introduced into the programs were faults that, according to the authors, "occurred naturally". However, the programs are relatively small (no more than 78 LOC), and it is not said whether or not they are real. Additionally, the fact that the programs had no more than two faults is not significant from a practical viewpoint. • Wong and Mathur do not use real programs or formal techniques of statistical analysis, which means that their results cannot be considered conclusive until a formal analysis of the results has been conducted on real programs. • The use of the percentage of sets that discover at least one fault as the response variable is not significant from a practical viewpoint. • Note that a potentially interesting question for this study would have been to examine the differences in the programs for which mutation and data flow testing techniques yield different results. This could have identified a possible relationship between program or fault types

Limitations of Empirical Testing Technique Knowledge

•

23

and the techniques studied, which would help to define application conditions for these techniques. There should be a more detailed study of the dependency between the technique and the program type to be able to more objectively determine the benefits of each of these techniques. In any replications of this study, it would be important to analyse the cost of technique application (in the sense of application time and number of test cases to be applied) to conduct a more detailed cost/benefit analysis.

The main results of this group are summarised in Table 13. As a general rule, mutation testing appears to be as or more effective than all-uses, although it is more costly. STUDY

- Standard mutation is more effective than all-uses in 63% of the cases and equally effective in 37% - Abs/ror is more effective than all-uses in 50% of the cases, equally effective in 30% and less effective in 20% - All-uses is more effective than 10% mutation in 40% of the cases, equally effective in 20% and less effective in 40%

PRACTICAL RESULTS

Ratio of sets generated that detect at least 1 fault

(Wong & Mathur, 1995)

- If high coverage is important and time is limited, it is preferable to use all-uses as opposed to mutation, as it will be just as effective as mutation in about half of the cases. - All-uses behaves similarly as regards effectiveness to abs/ror and 10% mutation.

LIMITATIONS

ASPECT STUDIED

% mutants killed by each technique

(Franklrf«/.,1997) It is more costly to reach high coverage levels with mutation than with all-uses There is not a clear difference between mutation and all-uses

- It remains to ratify the laboratory results of the studies in industry. - The studies should be repeated for a more practical measure of effectiveness, as the percentage of sets of cases that find at least one fault is not real. - It would be of interest to further examine the differences in the programs in which mutation and the data flow testing technique yield different results. - The cost of technique application should be studied.

Table 13. Comparisons between mutation and all-uses.

24

N. Juristo, A. M. Moreno & S. Vegas

7. Comparisons Between the Functional and Control-Flow Testing Techniques Families The four studies of which this group is composed are reflected in Table 14. These are empirical studies in which the authors investigate the differences between control flow testing techniques and the functional testing techniques family. These studies actually also compare these two testing technique families with some static code analysis techniques, which are not taken into account for the purposes of this paper, as they are not testing techniques. In Myers' study (Myers, 1978), inexperienced subjects choose to apply one control flow and one functional testing technique, which they apply to a program taken from a programming book, analysing the variables: number of faults detected, time to detect faults, time to find a fault/type, number of faults detected combining techniques, and time taken to combine techniques/fault type. Myers does not specify which particular techniques were used, which means that this study does not provide very practical results. One noteworthy result, however, is that the author does not find a significant difference as regards the number of faults detected by both technique types. However, the author indicates that different methods detect some fault types better than others (although this study is not performed statistically). Myers also studies fault detection efficiency combining the results of two different people. Looking at Table 14, we find that (Wood et al., 1997) also address this factor. The conclusions are similar in the two studies, that is, more faults are detected combining the faults found by two people. However, there are no significant differences between the different technique combinations. Of the studies in Table 14, we find that (Basili and Selby, 1987) (see also (Basili and Selby, 1985) and (Selby and Basili, 1984)) and Wood et al. use almost the same response variables: number of detected faults, percentage of detectedfaults, time to detect faults, number of faults detected per hour and percentage of faults detected by time for Basili and Selby and number of detected faults, number of faults detected combining techniques, number of faults detected per hour and percentage of detected faults for Wood et al. Apart from these results, (Kamsties and Lott, 1995) also take an interest in the faults that cause the different failures, studying another set of variables, as shown in Table 14: time to detect faults, number of faults found

Limitations of Empirical Testing Technique Knowledge

25

per hour, number of faults isolated per hour, percentage of faults detected per type, percentage of faults isolated per type, time to isolate faults, percentage of faults isolated, percentage of faults detected and total time to detect and isolate faults. Whereas Basili and Selby replicate the experiment with experienced and inexperienced subjects (two and one replications, respectively), Wood et al., like Kamsties and Lott, use only inexperienced subjects.

ASPECT STUDIED

STUDY (Myers, 1978)

(Basili & Selby, 1987)

No. faults detected

X

X

Time to detect faults

X

X

Time to detect faults/ fault type

X

No. faults detected combining techniques

X

Time combining techniques/fault type

X

No. faults found/ time

X

X X

X

X

X

X

% faults isolated/type

X

Time to isolate faults

X

Total time to detect and isolate

X X

% faults detected % faults isolated

X

X

X

White box

0

Black box

0

Boundary value analysis

o

Sentence coverage

0

Decision coverage (branch testing)

X

X

% faults detected/type

Condition coverage

(Wood et al, 1997)

X

No. faults isolated/hour

TESTING TECHNIQUE

(Kamsties & Lott, 1995)

0

0

0 O

Table 14. Comparative studies of functional and control testing techniques.

26

N. Juristo, A. M. Moreno & S. Vegas

This means that Basili and Selby can further examine the effect of experience on the fault detection rate (number of faults detected per hour) or the time taken to detect faults. As regards the first variable, the authors indicate that the fault detection rate is the same for experienced and inexperienced subjects for both techniques (boundary value analysis and sentence coverage), that is, neither experience nor the technique influences this result. With respect to time, Basili and Selby indicate that the experienced subjects take longer to detect a fault with using the functional technique than with sentence coverage. This means that experienced subjects detect fewer faults with the structural technique than with the functional testing technique within a given time. For inexperienced subjects, on the other hand, the findings are inconclusive, as the results of the replications are not the same (in one replication no differences were observed between the techniques and in the other, the functional testing technique took longer to detect faults). Also as regards time, the study by Kamsties and Lott (who, remember, worked with inexperienced subjects) indicates that the total time to detect and isolate faults is less using the functional testing technique than with condition coverage. As these authors studied the time to detect and isolate faults separately, the authors were able to determine statistically that it takes longer to isolate the fault using the functional technique than with condition coverage, but the time to detect the fault is less. Note that this result cannot be directly compared with the findings of Basili and Selby, where the functional technique did not take less time to detect faults, as the two consider different structural testing techniques: sentence coverage (Basili and Selby) and condition coverage (Kamsties and Lott). As regards efficiency, Kamsties and Lott indicate that the fault detection rate was greater for the functional testing technique than for condition coverage. Kamsties and Lott note that there were no significant differences between the percentage of isolated and detected faults, that is, both techniques behaved similarly, because the program was the influential factor. This result was corroborated by studies by Basili and Selby and Wood et al., who claim that the percentage of detected faults depends on the program and, according to Wood et al, more specifically, on the faults present in these programs. Basili and Selby and Kamsties and Lott have also studied the percentage of faults detected by the techniques according to fault type. In this respect, whereas Basili and Selby claim that the functional technique detects more

Limitations of Empirical Testing Technique Knowledge

27

control faults than sentence coverage, Kamsties and Lott indicate that, generally, there are no significant differences between the functional testing technique and condition coverage with regard to the percentage of isolated and detected faults by fault type. Finally, it should be mentioned that Wood et al. also focus on the study of the number of detected faults using each technique individually and combining the results of two people applying the same or different techniques. Individually, they reach the conclusion that it is impossible to ascertain which technique is more effective, as the program (fault) is also influential. On the other hand, they find that the number of different faults detected is higher combining the results of different people, instead of considering only the results of the individual application of each technique. However, a formal analysis of the data would show that there is no significant difference between two people applying the same or different techniques, which might suggest that it is the people and not the techniques that find different faults (although this claim would require further examination). The studies considered in this group generally include an experimental design and analysis, which means that their results are reliable. However, caution needs to be exercised when generalising and directly comparing these results for several reasons: • They use relatively small programs, between 150 and 350 LOC, which are generally toy programs and might not be representative of industrial software. • Most, although not all, of the faults considered in these programs are inserted by the authors ad hoc for the experiments run, which means that there is no guarantee that these are faults that would occur in real programs. • The studies by Basili and Selby, Kamsties and Lott and Wood et al. compare the boundary value analysis technique with three different structural techniques. Hence, although some results of different studies may appear to be contradictory at first glance, a more detailed analysis would be needed to compare the structural techniques with each other. • Although the response variables used in all the studies are quite similar, care should be exercised when directly comparing the results, because, as mentioned above, the techniques studied are not absolutely identical.

28

•

•

•

•

• •

•

N. Juris to, A. M. Moreno & S. Vegas

In Myers' study, it is not very clear the concrete techniques the subjects apply, since they are asked to apply a control flow and a functional testing technique. And the conclusions that can be drawn are: The boundary analysis technique appears to behave differently compared with different structural testing techniques (particularly, sentence coverage and condition coverage). Note that from the practical viewpoint, condition coverage is more applicable, which means that future replications should focus on condition coverage rather than sentence coverage. Nothing can be said about branch testing. Basili and Selby, Kamsties and Lott and Wood et al. find effectiveness-related differences between functional and structural techniques depending on the program to which they are applied. Wood et al. further examine this relationship, indicating that it is the fault type that really influences the detected faults (and, more specifically, the influential factor is the failures that these faults cause in programs), whereas Kamsties and Lott and Myers find no such difference. Also there appears to be a relationship between the programs, or the type of faults entered in the programs, and technique effectiveness, as indicated by all three studies. However, this relationship has not been defined in detail. Basili and Selby point out that the functional technique detects more control faults. Myers also discerns a difference as regards different faults, but fails to conduct a statistical analysis. Finally, Kamsties and Lott find no such difference, which means that a more exhaustive study would be desirable. More faults are detected using the same technique if different people are combined than individually. Any possible extensions of these studies should deal, whenever possible, with real problems and faults in order to be able to generalise the results obtained. Finally, it would be recommendable to unify the techniques under study in future replications to be able to generalise conclusions.

Taking this into account, we have summarised the results of this group in Table 15 and Table 16.

Limitations of Empirical Testing Technique Knowledge

STUDY

No. faults detected

% faults detected

H

% faults detected/

H

type

0. en

<

(Basili & Selby, 1987) - Experienced subjects: the functional technique detects more faults than the structural technique - Inexperienced subjects: • In one case, there is no difference between structural and functional techniques • In the other, the functional technique detects more faults than the structural technique - Experienced subjects: The functional technique detects more faults than the structural technique - Inexperienced subjects: • In one case, there is no difference between the structural and functional techniques • In the other case, the functional technique detects more faults than the structural technique - Boundary value analysis detects more control faults than sentence coverage - There is no difference between these techniques for other fault types

(Kamsties & Lott, 1995)

No. faults detected/ hour % faults isolated

(Wood et al., 1997)

The number of detected faults depends on the program/technique combination

Depends on the program, not the technique The percentage of detected faults depends on the program/technique combination

There is no difference between techniques

-

Higher number of faults combining techniques

No. faults detected combining techniques Time to detect faults

29

- Experienced subjects: Boundary value analysis takes longer than sentence coverage - Inexperienced subjects: Boundary value analysis takes as long or longer than sentence coverage - The fault rate with boundary value analysis and sentence coverage does not depend on experience - The fault rate depends on the program

- Inexperienced subjects: • Boundary value analysis takes less time than condition coverage • The time taken to faults also depends on the subject Boundary value analysis has a higher fault detection rate than condition coverage

Depends on the type of faults in the programs

Depends on the program and subject, not on the technique

Table 15. Results of the comparison of the functional and control flow testing technique families (1/2).

30

N. Juris to, A. M. Moreno & S. Vegas

(Basili & Selby, 1987)

(Kamsties & Lott, 1995)

(Wood et al., 1997)

No. faults isolated/ hour % faults isolated/ type Time to isolate faults

-

Is influenced by the subject not by the technique

-

-

There is no difference between techniques

-

Total time to detect and isolate

-

With inexperienced subjects, boundary value analysis takes longer than condition coverage - With inexperienced subjects, boundary value analysis takes less time than condition coverage - Time also depends on the subject

-

-

PRACTICAL RESULTS

-

- For experienced subjects and when there is plenty of time, it is better to use the boundary value analysis technique as opposed to sentence coverage, as subjects will detect more faults, although it will take longer. On the other hand, for inexperienced subjects and when time is short, it is better to use sentence coverage as opposed to boundary value analysis, although there could be a loss of effectiveness. The time will also depend on the program. - It is preferable to use boundary value analysis as opposed to condition coverage, as there is no difference as regards effectiveness and it takes less time to detect and isolate faults. - There appears to be a dependency on the subject as regards technique application time, fault detection andfault isolation. - There appears to be a dependency on the program as regards the number and type of fault detected. - More faults are detected by combining subjects than techniques of the two families. - If control faults are to be detected, it is better to use boundary value analysis or condition coverage than sentence coverage. Otherwise, it does not matter which of the three are used. - The effect of boundary value analysis and branch testing techniques on effectiveness cannot be separated from the program effect.

LIMITATIONS

ASPECT STUDIED

STUDY

- It remains to ratify the laboratory results of the studies in industry. - The studies compare boundary values analysis with three different structural testing techniques, hence a more detailed analysis is needed to compare the structural testing techniques with each other.

Table 16. Results of the comparison of the functional and control flow testing technique families (2/2).

Limitations of Empirical Testing Technique Knowledge

31

8. Conclusions As readers will be able to appreciate, the original intention of extracting empirically validated knowledge on testing techniques from this survey has been held back for several reasons. These reasons have been mentioned throughout the article and can be summarised globally as: • Dispersion of the techniques studied by the different papers within one and the same family. • Dispersion of the response variables examined even for the same techniques. Additionally, as regards the validity of the individual papers studied, we have also found some limitations that prevent their results from being generalised. Most of the papers are beleaguered by one or more of the following limitations: • Informality of the results analyses (many studies are based solely on qualitative graph analysis). • Limited usefulness of the response variables examined in practice, as is the case of the probability of detecting at least one fault. • Non-representativeness of the programs chosen, either because of size or the number of faults (one or two) introduced. • Non-representativeness of the faults introduced in the programs (unreal faults). Despite the difficulties encountered, Table 17, Table 18, Table 19 and Table 20 show some recommendations that can be of use to practitioners, along with their maturity level and tests pending performance. Note that there is no statement on testing techniques that can be accepted as fact, as they are all pending some sort of corroboration, be it laboratory or field replication or knowledge pending formal analysis. Furthermore, some points yet to be covered by empirical studies and which might serve as inspiration for researchers should be highlighted: • The comparative study of the effectiveness of different techniques should be supplemented by a study of the fault types that each technique detects and not only the probability of detecting faults. That is, even if Ti and T2 are equally effective, this does not mean that they detect the same faults. Ti and T2 may find the same number of faults, but Ti may find faults of type A (for example, control faults) whereas T2 finds faults of type B (for example, assignation faults). This would

32

•

N. Juris to, A. M. Moreno & S. Vegas

provide a better understanding of technique complementarity, even when they are equally effective. An interesting question for further examination is the differences between the programs for which different techniques yield different results. That is, given two programs P) and P2, and two techniques Ti and T2 that behave differently with respect to Pi, but equally with respect to P2 (either as regards the number of detected faults, the technique application time, etc.), identify what differences there are between these two programs. This could identify a possible relationship between program types or fault types and the techniques studied, which would help to define application conditions for these techniques. It would be a good idea to conduct a more detailed study of technique dependency on program type to be able to more objectively determine the benefits of each technique.

TECHNIQUE

PRACTICAL RECOMMENDATION If time is a problem, all-puses should be used instead of all-uses, and all-uses instead of all-du-paths, as they generate fewer test cases and generally cover the test cases generated by the other criteria.

Data flow

It is not clear that it is better to use all-c-uses instead of all-p-uses, as, even though all-c-uses generates fewer test cases, coverage is not assured

MATURITY STATUS

- Confirmed with lab programs and faults. - Confirmed formally.

- Pending lab replication. - Pendingfield study.

- Confirmed with lab programs and faults. All-du-paths is not as time consuming as stated by the theory, as it generates a reasonable and not an exponential number of test cases.

- Confirmed with field study - Pending formal analysis - Pending lab replication.

PENDING KNOWLEDGE - Find out the difference in terms of effectiveness between all-c-uses, allp-uses, all-uses and alldu-paths. - Compare with the other techniques in the family. - Find out whether the fact that maximum coverage does not detect a fault depends on the fault itself.

- Confirm whether the number of test cases generated by all-dupaths depends on the number of sentences or the number of decisions, as the two authors disagree

Table 17. Conclusions for intrafamily studies (1/2).

Limitations of Empirical Testing Technique Knowledge

TECHNIQUE

PRACTICAL RECOMMENDATION

Where time is a critical factor, it is better to use selective (exweak/1 and bbweak/n) as opposed to standard mutation, as it generates fewer mutants (and, therefore, fewer test cases) and its effectiveness is practically the same.

Mutation

MATURITY STATUS

33

PENDING KNOWLEDGE

- Confirmed with lab programs and faults. - Confirmed formally.

- Pending lab replication. - Pendingfield study.

Where time is a critical factor, it is better to use weak as opposed to standard mutation, as it generates fewer test cases and effectiveness is approximately the same. - Confirmed with lab Where time is a critical programs and faults. factor, it is better to use 10% selective as opposed to standard mutation, although - Pending formal analythere is some loss in effecsis. tiveness, because it generates - Pending lab replicamuch fewer test cases. In tion. intermediate cases, it is preferable to use abs/ror - Pending field study. mutation, because, although it generates more cases (from 50 to 100 times more), it raises effectiveness by 7 points. If time is not a critical factor, it is preferable to use standard mutation.

- Compare the different mutation variants with each other - Use another metric type for effectiveness, as the number of mutants killed by a technique is only useful for relative comparisons between mutation techniques

Table 18. Conclusions for intrafamily studies (2/2).

34 TECHNIQUE

N. Juristo, A. M. Moreno & S. Vegas PRACTICAL RECOMMENDATION In the event of time constraints, the use of the random technique can be relied upon to yield an effectiveness similar to all-uses and all-edges (the differences being smaller as coverage increases) in 50% of the cases. Where testing needs to be exhaustive, the application of all-uses provides assurance, as, in the other half of the cases, this criterion yielded more efficient results thanks to the actual technique, unlike all-edges, which was more efficient because it generated more test cases.

Data flow (alluses, all-dus) vs. Control flow (alledges) vs. Random

High coverage levels are recommended for all-edges, all-uses and all-dus, but not for the random testing technique. Even when there is maximum coverage, however, there is no guarantee that a fault will be detected.

All-edges should be applied together with all-dus, as they are equally effective and detect different faults. Additionally, they generate about the same number of test cases, and the random testing technique has to generate between 50% and 160% more test cases to achieve the same effectiveness as all-edges and all-dus.

Mutation (standard) vs. Data flow (all-uses)

If high coverage is important and time is limited, it is preferable to use all-uses as opposed to mutation, as it will be just as effective as mutation in about half of the cases.

MATURITY STATUS - Confirmed with lab programs andfaults. - Confirmed formally. - Pending lab replication. - Pending field study. - Confirmed with lab programs andfaults. - Confirmed by field study. - Pending formal analysis. - Pending lab replication - Confirmed with lab programs andfaults. - Confirmed formally. - Pending lab replication. - Pending field study.

- Confirmed with lab programs andfaults. - Pending formal analysis. - Pending lab replication. - Pending field study.

All-uses behaves similarly as regards effectiveness to abs/ror mutation and 10% mutation.

Table 19. Conclusions for interfamily studies (1/2).

PENDING KNOWLEDGE

- Compare with other techniques of the family. - Use a better metric for effectiveness

- Find out whether the fact the maximum coverage does not detect a fault depends on the fault itself.

- Compare with other techniques of the family. - Use a better metric for effectiveness - Find out whether the cases in which mutation is more effective than alluses is due to the fault type - Study the costs of both techniques in terms of application time - Use another more significant metric type to measure effectiveness - Study the number of cases generated by the three alternatives

Limitations of Empirical Testing Technique Knowledge

TECHNIQUE

PRACTICAL RECOMMENDATION

35 MATURITY STATUS

For experienced subjects and when there is plenty of time, it is better to use the boundary value analysis technique as opposed to sentence coverage, as subjects will detect more faults, although it will take longer. On the other hand, for inexperienced subjects and when time is short, it is better to use sentence coverage as opposed to boundary value analysis, although there could be a loss of effectiveness. The time will also depend on the program. It is preferable to use boundary value analysis as opposed to condition coverage, as there is no difference as regards effectiveness and it takes less time to detect and isolate faults.

F'linrfiniifll

M. U l U - l I U I I n l

(boundary value analysis) vs. Control flow (sentence LUVCI d g c ,

decision coverage, branch testing

There appears to be a dependency on the subject as regards technique application time, fault detection and fault isolation

PENDING KNOWLEDGE

- Compare control flow testing techniques with each other

- Confirmed with labprograms and faults. - Confirmed formally.

There appears to be a dependency on the program as regards the number and type of faults detected - Pending lab replication. - Pending field study. More faults are detected by combining subjects than techniques of the two families

If control faults are to be detected, it is better to use boundary value analysis or condition coverage than sentence coverage. Otherwise, it does not matter which of the three are used. It is impossible to ascertain whether boundary value analysis is more or less effective than branch testing, because effectiveness also depends on the program (fault).

Table 20. Conclusions for interfamily studies (2/2).

- Check whether it is true for all techniques - Further examine the combination of fault and failure - Check whether it is true for all techniques - Further examine the type of faults detected by each technique - Check whether it is true for all techniques

- Classify the faults to which the techniques are sensitive

36

N. Juristo, A. M. Moreno & S. Vegas

After analysing the empirical studies of testing techniques, the main conclusion is that more experimentation is needed and much more replication has to be conducted before general results can be stated. While it is true that this conclusion was to be expected, as experimental software engineering is not a usual practice in our field, more experimenters are needed, so that the ideas thrown into the arena can be corroborated and tested and then used reliably.

Bibliography Basili, V.R and Selby R.W., 1985. Comparing the Effectiveness of Software Testing Strategies. Department of Computer Science. University of Maryland. Technical Report TR-1501. College Park. Basili, V.R. and Selby, R.W., 1987. Comparing the Effectiveness of Software Testing Strategies. IEEE transactions on software engineering. Pages 12781296. SE-13 (12). Beizer, B., 1990. Software Testing Techniques. International Thomson Computer Press, second edition. Bieman, J.M. and Schultz, J.L., 1992. An Empirical Evaluation (and specification) of the All-du-paths Testing Criterion. Software Engineering Journal. Pages 4 3 51, January. Davis, A., 1993. Software Requirements: Objects, Functions and States. PTR Prentice Hall. Frankl, P. and Iakounenko, O., 1998. Further Empirical Studies of Test Effectiveness. In Proceedings of the ACM SIGSOFT International Symposium on Foundations on Software Engineering, pages 153-162, Lake Buena Vista, Florida, USA. Frankl, P.G., Weiss, S.N. and Hu, C, 1994. All-Uses versus Mutation: An Experimental Comparison of Effectiveness. Polytechnic University, Computer Science Department. Technical Report. PUCS-94-100. Frankl, P.G., Weiss, S.N. and Hu, C , 1997. All-Uses vs Mutation Testing: An Experimental Comparison of Effectiveness. Journal of Systems and Software. Volume 38. Pages 235-253. September. Frankl, P.G. and Weiss, S.N., 1991. An Experimental Comparison of the Effectiveness of the All-uses and All-edges Adequacy Criteria. Proceedings of the Symposium on Testing, Analysis and Verification. Pages 154—164. Victoria, BC, Canada.

Limitations of Empirical Testing Technique Knowledge

37

Frankl, P.G. and Weiss, S.N., 1991. Comparison of All-uses and All-edges: Design, Data, and Analysis. Hunter College, Computer Science Department. Technical Report. CS-91-03. Frankl, P.G. and Weiss, S.N., 1993. An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing. IEEE Transactions on Software Engineering. Volume 19 (8). Pages 774-787. August. Hamlet, R., 1989. Theoretical Comparison of Testing Methods. In Proceedings of the ACM SIGSOFT '89 Third Symposium on Testing, Analysis and Verification. Pages 28-37, Key West, Florida, ACM. Hutchins, M., Foster, H., Goradia, T. and Ostrand, T., 1994. Experiments on the Effectiveness of Dataflow- and Controlflow-Based Test Adequacy Criteria. Proceedings of the 16th International Conference on Software Engineering. Pages 191-200. Sorrento, Italy. IEEE. Kamsties, E. and Lott, CM., 1995. An Empirical Evaluation of Three DefectDetection Techniques. Proceedings of the Fifth European Software Engineering Conference. Sitges, Spain. Latour, B. and Woolgor, D, 1986. Laboratory Life. The Construction of Science Facts. Princeton, USA: Princeton University Press. Myers, G.J., 1978. A Controlled Experiment in Program Testing and Code Walkthroughs/Inspections. Communications of the ACM. Vol. 21 (9). Pages 760-768. Myers, G.J., 1979. The Art of Software Testing. Wiley-interscience. Offut, A.J., Rothermel, G. and Zapf, C , 1993. An Experimental Evaluation of Selective Mutation. Proceedings of the 15th International Conference on Software Engineering. Pages 100-107. Baltimore, USA. IEEE. Offut, A.J., Lee, A., Rothermel, G., Untch, R.H. and Zapf, C, 1996. An Experimental Determination of Sufficient Mutant Operators. ACM Transactions on Software Engineering and Methodology. Volume 5 (2). Pages 99-118. Offut, A.J. and Lee, D., 1991. How Strong is Weak Mutation?. Proceedings of the Symposium on Testing, Analysis, and Verification. Pages 200-213. Victoria, BC, Canada. ACM. Offut, A.J. and Lee, S.D., 1994. An Empirical Evaluation of Weak Mutation. IEEE Transactions on Software Engineering. Vol. 20(5). Pages 337-344. Selby, R.W. and Basili, V.R., 1984. Evaluating Software Engineering Testing Strategies. Proceedings of the 9th Annual Software Engineering Workshop. Pages 42-53. NASA/GSFC, Greenbelt, MD.

38

N. Juris to, A. M. Moreno & S. Vegas

Weyuker, E., 1988. An Empirical Study of the Complexity of Data Flow Testing. Proceedings 2nd Workshop on Software Testing, Verification and Analysis. Pages 188-195. Banff, Canada. Weyuker, E.J., 1990. The Cost of Data Flow Testing: An Empirical Study. IEEE Transactions on Software Engineering. Volume 16 (2). Pages 121-128. Wong, E. and Mathur, A.P., 1995. Fault Detection Effectiveness of Mutation and Data-flow Testing. Software Quality Journal. Volume 4. Pages 69-83. Wood, M., Roper, M., Brooks, A. and Miller, J., 1997. Comparing and Combining Software Defect Detection Techniques: A Replicated Empirical Study. Proceedings of the 6th European Software Engineering Conference. Zurich, Switzerland.

CHAPTER 2

Replicated Studies: Building a Body of Knowledge about Software Reading Techniques Forrest Shull Fraunhofer Center—Maryland, USA fshull @fc-md. umd. edu Jeffrey Carver Dept. of Computer Science University of Maryland, College Park, USA carver® cs. umd. edu Guilherme H. Travassos COPPE-Systems Engineering and Computer Science Program Federal University of Rio de Janeiro, Brazil [email protected] Jose Carlos Maldonado Dept. of Computer Science University of Sao Paulo at Sao Carlos, Brazil jcmaldon @ icmsc. sc. usp. br Reidar Conradi Norwegian University of Science and Technology, Norway conradi @ idi. ntnu.no Victor R. Basili Fraunhofer Center—Maryland and Dept. of Computer Science University of Maryland, College Park, USA basili @ cs. umd. edu 39

40

F. Shull et al. An empirical approach to software process improvement calls for guiding process development based on empirical study of software technologies. This approach helps direct the evolution of new technologies, by studying the problems developers have applying the technology in practice, and validates mature technologies, by providing indication of the expected benefit and the conditions under which they apply. So, a variety of different empirical studies are necessary for a given technology over time, with evolving goals and hypotheses. Thus, what we as a field know about a software development technology is never based on a single study; rather, a "body of knowledge" must be accumulated out of many individual studies. Multiple studies also help mitigate the weaknesses inherent in any empirical study by requiring the confirmation or refutation of the original findings by means of independent replications, which can address the original threats to validity although they will invariably suffer from threats of their own. Since formal methods for abstracting results from independent studies (such as meta-analysis) have not proven feasible, we advocate a more informal approach to building up such bodies of knowledge. In this approach, replication is used to run families of studies that are designed a priori to be related. Because new studies are based upon the designs of existing ones, it becomes easier to identify the context variables that change from one study to another. By comparing and contrasting the results of studies in the same family, researchers can reason about which context variables have changed and hypothesize what their likely effects on the outcome have been. As more studies become part of the family, hypotheses can be refined, or supported with more confidence by additional data. By using this informal approach, we can work toward producing a robust description of a technology's effects, specifying hypotheses at varying levels of confidence. In this chapter, we first present a more detailed discussion of various types of replications and why they are necessary for allowing the variation of important factors in a controlled way to study their effects on the technology. For each type of replication identified, we provide an example of this informal approach and how it has been used in the evolution of a particular software development technology, software reading techniques. We present a brief description of each set of replications, focusing on the lessons learned about the reading technology based on the results of the original study and the replication together. We will also discuss what we learned about the technologies from the entire series of studies, as well as what we learned about reading techniques in general. We will indicate which of these lessons were due directly to the process of replication and could not have been learned through a single study. Based on these examples, this chapter concludes with lessons learned about replicating studies. Keywords: Empirical software engineering; experiment replication; software reading techniques; perspective-based reading; object-oriented reading techniques; experimentation process.

Building a Body of Knowledge about Software Reading Techniques

41

1. Introduction In Software Engineering, researchers are continually developing new tools, techniques and methods. The problem is that very often these new technologies never make it out of the research laboratory into real-world practice, and when they do, there is often little empirical data capable of showing their likely effect in practice. Therefore, software developers have a plethora of development technologies from which to choose, but often little guidance for making the decision. Researchers and developers can both benefit from a better understanding of the practical effects and implications of new technologies. Such an understanding will allow decisions to be made not based on anecdote, hearsay, or hype, but rather based on solid empirical data. Many software engineers are still surprised to learn that 25 years of empirical research activities in software engineering have yielded important insights that can aid in the decision-making process. Both researchers and practitioners have a general need for properly evaluated technologies that are well understood, but their specific needs and goals are slightly different. Researchers need to perform focused evolution of their own technologies in the lab to understand when further development and assessment are required or when the technology is ready for deployment to practitioners. On the other hand, developers need reliable support to help determine which technologies to select for best use in their environment. An important way in which information can be built up about a technology is by running empirical studies. While a single well-run study can provide some specific information about the technology within a particular context, the results of any single study on almost any process depend to a large degree on a large number of relevant context variables. Thus, the results of any single study cannot be assumed a priori to apply in another context. Obtaining more general information about a technology requires the running of multiple studies under different contexts. Multiple studies allow the specific results of a single study to be validated and/or generalized across varying environments. However, some kind of approach is necessary to abstract the specific results from multiple studies into a useful and reliable body of knowledge capable of providing general recommendations about a technology. Based on its usefulness in other fields, meta-analysis, which provides a statistical basis for drawing conclusions across multiple studies, could be a promising vehicle for use in software engineering research. However, initial attempts to apply it to studies of software development technologies have

42

F. Shull et al.

not been successful [MillerOO], perhaps reflecting the level of maturity of the field. To apply meta-analysis, the field must be at a level of sophistication where different researchers are able to agree upon a common framework for planning, running, and reporting studies. While we believe that increased experience with running empirical studies will eventually lead to guidelines to facilitate the combining of software studies, in the meantime we must use less formal methods to combine results until the field reaches this level of sophistication. We advocate a more informal approach to building up a body of knowledge while the field matures to the point of using meta-analysis. In this approach, replication is used to run families of studies that are designed a priori to be related. Because new studies are based upon the designs of existing ones, it becomes easier to identify the context variables that change from one study to another. By comparing and contrasting the results of studies in the same family, researchers can reason about which context variables have changed and hypothesize what their likely effects on the outcome have been. As more studies become part of the family, hypotheses can be refined, or supported with more confidence by additional data. By using this informal approach, we can work toward producing a robust description of a technology's effects, specifying hypotheses at varying levels of confidence. Informal approaches such as this have been used before in software engineering research, for example, to formulate, successively refine, and then test hypotheses concerning effective Object-Oriented development [Wood99]. Our work is directly in line with the multi-method approach advocated by Wood et al. Later in this chapter, we will present a more detailed discussion of various types of replications and why they are necessary for allowing the variation of important factors in a controlled way to study their effects on the technology. For clarity a brief working definition of an replication will be given here. While in many contexts, the term replication implies repeating a study without making any changes, this definition is too narrow for our purposes. In this work, a replication will be a study that is run, based on the results and design of previous study, whose goal is to either verify or broaden the applicability of the results of the initial study. For example, the type of replication where the same exact study is run could be used to verify results of an original study. On the other hand, if a researcher wished to explore the applicability of the results in a different context, then the design

Building a Body of Knowledge about Software Reading Techniques

43

of the original study may be slightly modified but still considered a replication. This chapter will provide an example of this informal approach and how it has been used in the evolution of two sets of software reading techniques. We will present a brief description of each set of replications, focusing on the lessons learned about the reading technology based on the results of the original study and the replication together. We will also discuss what we learned about the technologies from the entire series of studies, as well as what we learned about reading techniques in general. We will indicate which of these lessons were due directly to the process of replication and could not have been learned through a single study. We will conclude this chapter with some lessons learned about replicating studies. Based on the replications discussed as examples, we will discuss what we learned about making replications more effective.

2. Reading Techniques To illustrate what we mean about a body of knowledge about a particular software development technology, we give the example in this paper of recent work in the area of "software reading techniques". This section gives some background on the technology and how it relates to the larger set of software development approaches. Reading Techniques in General A reading technique can be defined as "a series of steps for the individual analysis of a textual software product to achieve the understanding needed for a particular task" [Shull02a]. This definition has three main parts. First, the series of steps gives the reader guidance on how to achieve the goal for the technique. By defining a concrete set of steps, we give all readers a common process to work from which we can later improve based on experience. In contrast, in an ad hoc, or unstructured, reading process, the reader is not given direction on how to read, and readers use their own processes. Without a standardized process, improvement of the process is much more difficult. Secondly, a reading technique is for individual analysis, meaning that the aim of the technique is to support the understanding process within an individual reader. Finally, the techniques strive to give the reader the understanding that they need for a particular

44

F.Shulletal.

task, meaning that the reading techniques have a particular goal and they try to produce a certain level of understanding related to that goal [Shull98]. The "series of steps" that each reviewer receives consists of two major components: A concrete procedure that can be followed to focus on only the information in the review document that is important for the quality aspects of interest, and questions that explicitly ask the reader to think about the information just uncovered in order to find defects. Using the broad description outline above, different families of reading techniques can be instantiated for many different purposes [Basili96]. For example, a candidate Object-Oriented framework1 could be evaluated for reuse by means of a reading technique. One option would be for a textual representation of the framework to be reviewed by an individual from the point of view of whether the functionality supported would be useful for the planned project. A reading technique could be developed for this task to give the reviewer a procedure that could be followed to understand what types of functional descriptions and interface issues to focus on. Sets of questions would be used to make the reviewer consider important quality aspects that would affect reuse, such as how the expectations for flow of control and interface parameters of the reusable framework match the expectations for the rest of the system. The taxonomy of reading technique families developed to date is shown in Figure 1. The upper part of the tree (over the horizontal dashed line) models the problems that can be addressed by reading. Each level represents a further specialization of a particular software development task according to classification attributes that are shown in the rightmost column of the figure. The lower part of the tree (below the horizontal dashed line) models the specific solutions we have provided to date for the particular problems represented by each path down the tree. Each family of techniques is associated with a particular goal, artifact, and notation.

1

Here we use the term "framework" according to the specific technical definition of a particular system designed for reuse: an object-oriented class hierarchy augmented with a built-in model that defines how the objects derived from the hierarchy interact with one another to implement some functionality. A framework is tailored to solve a particular problem by customizing its abstract and concrete classes, allowing the framework architecture to be reused by all specific solutions within a problem domain. By providing both design and infrastructure for developing applications, the framework approach promises to develop applications faster [Lewis95].

Building a Body of Knowledge about Software Reading

45

Techniques

Reading

Technology

PROBLEM SPACE

General Goal

Usability

Design

Code Test PlariRequirernents

Interface

Project White Box ™ „ V B m Code Use-Cases OO Diagrams SCR English Source Framework ^ ^ LiBra^ / / 1 Code | S/_ / Traceability Defect-based Perspective-based Scope-based SOLUTION SPACE System , , Task Horizontal Vertical Omission Inconsistent Incorrect / \ \ wide

Oriented

Ambiguity

Specific Goal

Document (artifact) man Notation

Screen Shot \ Form \_ Usability-based Family

Expert Novice Etror Technique Developer Tester User

Fig. 1. Families of reading techniques.

This tailorablity is an important attribute of reading techniques, by which we mean that each reading technique defined is specific to a given artifact and to a goal. For example, one specific goal could be defect detection. Software reading is an especially useful method for detecting defects since it can be performed on all documents associated with the software process, and can be applied as soon as the documents are written. Given this goal, we could imagine software reading techniques tailored to natural language requirements documents, since requirements defects (omission of information, incorrect facts, inconsistencies, ambiguities, and extraneous information) can directly affect the quality of, and effort required for, the design of a system. For this technique, the procedure would be concerned with understanding what information in the requirements is important to verify (namely, that information which is required by downstream users of the requirements document to build the system). Questions could be concerned with verifying that information to find defects that may not be uncovered by a casual or unstructured reading. Reading techniques for the purpose of defect detection, especially as used to augment software inspections, has been one of the most widely applied branches of the tree in Figure 1. For this reason, this subset of the technology is described in more detail in the next section.

46

F. Shull et al.

Reading Techniques for Defect Detection A software inspection aims to guarantee through defect detection and removal that a particular software artifact is complete, consistent, unambiguous, and correct enough to effectively support further system development. For instance, inspections have been used to improve the quality of a system's design and code [Fagan76]. Typically, inspections require individuals to review a particular artifact, then to meet as a team to discuss and record defects, which are sent to the document's author to be corrected. Most publications concerning software inspections have concentrated on improving the inspection meetings while assuming that individual reviewers are able to effectively detect defects in software documents on their own (e.g. [Fagan86], [Gilb93]). However, empirical evidence has questioned the importance of team meetings by showing that meetings do not contribute to finding a significant number of new defects that were not already found by individual reviewers [Porter95, Votta93]. Software reading techniques can be applied to improving the effectiveness of software inspections by improving the effectiveness of individual reviewers, during their preparation activities before the inspection meeting. Reading techniques provide procedural guidelines containing tested procedures for effective individual inspection—a step that is often ignored in state-of-the-practice inspection processes. Reading techniques combine and emphasize three "best practices" that we have found helpful based on personal experience for effective inspections in a variety of contexts. While any of the three practices can be useful in and of itself, their integration in a unified inspection approach has been shown to be particularly valuable. Those practices are: 1. Giving each reviewer a particular and unique focus (or perspective) on the document under review. Studies such as [Basili96] and personal experience have shown that reviewers work better when they have a clear focus than when they feel themselves responsible for all types of defects, in all parts of the document. Additionally, having a unique focus means that each reviewer has clear responsibility for a certain aspect of the document and can't count on another reviewer catching any missed defects. 2. Making individual review of a document an active (rather than passive) undertaking. Reviewers tend to make a more thorough review when they are actively engaged in working with the information contained in a document than when they can get by

Building a Body of Knowledge about Software Reading Techniques

47

with merely reading it over. This has been an important principle driving the development of certain inspection approaches, such as Active Design Reviews [Knight93]. 3. Articulating a taxonomy of the types of defects of interest, and giving the reviewer an understanding how to look for those types of issues during the individual review. Reviewers do a better job of reviewing documents when they have a good idea what they are looking for. In tailoring reading techniques to specific project environments, defect taxonomies must be made explicit and reflected in the questions that are given to the reviewer. Two families of reading techniques that have received the most effort in terms of training and evaluation include Perspective-Based Reading (PBR) for requirements inspections, and Object-Oriented Reading Techniques (OORTs) for inspection of high-level UML designs. PBR Perspective-Based Reading (PBR) is a family of techniques that have been specifically designed for the review of English-language requirements. In planning a PBR review, the potential stakeholders of the requirements document are identified and the differing perspectives are used to give different reviewers a particular and unique quality focus: Each reviewer is asked to take the perspective of some downstream user of the requirements, and be concerned with only that quality focus during the review. Some examples of requirements stakeholders could include: • a user, who has to validate that the requirements contain the right set of functionality for his/her needs; • a designer, who has to verify that the requirements will allow a correct design to be created; • a tester, who must ensure that the functionality is specified in such a way as to be testable in the final system. Review is made into an active undertaking by asking each reviewer to create a high-level version of the work products that the appropriate requirements stakeholder would have to create as part of his or her normal work activities. In this way, the reviewer is forced to manipulate the information in the document in a way that approximates the actual work activities that document must be able to support. Additionally, the

48

F. Shall et al.

intermediate artifacts to be created during the review should be chosen carefully for their reusability downstream. The objective is not to duplicate work done at other points of the software development process, but to create representations that can be used as a basis for the later creation of more specific work products and that can reveal how well the requirements can support the necessary tasks. For the designer, tester, and user perspectives discussed above, the relevant work products would be a high-level design of the system described by the requirements, a test plan for the system, and an enumeration of the functionality described in the requirements, respectively. Finally, questions are distributed at key points of the procedure to focus reviewers on an explicit defect taxonomy. As the reviewer goes through the steps of constructing the intermediate artifact, he or she is asked to answer a series of questions about the work being done. There is at least one question for every applicable type of defect. When the requirements do not provide enough information to answer the questions, this is usually a good indication that they do not provide enough information to support the user of the requirements, either. This situation should lead to one or more defects being reported so that they can be fixed before the requirements need to be used to support that particular stakeholder later in the product lifecycle. As always, the defect types of interest to a project vary widely from one environment to another. However, a set of generic defect types, with associated definitions tailored to requirements inspections, is shown in Table 1 as a starting point for project-specific tailoring. OORTs Object-Oriented Reading Techniques (OORTs) are a family of techniques created for review of UML designs. The OORT family of techniques has been evolving over a long series of studies in software engineering courses where inspections are being taught, and are now at the point where they have been first used by some industrial companies. Like PBR, the OORTs drive the inspection of a design by means of key perspectives: validating that the design created is sufficient for implementing the desired functionality that was set forth in the requirements (called vertical reading), and verifying that the design documents themselves are consistent enough to support detailed design and eventual implementation (known as horizontal reading). However, because UML describes a document standard that is not implemented in all environments in exactly the same way, the OORTs have been modularized. There are seven techniques altogether, one for each

Building a Body of Knowledge about Software Reading Techniques

49

Defect

Applied to requirements

Applied to design

Omission

(1) some significant requirement related to functionality, performance, design constraints, attributes or external interface is not included; (2) responses of the software to all realizable classes of input data in all realizable classes of situations is not defined; (3) missing sections of the requirements document; (4) missing labeling and referencing of figures, tables, and diagrams; (5) missing definition of terms and units of measures [ANSI84].

One or more design diagrams that should contain some concept from the general requirements or from the requirements document do not contain a representation for that concept.

Incorrect Fact

A requirement asserts a fact that cannot A design diagram contains a be true under the conditions specified for misrepresentation of a concept the system. described in the general requirements or requirements document.

Inconsistency

Two or more requirements are in conflict A representation of a concept with one another. in one design diagram disagrees with a representation of the same concept in either the same or another design diagram.

Ambiguity

A requirement has multiple interpretations due to multiple terms for the same characteristic, or multiple meanings of a term in a particular context.

Extraneous

Information is provided that is not needed The design includes information that, while perhaps or used. true, does not apply to this domain and should not be included in the design

Information

A representation of a concept in the design is unclear, and could cause a user of the document (developer, low-level designer, etc.) to misinterpret or misunderstand the meaning of the concept.

Table 1. Types of software defects, with specific definitions for requirements and design.

comparison between two (or in some cases three) UML diagrams that can effectively be compared. In this way, tailoring to project environments can be done more easily, as projects can simply choose the subset of OORTs that are appropriate for the subset of UML diagrams that they are using in

50

F. Shull et al.

their environment. The complete set of OORTs is defined as shown in Figure 2. Each line between the software artifacts represents a reading technique that has been defined to read one against the other.

Requirements Specification

Requirements Descriptions

t

High Level Design

Class Diagrams

Tft-

Class Description

Use-Cases

T State Machine Diagrams

Interaction Diagrams

Jtr-f:

Vert, reading Horz reading

Fig. 2. The set of OORTs (each arrow represents one technique) that has been defined for various design artifacts. OORT review is an active undertaking because effort must be expended to reconcile the different views of the same information contained in different UML diagrams being compared. The OORT procedures guide the reviewer to walk through the various diagrams, marking related information in each. For example, sequence diagrams show many specific messages being passed between objects, which taken in the aggregate might correspond to a particular high-level use case. The OORT reviewer has to understand which sequence of messages corresponds to the higher-level functional description, and mark both documents to show that the overall concept is the same in both cases. Once the equivalent information has been marked on various documents, then the reviewer can analyze whether it is represented correctly in both cases. This analysis is guided by means of specific questions included in the techniques. As in PBR, the list of questions is tied directly to the list of defect types explicitly identified in the taxonomy. An initial set of UML defect types is included in Table 1, shown alongside the requirements definitions to show how generic defect types can be redefined in various phases of the lifecycle.

Building a Body of Knowledge about Software Reading Techniques

51

3. Building a Body of Knowledge As mentioned in the Introduction, the sophistication level of Empirical Software Engineering is not yet at the level where families of studies are organized with the goal of drawing broad conclusions. On the other hand, because of the relatively limited scope within which a specific set of results is applicable, we need to begin to aggregate results of multiple studies to move the field to this higher level of sophistication. Therefore, we need to develop a way to organize and plan the studies in order to get the most benefit from the combination of their results. This means that the studies should be planned so that the important context variables can be varied in some controlled way in order to study their effects on the results of using the technology. The tree in Figure 1 illustrates a way that results of different studies can be abstracted up to draw lessons learned about specific technologies or classes of technologies. While it is not realistic to assume the field will ever be at the point where all of the studies run on technologies have been coordinated and planned with the idea of abstraction of results in mind, we think that the hierarchy in Figure 1 provides a good model to start from to reason about how some such studies may fit together. As we do have studies that fit into the tree, we can start with those studies and abstract the results to show the value of putting studies together. If the aggregated results of studies, which were not specifically planned with the goal of result abstraction in mind, can be abstracted successfully, then there is promise that a set of studies that are planned with this goal in mind can provide even better results. Abstraction Across a Series of Studies If we can plan a series of studies on a given technology such that the studies are coordinated to address different context variables, we will be better able to draw conclusions about the technology. Each of the various context variables (e.g. the process experience of the developers using the technology, the problem domain in which it is applied, or the level of procedural constraint given to the subjects) is a dimension along which the design of a study can be altered. That is, each context variable represents a specific characterization of the environment that can be modified for each study. To rigorously test the effects of different context variables on results, it would be necessary to run further studies that made different decisions for the relevant variable. If this series of studies can be done systematically,

52

F.Shulletal.

then a body of knowledge can be built up that represents a real understanding of some aspect of software engineering. A body of knowledge supports decision-making because the results of the individual studies can be viewed together and abstracted to give deeper insight into the technology being studied. For example: • If a new environment is substantially similar to the environment of a previous study that helped build the body of knowledge, the previous results about the use of the technology can be directly useful; • If the new environment is significantly different from others that were used to build up the body of knowledge, the aggregate data can still be useful by observing across a number of different environments which context variables have an impact on effectiveness and which not. In this way, articulating the important context variables can serve as a means of organizing a set of related empirical studies. Thus, families of related studies are the cornerstone to building knowledge in an incremental manner. However, performing such studies in an ad hoc way (without careful planning of relationships within the family) is inadequate. For example, an attempted meta-analysis of several studies on software inspections was unable to come to any higher-level conclusions because the studies were not designed, run, or reported in ways that were compatible [MillerOO]. What is necessary is a framework for planning studies and data analysis in a way that enables reasoning about the commonalties of the results. The key method for creating families of related studies is through the use of replications. The main idea behind a replication is to take a study run to investigate some technology and repeat that study. The replication normally makes some slight modifications to the original study in order to either test new hypotheses or to verify existing results. In this way, a body of knowledge is built through studies designed to relate to one another, rather than through trying to impose a framework post-hoc on unrelated studies. Types of Replications A taxonomy describing how replications can relate to the original study was proposed in the TSE special issue on empirical studies [Basili99]. In this view there were three major categories of replications based on how key

Building a Body of Knowledge about Software Reading Techniques

53

dimensions were varied. Each of the dimensions below illustrates a particular reason for running a replicated study, although it should be noted that replications often contain several different types of changes from the original (for example, a replication might be primarily intended to study some particular change to the technology itself but also might introduce some refinements into the study design at the same time). I. To duplicate as accurately as possible the original study. These replications are necessary to increase confidence in the validity of the study. They demonstrate that the results from the original study are repeatable and have been reported accurately. They are also useful for teaching new empirical researchers how to run studies. An example is the ISERN Benchmark Re-Inspection Experiment (BRIE), an experiment to investigate the effectiveness of inspections and meetings that has been packaged at http://csdl.ics.hawaii.edu/techreports/9613/96-13.html. BRIE has been designed so that it is easy to carry out, and the expected results (or at least their general characteristics) are well-known and stable across replications. Thus, experimenters in multiple environments can run the BRIE experiment and compare their results in order to calibrate their experimental methods and gain experience in running experiments. II. To vary how the particular development technology is studied. These studies seek to increase our confidence in the results concerning a specified technology by addressing the same problem as previous studies, but altering the details of how the study is run so that threats to validity can be addressed. A.

To address external validity. These replications use the same study design as the original but use a different type of sample population to address concerns about whether the original results can be extrapolated beyond the original subjects. Such replications can be used to investigate whether results transfer between different industrial environments, or from a laboratory environment to a particular industrial environment. For example, an approach used at NASA's Software Engineering Laboratory was to run small focused studies first with students, to determine the feasibility of an approach or general principles. If those results proved successful then studies could be run with (more expensive) NASA personnel to test whether the results could transfer successfully into industrial practice [Basili97].

54

F. Shull et al.

B.

To address internal validity. These replications investigate similar hypotheses to the original study but use a different design to address threats to validity resulting from the way the study is run. The goal is for the replicated study to contain different threats to validity than the original study so that, while neither is perfect on its own, both studies contribute to raising confidence that the results are rigorous and independent of the study methodology used. For example, a study by Ciolkowski et al. [Ciolkowski97] replicated an earlier study of inspection techniques in which the effectiveness of individuals during a review was studied. In the earlier study, the effectiveness of review meetings was not studied directly, but was instead simulated from the data collected on individuals using statistical techniques. The replicated study collected data from both individuals and review meetings, allowing the accuracy of those statistical methods to be verified.

III. To vary what is studied. These replications vary details of the particular development technology under study. A.

To vary details of the technology, for improvement within an environment. These replications investigate what aspects of the technology are important by systematically varying intrinsic properties of the technology and examining the results. For example, a series of studies was undertaken at the University of Maryland to evolve a new inspection technique in an incremental matter. First the idea was shown to be feasible, then additional studies were undertaken to optimize the inspection procedure. Procedural steps were re-ordered or dropped and terminology was refined in order to better match the procedure to the work practices of the students [ShullOl].

B.

To vary details of the technology, for tailoring to a new environment. These replications vary certain environmentally dependent parts of the technology to identify potentially important environmental factors that can affect the results of the process under investigation, and hence the technology's suitability for various environments. Results also demonstrate the tailoring needed for various environments.

Building a Body of Knowledge about Software Reading Techniques

55

This type of replication requires the technology and other artifacts from the original study to be supplied in sufficient detail that changes can be made. This implies that the rationales for the design decisions must be provided (so that replicators know what features are easy to change and which are intrinsic to the effectiveness of the technology) along with the finished product. For example, an experiment undertaken in 2001 in a class at the Federal University of Rio de Janeiro used a set of English-language inspection techniques that had been shown to be useful in a similar classroom environment in the United States, and translated them into Portuguese. The recorded design rationales were used to determine where key concepts needed to be translated "as is" and where local equivalents were free to be substituted. More loosely defined, there are also replications that adapt only the basic idea behind a particular technology to a new goal or new situation. For example, an inspection approach that had proven effective for requirements inspections was adopted for use in usability inspections of web pages [Zhang99]. Key concepts were retained, such as having reviewers take the perspectives of different stakeholders of the inspected product, but adapted to the new domain, for example by focusing on expert and novice users of the web-based system rather than downstream users of the requirements Benefits and Risks of Replications Running replications involves potential benefits but also dangers, which must be kept in mind while planning and running the studies. Benefits fall into two main categories: The first benefit is that by using replications the results of a study can be validated or corrected, in the case where the original study has methodological errors. If a study is replicated and produces similar results then the confidence in the validity of those results increases. As more replications produce similar results, the level of confidence in those results increases. On the other hand, if dissimilar results are produced, further analysis may trace the discrepancy back to issues in either the original or replication, helping to debug the experimental protocols. The second benefit is that the interaction of different values in the study can be better understood. By replicating a study and changing the dependent and independent variables, the scope effect of those variables on the process under study can be better understood.

56

F. Shull et al.

On the other hand, the dangers that can arise when replicating studies have to be taken into account. The first and most important danger is that if the replication is done poorly or by an inexperienced researcher, the results could be at best incomparable or at worst contradictory. A well run study producing contradictory results provides enough information to understand the results and how they relate to the existing body of knowledge, but a poorly run study with contradictory results can only confuse the effort to provide useful decision support, by introducing unsubstantiated data into the body of knowledge. The second danger is that in the process of trying to make the replication more interesting, the researcher might change too many variables in the study design. When this situation occurs, the results are often incomparable because there are too many potential sources of variation that might account for any differences in results. For replications, as for individual studies, the goal must be to minimize the number of rival hypotheses that can be put forward to provide an explanation of the results.

4. Building a Body of Knowledge about Reading Techniques In this section, we illustrate how the body of knowledge is being built on the subject of software reading techniques (described in an earlier section) by describing four of the latest replications done on this topic. In each case, we describe briefly the mechanics of running the replication (to explore some of the important issues involved in replications in general) and concentrate on what each replication contributes to the body of knowledge. The studies described evaluated two different families of reading techniques (PBR for requirements and OORTs for high-level object oriented designs), allowing us to discuss lessons that were learned both about the specific technologies and about reading techniques in general. In this way we aim to show that replications are useful both for addressing specific practical questions about technology use and for abstracting up basic principles concerning a technology that can be used to tailor it for additional environments. These particular studies are also useful for illustrating the four types of replications described in the previous section. Figure 3 illustrates the time frame in which each study/replication pair was run, and shows that the example replications were selected from different periods in the maturity of the techniques. Some replications followed their original studies closely in time, while others came much later.

Building a Body ofKnowledge about Software Reading Techniques Type of replication

1994

1995

1996

1997

1998

Requirements Reading Experiments To address NASA external validity

2000

^ w

OO Design Reading Experiments For improvement

2001

Univ. of Sao Paulo

Univ. of Maryland

To address internal validity

For tailoring

1999

57

Univ. of w w Maryland

Univ. of Norwegian Maryland Univ. Sci. —*-&Tech Univ. of Maryland

Univ. of

w Southern W

California

Fig. 3. Relationships between replicated studies and originals.

Later in the chapter, we abstract some lessons learned about running replications in general and describe some guidelines for "lab packages" that collect and organize the information necessary for effective replications that take these lessons into account. We also describe prototype repositories of studies that are being built to instantiate these packages. PBR PBR was piloted at NASA's Goddard Space Flight Center [Basili96] in Maryland and has been adapted for industrial use and training at Allianz Life Insurance, Bosch Telecom, and Robert Bosch GmbH, Germany. In addition, a series of empirical studies of PBR with students and professionals through universities in several countries has shown that it is more effective than less procedural approaches, e.g. [Ciolkowski97, Sorumgard97, Shull98].

58

F. Shull et at.

Study at University of Sao Paulo In 2000 a study was run at the University of Sao Paulo in Brazil (USP) [Shull02b], which replicated a study that was originally run at NASA [Basili96] in order to address the external validity. The study was a controlled experiment of the Perspective-Based Reading (PBR) techniques for requirements review [Basili96], designed so that it can be run in many different environments, allowing the improvement due to PBR to be studied in comparison to many other inspection approaches already in use. Due to an experience package that was made available on the web specifically for facilitating replications,2 this experiment has been replicated many times, in different contexts. The replicating researchers reused the same design and materials as the original study, but changed the sample population in order to assess whether the conclusions of the original study held outside of the original subjects. This context was a good choice for this replication, because many of the previous runs of this study had shown an increase in inspection effectiveness for subjects with a mid-range of experience (i.e. subjects with a basic knowledge of some requirements stakeholder role, although not experts in inspections) due to the use of PBR. As a result, the experimenters reasoned that this could be a promising technique for instructing software engineering students, as the procedural approach can be used to provide guidance when subjects have not had the experience yet to develop their own inspection processes. The replication was undertaken by independent researchers with the support of the original experimenters, who were consulted during the design and preparation of the replication. A pilot study was run in the local environment before the main study, to increase the researchers' expertise in the technology and debug the experimental protocol. Subjects 18 undergraduate students from the University of Sao Paulo participated. Students had previously received introductory training in the perspectives they were asked to take during the review. The experience level of subjects was very different in the original study, which used 25 professional software developers from the National Aeronautics and Space Administration/

2

http://www.cs.umd.edu/projects/SoftEng/ESEG/manual/pbr_package/manual.html

Building a Body of Knowledge about Software Reading Techniques

Goddard Space Flight Center (NASA/GSFC) Software Laboratory (SEL).

59

Engineering

Procedure The central research question was to evaluate the effectiveness of a systematic (i.e. procedure-based) approach to inspections. To achieve this, the experimental design compared PBR, which provided a systematic process focused on certain classes of defects, to a nonsystematic approach that could be used for the same defect classes. The nonsystematic approach was provided by a checklist-based approach focused on the same defect taxonomy as was used to create the PBR procedures. The experimental design consisted of training the subjects in the checklist approach and allowing them to review a document, then training them in PBR and allowing them to review a second document. Improvement was measured mainly by evaluating the defect detection effectiveness of each review (the average percentage of defects found). Because it was assumed that reviewers could not avoid incorporating elements of the systematic process training in the nonsystematic process, the order of applying the two review approaches was not varied (that is, systematic training always had to come after the nonsystematic review). However, two different documents were used in each review to ensure that the results were not document-specific. (These documents were from "generic" domains with which most subjects were expected to be familiar, an ATM system and a control system for a parking garage (PGCS)). Subjects in the original experiment, being professional developers at NASA's Goddard Space Flight Center, already had their own nonsystematic procedure for inspections [SEL92]. They also had additional experimental treatments where they applied both techniques to typical documents from their own environment (called NASAA and NASAB). Because subjects in the replication were students who did not have existing review procedures or work documents in their environments, the reviews of NASA-specific documents were simply dropped and training in a checklist technique had to be added (as illustrated in Figure 4). Data Collection The defect lists resulting from each subject's reviews were evaluated to determine the defect detection rate, i.e. the percentage of the seeded defects that was found by each reviewer. The data from the individual reviews was later used to investigate the defect detection effectiveness that can result

60

F.Shulletal.

from review teams by using simulation techniques to evaluate the coverage that would result when individuals were grouped. The percentage of defect occurrences found by each approach was also measured, i.e. percentage of the number of defects that would have been reported if all reviewers had been 100% effective. Subjects were asked to fill out a background questionnaire at the beginning of the experiment, and an opinion questionnaire at the end. Original Study Group B (halfof subjects)

Replicated Study Group A (halfof subjects)

Training in checklist review

X

X

Group A (halfof subjects) V

Review generic document

S (ATM)

S (PGCS)

S (ATM)

• (PGCS)

Review NASA document

X

X

Training in PBR review

• (NASA_A) •

(NASA_B) •/

•

•

Review NASA document, with PBR

X

X

(NASA_B)

(NASA_A)

• (PGCS)

/ (ATM)

S (PGCS)

S (ATM)

Group A (halfof subjects)

Activity

Review generic document, with PBR

/

Fig. 4. Order of activities in the study design. (Activities that did not occur in both the original and replicated studies are marked with an X and shaded in the appropriate place.)

Lessons Learned Previous studies had shown that PBR was at least as good as, and sometimes better than, the subjects' usual approach. Moderately-experienced reviewers exhibited some of the largest improvements in review effectiveness when using PBR. Although not a statistically significant trend, this had indicated that less experienced reviewers might be more likely to benefit from procedural guidelines. The Sao Paulo replication provided more data corroborating this supposition. For one document (ATM), the PBR group had a statistically better performance measured both by the percentage of defects that were found by the group overall as well as the total number of defect reports by

Building a Body of Knowledge about Software Reading Techniques

61

unique individuals, than the group that applied the checklist. For the other document (PGCS), the effectiveness and efficiency of the group applying PBR were only slightly better. As in previous studies, each unique stakeholder perspective found certain unique defects that were found by no other perspective. This indicates that the multiple perspectives each contribute value and really do provide a unique viewpoint of the document, minimizing overlap in effort. Taking these results in combination with the results of the original study, some lessons learned can be drawn about the PBR techniques. Because these studies were run in such different environments, we have more confidence in the robustness of the external validity of the conclusions (i.e. more confidence that they do not just hold in particular environments): • Results from both experiments show that under certain conditions, the use of detailed procedural techniques can lead to improvements for subjects who have a certain minimum experience in the review perspective but are not yet experts in the inspection procedure. This was true both in the case of the NASA professionals with the relevant range of experience as well as of the Brazilian students. This provides more evidence to corroborate the theory that PBR is an effective way to bring novice reviewers up to speed, provided a certain minimal amount of previous training has been achieved. • Both the original and the replication show benefits due to the detailed PBR techniques over less systematic approaches (the nonsystematic approach of NASA and the checklist approach at USP). This was true even when subjects had significantly more experience in the less systematic approach (as at NASA). This provides additional evidence that PBR can help improve inspection effectiveness for even experienced reviewers. Results The main results of this replication have been in the area of improving packaging. At the suggestion of the replicating research team, the original experimenters are trying to improve the packaging of the experiment by clarifying such items as the instructions for subjects, the time estimates for various tasks, and the descriptions of the known defects in the artifacts to be inspected. At this point, thanks in part to a joint NSF/CNPq project, further replications have been run based on these experiences and the lessons

62

F. Shull et at

learned. A total of four replications have now been conducted in Brazil, some at other universities (such as the Federal University of Sao Carlos) and others in industry. These have had the side-effect of highlighting another important benefit of replications: that they facilitate both the dissemination of the underlying concepts and technology and of the experimentation technology. Study at University of Maryland The study described in this section is a consequence of a later study into the Perspective-Based Reading (PBR) approach to requirements inspection. Previous studies, such as those described in the earlier section on PBR, had shown that PBR was feasible and effective at detecting defects in certain environments. Our next goal had been to expand the understanding of the other variables that might affect the use of PBR. For that reason, a study was run in 1999 to collect observational data, at a very detailed level, about the use of the techniques. One problem with that study was that conclusions about the technology were based on the experiences of "beginners," subjects who had only just learned about the technology and did not have significant experience in applying it. Thus one threat to validity was that we were measuring early effects that would increase or diminish over time, as subjects got more familiar with its use. Therefore we decided to undertake a replication to address internal validity that could explore such learning effects in more detail, and increase the confidence with which we could draw conclusions about the likely effectiveness of the process in industry. Subjects The subjects were graduate students at the University of Maryland. The subjects were paired up with one subject acting as the executor (responsible for applying the procedure) and the other as the observer (responsible for recording observations about the application). There were 26 subjects grouped into 13 pairs each subject got a chance to play each role so all 26 performed a review. As in the original study, around 1/3 of the subjects had industrial experience reviewing requirements. Procedure To study the technology at the level of detail that we were interested in, in both the original study and the replication, an observational approach was

Building a Body ofKnowledge about Software Reading Techniques

63

used. An observational approach is an research method suitable for understanding how a process is applied. In an observational study, a subject applies the technology being studied (the executor of the technology) while a researcher (the observer) observes them to understand details of the technology in use. Often the subject is instructed to "think aloud" so the researcher can better understand his or her thought processes. These types of studies provide a level of detail about individual process steps and their usefulness that is difficult to collect using traditional post-study questionnaires [Shull99]. The observational approach was necessary to understand what improvements might be necessary at the level of individual steps, for example whether subjects experience difficulties or misunderstandings while applying the technique (and how these problems may be corrected), whether each step of the technique contributes to achieving the overall goal, and whether the steps of the technique should be reordered to better correspond to subjects' own working styles. Before the study, subjects received training in the reading techniques to be applied and in observational methods. In the first inspection, roughly half of the teams inspected the requirements for the LA and the other half the requirements for the PGCS. After this inspection was complete, the team members switched roles, i.e. the process observer in the first inspection became the process executor in the second inspection. The teams also switched requirements documents, from LA to PGCS or vice-versa. There was no team meeting to collect or detect more defects. The design is summarized in Figure 5. Original Study

Replicated Study Teammate 1

Teammate 2

• /

• /

•/

•/

(observed)

S (LA or PGCS)

(observed)

X

(observed)

• (PGCS or LA)

Teammate 1 •/

Teammate 2 •/

Training in observational methods

•/

• /

Individual review using PBR

S (LA PGCS)

Individual review using PBR

X

Activity Training in PBR

or

.._

Fig. 5. Order of activities in the study design. (Activities that did not occur in both the original and replicated studies are marked with an X and shaded in the appropriate place.)

64

F. Shull et al.

Data Collection Quantitative data was collected, such as the time required to perform the inspection using PBR, and the number and type of defects detected. Using the observational techniques, a rich array of qualitative data was also collected, including: • Subjective evaluation of the effectiveness of the technique. • Specific problems with steps in the technique • Usefulness of the different perspectives • Practicality of the techniques (whether subjects would use some or all of them again) • High-level problems with the techniques. Because the team performed two requirements inspections, they were also able to include in their reports information about learning. Thus it was possible to evaluate whether the qualitative data was different because each team had an additional review during which to apply the techniques and would have learned from each other over the course of the study. Lessons Learned This study, when compared with the original, helped illuminate issues of internal validity by providing interesting results about our metrics for subject experience and learning. • There may be other experience factors that affect the performance of inspectors when using the PBR techniques, because the results of the first and second studies showed opposite correlations in terms of experience vs. performance. In the replication, software development experience was negatively correlated to the performance in the inspection (unlike in the original study). • There was no statistically significant difference in review effectiveness depending on whether the subject was reviewing a document from a familiar domain (PGCS) or an unfamiliar one (LA). • There were no clear and consistent quantitative differences due to learning. For the each team's second review, both review effectiveness and effort required were about the same as the historical baseline for the document being inspected. Although there were no quantitative indications of a learning effect, the qualitative data provided indications that the subjects were able to achieve

Building a Body of Knowledge about Software Reading Techniques

65

the same review effectiveness in later reviews while short-cutting or otherwise modifying the techniques. The qualitative data provided the following results: • 8/13 teams felt they improved as a result of the learning opportunity: they understood the steps better, were more confident and efficient the second time. • 7/13 teams said that as a result of the learning opportunity they were able to change the application of the procedure to better match their own work practices, i.e. they could curtail or reorder steps when using PBR a second time. • 6/13 teams provided new questions that could be asked during PBR (domain and organization specific) Results The main results of this study have been to propose a new set of hypothesis to be evaluated. Because the replication was the first one done in order to begin to understand learning, some questions could not be answered in the design of this study. Because we did not observe the improved performance in the second inspection that we hypothesized, we must investigate some of these other questions. The new questions include: • Would we see a stronger learning effect if the same subject (as opposed to the same team) executed PBR two times? • Is there some threshold of expertise that a subject must have in order to tailor effectively the PBR process for his needs? Lessons Learned about PBR Both sets of replicated studies allow us to abstract up some common observations about PBR. Although not tested statistically, by combining data sets from all studies, we can formulate some general hypotheses based on patterns seen across multiple environments. • Subject experience is not consistently related to a subject's effectiveness applying PBR. We have been trying to define heuristics that would allow developers to understand how much experience is necessary to apply PBR most effectively, but so far we have not been able to find consistent patterns that hold up across multiple studies, despite looking at several measures of experience. For example, in the section on the Study at The University of Maryland, despite the subjects having similar amounts of industrial

66

F. Shull et al.

experience, the relationship between software development experience and PBR effectiveness was opposite between the original and replicated studies. On the other hand, the study in the section on the Study at The University of Sao Paulo showed that earlier results with professional subjects were consistent with results seen from very much less experienced undergraduate students. Based on these results, we have observed a general trend that allows us to hypothesize that the PBR techniques may be best for novice training, and that the most experienced users get much less benefit if any from using PBR as compared to their usual approach. However, further testing is needed to refine this hypothesis and provide clear results. The structured review approach of PBR seems to help reviewers, even if the specific steps in the procedure aren't exactly right. In The Sao Paulo study, there were consistent results that systematic results were at least as good as and often better than less structured approaches. The observational study in the University of Maryland study provided one possible explanation why this could be true: a structured approach makes improvement easier because over time the steps can be updated based on the review teams' experience. OORTs OORTs were first evaluated at University of Maryland [Travassos99] and have been the subject of a series of empirical studies with students and professionals through universities and industry in different countries [Shull01],[Melo01],and [Travassos02] that have demonstrated their effectiveness. Study at Norwegian University of Science and Technology The replication [ArifOl] was undertaken by an independent researcher in order to understand the feasibility of the techniques outside of the original environment, to see if the results of the original study [Shull99] were true only for the original set of subjects, or if the results were more globally applicable. In preparing to run the replication, the researcher hypothesized that the techniques were too detailed for easy use by subjects. The replicating researcher to vary some features of the techniques for improvement, removing certain details and moving the instructions to a more abstract level. The modifications to the techniques included:

Building a Body of Knowledge about Software Reading Techniques

• •

•

67

summarizing the detailed instructions of the original techniques into higher-level steps; adding more specific instructions to the reader to indicate the type of defect that was expected to be reported at different points in the techniques; changing the formatting.

The goal of these changes was to make the techniques more feasible to be performed in a shorter amount of time and produce more accurate and less ambiguous defect lists. Also, additional detail requested in the defect reports was to be used for checking the contribution of each step of the process to the final defect reports. Researchers were novices in the technology (there was no one on site who had received training in the technology directly) at the Norwegian University of Science and Technology. Subjects There were 19 students who were members of a Master of Engineering class in Computer Science at the Norwegian University of Science and Technology, organized into 2- and 3-person teams. 10% had industrial experience in design reading. Procedure The replication made use of observational techniques, which are described in the University of Maryland Study, to examine the way in which subjects applied the less-detailed techniques. A modified version of the 0 0 reading techniques from the original study was applied to the same artifacts used in the original study. A design was used in which half of the class reviewed the LA design and the other half the PGCS, as shown in Figure 6. Five teams inspected the Parking Garage and 4 teams inspected the Loan Arranger design documents. The differences from the design of the original study were as follows: 1) In the replication, there was no requirements inspection prior to the design inspection, so there could be no analysis of whether familiarity with the system requirements had an impact on the effectiveness of the design inspection. 2) Subjects switched roles (executor and observer) in the middle of the study, rather than keeping the same role throughout.

68

F. Shull et al.

Training in the observational techniques of the study was performed in the same way as in the earlier study. Subjects received training in the techniques and observational study methodology, reusing the presentation slides of the original researchers. However, the class training and instruction were done by proxy, i.e. by an instructor who corresponded with the replicating researcher but had not participated in the training himself. Original Study Group A (halfof subjects)

Activity Individual inspection of system requirements Training in OORTs Training in observational methods Individual review using OORTs

Group B (halfof subjects)

Replicated Study Group A (halfof subjects)

Group B (halfof subjects)

S (LA or PGCS) V

X

X

• /

• /

V

• /

• /

• /

•/

•/ (LA or PGCS)

S (LA or PGCS)

'(LA)

S (PGCS)

X

Fig. 6. Order of activities in the study design. (Activities that did not occur in both the original and replicated studies are marked with an X and shaded in the appropriate place.) Data Collection Quantitative data was collected, namely the time required for executing the techniques and the number and type of defects detected. However, observational techniques were the most important method used in this study, providing data on: • Executor's opinion of effectiveness of technique • Problems encountered with specific steps of procedure • How closely executors followed the techniques • Practicality of the techniques (whether subjects would use some or all of them again) • The problems encountered using the techniques Lessons Learned Unfortunately, the quantitative data from this study did not allow us to assess the improvement due to the new version of the techniques. It was

Building a Body of Knowledge about Software Reading Techniques

69

hoped that the replication would provide some indication of the effectiveness of the operationalized version of the techniques in comparison to the original version, but unfortunately that was not the case. The students did report a number of problems using the techniques, but it could not be determined if these were caused by changes introduced in the operationalized version, issues implicit in the original version and carried over, or problems that were uncovered while running the study. This problem actually led to many of the lessons learned that we formulated about running replications in general. However, comparison of these results to the original study did allow us to confirm the effectiveness of some basic design features: • Horizontal and vertical techniques found different types of defects (results were consistent with the original study). • Having domain expertise was not helpful for subjects in the design inspection (results were consistent with the original study). Results Results from both studies are together contributing to a new version of the techniques, using the data from the observations about the way that readers applied the techniques. This version of the techniques will be focused more on the semantics behind the design models and less on the syntax. Additional improvements are being made in regards to training; lessons learned concerning the packaging necessary to support independent replications are discussed in a later section. Study at the University of Southern California At this point in their development, we had some confidence from multiple studies that the techniques were feasible and effective. The replication described in this section, in which the techniques were introduced into a course at the University of Southern California (USC), was one of our first attempts to have the techniques used on real development projects outside of the University of Maryland. It was especially important because several key differences existed between the two environments that required small focused changes to the techniques for tailoring. These key changes from the original study [Travassos99] fell into the following areas: • Lifecycle Model: The students in the class used the Spiral lifecycle [Boehm88] model, whereas the original techniques were designed

70

F. Shull et al.

•

•

for use in a waterfall lifecycle. The Spiral model emphasizes iteration and incremental development, therefore the requirements and design are created concurrently and evolved over time. Inspection Process: In these classes, the reading techniques were used in conjunction with a Fagan-style inspection process [Fagan86]. Previous studies had investigated the reading techniques in a different inspection context, in which the majority of defect detection was done during individual review. In the Fagan-style inspection process, the individual inspectors do some individual preparation, but use the team meeting as the primary time to detect the defects. Inspected Artifacts: The students were also required to use a set of modeling and documentation guidelines called MBASE [Boehm99]. MBASE focuses on ensuring that a project's product models (e.g. architecture, requirements, code), process models (tasks, activities, milestones), property models (cost, schedule, performance, dependability), and success models (stakeholder win-win, IKIWISI—I'll Know It When I See It, business case) are consistent and mutually enforcing. Because the format of the artifacts created under MBASE was different from the format of the artifacts used in previous studies, some minor modifications had to be made to the reading techniques, to check conformance to the guidelines and cross-references to other MBASE documents.

Our hypothesis was that these changes could be made without destroying the feasibility of the techniques in the new environment. To test this hypothesis, one of the early OORT feasibility studies from UMCP was selected and replicated in the new classroom environment at USC. Other factors were held constant as much as possible in order to have a high degree of comparability between the studies. The replication was done as a partnership between the developers of the OORTs and local researchers at USC, with both groups working together to adapt the techniques, the original researchers responsible for the classroom training and data analysis, and the local researchers responsible for data collection. Subjects The subjects were graduate students in the Computer Science Department at the University of Southern California enrolled in a two-semester Graduate

Building a Body of Knowledge about Software Reading Techniques

71

Level Software Engineering Course. (The study described in this section took place in the Spring 2001 semester.) The majority of the 31 subjects had industrial experience (61% of the students had developed software as part of a team in industry). Procedure The replication differed from the original study in the assignment of subjects to designs. In the original study, all teams used the horizontal reading techniques to inspect their own designs to ensure that they were consistent. They corrected any defects that they found. After the designs had been corrected, the teams traded designs. Each team then performed the vertical reading techniques on a design for another team. The list of discrepancies found by the reviewers was then returned to the authors of the design for correction. In the study at USC, both horizontal and vertical reading techniques were applied by subjects to their own designs. In both of these inspections, team meetings followed individual review. These activities are summarized in Figure 7. Activity

Original study

Replication

Team inspection of system requirements

•S (own system)

•S (own system)

Team inspection of design

S (own design, using horizontal techniques only)

•S (own system, using both horizontal and vertical kvhniques)

Team inspection of design

•S (another team's design, using vertical techniques only)

\

Fig. 7. Order of activities in the study design. (Activities that did not occur in both the original and replicated studies are marked with an X and shaded in the appropriate place.)

In the overall scope of the software development process there was no control group here. This occurred for two reasons: first, the design inspection was one small part of a larger study, and the overall design did not allow for a control group. Secondly, in a classroom environment, it was not possible to provide instruction on a topic to only a portion of the class.

72

F. Shull et al.

The subjects used the Spiral development model to create their software. The OORTs were used in one development iteration to aid in the inspection of the designs. The subjects used Fagan-style inspections in their projects. Unlike the other studies, the goal of the individual review was to prepare the individual reviewers for the main defect detection effort, which occurred at the team meeting. So, the individual inspectors used the OORTs in the preparation phase to help them make a list of potential defects, which they wanted to discuss during the team meeting. The team manager decided which subset of the techniques was relevant for the team and which techniques were assigned to which reviewers. There was no control group. It was not possible, based on constraints of the class, to divide the class into a control and an experimental group. Also, pedagogically we could not teach the OORTs to only part of the class. Data Collection Questionnaires and an analysis of the defect lists were used to evaluate the effectiveness of the techniques in the development process. The quantitative data collected included both background information and the amount of time taken to use the techniques, used to evaluate feasibility of use. The qualitative data collected by the questionnaires concerned: • Opinions of the helpfulness of the techniques. • Problems encountered using the techniques, or extra knowledge that was needed to use the techniques. • Opinions of effectiveness of training. Analysis of the defect lists provided quantitative data about the number and types of defects found by the teams. The data was useful in determining if the output of the reading process uncovered defects and was useful for continuing the development process. It was necessary to collect some additional information specific to the new context at USC, such as: • Whether the techniques were able to be used with the specific design artifacts of the team; • Whether any techniques were missing for a complete review of what the team felt to be important information. Lessons Learned The data from this study again showed us that the subjects were able to find real defects using the OORTs. Correction of these defects helped the

Building a Body of Knowledge about Software Reading Techniques

73

subjects in their projects. The subjects also reported that they found the techniques useful and that the time required was not prohibitive in the context of the whole project. Most subjects thought the techniques were useful enough to recommend that they be used again. More importantly, the data indicated that the techniques were tailored effectively to the new environment (Fagan-style inspections in a Spiral lifecycle model using the MBASE guidelines). This assessment was based on the results that: • The researchers were able to analyze the environmental differences and make corresponding changes to the techniques; • Subjects were able to use the techniques without reporting major difficulties and finding real defects; • Qualitative feedback from the students reported that the training in the inspection techniques was one of the "5 most effective aspects" of the course. As in the original study, the effort required for inspection using the techniques was not prohibitive. The average time spent preparing for the team meeting using the OORTs was 1.7 hours. Results Due to success of the replicated studies and the value that students in the class thought the training in the technology brought to them, researcherss from UMCP and USC are collaborating to train a local expert in the techniques at USC to continue the use and training of the techniques in education there. The local expert will be in a position to investigate further tailoring that might be needed of the techniques to the local environment. As a result, further studies will be undertaken to understand how the reading techniques can be integrated with MBASE and Spiral model to make for a more effective and unified development approach. One idea that we will concentrate on especially is whether the reading techniques can be used at different levels of detail at different points in the Spiral lifecycle, for example, to give a high-level review of the concepts at early iterations of the Spiral and at successively deeper levels of detail as the documents get fleshed out over time. Lessons Learned about OORTs As with PBR, both sets of replicated studies allow us to abstract up some common observations about the technology under study. Our current set of

74

F. Shull et al.

hypotheses about the OORTs, based on patterns seen across multiple environments, include: • OORTs are feasible: All studies have confirmed that they do find real defects, that the different horizontal and vertical techniques are focused on different types of defects, and that the techniques can be used in a variety of environments. • Some evidence has been provided to support hypotheses about the type of experience necessary for applying the techniques. For example, the surprising result from the original study that having a high level of experience in a domain was not beneficial during the inspection was confirmed in the replication. However, because of its counter-intuitive nature, we needed more evidence to support this hypothesis. • Qualitative analysis of the difficulties seen across all studies show that the design methodology of the document author must be communicated to the inspectors to cut down on the number of defects raised because of different approaches to design. For instance, the type and level of detail of information that goes in the high-level design as opposed to a low-level design must be clearly understood. Design reviewers have to be reminded of the possibility of multiple 'correct' designs for any given specification, and not to confused "defects" with "personal design preferences." At the same time, it should be recognized that some items reported by subjects, while not defects in the sense that they require a fix to be made, should be taken as advisory comments that are not necessary but could improve the quality of the design. Hence, even "false positives" reported can be valuable and should not just be discarded. • Certain design choices and environmental assumptions may lead to process tailoring. The study at USC provided some evidence that such tailoring can be done effectively, at least when both the inspection techniques and the local environmental characteristics are sufficiently well understood. • At this point in the evolution of the OORTs, we need controlled studies. A common thread in all the studies reported in this section has been the comparison of one version of the OORT techniques to a previous version, in order to test whether changes for improvement or tailoring have been done effectively. Based on this series of studies, the techniques have become quite sophisticated compared with the earliest versions and many of the original issues

Building a Body of Knowledge about Software Reading Techniques

75

have been addressed. It is our subjective evaluation that this body of data shows that the techniques are ready to be compared to other forms of design inspection approaches for evaluation. Lessons Learned about Reading Techniques By combining the results of multiple replications discussed in this chapter, we were able to abstract up some common observations about reading techniques in general that are supported by some evidence, across environments, studies, and in fact specific families of reading techniques being studied. Although at a high level of generality, lessons learned at this level have helped us understand the common basis of effective reading techniques and further tailor that basic approach to new types of inspections and new environments. • A procedural approach to individual review is a feasible way both to find defects and to focus reviewers' attention on different aspects of the document under review. In all studies it was shown that: o All reviewers reported issues that represented real defects (not just differences of perspective or opinion); o Reviewers found different types of defects depending on the specific reading techniques in the family that they were applying • Experience measures are hard to match to process recommendations. The lack of a consistent correlation between any experience measure and review effectiveness is itself a result of these studies. It remains to further study to show whether this is because there simply is no correlation between effectiveness and experience, or because we haven't yet found an appropriate way to measure the relevant experience. We are continuing to explore several hypotheses (e.g. experts don't benefit as much from introducing reading techniques, while novices with sufficient background benefit the most). Such hypotheses have definite implications for practice if we can build an acceptable level of confidence: For example, they could show that reading techniques can help with cross training (i.e. bringing experts up to speed on each other's roles), or can help novices fill in when the experts aren't available to participate in reviews. • We have varying degrees of confidence about several different design principles for developing reading techniques. Going back to the three "best practices" listed earlier, we have collected some information validating that these are good design principles:

76

F.Shulletal.

Focused perspectives: Results across all studies described in this chapter are clear that focused perspectives are feasible and effective. When compared to one another, it was shown that different techniques found different things (showing that focusing of a reviewer's attention can be done effectively). When compared to checklist or ad hoc review approaches, it was shown that the focused techniques did better than unfocused inspection approaches. Active review: The benefits of an active review where subjects have to work with and somehow manipulate the information in the document have never been tested separately in a controlled study. However, some qualitative evidence has been collected to address the issue: some participants have reported feeling more comfortable in following directions to work on intermediate models of the information rather than being given no guidance as to exactly how to check the information in the document. Studies are needed that can adequately test this hypothesis, e.g. by looking at how do people perform when not working with models during review. Defect taxonomies: Although it is intuitive that making defect taxonomies explicit helps improve inspections by telling reviewers more clearly what to look for, there has been no direct indication of the effectiveness of this approach from the replicated studies described here. This approach has always confounded with other design principles, because in the context of the studies it was not a fair comparison to ask people to use different approaches but give them different amounts of guidance as to what to search for.

5. Lessons Learned about Replication Based on the four replications discussed in the previous section, in addition to the lessons we learned about reading techniques we were also able to learn some lessons about running replications. We report those lessons here as a guide for researchers running replications in the future. •

Having a local expert on hand who is well versed in the technology being studied is crucial.

Building a Body of Knowledge about Software Reading Techniques

•

•

•

•

•

77

Pilot studies by the replication researchers can be important for understanding and conveying information about the study and proper expectations to the subjects (e.g. concerning time requirements for participation). When replicating a study, improving the training by allowing the subjects a laboratory setting in which to practice is seen as positive by the subjects. "Modular" designs, such as the one used to support the replication in Sao Paulo, can facilitate replications by including both generic and environment-specific parts. With this type of design, it was easy to adapt the experiment for use in a non-NASA environment since environment-specific treatments could be removed from the design without diminishing the overall ability to investigate the hypotheses. When the subject of the technology is "requirements defects" it can be hard to communicate clearly enough what the issues are with the requirements document, in order to enable the replicating researchers to identify them specifically in the subjects' defect reports. Subjects have many, many ways of expressing the items they find on their defect reports, and deciding whether or not the reported items represent real defects can be difficult. It is hard to understand how different subjects may interpret the same instructions: It became clear in retrospect that the Brazilian students did not report some defects that they found but we would have liked to have studied, because they noticed those defects when not applying a step of the inspection procedure and thought we only cared about defects directly related to the procedure. There were also terminology problems: The word "scenario" was confusing in Brazil because they had previously learned a specific definition of the term for use case creation, and it was applied less formally in another context during the training. Another example was the word "service," which was used in the OORTs to represent one level of design abstraction but has a specific meaning in the telecom industry. Terminology must be clearly defined and explained to subjects.

The last point demonstrates how hard it is to do an exact replication of a study, even when the original and replication research teams are working closely together. There are many misunderstandings or simply different interpretations that can occur on the part of subjects due to cultural or

78

F. Shull et al.

domain issues, and trying to identify them all beforehand is almost impossible. Because of the above lessons learned, we have formulated the following guidelines that we feel are crucial for getting the planned type of results from a replication: • Choosing the right type of replication is necessary for running an effective study. A first step in preparing for any replication should be to analyze the replicating environment in order to understand if tailoring of the technology is necessary. At the point in the development of the technology at which this replication was run, it is not clear what aspects were important and what aspects could be altered for improvement. Therefore, if the changes are driven by specific environmental factors, the replication will have a better chance of success. • Don't minimize the importance of "training the trainer. " Effective training of the researcher who is going to perform the replication is crucial. When there is no local expert and the expectations are not clearly communicated, problems arise that could most likely be avoided. • An effective replication requires packaging the object of study with key context information, not just packaging the study design itself. It is clear from this study that packaging a study for replication requires a lot of work on the part of the original researchers. The study package requires much more than just the artifacts used in the study. It is difficult to capture all of the knowledge needed to replicate a study. Moving toward greater formality in building bodies of knowledge We stated earlier that the field is not yet sophisticated enough for us to build bodies of knowledge using formal statistical methods. Based on our experiences so far, however, we can begin to formulate a potential meta-data description of the studies that would provide a necessary step towards this goal. This meta-data would describe the context in which a study has been run, capturing the important aspects of different replications on the same topic. Such meta-data would allow readers to understand the relevance of studies to their own environment and what contributions they make to the larger body of knowledge. Both the study data and the meta-data must be used together to draw conclusions about the technology.

Building a Body of Knowledge about Software Reading Techniques

79

Capturing meta-data in a comparable way across multiple environments requires, first, that the environments can be meaningfully compared and second, that the same definitions of the meta-data variables are used for each environment. Describing the environments in a comparable way requires measuring at a comparable level of granularity, i.e. measuring equivalent organizational units. For example, there is probably no way to describe "the" software development process at a large organization that incorporates a multitude of different development projects, each with its own technologies used and set of quality concerns. Rather, the meta-data should be measured at the level of individual projects. That is, each set of meta-data should describe one project, not larger heterogeneous organizational entities like divisions or departments. The types of questions that could be answered in a meta-analysis of software technology data basically concern three sources of context variation: • Of the software development technology. (Is a particular change to the development process more or less effective than some alternative process?) • Of the systems to which the technology is being applied. (Are different types of artifacts, notations, or systems more or less suited to different development technologies?) • Of the personnel applying the technology to systems. (Do subjects with different levels of experience or skills apply the technology differently?) Choosing meta-data that describe each of these aspects of an environment limits the specific questions that can be answered by the analysis, so it is important to choose the meta-data wisely. Choosing a common set of metadata should take into account well-supported ideas for what variables are responsible for affecting software development effectiveness. Describing a technology is difficult, but can often be addressed by describing as specifically as possible what version of a given tool, process, or methodology was used. To describe the application profile of the systems to which the technology is applied, we have identified the following set of initial metadata: • Application Context: The description should contain information on whether the object of study was used in an isolated (or classroom) development task not done as part of the development of a real

80

F. Shull et al.

• •

• •

•

project (e.g. a requirements inspection of pre-existing documents, with no action taken on the resulting defect lists) or as part of an actual development project. Size of project: Maximum team size required to complete the project. Type of application: Brief description of the domain of the application. May include a link to a prepackaged application description. Platform: Values might include: Mainframe, client/server, network, applications generator. (This list should be extended as needed.) Process Drivers (ranked): What were the primary quality factors of concern to the subjects? To answer, the quality factors from the following set should be listed in order as appropriate: Requirements, cost, schedule, dependability, option exploration. Time Period: Include the start and end dates, and any key milestones.

The subjects applying the object of study can be described by means of the following attributes: • Developer skill and experience: Were the subjects students or professionals? The number of years of experience can also be measured in certain, technology-specific areas. For example, in general we could measure a subject's number of years of industrial experience (although because of wide interpersonal variation this is usually not so useful). For requirements inspection technologies, we might also measure the number of years (or number of projects) experience with inspections, with requirements inspections specifically, and with creating requirements. • Relative skill level: Subjective assessment of how subjects compare to other subjects in similar types of populations. Example Instantiations As a practical way to avoid single isolated studies, and to allow these expensive undertakings to contribute to the larger body of knowledge, the United States' National Science Foundation funded the Center for Empirically Based Software Engineering (CeBASE) in 2000. CeBASE has as part of its mandate the role of improving software development by communicating to developers what heuristics and models exist to support

Building a Body of Knowledge about Software Reading Techniques

81

decision-making based on empirical research. CeBASE has researched methods for abstracting and modeling the needed information for decision support across multiple studies, and collaborates on further empirical studies where necessary to support that research. We are refining our conception of packaging by creating a repository of studies to facilitate replications in the area of reading techniques. This repository, currently still under construction, is available to CeBASE affiliates. See http://www.cebase.org/www/researchActivities/defectReduction/index.htm for more information. A related repository based on a similar setup is also being created for the Simula Research Lab in Norway, http://www.ifi.uio.no/isu/forskerbasen. The ViSEK project has a similar mandate to create an experience base of empirically validated technologies for German industry (www.visek.de). Finally, the Experimental Software Engineering Research Network (ESERNET, www.esernet.org) is also interested in families of related studies that can be used to abstract higherlevel conclusions.

6. Conclusions In this chapter we have discussed a family of reading techniques useful for the individual inspection (reading) of software documents. Specifically, we have focused the discussion around a technique for inspection of requirements (PBR) and a technique for inspection of object-oriented design (OORTs). A series of empirical studies have been run on these two sets of techniques in order to evolve and improve them. This chapter has provided a discussion of the latest series of studies that we have run on PBR and OORTs. Based on these studies, one of the key conclusions is that there is a need to analyze the results of these studies collectively. Isolated studies only provide part of the picture, but families of coordinated studies allow researchers to build up bodies of knowledge about development technologies. In this chapter we have provided a brief discussion and justification for the need to build bodies of knowledge. One of the main techniques that we have found useful in building these bodies of knowledge is that of replications. We have explained how replications can be useful in building bodies of knowledge. In conjunction with that we provided a discussion of the various types of replications and when each type should be used.

82

F. Shull et al.

These different types of replications are then illustrated by a discussion of the studies used to evolve and improve PBR and OORTs. For each of the replications, we have provided some justification as to why the particular type of replication was chosen. At the end of each discussion, we have discussed the results learned from the replication that could not have been learned from a single study. We conclude the discussion of the replications by providing lessons learned about PBR, OORTs and reading techniques in general from the combination of all the replications that we could not have learned from any single replication. We then have concluded the chapter by discussion how the process of replications can be improved. Our experiences have provided us with several guidelines concerning how and how not to run replications. The main vehicle for successful replications is the lab package. We have shown the necessity of such packages, and also given some discussion as to how researches begin to build these lab packages.

References [ANSI84] [ArifOl]

[Basili96]

[Basili97] [Basili99]

[Boehm88] [Boehm99]

ANSI/IEEE. "IEEE Guide to Software Requirements Specifications." Standard Std 830-1984, 1984. Arif, T. and Hegde, L.C. "Inspection of Object Oriented Construction: A study of Reading Techniques tailored for inpection of Design Models expressed in UML." Prediploma Thesis, Norwegian University of Science and Technology, Nov. 2001. Available at http://www.idi.ntnu.no/grupper/su/sif8094reports/p2-public.pdf Basili, V.R., Green, S., Laitenberger, O., Lanubile, F., Shull, F., Sorumgard, S. and Zelkowitz, M.V. "The Empirical Investigation of Perspective Based Reading." Empirical Software Engineering— An International Journal, 1(2): 133-164, 1996. Basili, V. "Evolving and Packaging Reading Technologies." The Journal ofSystems and Software, 38(1): 3-12, July 1997. Basili, V., Shull, F. and Lanubile, F. "Building Knowledge through Families of Experiments." IEEE Transactions on Software Engineering, 25(4): 456-473, July 1999. B. Boehm. "A Spiral Model of Software Development and Enhancement." IEEE Computer 21(5): 61-72, May 1988. Boehm, B., Port. D., Abi-Antoun, M. and Egyed, A. "Guidelines for the Life Cycle Objectives (LCO) and the Life Cycle

Building a Body of Knowledge about Software Reading Techniques

83

Architecture (LCA) deliverables for Model-Based Architecting and Software Engineering (MBASE)." USC Technical Report USCCSE-98-519, University of Southern California, Los Angeles, CA, 90089, February 1999. [Ciolkowski97] Ciolkowski, C , Differding, C , Laitenberger, O. and Muench, J., "Empirical Investigation of Perspective-based Reading: A Replicated Experiment." International Software Engineering Research Network, Technical Report ISERN-97-13, 1997. http://www.iese.fhg.de/ISERN/technical reports/isern-97-13.pdf [Fagan76] Fagan, M. E. "Design and Code Inspections to Reduce Errors in Program Development." IBM Systems Journal, 15(3):182-211. 1976. [Fagan86] Fagan, M. E. "Advances in Software Inspections." IEEE Transactions on Software Engineering, 12(7): 744-751, July 1986. [Gilb93] Gilb, T. and Graham, D. Software Inspection. Addison-Wesley, Reading, MA, 1993. [Knight93] Knight, J.C. and Myers, E.A. "An Improved Inspection Technique." Communications of the ACM. 36(11): 51-61, Nov. 1993. [Lewis95] Lewis, T., Rosenstein, L., Pree, W., Weinand, A., Gamma, E., Calder, P., Andert, G., Vlissides, J. and Schmucker, K. ObjectOriented Application Frameworks. Mannings Publication Co., Greenwich, 1995. [MeloOl] Melo, W., Shull, F. and Travassos, G.H. "Software Review Guidelines." Systems Engineering and Computer Science Program -PESCES-556/01. COPPE/UFRJ. September, 2001. http://www.cos.ufri.br/publicacoes/reltec/es55601.pdf [MillerOO] Miller, J. "Applying Meta-Analytical Procedures to Software Engineering Experiments." Journal of Systems and Software, 54(1): 29-39,2000. [Porter95] Porter, A., Votta Jr., L. and Basili, V. "Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment." IEEE Transactions on Software Engineering, 21(6): 563-575, June 1995. [SEL92] Software Engineering Laboratory Series. Recommended Approach to Software Development, Revision 3. SEL-81-305, pp. 41-62, 1992. [Shull98] Shull, F. S. "Developing Techniques for Using Software Documents: A Series of Empirical Studies." Ph.D. Thesis, Computer Science Department, University of Maryland, 1998.

84

[Shull99]

[ShullOl]

[Shull02a]

[Shull02b]

F.Shulletal. Shull, F., Travassos, G.H., Carver, J. and Basili, V. "Evolving a Set of Techniques for OO Inspections." Technical Report CS-TR4070, UMIACS-TR-99-63, University of Maryland, October 1999. http://www.cs.umd.edu/Dienst/UI/2.0/Describe/ncstrl.umcp/CSTR-4070 Shull, F., Carver, J. and Travassos, G.H. "An Empirical Methodology for Introducing Software Processes." In Proceedings of European Software Engineering Conference, Vienna, Austria, Sept. 10-14, 2001. pp. 288-296. Shull, F. "Software Reading Techniques." In the Encyclopedia of Software Engineering, Second Edition. Copyright John Wiley & Sons. 2002.

Shull, F., Basili, V., Carver, J., Maldonado, J., Travassos, G. H., Mendonca, M. and Fabbri, S. "Replicating Software Engineering Experiments: Addressing the Tacit Knowledge Problem." Accepted at the International Symposium on Empirical Software Engineering 2002, Nara, Japan, October 2002. [S0rumgard97] Sorumgard, S., Verification of Process Conformance in Empirical Studies of Software Development, Ph.D. Thesis, Norwegian University of Science and Technology, February 1997, Chapters 10-11. http://www.idt.unit.no/~sivert/ps/Thesis.ps. [Travassos99] Travassos, G.H., Shull, F., Fredericks, M. and Basili, V.R. "Detecting Defects in Object Oriented Designs: Using Reading Techniques to Increase Software Quality." OOPSLA '99. Denver, CO, Nov. 1999. [Travassos02] Travassos, G.H., Shull, F., Carver, J. and Basili, V.R. "Reading Techniques for OO Design Inspections." Technical Report CS-TR4353, UMIACS-TR-2002-33, University of Maryland, 2002, 56 p. URL: http://www.cs.umd.edu/Librarv/TRs/. Also available at http://www.cos.ufri.br/publicacoes/reltec/es57502.pdf [Votta93] Votta Jr., L.G. "Does Every Inspection Need a Meeting?" ACM SIGSOFTSoftware Engineering Notes, 18(5): 107-114, December 1993. [Wood99] Wood, M., Daly, J., Miller, J. and Roper, M. "Multi-method research: An empirical investigation of object-oriented technology." Journal of Systems and Software 48(1): 13-26, 1999. [Zhang99] Zhang, Z., Basili, V. and Shneiderman, B. "Perspective-based Usability Inspection: An Empirical Validation of Efficacy." Empirical Software Engineering: An International Journal 4(1): 43-70, March 1999.

CHAPTER 3

Combining Data from Reading Experiments in Software Inspections A Feasibility Study Claes Wohlin Dept. of Software Engineering and Computer Science Blekinge Institute of Technology Box 520, SE-372 25 Ronneby, Sweden claes. wohlin@bth. se Hakan Petersson Dept. of Communication Syst. Lund University Box 118, SE-221 00 Lund, Sweden hakanp@telecom. Ith. se Aybiike Aurum School of Information Systems, Technology and Management University of New South Wales, Sydney NSW 2052, Australia aybuke@unsw. edu. au

Software inspections have been around for 25 years, and most software engineering researchers and professionals know that they are mostly a costeffective means for removing software defects. However, this does not mean that there is consensus about how they should be conducted in terms of reading techniques, number of reviewers or the effectiveness of reviewers. Still, software inspections are probably the most extensively empirically studied technique in software engineering. Thus, a large body of knowledge is available in literature. This paper uses 30 data sets from software inspections found in the literature to study different aspects of software inspections. As a feasibility study, the data are 85

C. Wohlin, H. Petersson & A. Aurum

86

amalgamated to increase our understanding and illustrate what could be achieved if we manage to conduct studies where a combination of data can be collected. It is shown how the combinated data may help to evaluate the influence of several different aspects, including reading techniques, team sizes and professionals vs. students. The objective is primarily to illustrate how more general knowledge may be gained by combining data from several studies. It is concluded that combining data is possible, although there are potential validity threats. Research results are examined with reference to software inspections on three levels: organization, project and individual. Keywords: Software inspections; reading technique; empirical study; combining data.

1. Introduction Software inspections have over the years been accepted as a key principle in software engineering. It was first formalized and described by Fagan in 1976 [Fagan76]. Since then inspections have been researched and widely applied. Several variants of inspections have been proposed [Parnas85, Bisant89, Martin92, Knight93]. Software inspections are probably also the most thoroughly empirically studied subject in software engineering [Basili96, Laitenberger97, Porter95, Porter97, RegnellOO, Votta93]. Consequentially, several books are now available on this subject [Gilb93, Ebenau94]. The volume of studies in this area implies that it may be possible to combine the various empirically derived information together to build a body of knowledge regarding the effectiveness of software inspections and different aspects of inspections. Examples of such aspects are reading techniques, team size and performance of individual reviewers. Combining empirical information, however, is not a simple task. To build a body of knowledge in software inspections from published studies requires that the results from these studies are comparable. This imposes significant requirements on the descriptions of the published studies. For example, there are consistency issues regarding descriptions of context, subjects, artifacts and other aspects between the different studies. There have been successful attempts to produce so-called lab packages to encourage replication continuity, such as those based on Basili et aVs study [Basili96]. This is a great starting point, but there is still much to be done. We need ways of documenting empirical studies so that it is possible to combine the results from different studies to allow both meta-analysis

Combining Data from Reading Experiments in Software Inspections

87

[Pickard98, Miller99, Hayes99] and the pooling of data. The latter refers to the combination of data sets, which is the approach used in this paper. The objective is of course to create new or more general results by amalgamating the results from other studies. However, the validity of both meta-analysis and pooling of data may be challenged, since it is always problematic to combine information from different sources. From the published literature it is often hard to understand the exact context of a given study and different studies may have dependencies through, for example, usage of the same artifacts or subjects. However, the alternative of not combining information or data from different studies is not attractive, since it would mean that studies are primarily interpreted as single events and generalized knowledge is hard to construct. Thus, the challenge is to try to combine information and data in such a way that the results indeed become a general collection of knowledge and experiences. This may be particularly appropriate for some examples in software inspections, especially when the inspections can be viewed as a random sample of inspections in general or when the context is limited to, for example, a specific company. The data, in this paper, does not fulfil these criteria since they are based on convenience sampling [Robson93]. Hence, the main objective is to illustrate what is feasible if combining information or data that is available. The primary objective of this paper is to illustrate the types of generalized results that can be derived if we were able to combine different studies, whether combining the data or combining the results. In particular, the intention is to show the opportunities for evaluating results at different organizational levels including the organization itself, teams in the preparation phase in software inspections and individual performance. A secondary objective is to present some results from the combination of data from 30 data sets found in the software inspection literature. The actual results should be interpreted with some caution since the data sets are based on availability and hence they are not a true random sample from a population. However, the results may be used as a first indication of what can be expected. In addition, it is of course very important to see whether the results of our combination of data sets is more generally valid even though they are based on convenience sampling. The primary and secondary objectives are illustrated on three different levels, i.e. organization, project and individual (see Sections 5, 6 and 7), where each level has its own objectives. These are however primarily

88

C. Wohlin, H. Petersson & A. Aurum

presented to illustrate how the overall approach can be applied to the different levels of analysis. We have chosen to perform this feasibility study in an area where quite a number of experiments have been carried out. However, when performing the analysis we realized that we have insufficient knowledge of published studies, thus it is still very hard to perform studies of this type and come to generally accepted conclusions. This points to a very important issue, namely that we must improve the way we document and report experiments. Otherwise, experimental studies will continue to be isolated studies and we will be unable to build a solid body of knowledge based on empiricism. The ability to combine data or results is, in the authors' opinion, a key issue for the success of empiricism in software engineering. With this paper, we hope to illustrate that if the challenges of combining information from different studies could be overcome then there are opportunities to answer some important research questions in the software engineering community. The paper is structured as follows. Section 2 discusses characterization of software inspections studies. The data sets used in the analysis and some issues related to the data sets are introduced in Section 3. Analyses and discussions are made on three different levels: organization, project and individual. These levels are discussed in Section 4. The following three sections discuss the results for the levels. In Section 5, the organizational benchmarking in software inspections is discussed. Software inspection planning on the project level in terms of team size for software inspections is examined in Section 6. Section 7 presents the results on the individual level and finally the conclusions are presented in Section 8.

2. Characterization of Studies 2.1 Introduction There are many reasons for combining the results and data sets of software inspections. Potential reasons include to make an organizational benchmark study, an internal study to maximize effectiveness within an organization or to measure the ability of individual reviewers. The different types of studies are further discussed in Section 4, and then elaborated on with specific data in the following sections to illustrate the actual opportunities at different levels. Anyhow, it is important to document these studies to enable greater understanding and comparison among them. To support inspection comparisons, it is necessary to:

Combining Data from Reading Experiments in Software Inspections

89

• Characterize each inspection process to be compared by application, environment and people factors (i.e. qualitative description), • Use comparable measures across inspections (i.e. quantitative measurements). It is possible to only perform a qualitative comparison using the characterization. Fuller characterization also yields the possibility of comparing inspection processes quantitatively. In this case the qualitative description may act both as a means for comparison and as a way of characterizing the process to enable quantitative comparison. The addition of measures means that it is possible to quantitatively compare inspection processes. The main measures to compare include effectiveness (how large a proportion of the defects were found?), efficiency (how many defects were found over a specific period of time? This can be described as defects found per time unit) and the number of reviewers (how many reviewers were needed to achieve a certain level of effectiveness or efficiency?) respectively. 2.2 Characterization A key aspect of comparison is the characterization, which can either be used as a stand-alone qualitative description or as part of a quantitative evaluation, where the characterization can also be used to support the identification of suitable quantitative comparisons. The characterization includes three perspectives and five aspects that are characterized (see Table 1). The characterization is based on the authors' experience from working with software inspections. The first perspective is the normal working situation, which should capture characteristics related to the working environment and the typical applications developed. The second perspective is related to the resources in the study, i.e. the people participating in the study and the applied inspection process. The third perspective is a characterization of the unique aspects of the study. The latter refers to the fact that, in many studies a document is developed for a specific study or reused from another study. In many cases, this means that a specific study is conducted in a controlled environment where other artifacts, notation and so forth differ from what the subjects are used to.

90

C. Wohlin, H. Petersson & A. Aurum

Work

Study

Resources

Environment

Application

People

Process

Specifics

Phase

Domain

Native language Experience in application Experience in environment

Inspection type

Artifact type

Roles

Artifact notation

Normal notation

English or Individual defect detection technique, translated e.g. reading technique Meeting Number of known defects Experience in study Tool support application Distance from Protocol normal artifacts Procedure for re-work

Table 1. Characterization of software inspections. From an environmental point of view, it is important to document the type of inspection (e.g. Fagan [Fagan76] or Phased-Inspections [Knight93]) that is studied as well as the normal notation used in each phase. The characterization in Table 1 may be used both for quantitative and qualitative comparisons. The former is however the main focus in here. The type of application normally developed is important. This should not only include the application domain, but also some additional information, for example, whether it is a soft- or hard real-time system that is normally developed. Next, the people participating in inspections have to be characterized. This includes their native language and experience, both in terms of the application domain and the environment. The inspection process has to be documented. It is important to collect as much relevant information as possible about the process. This includes the type of inspection (e.g. Fagan, or walkthrough); the roles used in the inspection; the techniques used for individual defect detection, if any; the data collection procedure (for example comments are sent by e-mail or collected during a meeting); who participates (both as reviewers and in any prospective meeting); and whether any tool support is used in the inspections. It is also essential to document how protocols are written and the procedure for re-work. The processes as

Combining Data from Reading Experiments in Software Inspections

91

applied may be different from the processes as documented, meaning that ethnographic techniques may be appropriate. Finally, it is important to document aspects that relate to a particular study, i.e. aspects that are specific for the study at hand. This includes the type of artifact, notation used and the number of defects in the artifact used for the study. Preferably, artifacts are developed and made available to other researchers. In these cases, it is advantageous if the artifact can be reused as is. However, in some cases this may be impossible, and it may be necessary to translate it. If this is the case, it needs to be documented that the artifact has been translated from and to particular languages. In many controlled studies, the number of defects is known. This number needs to be identified. Moreover, it is important to document the experience of the people in the application domain of the study (especially if different from their normal application domain). To increase the comparative value of a study, it is important to document the difference in the inspection situation in comparison with how inspections are either described in the literature or conducted at a specific company. In other words, the distance from the normal artifacts and the normal inspection situation to that of the study has to be captured. This is preferably done in a survey after having done the study. The descriptions should include as many aspects as possible, including application type, language, notation, complexity of the review artifact and also general opinions among developers. The developers could, for example, answer questions such as: Was this inspection representative compared to those you normally perform? Was it easier or more difficult? Did you find more, less or an equal number of defects than usual?

2.3 Quantitative Measures The measure of primary interest here is the effectiveness of an inspection team. The effectiveness E of a team T, is in this study calculated as: DT E T = N" DT is the number of unique defects found by team T and N is the total number of defects in an artifact. In the long run it would be very important to also address costeffectiveness. However given the availability of data and that effectiveness is a starting point also for judging cost-effectiveness, the focus in this paper is on effectiveness. The effectiveness of an inspection team denotes what proportion of the existing defects the team found. The efficiency of an

92

C. Wohlin, H. Petersson & A. Aurum

inspection team can be defined in several ways [Briand98] but must include the amount of effort spent by the team. To obtain comparable measures regarding, for example, the effectiveness of software inspections, it is necessary to list both the defects that were and were not discovered, as this is needed in order to determine the true effectiveness. The most common way of doing this is to conduct a controlled experiment, where the number of defects is known (either through seeding or through previously identified real defects) in a document such as an information or software artifact. The document may be from either a generic or a company-specific domain. The advantage of having a document from a generic domain is that it makes comparison easier. The disadvantage is that the document may not reflect the usual nature of such documents in a specific organization. The company specific documents may on the other hand make comparison more difficult across different environments. The documents with seeded defects may be from any application domain. In case of a standardized (or generic) artifact (for example in lab packages), it is preferable to find an area, which is familiar to people in general. However, it is also preferable, if few developers have actually developed systems in the chosen application domain, to minimize the risk of affecting the results due to knowledge in that specific domain. Examples of specific domains include an elevator system or a reservation system for taxis. Most people have an intuitive feeling for how these types of systems should work, although most developers have not developed systems in these application domains. Systems, for example, in the telecommunication domain are probably not suited since some of the software is hard to understand unless you have worked in the area. Subjects who have worked with the chosen type of system have major advantages in domain experience to those who have not. This makes comparison of subject's inspection results difficult. Another aspect of the artifacts is the phase they represent. It is important to consider different development phases when studying software inspections. One of the main strengths of inspections in general is the possibility of applying software inspections to any type of artifact, but for comparative purposes it is important to document exactly what was inspected. As a first step, inspections of requirements specifications and code could be studied, since several experiments have been conducted which review these types of documents (see Table 1), and hence baseline data already exists. The approaches used in requirements inspections may be

Combining Data from Reading Experiments in Software Inspections

93

extended to other artifacts in the future. The requirements review is especially useful when the specification is written in natural language and hence is readable by most developers, i.e. they need not have any knowledge of any specific high-level language. Code is also readable for developers, even if they are not experts in that specific programming language. However, the use of more common programming languages, such as Java, C or C++, is preferred as more developers are familiar with these languages.

3. Data Sets 3.1 General Information This study is based on publicly available data, and the main objective herein is to illustrate how software inspection data may be used to evaluate several important questions regarding software inspections. It must be noted that full characterizations of the different contexts for some of the individual data sets used here are not available and hence the derived empirical results are presented with a degree of caution. The objective is to describe how, if an appropriate characterization is conducted, this type of information can be used for comparison purposes and for drawing more general conclusions about software inspections. The data has primarily been collected as part of formal experiments [WohlinOO]. For the sake of this illustration, let us imagine that the data is collected from different companies. This is done for symbolic purposes to show the feasibility and opportunity of combining data from software inspections. The data used is a collection of 30 data sets from a number of controlled experiments performed by different researchers in the field. In some of the analyses, the data sets are analyzed as one entity and for other analyses the data sets are classified based on three attributes: a) the environment in which the study was conducted (Profession), b) the type of document, and c) the reading technique used (Inspection Technique). The data sets and their attributes are shown in Table 2, and the attributes are further explained below. Three of the data sets, i.e. no. 6, 11 and 12, have been divided into sub sets to lessen the effect of any large single data set in the analysis, see Section 3.2. The data sets are used without having complete information about the full context of the environments, and it should once again be pointed out that the data sets are based on availability. This is means that results should

94

C. Wohlin, H. Petersson & A. Aurum

be interpreted with some caution, and primarily viewed as an illustration of what can be achieved if able to combine information from different studies. It also illustrates some of the problems associated with combining data sets. Several factors that may influence the results are unknown about in the data sets. For example, information is not available regarding time spent in inspections, motivation of the inspectors, severity of defects and several other factors describing the actual context of each study in Table 2. It was only possible to use three attributes to describe the context, as pointed out above, and shown in the table. 3.2 Virtual Inspection Teams A typical inspection includes an inspection meeting. At this meeting, the reviewers gather the defects, and depending on the type of inspection, they either focus on identifying as many defects as possible, or discussing the defects they have found during the preparation. The data given from the 30 experiments contain no meeting data; only individual data showing which of the defects a specific individual found or missed. It should also be noted that a number of studies during the last ten years which, when performing inspections where the fault discovery is focused to the preparation phase, show small or virtually no faults are found during meetings. For instance, Votta found in his experiment that during the meeting on average only an additional 4% of faults were found [Votta93]. When including true faults that were reported by an individual but did not get logged at the meeting, Johnson et al. found no significant difference between having a meeting or not [Johnson98]. Porter et al. even found a negative meeting gain of on average around minus 1% [Porter95]. In the experiments generating the data for this study, the focus of reviewers was to find defects during the preparation, not in the inspection meeting itself. In order to study the effect of teams, the individual data are combined to form nominal teams of a certain size and by calculating the team's effectiveness, a virtual inspection is created. This virtual inspection does not take meeting effects into account. To investigate the whole span of possible outcomes from the data sets all possible combinations of groups are formed. One approach for combining the data from the different data sets is to: 1. Generate all combinations of nominal teams of all sizes, for all data sets 2. Calculate the effectiveness value for all the nominal teams

Combining Data from Reading Experiments in Software Inspections

95

3. Generate graphs and tables sorted on the number of reviewers However, since the data sets contain different numbers of reviewers, each data set's influence on the graphs would be dissimilar. With six reviewers, 20 teams of size three could be created, while for 22 reviewers this number would be 1540. This is partly solved by dividing the three largest data sets (data set no. 6, 11 and 12) between groups of only seven or eight reviewers. This leaves 34 data sets with five to eight reviewers in each. The differences in influence are thereby reduced. It should be noted that in the long run, the aim should be to base the comparison only on real groups to ensure that the conclusions are based on groups that are similar to the ones found in industry. When generating the nominal teams from the data sets, team sizes from 1 up to one less than the number of available reviewers is created. This means that, when investigating larger team sizes than four, some data sets have to be excluded. In Section 6.3, two graphs showing general behavior are presented. One with team sizes up to four and one with up to six. In these two graphs, the data sets numbered 0 and 15 respectively are excluded. To further decrease the difference in data set influence, the reviewers were selected randomly. For example, in the graph showing the general effectiveness behaviour of teams with size 1 to 4 (Figure 3), all reviewers from data sets number 26-30 were included while 5 reviewers were randomly selected in the other data sets. The disadvantage of virtual groups is that there is a high dependency between the groups. On the other hand, all data sets are treated the same, and since the main concern is comparison, this should not be critical to the outcome. A random selection of all combinations is used so that all data sets get a similar weight. Otherwise data sets with the most reviewers would dominate over the others. 3.3 Dependency Concerns Since each reviewer was included in many of the teams, there exist an obvious dependency between the data points. To evaluate some of the dependency, a simulation of virtual teams versus an approach that randomly creates teams without redraw and an approach having only real teams has been conducted. The simulation approach seems better than the random-noredraw approach. Compared to the real-teams-only, the virtual team approach generates results with the same mean value but reports less variance in the results. This should be remembered when looking at the

96

C. Wohlin, H. Petersson & A. Aurum

graphs. However, the approach of using virtual teams shows the full scope of what the effect could be of having these people as reviewers in a company and picking some of them to be included in each inspection team. There are also some dependencies among the 30 data sets. A couple of the experiments are based on an experiment kit or lab package developed during Basili et al's PBR experiment [Basili96]. In these data sets, the inspected documents are the same or similar to one another. In other cases, the same person has participated in more than one of the experiments. However there are no cases where the same person inspected the same document. 3.4 Classification of the Data Sets The characterization of the data sets is shown in Table 2. The data is characterized based on type of subjects (NASA representatives, Academics and professionals other than NASA), document type (requirements specification, artificial requirements specification,1 text and code) and reading technique (ad hoc, checklist, and active-based reading; an example of the latter is perspective-based reading [Basili96]). The data provides opportunities to make controlled comparisons to evaluate if, for example, inspection rates vary by the profession, document type or reading technique. The first context attribute of the experiments is connected to the environment in which the experiments took place. Software engineering experiments, having students as subjects, are often criticised as they are not representative of the real life software inspection teams. Hence studies conducted on academics are categorized as a separate group. Several of the studies have been conducted as part of the Software Engineering Laboratory work at NASA [Basili95]. This initiative has been running for more than 20 years and hence the people involved in the studies are likely to have been exposed to more empirical research than other people from industry. As a result, NASA is separated as one group. Finally, studies conducted in other industrial settings are viewed as a third group. This results in the following three groups that are related to the environment of the studies: 1. Mix of college students, faculty members and some professionals. (Acad) 2. Professional software engineers at NASA. (NASA) 1

The term artificial requirements specification is used when the specification is developed for the sake of the experiment.

3

ro\ - f aj

r-

ON

5

r~

ON 3

r-

CT\ 3

t-~ ON 3 • 3

r2

rON 3

rON 3

t^ ON 3 ON 3

PH

ON 3

r~

BH

r-

r~

3

ON

H

H

CT-

00 O IU

c

00 ON

c

ON

ON

O
UH

00 ON

OO O O

c

ON

3 O

C3 c 3 3 OS OS

< u u u u

2

•CT O

2

<

u u u

H

<

T3 O

H H H H OS OS OS OS 2

3

U

<

<

•CT O

OH

2 OH

2 0H

2

Ul

C3

0

in

"S C3 "2 O 0 0 <: <: <

•a

<

•a 0

NO CN

0 •CT O

•CT O

<£ r~

0) •CT O

•CT O

PH

00

u

u u u

«i u -a 0 U

cu

r-

"^

•CT O

Z

00

0 •CT O

z z t-

u

u u

E05

3 c 5 3 OS 05

3

r~

Combining Data from Reading Experiments in Software Inspections

g 4J CU

Pl-l H

E-

OS

H 05 «3

a a a a a a a 0a a a 3a a a 3 2 2 2 2 2 p. u. p- a. ft- u- S

£ u-2 u.2 u.

3 04 H

H

E-

d-S

CT-

l

CT-

e <

<+H

CT" U

os CTOJ

OS OS

It-*

CTU OH

os u-4

OJ

Z

00

2

Z

NO

2

Z

00

2

z

NO

VI

u-i

t^

• *

in

NO

fN

ON CN

m

en r^

<

C4

00 CN

NO

ts CN

NO

^

O O

CTU

rcs O O

t^

<

OH

ON 3

CTCJ

H OS

<

05

CTU

E-i tQ5 05

<

05

05

<

CTU

n 0 0

CN

r~ O O

O CN

1/-)

33

CT(U

(4-.

ON 3

1^ ON 3

33

OS 05

<+H

ON 3

t--

ON

m

00

t^ ON 3

H 05

t-~

r"~

.£ c

E-H

NO

t— in ON 3

•CT

•0

ra\

V") Os 3

•CT

-*

< < <
'5 <

CT"
OS OS OJ « OS OS 5 H <: < < < < <

s

«H

2

ON

1 O)

2

CT(U

«4~:

M-H

U-4

Freim

e 3

CT; 3

s

• 3

•0

Oj 3

33 33

33 -a •CT

33 •CT

-a

CTU

33 33 •CT

33 T3

33 T3

<

CT U

Req

OJ

&-S <

o-

OISSS NO

H *

NO

NO

NO

00

ra in

rx NO

0 NO

r~

f-

r~

00

NO

ON

NO

CCJ

Req

<

< <

•a

l>

o"

oo NO

en

TO

Req

< <

OS T3 •3

N

U

OH

< <

OS

<

05 I*-.

<

OH

U-i

Z

3

C*-i

z

CN

< 00 < Z

Req

NO

O

a

00

*

c~

CN

rCN

ca ,n

00

m

NO

< < < < < < < < 00 00 00 00 "3 "2 "3 •3 0 0 0 0 00 00 "S "2 •CT "2 cd cd < < < 0 0 0 0 < < < < 0 0 0 O < •< < «3 < Z Z Z Z < < < < Z

Z

3

<4_:

Freim

3

l*H

Freim

rt

hkl

Regne

Regne

Arti

V3

U CA

03 03

Q CN

M 373 H

97

W5

(U J3

a.

x> 3

CA

0 3

,. CO

C3

Xj

u

i

XI 00 3 0

•3 0 0

1) C 60
Pi

O 0

la

13 tl> 0 0 «=! 4>

•a d

"1

u 0 a a

*H

c o ti 3O 0c -o

ss 8 2

ai "^ O o

ion. udy

Regne Arti

a£

CT!

e

O

«M

EL,

O v

o'g z 2

1

Wohli Tex

Freim Arti

Regne

Arti

Wohli

Tex

Freim

Arti

Freim Arti

Unpub] Wohli

Arti

Freim Arti

Freim Arti

hkl Texi

Freim

Arti

Doc. Type essio No. of review. Refer

98

C. Wohlin, H. Petersson & A. Aurum

3. Professional software engineers from outside NASA. (Prof.) Several different types of artifacts have been used in the studies. The following four types were identified: 1. Requirements specification (Req.): This includes studies where a requirements specification from a software development project is inspected. 2. Artificial requirements specification (Artif. Req.): In several studies, requirements specifications have been developed for the sake of the study. The objective is that these should resemble real requirements specifications. A potential problem with the artificial requirements specifications is that there is a lack of real context, although it resembles a real specification. 3. Code (Code): Several studies have used code in the inspections. 4. Plain text document written in English. (Textual): One study used a textual document where the defects where grammatical defects rather than software defects. This study is included to observe whether the effectiveness is significantly different when reviewing with a different purpose compared to normal software development. Finally, three different types of reading techniques are identified: 1. Ad Hoc (AdH.): This simply means that the reviewers were neither taught nor instructed to use any special kind of formal inspection or reading technique. The reviewers all performed to the best of their ability. It should be noted that there is always a risk with using ad hoc as a control group, since most reviewers apply some method and hence it is difficult to understand what their actual behavior is in comparison with other methods. 2. Checklist-based (Chkl): When checklist-based inspections are performed there is a checklist introduced to the reviewers beforehand. This list is used to guide the reviewers regarding what kind of defects to look for. The reviewers read the document using the checklist to guide their review. 3. Active Reading Technique (ART): Most of these studies use a perspective-based reading (PBR) technique. However since we would like to use virtual groups it is not possible to guarantee that all groups include all of the different perspectives and hence we would like to refer to this new type of inspection as being activebased reviews. Thus, the results should not be interpreted as representative of PBR. It has also been discussed elsewhere, [LaitenbergerOl], that some of the benefits of PBR comes from

Combining Data from Reading Experiments in Software Inspections

99

team effect. In short PBR instructs the reviewer to use an active form or review by assigning different perspectives to each reviewer. The common perspectives are user, tester, and designer. With the perspective follows a detailed description of how to perform the inspection. The instructions involve active steps such as 'Construct test cases for...' or 'Make a small design of...'. Some concerns regarding PBR and the analysis here are discussed in the following paragraph. The fact that PBR (perspective-based reading) assigns different perspectives to the reviewers, combined with the use of nominal groups, leads to problems when analyzing the PBR data. The use of different roles was proposed by Fagan [Fagan76], although the emphasis on active reading is more recent. In order to make the best use of the PBR, the review teams should include at least one reviewer from each perspective. This greatly limits the number of PBR compliant inspection teams that can be generated. Our virtual team generating approach allows for groups of inspection teams without the optimal set of perspectives. This leads to the impossibility of evaluating the true potential of PBR, and therefore, any conclusions concerning PBR cannot be drawn in this study. The PBR data are renamed to ART (Active Reading Technique).

4. Levels of Study Three levels of comparisons can be identified: organization, project and individual. At the organizational level, of potential interest could perhaps be benchmarking a particular inspection process with respect to industry standards or other specific partners. Alternatively, an objective could be to select a reading technique or better understand the effectiveness of inspections in different development phases. Organisational benchmarking is discussed in Section 5. At the project level, it is often important to learn more about the effectiveness of different team sizes. Typically a project manager would also like to plan the inspections within projects so as to maximize effectiveness. In these cases, a manager would like to know how many reviewers to assign in different phases of the development in order to obtain a certain degree of inspection effectiveness. The effectiveness of different team sizes is studied in Section 6.

100

C. Wohlin, H. Petersson & A. Aurum

Finally, it is important to know more about the individual performance. Of particular interest is the gaining of an understanding of the differences between individuals to be able to select a suitable inspection team. It is well known that there are individual differences, but it is important to ascertain how large they are. This is investigated in Section 7.

5. Organisation: Benchmarking This chapter investigates the inspection data from an organizational perspective. The intention is to examine what we can learn from the data with a view to benchmarking the software inspection process of an organization. 5.1 Benchmarking in General Benchmarking is a widely used business practice and has been accepted as a key component in an organization's search for improvement in quality, competitive position or market share. According to a survey in 1992, 31% of US companies were regularly benchmarking their products and services. Another survey in UK (1996) revealed that 85% of the business was using benchmarking practices [Ahmed98]. In Japan, benchmarking is called "dantotsu", which means "striving to be the best of the best" [Corbett98]. Here, we would like to define benchmarking of processes as an activity that allows people to strive to be the best of the best. Thus, both qualitative and quantitative comparisons with this objective are viewed here as being benchmarking. The literature describes several types of benchmarks [Sole95, Ahmed98, LongbottomOO]. Sole and Bist point out that the level of benchmarking sets the degree of the challenge from a slight improvement in the development process to a radical change in the process [Sole95]. Benchmarking may be divided into different types, depending on with whom the comparison is made and what the objective of the comparison is. Some common types of benchmarking include: • comparison within the same organizations (internal benchmarking), • comparison with external organizations (external benchmarking), • comparison with competitors (industry benchmarking), • identification of best practices (generic benchmarking),

Combining Data from Reading Experiments in Software Inspections

• • •

101

comparison of discrete work processes and systems (process benchmarking), comparison of performance attributes e.g. price, time to market (performance benchmarking), and addressing strategic issues (strategic benchmarking).

5.2 Benchmarking Goal The goal here is to be able to compare different software inspection processes. Given the characterization and standardized artifacts, it is possible to identify, for example, whether a specific inspection process is better or worse than another. Some key concerns regarding benchmarking are scalability, thresholds, simplicity and atypical situations. Scalability should not be a major problem, as long as the inspections scheduled on real projects are of limited size. Since normal recommendations on the length of the preparation phase and an inspection meeting are in the order of hours, it is feasible to benchmark a realistic approximation of the process. Scalability must also be addressed by using documents representative of what is normally seen at an organization, with respect to size and defect density. In this case, the objective is not to set quality thresholds on the documents, but rather to provide feedback on effectiveness and efficiency and expectations on these two factors in terms of group size. Simplicity is also very important because in order to make a benchmark useful, it should be possible to replicate it without having to have a number of experts present. For instance, the characterization scheme from Table 2 supports simplicity. Finally, it is often reported that software projects are so different from each other that it is not possible to compare them. It may be true that projects are very different, but it should still be possible to compare certain aspects of software projects, for example, software inspections. The differences and similarities should be captured by the characterization used and hence atypical inspections should be accounted for in subsequent analysis. Atypical inspections may be important to learn from, but they should not be part of the normal benchmarking data, since, by definition, it is not anticipated that their individual situations will reoccur.

102

C. Wohlin, H. Petersson & A. Aurum

5.3 Benchmarking in Software Development Benchmarking provides many opportunities for comparisons in software development, for example, compilers may be compared by compiling the same program on several different compilers and by logging compilation time and errors. Benchmarking in software development is perceived as an assessment method, which is concerned with the collection of quantitative data on topics such as effectiveness, schedules and costs [Jones95, BeitzOO]. It allows the comparison between an organizational process and industry best practice. It also helps managers to determine whether significant improvements are required to maintain a particular business [BeitzOO]. Here, the term benchmarking is used for both qualitative and quantitative comparisons as long as the main objective of benchmarking is fulfilled. Thus, a characterization of several processes in qualitative terms would qualify as benchmarking if the objective is to improve these processes. Informally, the following definition of benchmarking is used in this paper. Process benchmarking is the comparison of similar processes in different contexts, it implies multiple points of comparison (e.g. two data points is not a benchmarking), and it requires a representative sample in terms of, for example, organizations and applications. Several assessment tools for software benchmarking have been developed. Maxwell and Forselius report on the development of an experience database which consists of 206 business software projects from 26 companies in Finland [MaxwellOO]. This database allows managers to compare their projects with the existing projects from the database. 5.4 Benchmarking in Software Inspection To our knowledge, limited work has been done in the area of benchmarking software inspection. One of the few examples is described by Jones [Jones95] who argues that function points provide useful metrics on two components of software quality: (a) potential defects, which is the total number of defects found in a work product, and (b) defect removal effectiveness level, which is the percentage of software defects removed prior to delivery. Jones reports that in the US the average for potential defects is about five per function point, and overall defect removal effectiveness is about 82%. According to one recent study, code inspection reduces life cycle defect detection costs by 39%, and design inspection reduces life cycle defect detection costs by 44% [Briand98].

Combining Data from Reading Experiments in Software Inspections

103

5.5 Research Questions Using the data described in Section 3, it should be possible to answer, the following benchmark questions: 1. Are there any differences in terms of effectiveness between requirements specification inspections and code inspections? Assuming that the primary interest is to benchmark an organization, the artificial requirements specifications may be treated together with the requirements specifications. However, the textual documents cannot be included when answering this benchmark question. 2. Are there any differences in terms of effectiveness between the different reading techniques? This question is important to the organization as an answer to it allows for the selection of a suitable reading technique. 3. Hypothetically, we could also remove data sets 19 and 29 from the database, and assume that the organizations represented by these data sets would like to compare their inspections with the ones remaining in the database. Thus, it is possible to use the data from the other organizations and compare this with these to two fictitious companies for requirements and code inspections respectively. These types of questions and studies can be conducted as more and more data becomes available. The intention here is to illustrate how a benchmarking database for software inspections can be created. To answer the above questions the data will be presented in box plots and discussed qualitatively. The reason being that there is dependence between the data points and hence the data do not fulfil the requirements for using statistical tests. 5.6 Analysis The three questions above are addressed to illustrate how software inspection benchmarking can be used to study a number of issues of interest to software management in their effort to improve the software development process. It is here assumed that all data sets used for comparisons come from "comparable" companies, i.e. the characterization shows that it is reasonable to compare the companies. The benchmarking is illustrated with effectiveness and the intention is that a manager can transfer this information to cost-effectiveness within his or her own environment.

C. Wohlin, H. Petersson & A. Aurum

Effectiveness in requirements specification and code inspections In the left box plot group in Figure 1, the effectiveness in software inspections are shown for different types of documents. These box plots are shown for 1 to 4 reviewers with requirements specifications inspections to the left and code inspections to the right for the different cases. The focus is on 1 to 4 reviewers since the number of combinations is very few for higher number of reviewers, and the results would depend too much on single data points rather than representing a more general outcome. From the box plots, it seems obvious that the differences in terms of effectiveness between requirements specifications inspections and code inspections are minor. From a benchmarking perspective, this means that we may conclude that we can expect that our effectiveness for different types (or at least for requirements specification inspections and code inspections) of inspections should be approximately the same. Given that the faults in code inspections ought to be easier to find, we could have expected a higher effectiveness in code inspections. This was however not the case, which is an interesting observation. This information is valuable when planning different types of inspections. In particular, it implies, for example, that any experience regarding effectiveness for code inspections could probably be transferred to inspections of requirements specifications due to effectiveness in the different types of inspections (in our case requirements and code) being fairly similar. Effectiveness in inspections using different reading techniques In the right box plot group in Figure 1, the box plots for the effectiveness for different reading techniques are shown. Once again, the plots provide information for 1 to 4 reviewers with the plots in the following order: ad hoc, checklists and active reading technique. From the box plots, it seems as though checklists may be more effective than the other two techniques. This may result from us looking at individual inspection preparations only and not the potential team effect from having different perspectives. From a benchmarking perspective, this tells us that if we have good checklists they ought to outperform ad hoc inspections and active reading techniques on an individual preparation level. However, our study has not taken any team effects similar to the ones introduced by different perspectives in PBR into account. The benefits of a

Combining Data from Reading Experiments in Software Inspections

105

Fig. 1. Box Plots showing effectiveness in inspections for different types of documents (left) and for different reading techniques (right). The documents are from the left: requirements specification and code. The reading techniques are from the left ad hoc, checklists and active reading technique.

team approach, in particular when using active-reading techniques such as PBR, is further discussed by Laitenberger et al. [LaitenbergerOl]. The challenge is to develop inspection techniques that are strong both in terms that the individual will find many faults during the preparation and also that from a team perspective, the individuals complement each other well. New company scenario Here it is assumed that data sets 19 and 29 are not part of the experience base, and the box plots in Figure 2 are created without these two data sets. In particular, it is assumed that the data sets represent two companies which we will compare with a subset of the companies in the experience base selected on the basis of their similar characterizations. Thus, the box plots may be seen as the benchmarking experience base that new companies may use for 4

In box plots, a line in the box indicates the median value. Moreover, each box extends from the 25th percentile, lower quartile, to the 75th percentile, upper quartile, of the estimates. The whiskers (lines extending from the boxes) show the limit for non-outlier values. Outlier values have the following characteristics: Outlier>VQ+ Outlier

l.5(VQ-LQ)

Box plot outliers are marked with plus signs.

UQ - Upper quartile LQ - Lower quartile

C. Wohlin, H. Petersson & A. Aurum

106

planning and comparing their own software inspection process. The circles in the box plots represent the two new data sets (or companies from a scenario perspective) with data set 19 on the left diagram and 29 on the right. Company No. 19 could use the left diagram in Figure 2 to ascertain what to anticipate when performing inspections. It is possible to see the median value for requirements specification inspections as well as the different quartiles and whiskers. Thus, the company has a good picture of the industry standard for effectiveness of requirements specification inspections. After having conducted controlled inspections at the company, it is possible to plot how this company is performing. This is illustrated with the circles in the box plot. It can be seen that company No. 19 in this case performed below median for all group sizes. The difference between how close the circle is to the median depends on the individuals that are added as we increase the group size. It may be particularly interesting to actually study the performance of individuals and hence use this information to put together inspection teams. This issue is discussed in Section 7. To the right in Figure 2, a similar diagram is shown for code inspections. In this case, it is worth noting that company No. 29 is performing better than the industry standard.

Raquinmwt CBBUTW*

CwMOocunwM

Fig. 2. New company scenario for requirements specifications inspections (left) and code inspections (right).

Combining Data from Reading Experiments in Software Inspections

107

In summary, the illustration above shows examples of questions that can be addressed from using benchmarking in software inspections. The actual figures are not the main issue. The illustration is based on the assumption that the data sets are indeed comparable, i.e. the characterizations of the data sets are similar. These figures indicate the level of effectiveness that we can anticipate for different reading techniques, different types of documents and different group sizes.

6. Project: Team Size In Section 4, three different levels of study were introduced. In the previous section, the opportunity of benchmarking software inspection processes was investigated. The objective in this section is to evaluate the use of the data from a project perspective. The intention is to show how a project manager may use the data to help in planning and managing software inspections. 6.1 Teamwork in Small Groups versus Large Groups Teamwork is essential for software developers. There are some major assumptions behind teamwork. Firstly, the software products in today's market are too complex to be designed by individuals. The complexity of the product has become the driving force behind the creation of teams. It has been found that the quality of the product improves as it is inspected from multiple viewpoints [Laitenberger97]. Secondly, there is a belief that people are more committed to their work if they have a voice in the design of the product or the work in general [Smart98]. According to researchers, effective teams work best when the nature of their work requires a high level of interdependence [Smart98]. Valacich and Dennis emphasize that while many factors affect group performance, group size is a key ingredient since it places a limit on the knowledge available to the meeting group [Valacich94]. Several researchers in management science have focused on determining the optimal group size for teamwork where the members of the team are assigned to a set of tasks e.g. new product development or strategic decision-making. The findings from these studies are contradicting. Some researchers indicate a dozen is an optimal number and recommend an odd number to assure a majority in the case of conflicting ideas [Osborn57]. Some earlier studies report five people as an optimum number for a typical

108

C. Wohlin, H. Petersson & A. Aurum

group size [Hackman70, Chidambaram93, Tan94]. In [Jessup90], some experiments conducted using electronic supporting systems suggested a small number ~ four persons per group as an ideal size. On the other hand, Nagasundaram and Dennis remark that it is better to have small groups in face-to-face meetings, and large groups for electronic meetings [Nagasundaram93] . In summary, team size is a key issue when trying to optimize available resources. Given the need to use software development resources effectively, there is also a need to understand how to choose an appropriate team size for software inspections. 6.2 Team Size in Software Inspection Software inspection involves teamwork, where a group of individuals work together to analyze the product in order to identify and remove defects. Inspection teams are formed from small groups of developers. Fagan [Fagan76] suggests that four people are a good inspection team size. This suggestion also ties in with Weller's industrial work, where it is reported that four person teams are more effective and efficient than three person teams [Weller93]. IEEE STD 1028-1997 suggests teams of 3 to 6 people [IEEE98]. Teams of 4 or 5 people are common in practice [Wheeler96]. Owens reports that although it is expensive to have more reviewers, it is more effective to have more points of view for requirements inspection {e.g. 5-6 inspectors) than for design, and more inspectors are needed for design than coding {e.g. 1-2 inspectors for coding) [Owens97]. This idea of having two people in a coding inspection is supported by other researchers [Bisant89]. Bisant and lyle's findings illustrate that programming speed increases significantly in two-person inspections. Thus, it is clear, from literature, that there is no real consensus on the number of reviewers to use and hence this illustrates the need to combine data from different studies to obtain a more general understanding. Given the different arguments and results when identifying a good size for teams in both management and software inspections, an analysis is required to investigate the relationship between performance and team size. In particular, team size refers to the combination of individuals and not real teams. The focus here is directed at performance comparisons for various team sizes, which provides valuable information to people planning inspections. In the work presented here, we have used virtual teams, see Section 3.2. The team size in virtual inspections has also been studied by others [BifflOl].

Combining Data from Reading Experiments in Software Inspections

109

6.3 Effectiveness in General The first two box plots, Figure 3 and Figure 4, show the effectiveness of teams from the analyzed data sets where no attribute filtering has been done. In Figure 3, all data sets are represented, however, the data sets with six reviewers or more had five reviewers randomly chosen to generate the virtual teams. The median value is 0.26 for a team size of one and 0.64 for a team size of four. In Figure 4, 17 of the data sets have been removed since they only had five or six reviewers and do not provide any data for team sizes of six reviewers. The median value for 6 reviewers is 0.71. As expected the largest gain in adding a reviewer is achieved when using two reviewers instead of one. In Figure 3 and Figure 4, it can be seen how the inspection effectiveness increases as the team size increases, and how the added value for an additional reviewer decreases as the team size grows. Moreover, it is possible to see the variation between different combinations of reviewers. For example, the figures show that for a team size of four reviewers, the median effectiveness is 0.64, and that in 75% of the cases the effectiveness is above 0.5. On the other hand, it is also possible to note that in some cases the effectiveness is only around 0.2. This type of information is important for anyone planning, controlling and managing software inspections.

i +

+

09

0.8

+

!

+

'

1

2 I

I

I

'

I

-

0.7

I" i £0.5

I

I

I

0.3 0.2 0.1 0

Group Size

3 I

+

Fig. 3. Team effectiveness for all data sets.

C. Wohlin, H. Petersson & A. Aurum

110

Fig. 4. Team effectiveness for data sets with less than 7 reviewers not included.

6.4 Effectiveness Based on Filtered Data The next set of box plots, Figure 5 to Figure 7, shows the effectiveness of different team sizes when data sets have been filtered based on the three attributes discussed in Section 3.4: Environment, Document type and Inspection type. The legend is shown in Table 3.

1

Figure 5

Figure 6

Figure 7

-

Textual

-

NASA

Arti. Req.

AdH

Prof.

Req.

CheckL

Acad.

Code

ART

Table 3. Legend to Figs. 5 - 7 . From a visual inspection of the figures, some interesting observations can be made. In Figure 6, the large dispersion of the requirements

Combining Data from Reading Experiments in Software Inspections

111

specifications inspections is striking. The good performance of the checklist-based inspections in Figure 7 is also noteworthy. To further investigate the differences, the following research questions have been studied qualitatively: 1. Are there any differences in terms of the mean effectiveness for different: a) environments, b) types of documents, c) reading techniques. 2. Are there any differences in terms of variance in effectiveness for different: a) environments, b) types of documents, c) reading techniques.

0

.

-L

_i_

_l_

X

j -

Fig. 5. Environment.

C. Wohlin, H. Petersson & A. Aurum

112 Document Type

Fig. 6. Document type. Inspection Technique

2

3 Reviewers

Fig. 7. Reading technique.

The result were as follows: 1. Mean effectiveness (a) Environment: It seems that the academic environment better supports inspections compared to the other two environments. This is particularly visible for 3 and 4 reviewers. It also seems that the professionals at NASA perform better than professionals in general. It is interesting to note that the order (descending) of the

Combining Data from Reading Experiments in Software Inspections

2.

113

groups in terms of mean effectiveness is: Academia, NASA and other professionals. One possible reason for this result is that experiments at universities often are based on an isolated artifact, which has no system context. Another factor that may contribute is that people in industry may be less motivated when performing an experiment on artifacts, which are not part of their daily work. If they use an artifact from their daily work, then that artifact most likely has a complex system context which may make it harder to inspect than a stand alone artifact. (b) Document type: There are no visible differences in mean effectiveness for different types of documents. It is interesting to note that independently of the type of artifact, the mean effectiveness is about the same, including reviews aimed at finding grammatical defects in a text document. The main observation, from Figure 6, is the large variations, which are discussed below. Future work includes investigating the combined effect of, for example, a document type and a specific reading technique. (c) Reading technique: Checklist-based reading seems to be better, at least this is the case when only looking at the combination of individual data. However, the result may be a result of not having team meetings and hence the full potential of active-reading techniques is not explored. Effectiveness changes through having meetings or not is discussed further in Section 3.2. This result may be due to the fact that several of the defects are fairly trivial (although this cannot be fully ascertained since the severity of defects is not reported in most studies) and these are easily spotted using checklists. As pointed out above, ad hoc reading poses a problem in itself. Further studies are needed to explore whether, for example, certain individuals perform better using one technique rather than another. Variations of effectiveness (a) Environment: NASA seems to have a higher variation than the two other groups. This may be due to the fact that the studies at NASA include a mixture of real software artifacts and artificially created artifacts to a larger extent than the other groups. (b) Document type: Several interesting results were observed. Both requirements specifications and code, have a higher variance than the textual documents and the artificial requirements specifications. This indicates that the real artifacts are more challenging and result

114

C. Wohlin, H. Petersson & A. Aurum

in greater dispersion between different teams. In addition it is also worth mentioning that the requirements specifications seems to have a higher variance than the code. This implies that requirements specifications are harder to review than code, since the variation between different teams is high, (c) Reading technique: Checklist-based reading seems to have a lower variation than ad hoc and active reading techniques respectively. Once again, this outcome could be explained by the fact that several of the defects were relatively easy to uncover. Furthermore, the checklists support finding such defects. 6.5 Planning Table The box plots provide an insight into how the effectiveness of inspection teams varies, depending on the number of reviewers in each team. To support the decision-making process regarding which team size to choose in a specific situation, information is extracted and presented in Table 4. This table shows the percentage of all the virtual inspections that had a certain minimum level of effectiveness. For example, if we would like to be at least 75% certain of finding a minimum of 50% of the defects, then 4 reviewers are needed, (see the grey cell in Table 4). A table of this type could be used as a rule of thumb in different ways, depending on whether we would like to determine the team size, or if we would like to know the expected effectiveness resulting from a specified team size. 6.6 Discussion of Managing Inspection Effectiveness When planning an inspection, several issues have to be considered. For example: How many reviewers should be included? Who should participate? What roles should they have? When should the inspection meeting be held? and What inspection technique and reading technique should be used? It is common when dealing with these more practical matters of the planning, to forget one very important question; What level of quality do we aim for in the inspected document? If we know the approximate answer to this question, we automatically have a general outline of how to answer many of the other questions. The level of quality aimed for in a specific document depends on many factors, which can be summarized by answering the one question; How important is the document? This is where the project manager or whoever plans the inspection should begin. By processing and analyzing the different

Combining Data from Reading Experiments in Software Inspections

115

aspects of a document's importance, the manager gets an understanding of what level of quality is needed before releasing the document to the next development phase. Thus, he or she also may identify what amount of effort should be spent on that specific document. The number of defects within a document strongly affects the document's quality. In order to improve the quality, the number of defects must be reduced. As can be seen in Section 6.3, the number of reviewers involved greatly affects the proportion of defects detected by an inspection team. Therefore, this is one of the more important decisions to be made by the person planning an inspection. This study shows the opportunities in terms of providing a basis for supporting such decisions. The person planning the inspection can use Table 4 to get an indication of the level of risk that exists when choosing a certain number of reviewers. This risk can then be weighted against the importance of the document to obtain an estimate of the number of reviewers to use.

Effectiveness

1

2

3

4

5

6

7

0.10

90

99

100

100

100

100

100

0.20

68

93

99

100

100

100

100

0.30

36

80

93

96

97

100

100

0.40

17

59

80

88

92

94

100

"8

95

0.50

9

36

62

86

92

0.60

6

20

41

58

66

77

755

0.70

2

9

22

37

43

55

56

0.80

2

6

13

22

22

30

34

0.90

1

2

4

7

11

13

5

1.00

0

0

1

2

3

3

0

Table 4. Certainty of having a specific effectiveness for a specific team size.

Table 4 shows general results from a number of inspection experiments. By letting each company collect data to build their own tables, more relevant data can be attained. This would increase the accuracy of the data 5

This is an example of a lower value although we have more reviewers; this is a statistical artifact based on the limited number of data sets to create the columns for six and seven reviewers.

C. Wohlin, H. Petersson & A. Aurum

116

by narrowing down the variety of variables including document types, people involved and reading techniques used. Most of the measures needed to construct similar tables can be easily collected from inspections. The most difficult part would be in getting accurate numbers of the total number of defects. Different approaches could be taken to acquire this measure. Two examples include using Capture-Recapture estimations [Wohlin95] or controlled experiment, within the company in which the defects are known.

7. Individual: Reviewer Effectiveness This is the third level of investigation as pointed out in Section 4. Here we would like to study the data to better understand the effect of individual reviewers. This is important when planning inspections and also helps reviewers understand their own abilities. 7.1 Research Questions There are many aspects of individual contributions when considering inspection effectiveness. The analysis in this study as stated previously, is nominal inspection teams, i.e. no meetings are held. One important role of the meeting is to identify false positives, i.e. issues incorrectly reported as defects. Information on false positives is typically unavailable from our data sets, particularly as they come from controlled experiments where the number of existing defects is known. Including meeting data would also lead to an increased scope of possible impacts an individual may have, including aspects such as how a single person affects the group psychology. The questions investigated and illustrated in this study are: 1. What impact does a single reviewer have on average in terms of inspection effectiveness? 2. What impact does the reviewer with the best individual effectiveness have in terms of the team's total inspection effectiveness? 3. What impact does the reviewer with the worst individual effectiveness have in terms of the team's total inspection effectiveness? 4. Does the combination of the two best individuals in a team always find most unique defects?

Combining Data from Reading Experiments in Software Inspections

117

The aim of the first question is to provide a general view of what can be expected when adding or removing a person from a team. Question two and three investigate how the best or worst individual effectiveness affects the team's effectiveness. The fourth question investigates one aspect of a very interesting area: What makes a good team? It is not only individual contributions that are important when being part of a team. 7.2 Theoretical Analysis To gain some insight into what could be expected when investigating the effect of a single reviewer, the process of inspections can be modelled statistically and simulated by Monte Carlo simulations. These two approaches are shown below. The simplest approach is to assume that all defects are equally difficult to find and all reviewers have equal ability. If the probability for a defect to be found by a reviewer is p, then the probability that at least someone in a team of size R finds a specific defect is: PT = 1 - ( 1 - P ) S

If a document contains N defects then the expected value of the number of defects found in the inspection, E(D), is: N E(D)

=

T{D]-PTD-(1-PT)N~D-D=

-

=

PT-N

D=0

The average difference in effectiveness a single reviewer makes on a team of R reviewers is then: N(l-(l-p)R) N =

N-(l-(l-pf-1) N (l-p)R-i-(l-pf=p.(l-pf-1

Table 5 shows the expected differences for different values of p. For example, when finding defects with a probability of 0.4 the average difference in effectiveness when removing one reviewer from a threereviewer team is 0.14.

C. Wohlin, H. Petersson & A. Aurum

118 Group Size p 0.1

2

3

4

5

6

0.09

0.08

0.07

0.07

0.05

0.2

0.16

0.13

0.10

0.08

0.07

0.3

0.21

0.15

0.10

0.07

0.05

0.4

0.24

0.14

0.09

0.05

0.03

0.5

0.25

0.13

0.06

0.03

0.02

0.6

0.24

0.10

0.04

0.02

0.01

0.7

0.21

0.06

0.02

0.01

0.00

0.8

0.16

0.03

0.01

0.00

0.00

0.9

0.09

0.00

0.00

0.00

0.00

Table 5. Expected differences when removing one reviewer from a team. The previous model's assumptions are very restrictive. The assumptions can be relaxed by allowing the probability to vary among different reviewers. Then each reviewer i, has the probability pi to find a defect. Thus:

Pr= l - n < 1 - ^ 1 = 1

E(D)

= N-

i-n* 1 -^ i=i

The different probabilities (pi, i = 0...R) have then to be estimated or represented in some way. This is done as in [BoodooOO], i.e. using a Betadistribution to represent the pi's and using the available data sets to estimate the parameters of the Beta distribution. A simulation of the difference a single reviewer makes was run with team sizes of 2 to 6 people and the pi's taken from the Beta-distribution. 30 simulated experiments for each team size were run. The approach with virtual team combinations is used in the simulation to facilitate comparison. The result is shown in Figure 8. For example, the median of the lost effectiveness when removing one reviewer from a three-reviewer team is close to 0.13.

Combining Data from Reading Experiments in Software Inspections

119

Simulation Results

Fig. 8. Monte Carlo simulated effect of removing one reviewer. Some caution must be taken when drawing conclusions from this Monte Carlo simulation. The Beta distribution does not handle reviewers with an effectiveness equal to zero. There are 8 cases out of 255 in the data sets where a reviewer did not find any defects. In order to estimate the parameters of the Beta distribution, these sets were removed. 7.3 Analysis Procedure The following measures are used to investigate the four questions. 1. What impact does a single reviewer have on average in terms of the inspection effectiveness? Measure I: Eff(Fw// Team)-Eff(Team with one person removed) All possible combinations is considered. A virtual team, with reviewers A, B and C, generates 3 combinations. Eff(A, B, C) minus either Eff(A, B), Eff(A, C) or Eff(B, C) 2. What impact does the reviewer with the best individual effectiveness have in terms of the team's total inspection effectiveness? Measure II a: Eff(Full Team)-Eff(Team with the reviewer with the best individual effectiveness removed)

C. Wohlin, H. Petersson & A. Aurum

120

Example: Inspection with reviewers A, B and C, where they have individual effectiveness 0.43, 0.27 and 0.32, respectively. Measure Ha = Eff(A, B, Q—Eff(B, C) 3. What impact does the reviewer with the worst individual effectiveness have in terms of the team's total inspection effectiveness? Measure II b: Eff(Full Team)-Eff{Team with the reviewer with the worst individual effectiveness removed) To investigate the difference between the best and the worst another measure is added. Measure II c: (Measure II a)—(Measure II b) 4. Does the combination of the two best individuals in a team always find most unique defects? Measure III: (Number of times a combination of two reviewers within a team is found to have better effectiveness than the two reviewers with the best effectiveness) / (Number of virtual inspections) The result of these measures is presented in Section 7.4. 7.4 Results 7.4.1 Measure I Measure I is calculated for all of the possible virtual teams and all possible removals of one reviewer. The mean and variance of measure I are calculated and presented in Table 6. To investigate whether the profession, document type or inspection technique has any impact on the result, the data sets have been filtered based on these criteria. The average mean and variance, when all data sets are treated together, are shown at the bottom of the table. The table shows that the difference a single person makes on the effectiveness varies on average from about 0.16 in the two-reviewer case to 0.04 in the six-reviewer case. The largest effect is when checklists are compared to Ad Hoc and the Active Reading Technique. There is a sharp drop in the checklist case between five and six reviewers. This is probably because there is only one data set using checklists for six reviewers (data set number 5), see Table 2. The largest variance is found in the data sets from inspecting requirement specifications.

Combining Data from Reading Experiments in Software Inspections

2 Mean Var

3 Mean Var

121

Team Size 4 Mean Var

Mean

5 Var

Mean

6 Var

NASA

0.168

0.020

0.107

0.010

0.074

0.005

0.055

0.004

0.042

0.002

Prof.

0.144

0.010

0.103

0.007

0.076

0.005

0.057

0.004

0.045

0.003

Acad.

0.173

0.010

0.112

0.006

0.075

0.004

0.053

0.002

0.038

0.002

Text Req.

0.172 0.170

0.011 0.032

0.109 0.098

0.007 0.014

0.073 0.063

0.004 0.007

0.052 0.043

0.003 0.004

0.038 0.029

0.002 0.002

„ 0.163 Req. Code 0.159 Ad Hoc 0.161 Chkl 0.218

0.008

0.110

0.005

0.077

0.004

0.056

0.003

0.042

0.002

0.012 0.016

0.110 0.104

0.008 0.008

0.079 0.074

0.005 0.005

0.058 0.055

0.045 0.043

0.003 0.002

0.015

0.151

0.009

0.105

0.006

0.070

0.004 0.003 0.004

0.029

0.001

ART

0.159

0.013

0.106

0.007

0.074

0.004

0.054

0.003

0.041

0.002

All

0.165

0.014

0.108

0.008

0.075

0.005

0.055

0.003

0.042

0.002

Table 6. Single reviewer impact on inspection effectiveness.

T o further illustrate the o u t c o m e , a b o x plot for four r e v i e w e r s is s h o w n in Figure 9. One Reviewer Effect / Team Size 4

Profession

Doc. Type

Insp. Type

All

Fig. 9. One reviewer effect. From left to right (NASA, Prof., Acad.; Text, Req., Artif. Req., Code; Ad Hoc, Chkl, ART and All).

122

C. Wohlin, H. Petersson & A. Aurum

7.4.2 Measure Ila, b and c The aim of measure II is to capture the effect extreme reviewers, i.e. the best and worst, have on a team's effectiveness. Table 7 shows the mean and variance of all data sets. On average, the effect of the best reviewers varies from 0.24 for two reviewers down to 0.10 for six reviewers. The effect of the reviewer with the worst personal effectiveness drops from 0.10 down to only 0.01.

Mean

Var

Mean

Var

Team Size 4 Mean Var

Ila, best

0.237

0.014

0.186

0.007

0.147

0.005

0.118

0.003

0.096

0.002

lib, worst

0.092

0.004

0.047

0.002

0.027

0.001

0.016

0.001

0.010

0.000

lie, difference

0.144

0.017

0.139

0.010

0.121

0.006

0.103

0.004

0.086

0.003

2

3

6

5 Mean

Var

Mean

Var

Table 7. Mean and variance of measure II taken of all data sets for different team sizes. Although the detailed statistics are not presented here, when Table 7's data is filtered on the different context attributes, the results are similar to measure I (see Table 6). Checklists show slightly larger mean values up to five reviewers while requirements specifications show the largest variance. Removing Beat/Worst Reviewer

Fig. 10. Effect of removing best or worst reviewer from the team. From left to right for each team size: Best, Worst, Difference.

Combining Data from Reading Experiments in Software Inspections

123

Figure 10 shows a box plot for measure Ila-c for all data sets with different team sizes. A feature not captured in the table is the occurrence of negative differences in the effect between the best and the worst reviewers. This represents cases where the reviewer with the worst effectiveness has found more unique defects than the best reviewer (Note: This condition can not occur in the two reviewer case). 7.5 Measure III Measure III investigates whether teams with the best individual reviewers necessarily form the best team. In Table 8, measure III is presented. This shows how often, within a team, the best combination of two reviewers is not the two reviewers with best effectiveness individually. For example, in 62 percent of the four-reviewer teams it is possible to find a better combination of two reviewers than the effectiveness of the individuals involved would suggest. The two-reviewer case is not investigated, since there is only one combination, and it is automatically the best.

Measure III

3 53%

Team Size 4 5 62% 75%

6_ 74%

Table 8. Percentages of how often a better combination of two reviewers than the two with the best individual effectiveness could be found within a team.

8. Conclusions Combining data or results from different studies of software inspections provide many interesting opportunities for comparison and evaluation. This includes benchmarking at the organizational level, inspection planning at the project level and an improved understanding of individual performance. 8.1 Organizational Benchmarking The results in Section 5 show variations in the mean effectiveness across some of the studied attributes. This includes differences between environments, and differences showing that checklist-based reading outperformed ad hoc and active reading techniques at least when only

124

C. Wohlin, H. Petersson & A. Aurum

comparing the individual results and nominal teams and ignoring the team effects for real teams. The results are certainly interesting, but further studies are required to better understand these issues. The variations observed are very interesting. They show that differences in inspection effectiveness when using real software documents, including both requirements specifications and code, is greater than when using other types of documents. It is also noteworthy that the variation in effectiveness for requirements specifications is higher than for code. This illustrates some of the problems when performing inspections and shows the potential of performing a study of this type. Such results, if confirmed by later studies, would form an important input for anyone who plans, controls or manages software inspections. Benchmarking opens a number of interesting opportunities. The need for empirical studies and experimentation in software engineering is well known. Software inspection is an area well suited to experimentation, both in industry and in universities. By agreeing on a number of standardized artifacts or use of company specific artifacts and a characterization schema, it should be possible to make a large number of experiments world-wide and hence it should be possible to rapidly develop greater understanding in this area. Software inspection benchmarking is interesting for universities as well as industry. Industry may perform benchmarking as discussed above. Universities may perform experiments allowing for more data to be collected regarding inspections, which would enable researchers to better understand different aspects of software inspections more fully. This becomes particularly valuable if the universities use the same documents as are used for benchmarking in industry. For example, universities should be able to run experiments with students in courses where different reading techniques can be compared. 8.2 Project Management The analysis in Section 6 uses effectiveness of the inspection to evaluate the results. The effort is assumed to be approximately the same in all experiments. The cost of effort is important too, since the effort put into inspection could instead be used to further develop the documents or product itself. If no effort considerations apply, the best option would be to use as many reviewers as available. However, even with effort considerations when deciding on how to perform an inspection, it still

Combining Data from Reading Experiments in Software Inspections

125

comes down to the question stated in Section 6.6, How important is the document? When this is known, the effectiveness of the inspection team is the next thing to be considered. Main objectives of this study include exploring how inspection teams behave, depending on team size, and identifying how inspection planning can be better supported. This was done by combining data from different controlled inspection experiments. Another aspect that was illustrated was the effect on the inspection effectiveness when altering some of the inspection attributes, such as the environment, document type and reading technique. The question of How many reviewers should be included in the review? is an important consideration when planning an inspection. This study has provided an initial answer, in terms of a table from which it is possible to estimate the number of individuals needed in a team to reach a certain effectiveness (in preparation) or vice versa (to estimate the effectiveness for a given team size). The table is based on 30 published data sets, and it provides a starting point. However, to get a better picture of the effectiveness of inspection, it is necessary for organizations to build their own experience bases. 8.3 Individual Performance in Inspections The study in Section 7 used data sets from several controlled inspection experiments to illustrate the impact an individual reviewer has on an inspection team's effectiveness. The results provide some rules of thumb to consider when planning inspections. The averages presented show rather a small individual reviewer contribution to the inspection effectiveness. However, good individual performance is still important and, in reality, may have a larger effect than the averages suggest. For example, having good reviewers can decrease the needed size of the inspection team and thereby reduce the effort cost of the inspection. Therefore it is important for companies to analyze their own inspection process to guide their decisions and to perhaps pinpoint reviewers with inspection 'talents'. Their ability can be utilized on important projects and their knowledge may be able to be captured and taught to others. When addressing individual performances, four questions were posed. The aim of the first question was to study how much impact in general, an individual reviewer has on the effectiveness of an inspection team. The average effect is of course dependent on the team's size and the mean was

126

C. Wohlin, H. Petersson & A. Aurum

found to be 0.16 in the two-reviewer case down to 0.04 for six reviewers. Compared to the theoretical models in Section 7.2, the investigated data sets behaved similar to the second model although the theoretical model had larger median values, especially in the two-reviewer case. The second and third questions focused on the degree to which the best and worst reviewers effect the team's performance. The difference between their impact on the team's effectiveness (Measure lie) is on average 0.14 in the two-reviewer case, down to about 0.09 with teams of size six. This can be seen as an approximation of the risk that exists when choosing members of an inspection team. These values show a limited risk but there is still an impact. For example, in a project with 10 design documents each having about 30 major defects, picking a worst reviewer instead of the best reviewer to a team of four (difference of 0.11) would on average lead to 33 defects not being found because of the choice of reviewer. There are of course, as shown in Figure 10, cases where the difference is much larger but in general most cases show a difference below 0.3. In the example above, the most extreme outlier for four reviewers would lead to that about 140 defects of the total 300 being missed because of the staffing. The answer to question four indicates that the individual effectiveness of a typical reviewer only views focuses on a single dimension of the inspection task. Selecting the people with the best individual effectiveness provides no guarantee of finding the most unique defects. As soon as a team is created, the individual effectiveness is not that important. Individual expertise increases the chances of the team finding many defects, but in order to make the team effective, the reviewers dimension focuses should complement each other. This is an important issue to remember when choosing an inspection team. It was interesting to discover that the inspections that used checklists generally had a larger difference between the best and the worst reviewers. The expected impact of using a checklist would be to increase the number of defects a reviewer finds, but also making the reviewers more homogenous in what defects they find. However, the data shows that on average larger single reviewer impact as well as larger differences between the best and the worst reviewers. Independent of which method is used, education of the reviewers would be beneficial. Sauer et al argue that the dominant factor of a team's effectiveness is the amount of expertise within the group [SauerOO]. Consequently, it is important to train reviewers to increase their knowledge

Combining Data from Reading Experiments in Software Inspections

127

and ability of reviewing and create tools and methods that utilize the expertise within the team. 8.4 Conclusions from Combining Data An important issue to consider is how useful the combination of data from different studies can be when the data is taken from inspections with such a wide range of conditions. This study and its results are primarily to be viewed as a feasibility study, although the findings themselves also provide some interesting results in terms of summarizing some of the studies that are available in the literature. It is clear that several factors that may influence the results are not documented, which of course is a threat to the accuracy of the analysis. On the other hand, it is important to start doing these types of analyses to build a body of knowledge, rather than only generating new relatively freestanding studies. It may also be argued that having a variety of conditions increases the generality, and this is the best way to start before collecting metrics from within a specific department or company. A study of this type may be criticized for its debatable validity, due to the lack of control and knowledge about the context of many of the studies. This also includes the general problems of performing different studies where either data or results are later combined. However, the solution is certainly not to avoid doing these types of studies. On the contrary, it is necessary to start undertaking such combined studies and attempt to meet the challenges of the resulting analyses in order to gain deeper understanding of this area. Despite the potential threats to the validity, the following key findings are of major interest: • There are no visible differences in effectiveness for different document types. • There are clearly differences in the variation of effectiveness for different document types. The variation is larger for real documents and is higher for requirements specification than for code. • It is possible to determine the effectiveness for different team sizes. This can be used for decision-making when planning and managing software inspections. • There are differences between different types of subjects. However, the results indicate that people in academia are more effective. This is probably due to the fact that the documents used in academia are more stand-alone than documents investigated in an industrial

128

C. Wohlin, H. Petersson & A. Aurum

•

•

•

setting. Thus, the result may be due to a confounding factor. This is an area for further investigation. Checklists turned out to be more effective than other types of reading techniques. This may also be a result of a confounding factor, namely experience. It is believed that less experienced reviewers would benefit more from checklists than others. Due to the fact that the experience of the subjects is unknown, we are unable to evaluate this. This is also an area for further studies. It is clear from the data that individual differences exist and hence it is important to put together an inspection team cautiously so as to make the best possible use of the available resources. A combination of the individually best reviewers is not necessarily the most effective team. This means that there are more effective combinations of reviewers than simply putting together the individuals that perform the best in the individual preparation. This stresses the need to further develop approaches using the expertise of different reviewers in an effective way.

Finally, there is of course a great need for more studies in this field to generate more individual studies that, in turn, can be utilized in metaanalysis, pooling of data or analysis of a series of experiments. In this way, a body of knowledge can be built that increases our general understanding of how to conduct cost-effective software inspections. It should also be noted that the combining of data and results in itself is an important area of research in software engineering. In particular, we must increase our understanding of when it is reasonable to combine data or results. The main objective of this paper has been to look at what can be achieved in this type of approach.

Acknowledgment The authors would like to thank Dr. Forrest Shull, Fraunhofer USA Center for Experimental Software Engineering in Maryland, USA and Marcus Ciolkowski, University of Kaiserslautern in Germany for many valuable discussions regarding benchmarking in software inspections. We would also like to thank Thomas Thelin and Dr. Per Runeson at Lund University in Sweden for many valuable discussions on software inspection research and Peter Parkin and Irem Sevinc from University of New South Wales, Australia.

Combining Data from Reading Experiments in Software Inspections

129

References [Ahmed98] Ahmed, P. K. and Rafiq, M. (1998): "Integrated Benchmarking: A Holistic Examination of Selected Techniques for Benchmarking Analysis". Benchmarking for Quality & Technology, 5(3), pp. 225-242. [Basili95] Basili, V. R., Zelkowitz, M., McGarry, F., Page, J., Waligora, S. and Pajerski, R. (1995): "SEL's Software Process Improvement Program", IEEE Software, Vol. 12, No. 6, pp. 83 -87 [Basili96] Basili, V. R., Green, S., Laitenberger, O., Lanubile, F., Shull, F., S0rumgard, S. and Zelkowitz, M. V. (1996): "The Empirical Investigation of Perspective-Based Reading". Empirical Software Engineering: An International Journal, 1(2), pp. 133-164. [BeitzOO] Beitz, A. and Wieczorek, I. (2000): "Applying Benchmarking to Learn from Best Practices". Proceedings 2nd International Conference on Product Focused Software Process Improvement, Oulu, Finland. [BifflOl] Biffl, S. and Gutjahr, W. (2001): "Analyzing the Influence of Team Size and Defect Detection Technique On the Inspection Effectiveness of a Nominal Team". Proceedings International Software Metrics Symposium, pp. 63-73, London, UK. [Bisant89] Bisant, D. B. and Lyle, J. R. (1989): "Two-Person Inspection Method to Improve Programming Productivity". IEEE Transactions on Software Engineering, 15(10), pp. 1294-1304. [BoodooOO] Boodoo, S., El Emam, K., Laitenberger, O. and Madhavji, N.: "The Optimal Team Size for UML Design Inspections", National Research Council Canada, ERB-1081, NRC 44149. 2000. [Briand98] Briand, L; El Emam, K., Laitenberger, O. and Fussbroich, T. (1998): "Using Simulation to Build Inspection Efficiency Benchmarks for Development Process". Proc. of the IEEE International Conference on Software Engineering, pp. 340-349. [Chidambaram93] Chidambaram, L. and Bostrom, R. P. (1993): "Evolution of Group Performance Over Time". Journal of Management Information Systems, 7, pp. 7-25. [Corbett98] Corbett, L. M. (1998): "Benchmarking Manufacturing Performance in Australia and New Zealand". Benchmarking for Quality Management & Technology, 5(4), pp. 271-282. [Ebenau94] Ebenau, R. G. and Strauss, S. H. (1994): "Software Inspection Process". McGraw Hill (System Design and Implementation Series), ISBN 007-062166-7. [Fagan76] Fagan, M. E. (1976): 'Design and code inspections to reduce errors in program development'. IBM System Journal. 15(3), pp. 182-211.

130

C. Wohlin, H. Petersson & A. Aurum

[Freimut97] Freimut, B. (1997): "Capture-Recapture Models to Estimate Software Fault Content". Diploma Thesis, University of Kaiserslautern, Germany. [Gilb93] Gilb, T. and Graham, D. (1993): "Software Inspection". Addison Wesley Publishing Company. ISBN 0-201-63181-4. [Hackman70] Hackman, J. R and Vidmar, N (1970): "Effects of Size and Task Type on Group Performance and Member Reactions". Sociometrics, 33, pp. 37-54. [Hayes99] Hayes, W. (1999): "Research Synthesis in Software Engineering: A Case for Meta-Analysis". Proc. of the IEEE International Software Metrics Symposium, pp. 143-151. [IEEE98] IEEE (1998): "IEEE Standard for Software Reviews". The Institute of Electrical and Electronics Engineering, Inc. ISBN 1-55937-987-1. [Jessup90] Jessup, L. M., Connolly, T. and Galegher, J. (1990): "The Effects of Anonymity on GDSS Process with an Idea Generation Task". Management Information Systems, Quarterly, 14(3), pp. 313-412. [Johnson98] Johnson, P. and Tjahjono, D. (1998): "Does Every Inspection Meeting Really Need a Meeting", Empirical Software Engineering: An International Journal, Vol. 3, No. 1, pp. 9-35. [Jones95] Jones, C. (1995): "Software Challenges". IEEE Computer, 28(10), pp. 102-103. [Knight93] Knight, J. C. and Myers, E. A. (1993): "An Improved Inspection Technique", Communications of the ACM, Vol. 36, No. 11, pp. 51-61. [Laitenberger97] Laitenberger, O. and DeBaud, J. (1997): "Perspective-based Reading of Code Documents at Robert Bosch Gmbh". Information and Software Technology, 39(11), pp. 781-791. [LaitenbergerOl] Laitenberger, O., El Emam, K., Harbich, T. G. (2001): "An Internally Replicated Quasi-Experiment Comparison of Checklist and Perspective-Based Reading of Code Documents". IEEE Transactions on Software Engineering, Vol. 27, No. 5, pp. 387-421. [LongbottomOO] Longbottom, D. (2000): "Benchmarking in the UK: An Empirical Study of Practitioners and Academics". Benchmarking: An International Journal. 7(2), pp. 98-117. [Martin92] Martin, J. and Tsai, W. T. (1992): W-Fold Inspection: A Requirements Analysis Technique'. Communications of ACM, 33(2), 225-232, February. [MaxwellOO] Maxwell, K. D. and Forselius, P. (2000): "Benchmarking Software Development Productivity". IEEE Software, January/February 2000, pp. 8088. [Miller99] Miller, J. (1999): "Can Results from Software Engineering Experiments be Safely Combined?". Proc. of the IEEE International Software Metrics Symposium, pp. 152-158.

Combining Data from Reading Experiments in Software Inspections

131

[Nagasundaram93] Nagasundaram, M. and Dennis, A. R. (1993): "When a Group is not a Group: The Cognitive Foundation of Group Idea Generation". Small Group Research, 24(4), pp. 463^189. [Osborn57] Osborn, A. F. (1957): "Applied Imagination: Principles and Procedures of Creative Thinking". Charles Scribner's Son, New York. [Owens97] Owens, K. (1997): "Software Detailed Technical Reviews: Findings and Using Defects". Wescon'97, Conference Proceedings, pp. 128-133. [Parnas85] Parnas, D. L. and Weiss, D. M. (1985): "Active Design Reviews: Principles and Practices", Proc. of the IEEE International Conference on Software Engineering, pp. 132-136. [Pickard98] Pickard, L. M., Kitchenham, B. A. and Jones, P. W. (1998): "Combining Empirical Results in Software Engineering". Information and Software Technology. Vol. 40, pp. 811-821. [Porter95] Porter, A. A., Votta, L. and Basili, V. R (1995).: "Comparing Detection Methods for Software Requirements Inspection: A Replicated Experiment", IEEE Transactions on Software Engineering, Vol 21, No 6, pp. 563-575. [Porter97] Porter, A. A., Siy, H. P., Toman, C. A. and Votta L. G. (1997): "An Experiment to Assess the Cost-Benefits of Code Inspections in Large Scale Software Development", IEEE Transactions on Software Engineering, Vol. 23, No. 6. [RegnellOO] Regnell, B., Runeson, P. and Thelin T. (2000): "Are the Perspectives Really Different?—Further Experimentation on Scenario-Based Reading of Requirements". Empirical Software Engineering: An International Journal, Vol. 5, No. 4, pp. 331-356. [Robson93] Robson, C. (1993): "Real World Research", Blackwell Publishers, UK. [Runeson98] Runeson P. and Wohlin, C. (1998): "An Experimental Evaluation of an Experience-Based Capture-Recapture Method in Software Code Inspections". Empirical Software Engineering: An International Journal, Vol. 3, No. 4, pp. 381-406. [SauerOO] Sauer, C , Jeffery, D. R., Land, L. and Yetton, P. (2000): "The Effectiveness of Software Development Technical Reviews: A Behaviourally Motivated Program of Research". IEEE Transactions on Software Engineering, Vol. 26, No. 1, pp. 1-14. [Smart98] Smart, K. L. and Thompson, M. (1998): "Changing the Way We Work: Fundamentals of Effective Teams'". Proceedings of IEEE International Communication Conference, Vol. 2, pp. 383-390. [Sole95] Sole, T. D. and Bist, G. (1995): "Benchmarking in Technical Information". IEEE Transactions on Professional Communication, 38(2), pp. 77-82.

132

C. Wohlin, H. Petersson & A. Aurum

[Tan94] Tan, B. C. Y., Raman, K. S. and Wei, K. (1994): "An Empirical Study of the Task Dimension of Group Support System". IEEE Transaction on Systems, Man and Cybernetics, 24, pp. 1054-1060. [Valacich94] Valacich, J. B. and Dennis, A. R: (1994): "A Mathematical Model of Performance of Computer-Mediated Groups During Idea Generation". Journal of Management Information Systems, 11(1), pp. 59-72. [Votta93] Votta, L. G. Jr. (1993): "Does Every Inspection Need a Meeting?". Proc. of 1st ACM SIGSOFT Symposium on Software Development Engineering', ACM Press New York, N.Y., pp. 107-114. [Weller93] Weller, E. F. (1993): "Lessons from Three Years of Inspection Data". IEEE Software, 10(5), pp. 38-45. [Wheeler96] Wheeler, D. A., Brykczynski, B. and Meeson, R. N. Jr. (1996): "Software Inspection: An Industry Best Practice". IEEE Computer Society Press, USA. ISBN 0-8186-7340-0. [Wohlin95] Wohlin, C , Runeson, P. and Brantestam, J. (1995): "An Experimental Evaluation of Capture-Recapture in Software Inspections". Journal of Software Testing, Verification and Reliability, Vol. 5, No. 4, pp. 213-232. [WohlinOO] Wohlin, C , Runeson, P., Host, M., Ohlsson, M. C , Regnell, B. and Wesslen, A. (2000): "Experimentation in Software Engineering — An Introduction". Kluwer Academic Publishers, Boston, USA.

CHAPTER 4

External Experiments - A Workable Paradigm for Collaboration Between Industry and Academia Frank Houdek DaimlerChrysler AG Research and Technology P.O. Box 23 60 89013 Ulm, Germany frank. houdek@daimlerchrysler. com

Results of empirical investigations are key input for industrial software process improvement activities. The relevance of an investigation for a given environment, however, depends on the similarities or dissimilarities of the investigation environment compared to the industrial environment which is interested in using a new technology. The optimal (or most relevant) results may be gained when performing the experiment in the industrial environment itself. Unfortunately, this takes place only rarely as experiments in industrial environments are either expensive (when new and old technology are used in parallel) or risky (when only the new technology is used). To overcome this dilemma, we introduced the concept of external experiments, i.e. experiments which are conducted in an environment different from the industrial target environment and which tries to simulate the characteristics of the industrial environment as best as possible. This paper describes this concept in detail, presents a process for identification, conduction and exploitation of such experiments and shows some results from past investigations which used this concept. Keywords: Empirical software engineering; external experiments; experience factory; software process improvement; industry impact; ICE3.

133

134

F. Houdek

1. Introduction Experimentation is essential in any engineering discipline. Experiments are useful for gaining information about influences on the various factors which determine the effect of technologies and methods. This trend can also be observed for software engineering. In the last years, a reasonable amount of attention has been paid to adapting the general experimentation methodology to the domain of software engineering (see, for instance, Basili et al. 1986, Pfleeger 1995, Wohlin et al. 1999, or Juristo and Moreno 2001). As a result, there now exist frameworks for planning, executing, and analyzing empirical investigations. The results of empirical investigations are key input for many kinds of software process improvement activities. Knowledge about the behavior of techniques, methods, and processes helps to anticipate their impact on software process and product quality. Thus the main consumers of empirical results are software development organizations in general or software process engineering groups in particular.1 Despite the need for empirical results, it is hard for many software development organizations to acquire such results in productive environments. For instance, benchmarking different software engineering processes is rarely feasible for industrial development units. In principle, there are two major alternatives for such investigations: 1. The new technology is applied in a parallel project. If the new technology fails (or shows more negative results), the outcome of the control group, i.e. the conventionally developed product, can be used. 2. The new technology is applied without a backup project. If the new technology fails, project success may be endangered. In general, both alternatives are unacceptable from an economic perspective: the first being too expensive, the second too risk prone. As a consequence, the application of the new technology has to be postponed until a significant amount of (external) knowledge about the impact of the new technology in similar environments is acquired. A promising solution to this problem is found by utilizing producers for empirical results that are separated from the productive environments (= consumers). These are called external experimentation environments. Thus, we can conduct experiments in these external environments without 1

Of course, there is also a general interest in empirical results from a scientific point of view. As these needs are not as focused, we do not consider them here.

A Workable Paradigm for Collaboration Between Industry and Academia

135

endangering the success of ongoing internal projects. If the results are promising, we then transfer the new technology into the internal processes. By using external environments, we postpone the application of new technologies by only a single development cycle. Today, many producers of empirical results do not have a dedicated audience, so that investigations are chosen for experiment methodological reasons, i.e. experiments which can be conducted with a high degree of local control and within a limited time-frame.2 However, there is one general shortcoming with external environments: we have to think about the transferability of the externally gained results to the target (= consumer) environment. This means that we need to consider the distance between the target environment etarget and the experimentation environment eexp. Distance values may vary between 0 (both e,arget and eexp behave identically) and 1 (etarget and eexp have nothing in common). Ideally, we would like to have an experimentation environment eexp for a given target environment etarget with distance eexp, etarget -» 0, i.e. the experimentation environment has the same characteristics as the target environments. Results obtained in the experimentation environment can the be transferred to the target environment (almost) without further considerations. A fairly good example of such a laboratory environment was the Software Engineering Laboratory (SEL) at the NASA Goddard Space Flight Center (Basili et al. 1992, McGarry et al. 1994, pp. 13). SEL developed software for ground support systems in standard projects with approximately 300 engineers. In addition, they were able to experiment with different techniques — even going so far to replicate some activities (Basili and Caldiera 1995, p. 62). The results of the experiments were extended to other software engineering groups both inside and outside NASA. Unfortunately, most organizations cannot afford to run an experimentation environment similar to NASA's SEL. Other environments have to replace their functions. Here, environments outside industry, i.e. in the education domain such as universities in particular, can be used as a feasible substitute. Reviewing the latest journals and conference proceedings shows that a number of experiments are performed in these environments 2

This is one reason why such a great number of investigations on software inspections are available.

136

F. Houdek

(see the Empirical Software Engineering Journal, Kluever). In such environments, it is especially easy to examine technologies in parallel and perform replicated studies. However, transfer of the results obtained in these environments is seldom seen (Houdek et al. 1997) for a number of reasons: • Distance. The characteristics of laboratory environments at universities and industrial target environments differ with respect to various aspects (e.g. the kind of systems developed, the degree of experience and education of the engineers, and the engineering processes used). • Ownership. The users of the empirical results, i.e. the target environment, were not involved in the investigation process. They feel uncertain about the quality of the results (a variation of the notinvented-here syndrome). Publishing all related materials such as forms, measurements procedures, and even data help to minimize this effect but cannot eliminate it. As a consequence, it is necessary to bring the requirements of industry and the laboratory environments closer together. As argued before, this cannot happen in general but must be done in individual cases, i.e. for one industry partner and one laboratory partner. In this chapter, we present an approach to identify, conduct and exploit external experiments (ICE3). Figure 1 provides a rough sketch of the idea behind the ICE3 approach. Note that the vertical position of the boxes indicates the assignment of activities: upper boxes are assigned to the target environment etarget and lower boxes to the laboratory environment eexp.

Target (consumer) environment

1) Identification of questions for emp. investigation

V Experimentation (producer) environment

5) Exploitation of results in target environment

2) Evaluation: Is an external experiment feasible?

V

4) Evaluation of results with respect to the target environment

3) Conducting the empirical investigation

y

S

Fig. 1. Concept for using external environments for experimentation.

A Workable Paradigm for Collaboration Between Industry andAcademia

137

A process as defined by the ICE approach helps to combine the key needs of the software development organization with the ability of external experimentation environments to conduct well-defined studies. On the other hand, individual empirical investigations are becoming more relevant for industry as there is at least one environment (the environment putting the question) which is greatly interested in the results. Without such an approach, only occasionally would an experiment conducted in a laboratory environment eexp meet the constraints and context criterion given by a target environment etarget. Structure of the Chapter The remainder of the chapter is structured as follows, first, we present a short overview of the ICE3 approach followed by an in-depth look at the various steps of the concept. These are augmented by a running example. The subsequent section presents three applications of the ICE3 approach at DaimlerChrysler to demonstrate its applicability and to touch on its possibilities and limitations. Then we discuss some related work and end this chapter with conclusions and closing remarks.

2. Overview of the ICE 3 Approach At this point, it is crucial to realize that the ICE3 approach is a manual approach. The various steps should provide a guideline for defining meaningful investigations from the perspective of a target environment. While this is helpful in order to take almost all the relevant aspects into account, completeness or accuracy cannot be achieved. One reason lies in the mechanisms for determining the distance between the target environment and experimentation environment. To determine this distance correctly, we would require a complete set of all the relevant factors influencing the transferability of results from one environment to another for a particular technology. It is obvious that such a set can never be defined as transferability of technologies does not just depend on technical and organizational aspects but also on individual, human factors. Figure 2 offers a graphical overview of the proposed ICE3 approach. The rectangles depict objects and the arrows (A) to (E) denote activities. The starting point for the ICE3 approach are experiences and needs in the target environment. Driven by these needs, investigation candidates are

138

F. Houdek

identified (A). Then, these investigation candidates are evaluated together with any restrictions given by the intended external experimentation environment (B). If an external investigation is feasible and the proposed accuracy of the results meets the expectations of the target environment, the evaluation can be conducted (C). After the experiment has been carried out, the concrete observations and results are evaluated with respect to the proposed accuracy and transferability (D). If the results seem to be relevant enough, the results are exploited in the target environment (E). Depending on the actual results achieved, exploitation could mean changing the existing processes, investing in further studies, or even refraining from implementing the new technology. In any case, as a basic benefit, experience of the target environment increases.

Experiences and needs in the target environment (A) Identification Investigation candidates (unevaluated)

Constraints of the experimentation environment (B) Evaluation of the investigation candidates

Candidates for internal and external investigations (C) Decision and empirical investigation Empirical results of the external investigation (E) Exploitation of the results

(D) Evaluation of the results gained Statements about the relevance of the results

Fig. 2. ICE3 approach for identifying, conducting and exploiting external experiments. The approach consists offivemain steps (A) to (E) which consume and produce information (indicated by the boxes).

A Workable Paradigm for Collaboration Between Industry andAcademia

139

The various steps of this approach form a cycle where the starting and ending points lie in the company experience base. This experience base may be in the form of a collection of individual experiences or even a formal repository. The experience base represents the interface between internal processes and external knowledge. The internal experiences are the driver for the internal software engineering processes. Figure 3 depicts this knowledge interchange. Thus, the ICE3 approach can be seen as a natural extension of the experience factory approach proposed by Basili et al. (1994, see also Basili and Caldiera 1995, Houdek et al. 1998, Landes et al. 1998). In the experience factory paradigm there exists an organizational unit, the experience factory, which provides the starting and ongoing software development projects with quantitative and qualitative knowledge concerning their current tasks. Additionally, the experience factory uses the ongoing projects also as means for acquisition of new or refined knowledge. The ICE approach adds here external investigations to bridge the gap between pure external knowledge (e.g. covered by text books) and knowledge which can be captured in internal software projects. Again, it is important to note that experiments are used here in pullmode, i.e. they are driven by the target's (= consumer's) needs. Here, we see a significant difference to the current situation for many empirical studies. issues

issues observations Software development projects

(External) investigations

*^ ( experience '"*L~-__^-^-''~\ base techniques methods results

\_J maturation

\ techniques methods results

External knowledge (e.g. books)

Fig. 3. Exchange of experience and knowledge via the experience base.

3. An In-Depth Look at ICE 3 In this section, we describe the ICE3 approach in detail. A running example helps to make the approach more intelligible. The first subsection provides

140

F. Houdek

the context for the example. The subsequent subsections deal with the steps of the ICE3 approach. Introduction of the Running Example Our example company, XYZ Consult, is engaged in developing customized software for external clients. In their engineering process they have incorporated code inspections in which the inspection meeting is optional. Figure 4 illustrates the process employed. Rectangles depict activities, circles depict artifacts, rhombs depict resources, and the diamond symbolizes a decision point. The current decision rule for the inspection meeting is as follows: // 15 or more defects on all preparation forms then perform an inspection meeting.

meeting :

/—v

C-code

~C\

preparation /

1

inspector

/

preparation forms

/ * \\ y\

/

1 all

aut IOT

meeting

,r "\J "L

reworked C-code

\J

notes

/

rework

/

„r"\

/ /

Fig. 4. Inspection process at XYZ Consult. The employees feel uncomfortable using this rule. They observed situations where they identified 20 issues during the preparation phase and the meeting initiated as a result turned out to be superfluous. They observed other situations where they identified only 10 issues during the preparation phase, but the code was found to be error-prone. What was especially disturbing was that many of the defects could have been identified in the meeting. A number of alternative decision rules came to mind: • Initiate an inspection meeting if 10 or more critical issues are found during the preparation phase. • Initiate an inspection meeting if the estimator for remaining defects indicates more than twice as many defects than those actually documented on preparation forms.

A Workable Paradigm for Collaboration Between Industry and Academia

141

Unfortunately as the current XYZ Consult software engineering process is part of their legal contracts, XYZ Consult is not able to play with these alternatives in ongoing projects. Identifying Investigation Candidates Candidates for empirical investigations arise from the current situation and observed information needs of the target environment. The concrete source may only be an engineer's funny feeling about the current software engineering process (or parts of it) or may, for example, stem from ideas for new techniques and methods stimulated by conferences or experience reports, customer demands for new or better software engineering technologies, or recommendations as result of an external assessment to mention only a few. It is essential to understand that investigation candidates are driven by the experiences and needs of the target organization. An investigation candidate may be described in a form similar to that depicted in Figure 5. Typically, an investigation candidate cannot be filled in completely during this first phase. Information is completed and reworked during the subsequent evaluation phase (see Section 'Evaluating Investigation Candidates'). Figure 6 presents an example investigation candidate for the situation described in the previous subsection. The process of identifying investigation candidates is typically aligned with process improvement activities. If a company owns a software process Problem

Question to be answered from an organization's point of view.

Context

Factors which characterize the current situation and the key influencing factors. These factors will be used to determine whether an external environment is sufficiently similar to the target environment.

Expected benefit

What are the goals associated with the outcome of an empirical investigation? Which quality attributes of the current process or product are to be affected?

Investigation type

Short description of the empirical study together with its major design constraints and measures.

Result type

Requirements concerning the output of the empirical investigation. These requirements may relate to confidence restrictions (e.g. statistical significance) or the type of result (e.g. management presentation).

Fig. 5. Structure of an investigation candidate.

142

F. Houdek

Problem

When (i.e. under which circumstances) should I initiate an inspection meeting?

Context

Company:

XYZ Consult

Document type:

C code

Characteristics:

Well documented, few controlling algorithms

Inspection process: Planning (lh), kick-off meeting (0.3h), preparation (individually, with checklists, lh—3h), inspection meeting (optional, lh), rework (lh-lOh), rework check (lh) Participants:

Three inspectors, moderator, author, scribe

Qualifications:

Electrical engineers, computer scientists

Experience:

None to 3 years software development

Expected benefit

Improving the inspection process, i.e. fewer unnecessary inspection meetings and fewer undetected defects.

Investigation type

Formal experiment which compares the three inspection meeting decision rules 'Meeting if 15 or more defects are found during preparation' (currently used), 'Meeting if 10 or more critical defects are found during preparation', or 'Meeting if remaining defect estimator proposes more than twice as many defects than the ones found during preparation' with respect to (1) synergy effects of the meeting and (2) percentage of uncovered defects.

Result type

Quantitative data on synergy effects and uncovered defects for all three decision rules. Statement if differences are significant on the level a = 10 %, tested with one-factorial variance analysis.

Fig. 6. Example of an investigation candidate. engineering group (SEPG) or an experience factory (EF), it should be their responsibility to identify investigation candidates. Evaluating Investigation Candidates After investigation candidates have been identified, we have to evaluate them with respect to two facets: utility and feasibility. Utility means that the results of an investigation can have a positive impact on the current situation in the target environment. Not every investigation is useful. Threats to utility could be as follows: • The investigation deals with objects which are prescribed by external restrictions (e.g. contractor demands a particular development process) and which are therefore fixed.

A Workable Paradigm for Collaboration Between Industry and Academia

•

143

The results of an investigation cannot be transferred to the current processes due to internal restrictions (e.g. the engineers are not willing or able to accept a new technology like formal methods). Feasibility is a measure of how well the question can be answered at all.

All combinations of feasible/non-feasible and useful/non-useful evaluations are possible, as the following examples demonstrate: • Useful and feasible: How does the distribution of effort with respect to our software development phases look like? Answering this question is useful as it helps project managers to plan future projects. The relevant data can be collected by using a measurement routine. • Non-useful but feasible: Which impact does the programming language used have on module length? If we assume that the customer requires a particular programming language, then we cannot use any information about the impact of languages on module length. An investigation would produce a table which would become shelfware. Answering this question, however, is fairly simple. Repeated coding of the same functionality using different languages would provide the answer. • Useful but non-feasible: What does a 95%-correct effort estimation model for our new software product line look like? An answer to this question could be extremely useful as it would help to plan new projects very precisely. Unfortunately, this question cannot be answered as we do not have any experience with the new product line. Therefore, a solid (external) investigation is basically not possible as the foundation is missing. • Non-useful and non-feasible: Are maintenance efforts for C++ code lower than for C code? If we assume that our legacy systems are written in C code, we would not benefit from any knowledge about maintenance efforts on C++ code. Conducting such a study is difficult, too, as the relevant objects — C++ code from the target environment — are not available. Feasibility itself is composed of two sub-attributes, quality-of-result and operability. Quality-of-result expresses the relevance of an investigation's result for the target environment. It can be seen as a measure for the distance between the target environment and experimentation environment. Operability deals with the question of whether the desired investigation can be conducted in the intended experimentation environment with the given

144

F. Houdek

result requirements (as described in the 'result type' section of the investigation candidate). Figure 7 depicts the various facets of the evaluation. In the following sections we deal with them in further detail.

Evaluation of an investigation candidate

Utility Is the answer to the question of any use?

Feasibility Can the investigation candidate's question be answered?

Quality-of-result Quality of (possible) results with respect to the target environment

Operability Can an investigation provide the required significant results?

Fig. 7. Evaluation of an investigation candidate.

Utility Evaluation of utility can be done in various ways. In most environments there is a solid feeling as to implementable process changes and the shortcomings of the currents processes (and therefore which changes might have positive a impact). In our running example, for instance, the engineers are willing to change the inspection decision rule and believe that a better rule would improve their process by reducing unnecessary meetings and uncovering more defects. Considerations about utility are linked with the section 'Expected benefit' in the inspection candidate and can be documented there. Our considerations may also affect the 'result type'. Sometimes, qualitative information is sufficient. In other cases, empirical results must show a minimal level of (statistical) quality in order to be useful as a trigger for process changes. Investigation candidates which have found to be non-useful can be deleted.

A Workable Paradigm for Collaboration Between Industry and

Academia

145

Feasibility Unlike utility, which is only related to the target environment, feasibility is also related to the intended experimentation environment. Here, all possible experimentation options come into bearing. For a given investigation candidate it may be the case that feasibility of an internal investigation is lower than an external literature survey. In general, we observe a trade-off between the two sub-attributes of feasibility, quality-of-result and operability. Whereas quality-of-result tends to be higher for internal investigations, operability is higher for external ones (as they do not require changes to current processes). Figure 8 illustrates this trade-off qualitatively. The investigation alternatives depicted are examples of possible investigation types. Zelkowitz and Wallace (1998) provide a more complete set of investigation types. In single cases, however, there may be deviations from the scenario in Figure 8. The following sections describe the evaluation of quality-of-result, operability, and their combination into feasibility. In each case we consider only external investigation alternatives since the systematic usage of external environments is the main intention of this chapter. However, a similar approach may be used to make a selection from internal investigation alternatives, as well.

CD quality-of-result I™3 operability

investigation alternatives

«" J'

/

^

. /

*S S's* y •

Fig. 8. Trade-off between quality-of-result and operability.

146

F. Houdek

Quality-of-Result The key input for assessing quality-of-result for a given investigation alternative is the investigation candidate's context description. Here, we try to determine the degree to which the various factors can be imitated in the experimentation environment, i.e. how much the experimentation environment behaves like the target environment (or can be stimulated to behave that way). Attribute:

Attribute Value

Document type:

C code

0.8

R

1

I

Characteristics:

Well documented, few controlling algorithms

0.6

0.7

Inspection process:

Planning (lh), kick-off meeting (0.3h), preparation (individually, with checklists, lh-3h), inspection meeting (optional, lh), rework (lh-lOh), rework check (lh)

1

0.8

Participants:

Three inspectors, moderator, author, scribe

0.7

1

Education:

electrical engineers, computer scientists

0.8

0.5

Experience:

None to 3 years software development

0.8

0.2

Fig. 9. Evaluation sheet for our running example. Column R contains the relative influence of the individual attribute rangingfrom0 (no influence) to 1 (major influence). Column /presents the imitability of the attribute for a given investigation alternative rangingfrom0 (attribute cannot be imitated) to 1 (attribute can be imitated perfectly). For documentation purposes, we can use an evaluation sheet as depicted in Figure 9. We try to quantify each attribute's relative importance and imitability. In practical applications of the ICE3 approach, however, such detailed documentation will typically be omitted. Operability Operability of an investigation largely depends on the constraints implied by the experimentation environment, i.e. are the available resources in the experimentation environment sufficient to produce results which meet the quality criteria set? The most important resources are the time needed to conduct a study and the subjects participating. Both resources depend on the required investigation and result types. Results with high statistical quality (i.e. significance) tend to require more data points than simple trend observations.

A Workable Paradigm for Collaboration Between Industry and Academia

147

However, detailed estimates about the minimal number of required data points is currently an open issue in software engineering experimentation (Miller et al. 1997, Miller 2000). And the type of investigation chosen plays an especially important role here. Combination of Operability and Quality-of-Result In the previous sections we dealt with quality-of-result and operability from a qualitative perspective. In principle, these attributes can be measured more precisely as set out in Houdek (1999). Our experiences using the ICE3 approach have shown that such precision is difficult to achieve. Thus we wish to restrict ourselves to a simple ordinal scale with values low, middle, and high for both attributes. After our considerations on quality-of-result and operability, we can now combine these two values by using a portfolio as depicted in Figure 10. Conducting Experiments If an investigation candidate has been found to be both useful and feasible with respect to a given experimentation environment, we can proceed with conducting the actual experiments. If a candidate is lacking in its feasibility values, we have to think about better experimentation alternatives. Either we find one, or—if utility is estimated fairly high—we consider internal experimentation options.

11

combined value for feasibility

high

•

high middle

•a

middle |

| low

o low

middle

high

quality-of-result

Fig. 10. Combination of quality-of-result and operability into a single value for feasibility.

148

F. Houdek

If external experimentation seems a sufficiently good choice, the conventional steps of experimentation start. We do not deal with these steps here as there is a great deal of excellent literature available on the subject (see, for instance, Judd et al. 1991, Fenton and Pfleeger 1996, Basili et al. 1986, and Rombach et al. 1993). Our approach can be seen as a systematic concept for defining an investigation goal, which is the starting point of any sound experiment. Ideally, people of the target environment accompany all the experiment phases such as planning, pre-study, execution, and analysis. It is essential that these people buy into the study to arouse the feeling that the study has become their study. As mentioned earlier, ownership is an important issue to enhance a study's credibility (Houdek et al. 1997). And credibility is crucial to decrease bias and motivate people to adopt new technologies. The result of this phase are all results of the experiment, i.e. materials, raw data, and analysis results. These may be available in the form of a report, loose collection of sheets, or a lab package. Evaluation of Results An initial analysis from the experiment's point of view was part of the conduct phase. Now we examine the experiment's outcome from the perspective of the target environment. This can be done in two steps. First, we perform a 'formal' check. We check (1) to which extent our previous assumptions about imitability were fulfilled and (2) whether our requirements with respect to 'result type' and quality are met. Then, we consider the actual results. Within the target environment we address the following questions: • What are the consequences of the deviations which we found in the formal' check? Deviations may be negligible, but they may also make the outcome of the study worthless if set result quality constraints have not been met or the differences between the target environment and experimentation environment were greater than anticipated. • Which conclusions can be drawn for the target environment? • What do the empirical results mean with respect to the 'expected benefit' of the investigation candidate? Do the results support the original expectations? As in every process improvement activity, it is vital to incorporate all the parties concerned in the discussion process.

A Workable Paradigm for Collaboration Between Industry and Academia

149

Exploitation of Results Exploitation means to make the empirical results beneficial to the target environment. Depending on the actual results, follow-up activities may have to be defined. For example, these may be further evaluations, additional internal or external studies, a pilot project employing the new technology, modifications of legacy process manuals, or rejection of the new technology. Note that this step is not trivial at all. Here we are back in the ugly world of software process improvement (see Humphrey 1990, Humphrey 1995, Pfleeger 1999, Perry et al. 1984, or Caputo 1998), where all the organizational, human, and political issues come into play. However, there is one difference compared to many process improvement initiatives. Improvements are not just selected randomly, i.e. reference models, best practice or from common sense, but on an empirical basis instead. At this point it is important to see that our approach does not claim to change entire software engineering processes based on the results gained in a single external study. Depending on the magnitude of modification we are talking about, a whole series of investigations ranging from external to internal may be appropriate. Moreover, our approach helps to fill the obvious gap between external investigations which do not tackle the specific needs of a given target environment at all and expensive, risky internal studies. Typically, we need a whole series of investigations to adapt and implement a technology in a dedicated target environment (see also 'Related Work').

4. Experiences In this section, we set out the three applications of the ICE3 approach within the DaimlerChrysler Group. In particular, we describe investigations on defect detection for executable specification documents, on the role of formal specification documents on contracting software development, and on requirements reuse. The external environment in each investigation was the Software Laboratory. The Software Laboratory is a joint project between DaimlerChrysler Research Department of Software Engineering and the Software Engineering and Compiler Construction Department at the University of Ulm, Germany.

150

F. Houdek

This collaboration was established by an initiative of the municipal government in 1996. Since then a series of experiments have been run. The first experiments were done with available lab packages (e.g. a replication of the Basili/Selby experiment on defect detection techniques, Basili and Selby 1987, Kamsties and Lott 1995). Over time, the proposed ICE3 approach evolved. Each experiment is embedded in a practical course of one semester's duration. This situation implies a number of constraints which have to incorporated in step (B) of the ICE3 approach:3 • Effort. A practical course lasts only one semester (i.e. 13 to 15 weeks) with a weekly expense of time for the students of approximately 8 hours. • Experience. Practical courses are part of the graduate program. Thus the students have a sound knowledge of computer science. Their expertise is comparable to employees in their first year at DaimlerChrysler. • Number of subjects. The University of Ulm is a small university. As a consequence, six to twelve students is a realistic expectation for the number of available subjects attending a course. In each of the following examples, we first provide motivation for the problem on the target environment's side. In our case, this comprises the various DaimlerChrysler business units. Then we elaborate the evaluation process by focusing on the relevant context factors identified (and their imitation limitations) and the experimental design derived. Remarks as to the impact of an individual study illustrate the initial exploitation of the results. As the investigation results become part of our knowledge base, further exploitation is likely to happen. Defect Detection on Executable

Specifications

Motivation and Problem In the last few years, executable modeling languages and their supporting tools (e.g. Statemate, Matlab, Simulink, Stateflow, MatrixX, or Rhapsody) have become more powerful. This makes them more attractive in software engineering activities, for example, for programming of electrical control 3

Compared with student experiments, which are embedded in lectures, we are confronted with fewer subjects but more individual efforts.

A Workable Paradigm for Collaboration Between Industry and Academia

151

units (ECU) deployed inside a car. Executable models are used for a number of purposes. They are used to build rapid prototypes which can be utilized on the desktop or even inside a passenger car. The model may be employed to clarify requirements, evaluate concepts, augment specification documents, or even for code generation. Independent of their specific role in the software engineering process, they are subject to quality assessment as is every other software deliverable. Conventional defect detection techniques for these models are ad-hoc desktop simulations using the simulation capabilities of the respective tools. As there are often uncertainties about the correctness of a model, the question logically arose whether other defect detection techniques might lead to better results. Potential candidates are inspections, formal testing (using a combination of black- and white-box testing), and ad-hoc simulation as used before. Evaluation and Experiment Assessing the current situation, we found the following factors to be of high relevance for the transferability of empirical results. • Model notation. Depending on the type of system used, either statebased notations (e.g. statecharts) or continuous modules (e.g. Matlab models) are employed. • Document size. A printout of a model may vary from several pages to more than one hundred pages. • Engineering experience. Developers are typically electrical engineers with several years of experience in the domain of automotive ECUs. • Model characteristics. The models typically do not make use of highly advanced modeling constructs, as they complicate either code generation or manual coding. The models are readable and well structured. • Number of persons involved. Typically, there are several engineers involved in developing, adjusting, and testing a model. These factors can be imitated differently in the Software Laboratory environment. In particular, we found the following limitations: • Both state-based and continuous notations can be used with computer science students. However, dedicated training is mandatory. • Within the given time-frame, we were only able to consider small models (up to 10 pages).

152 •

• •

F. Houdek

Experience is hard to imitate. We planned to imitate domain knowledge by using models for common systems (e.g. telephone answering machine). Experience with modeling notations and defect detection, however, will be lower than in the target environment. The model characteristics can be imitated easily. Imitating team size is easy, too.

Taking these limitations into account, we defined an experimental design as set out below. After several lectures and exercises on the modeling languages and defect detection techniques which we used, we planned to have three experiment blocks. The first one was about defect detection for C code to make the students more familiar with the techniques. The second and third blocks dealt with defect detection for Statemate and Matlab/Simulink models, respectively. The detail design of these two blocks is depicted in Figure 11. We planned to divide the students into three groups. Within each group, each subject acts individually (except in the inspection case, there we had to group two or three subjects to form an inspection team). In three consecutive weeks each subject had to assess three different given models using a different defect detection technique. This design seemed to be able to provide trends related to the capabilities of the various defect detection techniques. We conducted this experiment in summer semester 1999.

Sys.2

Sys. 1 Group A Group B Group C

Sys. 3

J

£k£ m "f%2WWf w^l KMJ. 1% ^"hj£tt ^ A'W

f —J*-

:

:

•

Systematic test

^

Inspection Ad-hoc simulation

Fig. 11. Experimental design of one block in the study on defect detection techniques for executable models.

A Workable Paradigm for Collaboration Between Industry and Academia

153

Impact The study produced some interesting results: • As expected, the more formal approaches (i.e. inspections and formal testing) required considerably more effort. • Unexpectedly, the formal approaches did not tremendously outperform the ad-hoc approach. • No single defect detection technique was suitable for uncovering all the defects. In particular, we found several classes of defects which seemed to be likely to be detected with one technique but not with the other ones. Thus combining techniques will help to increase defect coverage. • Formal testing is feasible but laborious due to insufficient tool support. Discussions with our business units confirmed our previous assessment about the high relevance and transferability of the study for industry. They also agreed that a combined approach of inspections and ad-hoc simulation seemed to be the best defect detection approach for these kinds of models. Now, there are inspection checklists for various types of models available and model inspections are becoming more and more usual in our software engineering processes besides ad-hoc simulation, which are still considered to be useful as they help to assess the dynamic behavior of the models. Impact of Formal Specifications on Contracting

Software

Development Motivation and Problem A large potion of the software which is currently programmed into our electronic control units (ECUs) is written by contractors. Thus the responsibility of specifying the ECU's behavior lies with DaimlerChrysler (both system and software requirements). Currently, specification documents are written as high-quality structured text augmented with tables and figures. Often models are built within our organization to achieve a deeper understanding of the functionality (see also 'Defect Detection on Executable Specifications'). In this situation, the question arose whether these models should be given to the contractors as part of the specification documents (or even as a replacement) or whether textual specification documents lead to

154

F. Houdek

better software quality. Answering this question could help to focus or even reduce the efforts spent currently for building the models and textual specifications. Simply asking the contractor would not provide a really meaningful answer as there is a lot of politics involved. Using models with a well defined semantic as specifications limits the contractor's design space as much more detail is given by the models (and as a consequence requested from the supplier). It is fairly obvious that experimentation within ongoing automotive development projects was not feasible. Thus we considered conducting an external experiment. Evaluation and Experiment Assessing the current situation, we found the following factors to be of high relevance for the transferability of results. • Model notation. Depending on the type of systems used either statebased notations (e.g. statecharts) or continuous modules (e.g. Matlab models) are employed. • System type. ECUs are typically hard-real-time systems with some directly attached sensors and actuators. Often, they are linked with other ECUs in a communication bus (CAN bus) and the microcontrollers deployed are only capable of integer arithmetics. • System size. The software for ECUs varies from a few kilobytes up to some megabytes. The (textual) specification documents vary from some ten pages to some hundred pages. • Engineering experience. The technical editors responsible for the specifications are typically electrical engineers with several years of experience in the domain of automotive ECUs. • Development process. Most ECUs are developed on the contractor's side (both hardware and software), taking the specification documents provided into consideration. During development, several increments of the system are built in order to meet the constraints given by the overall car development process. • Quality assurance. System quality is assessed both on the contractor's and customer's side with techniques ranging from module testing to ECU testing and test driving. Given these factors, we evaluated the degree to which the various factors can be imitated in the Software Laboratory environment and tried to

A Workable Paradigm for Collaboration Between Industry and Academia

155

work out an appropriate study. In particular, we made the following decisions: • As computer science students are more familiar with state-based notations, we decided to use statecharts since they can be modeled by using Statemate. • To make the study not only relevant for the target environment but also attractive for the students, we decided to use the LEGO Mindstorms4 system, which contains an RCX brick. This brick contains a Hitachi H8300 microcontroller with 32-kilobyte RAM and integer arithmetics. By using the LegOS (Knudsen 1999) operating system, we were able to write C programs which can be crosscompiled using gcc in a LINUX environment and downloaded via an infra-red interface. In order to keep the systems fairly simple, we decided to skip interECU communication. • Due to the given time constraints, we were forced to employ small systems. A maximum specification document size of eight to twelve pages seemed to be appropriate. • To imitate domain knowledge, we decided not to focus on automotive ECUs but instead on ECUs for common understandable systems, e.g. car park gate, microwave oven, or a simple offset printing machine. • An incremental development process with several increments could not be done due to the given time constraints. Thus we decided to restrict ourselves to a single development cycle. These considerations led to the following experimental design. After some introductory lectures and exercises on writing specification documents (both text and statecharts), LEGO RCX, and LegOS, we planned to have two runs with a design as depicted in Figure 12. Between these two runs, the assignments of students to model-based or text-based development was to be changed. In each run we planned to proceed as follows: First, we assigned each student randomly to one of four teams. Two teams applied the model-based approach, the other two teams the text-based approach. For each of two given systems there is one model-based and one text-based team. Then, each team was to write a specification document for their system using the given specification approach. The input for each team was a hardware system built

4

LEGO and LEGO Mindstorms are registered trademarks of the LEGO Group.

F. Houdek

156 4

Testing System 1 Testing

K A

' y H Testteg

x

{

System 2 Testing

Assessment of customer's understanding

Fig. 12. Core of experimental design of the study on the role of formal specification documents. with LEGO bricks and a rough system sketch describing the intended core functionality. The specific task for each team was to write a precise specification document in their role as customer. After finishing the specification document, we used a questionnaire to assess the customer's understanding of their system. For instance, if we requested a 'quick' response of the system in the system sketch, we asked whether this requirement had been refined or not. Then we gave each specification document to another team, which specified a different system using the same specification approach. Now the teams acted as contractors. The specific task for each team was to write an executable C code which implemented the requirements set on the given hardware. If there were questions regarding the specification, the contractors were able to ask the customers by email. After the two-week implementation period, the teams changed their roles again, turning back into customers. Here, they tested the contractor's implementation and evaluated whether they were willing to accept the system or not. After the first run, we carried out a second one with two different systems and changed assignments (the teams which had used the modelbased approach in the first run used the text-based in the second run and vice versa). At the end of the experiment there was a wrap-up meeting to collect the students impressions and qualitative observations. Using this design, we expected to gain results which were sufficiently relevant to the target environment. We conducted this experiment in winter semester 2000/01 and in summer semester 2001 as well.

A Workable Paradigm for Collaboration Between Industry and Academia

157

Impact The experiment produced some interesting results: • Building model-based specifications requires more effort than writing text-based specifications. Implementing systems using a model-based specification, however, requires less effort. In total, we observed slightly more effort for the model-based approach than for the textbased one. • Model-based specifications increased the customer's understanding of the system. • Model-based specifications rule the implementation. Although the models are intended to be solely functional specifications, they were also considered as implementation models ('we just coded the model'). • Model-based specifications reduce the care taken by developers, i.e. developers are less skeptical about requirements than in the textbased case ('the model must be right'). • There we found no significant difference in terms of defects which were detected during acceptance testing. The results, especially the over-estimated accuracy of the model, justify some skepticism as to pure model-based specifications. As a consequence, pure model-based specification is currently not recommended for new development projects. Moreover, a combination of text and model seems to be most promising, so that a number of current development projects are aligned with this policy. Reusing Specification

Documents

Motivation and Problem Many software-based systems inside a car do not offer a completely new functionality. Instead, they merely ensure enhancements and adaptations of existing functionality. Whatever functionalities a future car may have, it is highly probable that engine control, ABS, cruise control, interior light control, speed display, to name only a few state-of-the-art standard features, will be amongst them. As a consequence, writing specifications for such systems means reusing technical requirements. Today, reuse is frequently done merely by copying an existing document and then renaming and editing it. The

158

F. Houdek

technical editor has to take care that all the necessary modifications are made in the new document. As such specification documents tend to be large (i.e. several hundred pages), this is a challenging activity. In particular, a detailed technical level of the requirements without a more abstract level of description above them requires a deep system understanding in order to maintain and evolve the requirements correctly and consistently. Here we clearly see potential for improvement. Currently there is a requirements reuse process under development which tries to add higher levels of abstraction above the technical levels to make legacy requirements documents more comprehensive and facilitate change management. Applying such an approach in real development projects is a risky undertaking, however, as many consequences (both technological and human-based) are unclear. An external investigation would seem to be an effective means to gain first insights. Evaluation and Experiment The crucial factors for the reuse approach are the size and complexity of the specification documents. It is fairly simple to evolve a small document. The more complex the system and its interaction at the subsystem level, the more likely it is that inconsistent and incomplete changes will be incorporated. However, time (and, as a consequence, size) is hard to imitate in the Software Laboratory environment due to the given time constraints. To cope with this problem, we decided to use specification documents which contain both textual and model-based requirements. Some parts of the system were defined by means of natural text, some other parts by statecharts. This increases the complexity of the system without increasing the document size. Other relevant influencing factors are the degree of detail of the requirements and specification of a system (instead of software). These factors can easily be imitated. These considerations led to the conclusion that an external study was feasible. We planned to have a controlled study, which again started with a large block consisting of introductory lectures and exercises on software and system requirements, embedded systems, and requirements reuse. Then there were to be two blocks where the traditional copy-edit approach was compared with the new requirements reuse approach (NRRA). The two blocks differed in the assignments of the students to an approach and the systems used. Figure 13 depicts the design of one such block.

A Workable Paradigm for Collaboration Between Industry and

Academia

159

Proposed changes

Croup 1

Traditional specification document

Enhanced specification

Inspection

Group 2

Reuse-oriented specification document

Enhanced specification

Inspection

Fig. 13. Experimental design of one block in the study on reusing requirements documentation.

The main element of the NRRA is a document structure which augments the existing technical specification with abstract descriptions, interface definitions between system functions, and trace links from high-level descriptions to low-level specifications. Additionally, the NRRA provides some guidelines on how to proceed when reusing an existing specification in view of a new (enhanced) system. Impact At the time this chapter was written, the study was still running. Thus it is not possible to report on concrete results or direct impacts. However, an ongoing research project deals with the question of systematic requirements reuse. This project is the first customer for the results of the study. Depending on the empirical results, refinements of NRRA, further studies or pilot projects are expected to follow.

5. Related Work and Discussion At present, empirical research is seen as an important part of software engineering research. Empirical work provides essential input for software process improvement activities. In 1996, Wohlin et al. (1996) introduced a framework for technology infusion. At an abstract level, they describe a process, which helps to implement a new technology by using four steps: characterization of the (target) environment, technology selection, technology assessment, and technology introduction. To assess

160

F. Houdek

technologies, the authors present several supplementary techniques. Figure 14 depicts this approach and the interdependencies of the various techniques. The gray arrow in the back represents 'cost, confidence in the outcome, and similarities in terms of context in comparison to the ordinary software development environment' (Wohlin et al. 1996, p. 169). Thus, their approach employs the concept of distance (and therefore the relevance of an investigation for a given target environment) implicitly. The difference between Wohlin et al. (1996) and our approach is that they characterize solely the strengths and weaknesses of the various investigation types whereas we aim to derive investigation types based on the needs of a given target environment. The basic concept—assessing technologies on the basis of the needs and characteristics of a given target environment—is common to both approaches. Also the idea of a series of investigations and focusing investigations to a dedicated environment is part of both works. Here, we enter a debatable area: one might argue that the main purpose of empirical work is to provide evidence for general conclusions. We do not accept this as major purpose, here. Instead we believe that empirical results which are valid for a particular environment should be the first goal of empirical work. First, because they are more precise than general-aimed results, and second, because they can provide shorter term benefit. As a consequence, generalization of results to other environments that the original target environment e,ragei is not covered by the ICE3 approach at all.

observation environment

- -a»-

industrial environment - pilot project - normal project

'

•

*

.

laboratory -reduced experiment -full experiment

desk - literature study -basic impact analysis -detailed impact analysis studv

'

-*

f\ case study

I

experiment

evaluation method

Fig. 14. Alternative techniques for the assessment of software process modifications (according to Wohlin et al., 1996).

A Workable Paradigm for Collaboration Between Industry and Academia

161

The observation that an individual investigation rarely provides enough evidence as to the effectiveness of a technology can also be found in the work of Daly et al. (1995). The authors propose that an entire series of investigations be built on a single question, the so-called multi-method approach. Their example is an investigation on object-oriented software development which was to uncover the strengths and weaknesses of this paradigm. The starting point of their investigations was an expert survey carried out among 13 experienced software engineers. The survey culminated in a strength-weakness profile. Based on this profile they developed a questionnaire which was distributed via various channels (email, news, postal service). The 275 respondents were used to refine the profile. Then, they designed laboratory experiments which were used to examine previously mentioned hypotheses such as 'deep inheritance hierarchies have a negative impact on the understandability of objectoriented software'. The general significance of adequate and timely investigation of new technologies with respect to their transfer to industrial practice is also laid out in Pfleeger (1999). In her work, Pfleeger emphasizes the tremendous latency times between the invention of new technologies and their widespread acceptance. As early as 1985, Redwind and Riddle (1985) tried to measure these latency times and found intervals ranging between 11 and 23 years. A main reason for these long latency times lies in the uncertainties about the effect of new technologies on one's own (target) environment. Only a sufficient degree of sound experience can persuade the majority of users to drop their bias and adopt new technologies. Figure 15 depicts the adoption of a new technology with its distribution over time. The earlier a technology is adopted, the higher the risk. In order to tackle this problem, Pfleeger argues that there is the need for a more systematic approach of technology introduction. This process may include the following steps: • Technology creation. • Technology evaluation. This step has to answer the question 'whether there is any benefit to using the new technology relative to what they [the target environment] are doing' (Pfleeger 1999, p. 118). Should there be no visible benefit, the new technology can be discarded.

F. Houdek

162

Fig. 15. Adaption of new technologies (according to Pfleeger 1999).

• •

Technology packaging and support. Technology diffusion.

In her work, Pfleeger does not address the how of technology evaluation. It is her concern to emphasize that there are various ways to achieve confidence in technology evaluations. This aspect was also investigated by Zelkowitz et al. (1998). In their work, they deal with perception (subjective credibility) and reality (objective credibility) of technology evaluations. They found, for instance, that case studies, field studies, and formal experiments are more credible than other kinds of studies. This observation is in line with our observations, as the distance between the experimentation and target environments is the crucial dimension which decides whether external results are accepted or rejected. In the conclusions of her work, Pfleeger criticizes that 'some of the current empirical studies of software engineering technologies are looselyrelated and poorly-planned' (Pfleeger 1999, p. 123). Here, she points to a major deficit in current empirical software engineering research: the needs in the target environment and the results in the experimentation environment often do not mesh. This is where our work is to provides a (hopefully significant) contribution.

6. Conclusions In this chapter, we presented an approach for identifying, conducting, and exploiting external experiments (ICE3). This approach deals with the

A Workable Paradigm for Collaboration Between Industry and Academia

163

producer-consumer problem of empirical investigations in the software engineering field: despite the need for empirical results, many organizations (= consumers) are not able to conduct empirical studies for economic reasons. Laboratory environments (= producers) have the capabilities to conduct empirical studies, but typically these studies are more often driven by scientific interests than by concrete needs. To overcome this situation, needs for empirical investigations have to be handed over from consumers to producers and the results gained have to be transferred back to the consumer side. Of course, this concept will only work if the characteristics on both the consumer and producer side are sufficiently similar. This is where the ICE3 approach comes into play. The ICE3 approach provides a process for identifying empirical needs and evaluating whether they can be imitated by a given laboratory environment so that the results are relevant to the consumer side. The ICE3 approach has now been in use at DaimlerChrysler for several years. Various empirical studies driven by business units needs have been designed for and conducted in the Software Laboratory, a collaboration between DaimlerChrysler and the University of Ulm. The results returned have helped to improve the current software development practice in the DaimlerChrysler business units. Again, it is important to see that the ICE3 approach is human-based. As laid out in the introduction, credibility of results is important to promote the transfer of empirical results. There has to be a close collaboration between researchers and developers. Making the ICE3 approach beneficial for all partners involved requires organizational commitment on both sides, similar (research) interests, and interaction over a significant period of time. For instance, research questions cannot just be thrown 'over the fence' to a research institute. Instead, some joint work is inevitably. We found that people which can work some time on the other side (either developers as researchers or vice versa) help tremendously to increase the other sides understanding which is a prerequisite for deep and fruitful collaboration. As a consequence, applying the ICE3 approach requires financial investment. The discussion on the raise an fall of the NASA SEL (Basili et al. 2002) provides also some interesting insights in the soft factors of such collaborations. Living the ICE3 approach is a way to promote software process improvement. The concept comprises not only a methodological part but

164

F. Houdek

also a cultural one. Several identify-conduct-exploit cycles are needed before the full effects of the ICE 3 approach can be felt. At DaimlerChrysler it took us several years to understand how to use our external laboratory environment most effectively. Increasing understanding has brought about a commensurate increase in the demand for empirical investigations so that we are now faced with numerous investigation candidates desirous of reaping the benefits of our approach.

References Basili, V. and G. Caldiera. Improve software quality by reusing knowledge and experience. Sloan Management Review, 37(l):55-64, 1995. Basili, V., G. Caldiera, F. McGarry, R. Pajerski, and G. Page. The Software Engineering Laboratory — An operational software experience factory. In Proceedings of the 14' International Conference on Software Engineering (ICSE), pp. 370-381, May 1992. Basili, V., G. Caldiera, and D. Rombach. Experience factory. In J.J. Marciniak, editor, Encyclopedia of Software Engineering, volume 1, pp. 469-476. John Wiley & Sons, New York, 1994. Basili, V., F. McGarry, R. Pajerski, and M. Zelkovitz. Lessons learned from 25 years of process improvement: The raise and fall of the NASA Software Engineering Laboratory. In Proceedings of the 24th International Conference on Software Engineering (ICSE), pp. 69-79, May 2002. Basili, V. and R. Selby. Comparing the effectiveness of software testing strategies. IEEE Transactions on Software Engineering, 13(12):1278—1296, 1987. Basili, V., R. Selby, and D. Hutchens. Experimentation in software engineering. IEEE Transactions on Software Engineering, 12(7):733-743, July 1986. Caputo, K. CMM Implementation Guide — Choreographing Software Process Improvement. Addison Wesley, Reading, MA, 1998. Daly, J., J. Miller, A. Brooks, M. Roper, and M. Wood. A multi-method approach to performing empirical research. Research report 95/189 [EfoCS-12-95], University of Strathclyde, 1995. Fenton, N. and S. Pfleeger. Software Metrics — A Rigorous and Practical Approach. International Thomson Computer Press, London, 2nd edition, 1996. Houdek, F. Empirically Based Quality Improvement — Systematic Use of External Experiments in Software Engineering. Logos Verlag, Berlin, 1999. Ph.D. thesis, University of Ulm, (in German).

A Workable Paradigm for Collaboration Between Industry and Academia

165

Houdek, F., F. Sazama, and K. Schneider. Minimizing risk at introducing new software technologies into industrial practice by using external experiments. In Proceedings of the 27th GI Annual Conference, pp 388-397. Springer Verlag, 1997. (in German). Houdek, F., K. Schneider, and E. Wieser. Establishing experience factories at Daimler-Benz — An experience report. In Proceedings of the 20th International Conference on Software Engineering (ICSE), pp. 443-447. IEEE Computer Society Press, 1998. Humphrey, W. Managing the Software Process. SEI Series in Software Engineering. Addison Wesley, Reading, MA, 1990. Humphrey, W. A Discipline for Software Engineering. SEI Series in Software Engineering. Addison Wesley, Reading, MA, 1995. Juristo, N. and A. Moreno. Basics of Software Engineering Experimentation. Kluwer Academic Publishers, 2001. Judd, C , E. Smith, and L. Kidder. Research Methods in Social Relations. Holt, Rinehart and Winston, Orlando, Florida, 6th edition, 1991. Kamsties, E. and C. Lott. An empirical evaluation of three defect-detection techniques. In W. Schafer and P. Botella, editors, Proceedings of the 5th European Software Engineering Conference, number 989 in Lecture Notes in Computer Science, pp. 362-383. Springer Verlag, September 1995. Knudsen, J. The Unofficial Guide to LEGO Mindstorms Robots. O'Reilly & Associates, Inc., 1999. Landes, D., K. Schneider, and F. Houdek. Organizational learning and experience documentation in industrial software projects. International Journal on HumanComputer Studies, 51:646-661, 1999. Miller, J., J. Daly, M. Wood, M. Roper, and A. Brooks. Statistical power and its subcomponents — missing and misunderstood concepts in software engineering empirical research. Journal of Information and Software Technology, 39:285-295, 1997. Miller, J. Applying meta-analytical procedures to software engineering experiments. Journal of Systems and Software, 54:29-39,2000. McGarry, F., R. Pajerski, G. Page, S. Waligora, V. Basili, and M. Zelkowitz. An overview of the Software Engineering Laboratory (SEL). Technical Report SEL-94-005, Software Engineering Laboratory, NASA Goddard Space Flight Center, Greenbelt, MD, December 1994. Available at http://sel.gsfc.nasa.

gov/doc-st/docs. htm. Pfleeger, S. Experimental design and analysis in software engineering. Annals of Software Engineering, 1:219-253, 1995.

166

F. Houdek

Pfleeger, S. Understanding and improving technology transfer in software engineering. Journal of Systems and Software, 47:111-124, July 1999. Perry, D., N. Staudenmayer, and L. Votta. People, organizations, and process improvement. IEEE Software, pp. 38-45, July 1994. Rombach, D., V. Basili, and R. Selby, editors. Experimental Software Engineering Issues: Critical Assessment and Future Directions. Number 706 in Lecture Notes in Computer Science. Springer Verlag, Berlin, 1993. Redwine, S. and W. Riddle. Software technology maturation. In Proceedings on the 8lh International Conference on Software Engineering (ICSE), pp. 189-200, August 1985. Wohlin, C , A. Gustavsson, M. Host, and C. Mattsson. A framework for technology introduction in software organizations. In Proceedings of the Software Process Improvement Conference (SPI), pp. 167-176, Brighton, UK, 1996. Wohlin, C , P. Runeson, M. Host, M. Ohlsson, B. Regnell, and A. Wesslen. Experimentation in Software Engineering. Kluwer Academic Publishers, Boston, MA, 1999. Zelkowitz, M. and D. Wallace. Experimental models for validation technology. IEEE Computer, 31(5):23-31, 1998. Zelkowitz, M., D. Wallace, and D. Binkley. The cultural clash in software engineering technology transfer. In Proceedings of the 23rd NASA/GSFC Software Engineering Workshop, Greenbelt, MD, December 1998.

CHAPTER 5

(Quasi-)Experimental Studies in Industrial Settings Oliver Laitenberger and Dieter Rombach Fraunhofer Institute for Experimental Software Engineering Sauerwiesen 6 D-67661 Kaiserslautern Germany {Oliver. Laitenberger, Dieter.Rombach] @ iese.fhg. de

Software engineering research primarily deals with technologies that promise software organisations to deliver high quality products on time and within budget. However, in many cases researchers do not investigate the validity of the promises and, therefore, information about the relative strengths and weaknesses of those technologies in comparison with the already existing ones is often missing. Although experimentation and experimental software engineering have been suggested to address this issue and significant progress has been made throughout the last couple of years in this area, there is still a lack of experimental work in industrial settings. The reasons for this poor situation range from practical constraints, such as, the costs associated with a study and the benefits for a single company, to more methodological ones, such as the level of control that can be imposed on the different treatment conditions in an industrial setting. In this chapter we present an practical approach that helps overcome most of the objections. The approach represents a balance between the benefits for practitioners and methodological rigor. In essence, it uses training situations to set up and run empirical studies. While this procedure disqualifies the study as a pure controlled experiment, the characteristics of a quasi-experiment can often be preserved. The chapter explains the principle of the approach, differences between controlled experiments and quasi-experiments and finally, presents an example of a quasi-experiment in an industrial setting. Keywords: Controlled experiments; quasi-experiments; empirical software engineering; software inspection.

167

168

O. Laitenberger & D. Rombach

1. Introduction Much of the research in software engineering is devoted to the development of new technologies, such as languages, techniques, or tools that promise to leverage current software development practices. A researcher working on one of those technologies often expects that the application of it, by definition, improves the quality of the software product, increases the productivity of software development processes, or both. In many cases, he or she is so convinced about the benefits of the technology that no effort is spent for its empirical validation. Hence, it is no surprise that points of views, personal observations, and intuition dominated the software engineering discipline during its initial period [71]. The lack of solid empirical knowledge under which conditions a technology works best resulted in the current situation that practitioners have access to many software technologies, but do not know which one to choose for their projects. A good example of this dilemma is software testing. There is a plethora of testing techniques today and each one of them is expected to outperform the cost-effectiveness of the others. Since very little quantitative underpinning of this claim is available, a project manager has no reason to advocate a different testing approach than the one used for several years. In previous centuries several technical disciplines that are now called "engineering" exhibited problems similar to those frequently observed in software development today. At this time, for example, bridges have become so large and complex that they frequently collapsed when trains crossed them. Many mistakes were also made in the shipbuilding area. As a consequence, a number of ships sank at the beginning of their maiden voyage due to an unsuitable naval architecture. The losses in terms of money as well as in terms of human beings in both cases were tremendous. Collapses of bridges as well as ship disasters throughout the maiden voyage due to architectural problems are quite rare today. Both are avoided by ensuring before construction, that the design will satisfy its specification. The approach of choice to avoid further disasters was the use of rigorous inspections and review technologies especially in early phases to ensure quality. This practice paved the road towards engineering development and the development of engineering in these areas. It seems that the history of the pre-engineering fields is repeating itself today in the field of software development. Compared to the pre-engineering days of bridge or shipbuilding, software developers have killed only relative few people to

(Quasi-)Experimental Studies in Industrial Settings

169

date. However, in financial terms, losses due to software defects have definitely reached levels compared to the early days of bridge construction and shipbuilding when looking at examples, such as, Ariane 5 and many others taking place before that disastrous event. The tremendous financial losses provided the motivation for some companies to start a metamorphosis from crafting software to develop software according to engineering principles. The strategy they have selected was based on historical examples. Among other measures, these companies increasingly invest in quality- and process enhancements. They use, for example, measurement programs to get quantitative insight in their development approach as well as training and coaching activities to improve the skills of their employees. The lack of solid engineering principles in software development has also been recognized in software engineering research. In particular, it is the absence of empirical work that characterizes the field. In 1982, Moher and Schneider [71] described the current situation of experimental research in software engineering. They argued for a new research area which aims at formalizing the use of controlled group experimentation in programming language design and software engineering. In 1986, Basili et al. [2] published an influential paper in which they advocated the experimental paradigm as a learning process in software engineering. They described a framework for analyzing most of the experimental work in this area. Since then, experimental software engineering is on its way to establish itself as a sub-discipline of software engineering. Despite the progress that has been made in this area throughout the last decade, there has been very little synergy between research on the one hand and the software industry on the other regarding the design and conduct of experimental studies. In reading back a number of major journals that publish software engineering empirical research, we observed that much of the experimental work reported in the literature have been performed in an academic environment with students as subjects. However, as pointed out by Brooks [8], students are often not representative of the population of software engineers in industry. Brooks argues that experienced professionals often have different abilities and problem-solving skills than beginners. Hence, experimentation in software engineering should rather be performed with professionals in industry than with students to draw valid conclusions [23].

170

O. Laitenberger & D. Rombach

There are many reasons for the lack of experimental work in field settings. A prevalent one is the cost associated with empirical studies. Considering the fact that most experiments in software engineering are performed with human subjects, designing and running an experiment is a costly endeavor. In fact, according to Hays [42], each experiment is a problem in economics. Each choice that a researcher makes for the experiment has its price. For example, the more treatments, subjects, and number of hypotheses one considers, the more costly an experiment is likely to be. This is particularly true for experiments in the software industry with its many budget and schedule problems. Therefore, the objective of a researcher in software engineering must be to choose an experimental approach that minimises the threats to validity within the prevailing cost constraints and maximises the benefits for the organisation. By "experiment" we mean any experimenter-controlled or naturally occurring event (a "treatment"), which intervenes in the software engineering practices of developers and whose probable consequences can be empirically assessed. By "field" we understand any setting, which subjects do not perceive to have been set up for the primary purpose of conducting empirical research. This is particularly the case in industrial environments. According to Campbell and Stanley [11], experiments can be divided into two major categories depending on whether the various treatment groups were formed by assigning subjects to treatments in random or nonrandom fashion.1 The former approach is called "true" experiment2 and the latter "quasi"-experiment. Most of this chapter will be spent outlining and discussing quasiexperimentation. There are three reasons for this particular focus. First, there is probably less knowledge about the design and conduct of quasiexperiments in the software engineering community than about controlled experiments [2] [96]. Second, quasi-experiments are often more feasible in the software industry than are controlled experiments. Therefore, they represent a promising approach to increase the amount of empirical studies in industry. Finally, controlled experiments sometimes break down and have to be analyzed in quasi-experimental fashion. 1 Edgington [29] claims that a "true" experiment also requires random selection from the population under study. Since this is impossible in a software engineering experiment, we rather stick to the definition of Campell and Stanley [11]. 2 Researchers in software engineering often use the term "controlled" experiment instead of "true" experiment.

(Quasi-)Experimental Studies in Industrial Settings

171

The particular focus on quasi-experimentation should not be understood as a preference for quasi-experiments over controlled ones. As we explain later on, the results of quasi-experiments are often less interpretable than the results of controlled experiments. However, one often finds the situation in software engineering that random assignment is impossible or, even after taking place, cannot be maintained for the whole course of an experiment. Hence, some methodological knowledge about the design and conduct of quasi-experiments is beneficial. The purpose of this chapter therefore is to provide a relatively concise description of the quasi-experimental approach in a software engineering context. Since a large number of possible quasi-experimental design options are feasible, no attempt is made here to catalogue all possible design strategies. Rather the goal is to detail the underlying rationale of quasiexperiments and to present a major design options that may be particularly interesting for software engineering empirical research. A detailed example serves as an illustration of this type of empirical study. The chapter is structured as follows. It first explains the role of controlled experiments in software engineering research and why they are desirable. It continues by discussing specific issues that limit the applicability of controlled experiments in industrial settings and describes how quasi-experiments can be used to address those issues. Quasiexperimentation is elaborated by describing a promising design and by evaluating its advantages and limitations. Finally, the chapter presents an example of a quasi-experiment conducted in an industrial setting.

2. Controlled Experiments in Software Engineering In this section, we discuss the role of controlled experiments in software engineering. We define the terminology and examine why controlled experiments (involving random assignment to treatment groups) are not more common in software engineering empirical research in organizations. The practical problems with true experiments in the field are outlined, and ways of overcoming them are mentioned. Definition The most fundamental reason for controlled experiments is to examine the relationship between two or more variables. Sometimes this is referred to as

172

O. Laitenberger & D. Rombach

exploratory research, or as research designed to determine the conditions under which certain events occur. Often the impetus for beginning such a study is the appearance of a new or improved software technology. If, for example, a researcher is interested in the effects of a new inspection technique on the number of detected defects, or the impact of a case tool on software development costs, he or she is going to study the relationship between two variables. The gathering of such data in systematic ways under contolled circumstances is the primary function of controlled experiments. A variable can be considered as an event or condition that a researcher observes or measures or plans to investigate and that is liable to variation (or change) [84]. The variables are usually classified as independent and dependent. Independent variables are those that the researcher directly manipulates. The dependent variables are those in which the researcher is interested. They are often defined as the measured changes in the subject or team responses. According to Kirk [56], an experiment can be characterized by (1) the manipulation of one or more independent variables, (2) the use of controls, such as the random assignment of subjects to treatment conditions, and (3) the careful observation or measurement of the dependent variables. The first and second criteria distinguish controlled experiments from other empirical research strategies. Controlled experiments are therefore vehicles for testing causal hypothesis. This allows a researcher to arrange conditions so that the expected results will occur whenever he or she wishes. Cook and Campbell describe three necessary conditions for assuming with any confidence that the relationship between two variables is causal and that the direction of the causation is from A to B [20]. These conditions are temporal antecedence, treatment covariation with the effect, and the fact that other alternative explanations of B to A are not plausible. A good experiment makes temporal antecedence clear; is sensitive and powerful enough to demonstrate that a potential cause and effect covaries, rules out all third variables, which might alternatively explain the relationship between the cause and effect, and also eliminates alternative hypotheses about the constructs involved in the relationship. However, to infer a causal relationship at one moment in times, using one research setting, and with one sample of respondents, would give little confidence that a demonstrated relationship is robust. Hence, a further step is the replication of the study in which some of the parameters are varied.

(Quasi-)Experimental Studies in Industrial Settings

173

A controlled experiment assumes that a researcher can control the independent variables. However, there are many more variables that may have an impact on the dependent variable and that are difficult or impossible to control. These can cause inferential errors if they are confounded with dependent variables. To alleviate this situation an experimentalist usually uses randomization. The basic idea behind randomization is to assign chosen levels of the treatment variables in random order. The task of a researcher who sets up experiment is to examine the influence of a particular treatment in such a way that extraneous variables will not interfere with the conclusions that the researcher wishes to draw. Moreover, he or she wants to generalize the conclusions to the populations from which the subjects were selected. However, it is the nature of any empirical study that assumptions are made that later on impose restrictions on both tasks. These assumptions are referred to as threats to internal and external validity. Threats to Validity Each empirical study is subject to criticism due fluctuations in the investigated theory, design, conduct, analysis and interpretations of the results. The concerns are denoted "Threats to validity" and are usually clustered in threats to internal and to external validity. A third area, which is not discussed here, is related to construct validity. Threats to Internal Validity Controlled experiments help to reduce the plausibility of alternative explanations. In the ideal case, only one explanation, namely the independent variables, accounted for a change in the dependent variables. Variables other than the independent variables that could explain the results are called threats to internal validity. Several threats to internal validity have been identified [20]. The following table provides an overview about the most common ones.

174

O. Laitenberger & D. Rombach

History

Any event other than the treatment occurring at the time of the experiment that could influence the results or account for the pattern of data otherwise attributed to the treatment

Maturation

Any change over time that may result from maturation processes within the subjects or teams that are not part of the treatment of research interest. Such processes may include, for example, learning effects.

Testing

Any change that may be attributed to the effects of repeated assessment. Testing constitutes an experience that, depending on the measure, may lead to systematic changes in performance. It is particularly prevalent in within-subject experimental designs.

Instrumentation

Any change that takes place in the measuring instrument or assessment procedure over time. For example, a standard questionnaire is changed throughout the experiment.

Statistical regression

This is a threat when an effect might be due to respondents being classified into experimental groups at the pretest on the basis of pretest scores or correlates of pretest scores. When this happens and measures are unreliable, high pretest scorers will score relatively lower at the posttest and low pretest scorer will score higher at the posttest. It would be wrong to attribute such differential change to a treatment. It is due to statistical regression.

Selection Biases

Any difference between subjects or groups that is due to the differential selection or assignment of subjects to groups.

Diffusion of treatments

If the experimental and control groups can communicate with each other, the controls may learn the information and, thereby, receive the treatment. The experiment thus becomes invalid because there is no functional difference between the treatment and the control groups.

Table 1. Major threats to internal validity.

Threats to External Validity Although the purpose of an experiment is to demonstrate the relationship between dependent and independent variables, this is not the only goal of a researcher. He or she wants to generalize the conclusions to the population under study, that is, beyond the conditions of the experiment. This is referred to as external validity and the characteristics that may limit the generality of the results are referred to as threats to external validity. The following table presents the main threats to external validity.

(Quasi-)Experimental Studies in Industrial Settings

175

Generality across subjects

The extent to which the results can be extended to subjects whose characteristics differ from those included in the investigation.

Generality across settings

The extent to which the results extend to other situations in which the treatment functions beyond those included in the experiment.

Reactive experimental arrangements

The possibility that subjects may be influenced by their awareness that they participating in an investigation or in a special program

Reactive assessment

The extent to which subjects are aware that their behavior is being assessed and that this awareness may influence how they respond. Table 2. Major threats to external validity.

Constraints in Field Settings Controlled experiments conducted in industrial settings with humans almost always have professional engineers as subjects. This is in contrast with "laboratory" experiments whereby university students are usually the subjects. However, working with professional engineers entails additional constraints on the design of an experiment. Three constraints were important influences on the design in an industrial setting: Inability to Withhold Treatment. A subject in a software engineering experiment has to learn and apply all investigated technologies. From our personal experience we know that withholding a treatment may contribute to demotivation and subsequent confounding of the treatment effects. This immediately suggests a repeated-measures design3. Even though it is dictated by the study constraints, a repeated-measures design has certain additional advantages over a between-subjects design. Repeated-measures designs have higher statistical power. This is because there will almost always be a positive correlation between the treatments [53]. Previous empirical studies of different aspects of developer performance have found that individual performance differences can vary from 4:1 to 25:1 across experienced developers with equivalent backgrounds [8]. The high subject variability can easily mask each treatment effect that is imposed on the subjects in an experiment. This has caused some methodology writers to strongly recommend repeated-measures designs since subjects effectively serve as their own control [8]. Repeated-measures designs enable a direct and unconfounded comparison between the different treatments [53]. 3

In a repeated-measures design, subjects receive two or more treatments.

176

O. Laitenberger & D. Rombach

Finally, repeated-measures designs have an economical advantage in that fewer subjects are required compared to a between-subjects design to attain the same statistical power levels [66]. No Control Group. The validity of the concept of a "no-treatment" control group in software engineering research has been questioned [55]. This is because it is not clear what a "no-treatment" group is actually doing. A suggested alleviation of this in an industrial setting is that the "notreatment" group performs the technology that they usually employ in their practice. Natural Assemblage of Groups. In industrial settings, the individuals that form a treatment group are often neither randomly selected nor randomly assigned to treatment groups. Therefore, multiple unknown selection factors outside our control determined the make up of each group. Certainly, availability and schedule conflicts play a crucial role in the software industry. As explained earlier, random assignment can be regarded as a method of experimental control because it is assumed that over large numbers of subjects uncontrolled factors are distributed evenly over the treatment conditions [95]. Random assignment of subjects to groups is the defining contrast between true experiments and quasi-experiments [20] or observational studies [15]. While it is difficult to observe cause and effect relationships in an observational study, quasi-experiments attempt to preserve as many of the properties of true experimentation as possible given the constraints of the industrial research setting [90] .4

3. Quasi-Experimentation The major purpose of this chapter is to outline and discuss quasiexperiments as a means for empirical research in software engineering. We elaborate upon their definition and explain the various threats to validity in the context of this study type. Then, we explain two quasi-experimental designs that are particularly promising for empirical research in software engineering.

4

Campbell and Stanley point out that even in well-controlled true experiments, there are often nonrandom nuisance variables inherent to the experimental design that cannot be controlled [11]. Empirical support for this position can be found in [45].

(Quasi-)Experimental Studies in Industrial Settings

111

Definition In the previous section, we explained that controlled experiments use randomized treatment conditions to make causal inferences. The basic idea is to use an unbiased procedure to allocate sampling units to treatment conditions so that the groups differ only in respect to the treatment of interest. However, randomization is not always possible especially in today's software industry. There, a researcher often has situation of a predetermined group that is not formed randomly. In these cases, quasiexperimental approaches are beneficial. The term quasi means "resembling" and the type of study we now discuss is called quasi-experiment because in some ways it resemble a true experiment. The term "quasi-experiment" has been used in various ways, but its rise to prominence in experimentation originates with a very influential chapter by Campbell and Stanley [11]. For them, a quasi experiment is a research design involving an experimental design but where random assignment to treatment groups is impossible. Campbell and Stanley's main contribution has been to show the value and usefulness of several quasi-experimental designs for research. More generally, they have encouraged a flexible approach to design and interpretation of empirical studies, in which the results of a study interact with the threats to validity present in those studies and the extent to which particular threats can be plausibly eliminated in particular studies. In software engineering empirical research, quasi-experimental approaches have considerable attraction for those seeking to maintain a basic experimental stance in their work. The position taken by most textbook writers on the topic is that quasi-experiments are a second-best choice: a fall-back to consider when it is impossible to randomize allocation. Cook and Campbell [20], however, prefer to stress the relative advantages and disadvantages of true and quasi-experiments, and are cautious about advocating randomized experiments even when they are feasible. They advocate possible design options without necessarily assuming the superiority of a randomized design. They state that if a randomized design is chosen then it should be planned in such a way as to be interpretable as a quasi-experimental design, just in case something goes wrong with the randomized design, as it may well do in the industrial settings.

178

O. Laitenberger & D. Rombach

Threats to Validity The major difference between true experiments and quasi-experiments has to do with internal validity [11]. When subjects are randomly assigned to treatment groups each group is similarly constituted (no selection, maturation problems); each experiences at the same testing conditions and research instruments (no testing or instrumentation problems); there is no deliberate selection of high and low performers on any test except under conditions where respondents are first matched according to pretest scores and are then randomly assigned to treatment conditions (no statistical regression problem); each group experiences the same global pattern of history (no history problem); and if there are treatment-related differences in who drops out of the experiment, this is interpretable as a consequence of the treatment and is not due to selection. Thus randomization takes care of most, but not all, the threats to internal validity. With quasi-experimental groups, the situation is quite different. Instead of relying on randomization to rule out most internal validity threats, the researcher has to make them all explicit and then rule them out one by one. His or her task is, therefore, more laborious. It is also less enviable since his final causal inference will not be as strong as if he had conducted a true experiment. Examples of useful Quasi-Experimental Designs Some quasi-experimental designs are essentially defined as not true experimental designs. They include some that are frequently used in industrial research, but do not allow the investigations of cause and effect relationships. Hence we first present a brief explanation of them, since they should be avoided in software engineering empirical research. Then, we elaborate upon two designs that we consider promising for establishing cause and effect relationships. The designs that do not allow to study cause and effect relationships are the one-group post-test only design, the one group pre-test post-test design, and the post-test only non-equivalent groups design. These designs are outlined in Figure 1. We use a notational system in which X stands for a treatment and O stands for an observation; subscripts 1 through n refer to the sequential order of implementing treatment or of recording observations within an quasi-experimental run. The dashed line indicates that the groups were not randomly formed.

(Quasi-)Experimental Studies in Industrial Settings

179

XO,

xo.

OiX

o2 Onegroup post test only design

One group pre-test pos) test design

o, Post-test only non-equivalent groups design

Fig. 1. Experimental designs to be avoided in software engineering research. The one-group post test only design is sometimes called case study and involves observations on subjects or teams who have undergone a treatment. There is neither a pretest observations of subjects or teams nor a control group of subjects or teams who did not receive the treatment. Hence, no threats to validity can be ruled out. Campbell and Stanley point out that while this design may be useful for suggesting causal hypotheses, it is inadequate for testing them [11]. In some cases, it may be possible to compare the results to a well-established baseline to draw conclusions. However, since this cannot be taken for granted, one should rather be careful with this design. The one-group post test only design involves a pretest observation on a single group of subjects or teams. This design becomes uninterpretable in settings where the internal threats of history, maturation, and regression are plausible. Unfortunately, these settings can often be found in software organizations. The post-test only non-equivalent group design adds to the case study a non-equivalent control group which does not receive the treatment. A difference between the two groups can be attributed to either the treatment effect or to selection differences. This latter renders the design uninterpretable. In the following we discuss a design, which we consider the most practical for software engineering investigations. This is the replicated counterbalanced repeated measures design. This type of design is sometimes called a cross-over, randomized-block, or Latin Square design. The details

180

O. Laitenberger & D. Rombach

of the design vary somewhat, but the basic idea is that one subject or group is tested in one sequence of conditions while another subject or group is tested in a different sequence. This is illustrated in Figure 2.

X! O, X2 6:

X2 0 2 X, 0 2

Fig. 2. Replicated counterbalanced repeated measures design. In analyzing the results, the major interest is in comparing the mean score on the treatment Xi with the mean score on the treatment X2 regardless of the sequence; in other words, the data for the two groups on comparable tests are combined. This means that if there is a practice effect of some kind, it is not eliminated, but simply averaged out. This design method assumes that the practice effects in the different sequences used are approximately the same. The improvement in the 0 2 score due to taking the treatment Xi first is assumed to be about equal to the improvement in the 0 2 score due to taking the treatment X2 first. One of the advantages of the counterbalanced design is the each subject or team serves as his own control. Since the performance of an individual or a team on two different tasks, or on the same task repeated, tends to be highly correlated, the size of standard errors in tests of significance is reduced and thus it is easier to detect small effects. From this point of view, the design is more sensitive than a random group design. The extent of the limitation of the method depend on how much the order of testing affects the results, and how much one position in the sequence affects, or interacts with later positions in the sequence. If, for example, there is a large effect related to the order of testing (e.g., going from one technology to the other), this tends to greatly increase the variability of the results and decreases the sensitivity of any test of significance used. This loss in sensitivity may offset any advantage gained by using each subject or team as his own control.

4. An Example for a Quasi-Experiment in Industry In this section, we describe an example of a quasi-experiment that has been performed in an industrial context. The quasi-experiment investigates the

(Quasi-)Experimental Studies in Industrial Settings

181

effect of a systematic reading technique (i.e., perspective-based reading) on the cost-effectiveness of inspections. The investigation involves a comparison of perspective-based reading (PBR), for defect detection in code documents with the more traditional checklist-based reading approach (CBR). The comparison was performed in a series of three studies, as a quasi-experiment and two internal replications, with a total of 60 professional software developers at Bosch Telecom GmbH. A full description of the experiment can be found in [63]. Context of the Study A software inspection usually consists of several activities including planning, defect detection, defect collection, and defect correction [61].5 Inspection planning is performed by an organiser who schedules all subsequent inspection activities. The defect detection and defect collection activities can be performed either by inspectors individually or in a group meeting. Recent empirical findings reveal that the synergy effect of inspection meetings is rather low in terms of impact on defects detected [64][93][51]. Therefore, defect detection can be considered as an individual rather than a group activity. Defect collection, on the other hand, is often performed in a team meeting (i.e., an inspection meeting) led by an inspection moderator. The main goals of the team meeting are to consolidate the defects inspectors detected individually, to eliminate false positives, and to document the real defects. An inspection often ends with the correction of the detected defects by the author. Although each of these activities is important for a successful inspection, the key part of an inspection is the defect detection activity. Throughout this activity inspectors read software documents and check whether they satisfy quality requirements, such as correctness, consistency, testability, or maintainability. Each deviation is considered a defect. Because of its importance, adequate support for inspectors during defect detection can potentially result in dramatic improvements in inspection effectiveness and efficiency. The particular type of support that we focus on in this paper is the reading technique that is used during the defect detection activity. Under the assumption that the defect detection activity is the most important one, the value of the inspection meeting needs to be determined. 5

In this article, we model the inspection process in terms of its main activities. This allows us to be independent from a specific inspection implementation, such as the Fagan [30] or the Gilb and Graham one [37].

182

O. Laitenberger & D. Rombach

Hence, when investigating the effects of different inspection techniques, it is not sufficient to examine the overall effectiveness but to look at the detection and collection phases separately. Implementation of Reading Techniques at Bosch Telecom GmbH Checklist-based Reading The CBR approach attempts to increase inspection effectiveness and decrease the cost per defect by focusing the attention of inspectors on a defined set of questions. In our study, we provided the inspectors with a generic checklist. We limited the number of checklist items to 27 questions to fit on one page since this is recommended in the literature [13]. We structured the checklist according to the schema that is presented in [13]. The schema consists of two components: "Where to look" and "How to detect". The first component is a list of potential "problem spots" that may appear in the work product and the second component is a list of hints on how to identify a defect in the case of each problem spot. As problem spots we considered data usage (i.e., data declaration and data referencing), computation, comparison, control flow, interface, and memory. For each problem spot, we derived the checklist items from existing checklists such as [73], and books about the C Programming language [25] [54]. Hence, the problem spots as well as the questions help reveal defects typical in the context of software development with the C programming language. Fig. 1 presents an excerpt from the checklist.

Item no.

Where to look

How to detect

1

Data declaration

Are all variables declared before being used?

2

If variables are not declared in the code module, will it be sure that these variables are global ones?

3

Are the types of variables correct?

4 5 6

Data referencing

Are variables referenced that have no value (i.e., which have not been initialised)? Are the indices of arrays within the specific boundaries? Are all constants used correctly?

Fig. 1. Example questions of the checklist.

(Quasi-)Experimental Studies in Industrial Settings

183

Perspective-Based Reading The basic idea behind the PBR approach is to inspect an artefact from the perspectives of its individual stakeholders. Since different stakeholders are interested in different quality factors or see the same quality factor quite differently [68], a software artefact needs to be inspected from each stakeholder's viewpoint. The basic goal of inspectors applying the PBR technique is therefore to examine the various documents of a software product from the perspectives of the product's various stakeholders for the purpose of identifying defects. For probing the documents from a particular perspective, the PBR technique provides guidance for an inspector in the form of a PBR scenario on how to read and examine the document. A scenario is an algorithmic guideline on how inspectors ought to proceed while reading the documentation of a software product, such as the code of a function. As depicted in Fig. 2, a scenario consists of three major sections: introduction, instructions, and questions. This scenario structure is similar to the one described in the existing formulation of PBR for requirements documents [3].

PBR Scenario

~~l Introduction explaining the j * ~ Stakeholder's interests in the artefact L- Instructions on extracting the -J information relevant for examination

1

f

2...?

1 _J

Questions answered while following the instructions

Fig. 2. Content and structure of a PBR scenario. The introduction describes a stakeholder's interest in the product and may explain the quality requirements most relevant for this perspective. The instructions describe what kind of documents an inspector is to use, how to read the documents, and how to extract the appropriate information from them. While identifying, reading, and extracting information, inspectors may already detect some defects. However, an inspector is to

184

O. Laitenberger & D. Rombach

follow the instructions for three reasons. First, instructions help an inspector decompose large documents into smaller parts. This is crucial because people cannot easily understand large documents. Understanding involves the assignment of meaning to a particular document or parts of it and is a necessary prerequisite for detecting more subtle defects, which are often the expensive ones to remove if detected in later development phases. In cognitive science, this process of understanding is often characterised as the construction of a mental model that represents the objects and semantic relations in a document [25]. Second, the instructions require an inspector to actively work with the documents. This ensures that an inspector is well prepared for the following inspection activities, such as the inspection meeting. Finally, the attention of an inspector is focused on the set of information that is relevant for one particular stakeholder. The particular focus avoids the swamping of inspectors with unnecessary details. Once an inspector has achieved an understanding of the artefact, he or she can examine and judge whether the artefact as described fulfils the required quality factors. For making this judgement, a set of questions focus the attention of an inspector to specific aspects of the artefact, which can be competently answered because of the attained understanding. Perspective-based Reading of Code Modules In the context of our study, the products that are inspected are functions of a software system. The description of a function, that is, the physical inspected document, consists of its implementation in the C-programming language as well as of an operational description in the specification document. Two perspectives were identified for the inspection of code documents at Bosch Telecom GmbH: A code analyst perspective and a tester perspective. In short, the code analyst's main interests are whether the code implements the right functionality while the tester checks whether the functionality was implemented right. For each perspective, a scenario was developed. An inspector reading the code document from a code analyst perspective identifies the different functions in the code module and extracts for each function a description of the functionality. This description can be compared to the specification and deviations are considered potential candidates for defects. For the extraction, he or she uses an abstraction procedure that is similar to the one suggested in Harlan Mill's reading technique "Reading by Stepwise Abstraction" [65].

(Quasi-)Experimental Studies in Industrial Settings

185

An inspector reading the code module from the perspective of a tester identifies the different functions and tries to set up test cases with which he or she can ensure the correct behaviour of each function. Then, the inspector is supposed to mentally simulate each function using the test cases as input values and to compare the resulting output with the specification. Any deviation pinpoints potential defects. Apart from the instructions that describe each activity in more detail, each scenario includes some questions that help focus the attention of inspectors on specific issues. In contrast to a checklist, the number of questions is limited and can be answered based on the results of the activities. An inspection team consists of inspectors each of which has read the document from a different angle. The two perspective approach represents a minimal set of viewpoints with which we try to achieve high defect coverage. If some defects remain undetected we may include a third inspector that reads a code module from a different perspective. A primary candidate is the perspective of a maintainer. This derives from the fact that Votta reports that most of the issues found in a code inspection are so called soft maintenance issues [94]. Although these defects do not affect the functional behaviour of the software, their correction helps prevent code decay, which pays off later on. Measurement Dependent Variables In this quasi-experiment we investigated four dependent variables: Team defect detection effectiveness and the cost per defect with three different definitions. Team defect detection effectiveness refers to the number of defects reported by a two-person inspection team (without defects found in the meeting). As the different code modules included a different number of defects we had to normalise the detected number of defects. We did this by dividing the number of detected defects by the total number of defects that is known. Hence, the dependent variable "team defect detection effectiveness" can be defined in the following manner: Defects found by a two - person team Team defect detection effectiveness = Total number of defects in the code module

,1. ('•)

186

O. Laitenberger & D. Rombach

The cost per defect for teams is defined in three different ways depending on which phases of the inspection process are taken into account. The first definition relates the defect detection cost to the number of defects found by a two-person inspection team. The second one relates the meeting cost to the number of defects found by a two-person inspection team. And the third relates the sum of the defect detection cost and the meeting cost to the number of defects found by a two-person inspection team. Hence, the three instances of the dependent variable "cost per defect for teams" can be defined in the following manner: Defect detection effort of two subjects Cost per defect for the defect detection phase =

(2) Defects found by a two-person team (without meeting gains)

Meeting effort of two subjects Cost per defect for the meeting phase = Defects found by a two-person team (without meeting gains)

Detection effort + Meeting effort Cost per defect for the overall inspection =

(4) Defects found by a two-person team (without meeting gains)

Below we address some issues related to the counting of the number of defects found: Effects of Seeding Defects. We asked either the author or a very experienced software developer (if the author participated in a run6) to inject defects into the code modules. Hence, one person injected defects for the code modules that were used in either the quasi-experiment or one of its replications. These defects should be typical for the ones that are usually detected in testing and should not be detectable automatically by compilers or other tools, such as lint. There might be little bias since one person may insert defects that are easier or more difficult to detect than the ones inserted by another person. This bias may explain differences in the defect detection effectiveness and the cost per defect across the quasi-experiment and its replications. However, we only evaluate the difference between the two reading techniques within a study. This bias, therefore, does not have an impact on the individual study results. Furthermore, the tests of

6

Authors participated in a run in two cases. For these two modules, the authors had not worked on them for at least one year. Furthermore, they did not know what defects were injected. Therefore, the authors' inclusion in the study would incur minimal contamination, if any.

(Quasi-)Experimental Studies in Industrial Settings

187

homogeneity that were performed (see below) indicate that there were no differences in team performance across the three studies. Defect Reporting. In some cases the subjects reported more defects on their defect report forms than were seeded in a code module. When a true defect was reported that was not on the list of seeded defects, we added this defect to the list of known defects and reanalysed the defect report forms of all the remaining subjects. Whether a defect was a true defect was our decision (to some extent based on discussions with the author or the person who seeded defects). Meeting Gains and Losses. For the evaluation of our hypotheses, data was collected after applying each reading technique and after the team meetings. The data collected after the team meetings may be distorted by meeting gains and losses, which are independent of the reading technique used. We found very little meeting gains and meeting losses. We excluded meeting gains from the team results. The meeting losses were sufficiently minor that this issue was ignored in the analysis. Independent Variables We controlled two independent variables in the quasi-experiment and its replications: /. Reading technique (CBR versus PBR) 2. Order of reading (CBR —> PBR versus PBR —> CBR). Before we explain the expected effects when comparing PBR with CBR in the form of hypotheses, we define the following notation (we use d(a|b) for defining d(a) or d(b) respectively). Term

Definition

ej(CBR|PBR)

The individual defect detection effort for inspector i when using (CBR|PBR).

m (CBR|PBR)

The meeting effort after the individual inspectors used (CBR|PBR).

D (CBR|PBR)

The total number of unique defects found by two inspectors using (CBR|PBR) Table 1. Notation.

O. Laitenberger & D. Rombach

We use this terminology to define our expectations. The Effectiveness of PBR is Larger than the Effectiveness of CBR for Teams. Given the anticipated defect detection benefits of PBR, then we would expect the following to hold within a given experiment: D(CBR) < D(PBR)

(5)

Defect Detection Cost for Teams is Lower with PBR than with CBR. The PBR technique requires an inspector to perform certain activities based on the content of the inspected document and to actively work with the document instead of passively scanning through it. Moreover, individual inspectors have to document some intermediate results, such as a call graph, a description of the function, or some test cases. It is therefore expected that an inspector will spend at least as much effort for defect detection using PBR than CBR: ei{PBR)>ei(CBR)

(6)

The effect that we anticipate is: D(PBR) D(CBR)

el(PBR) + e2(PBR) >

e1(CBR) + e2(CBR)

(7)

which states that the increase in detected defects must be larger than the additional effort spent for defect detection using PBR. In this case, the total cost to detect a defect using CBR will be higher than for PBR. Meeting Cost for Teams is Lower with PBR than with CBR. Performing these extra activities during PBR is expected to result in a better understanding of the document. Therefore, during the meeting, inspectors do not have to spend a lot of extra effort in explaining the defects that they found to their counterpart in the team. Furthermore, it will take less effort to resolve false positives due to the enhanced

(Quasi-)Experimental Studies in Industrial Settings

189

understanding. The better understanding is expected to translate to an overall reduction in meeting effort for PBR than for CBR7: m{CBR) > m(PBR)

(8)

It follows from (5) and (8): m{CBR) D(CBR)

m(PBR) >

(9)

D(PBR)

Therefore, one would hypothesise that the meeting cost for PBR will be less than for CBR. 4. Overall Inspection Cost is Lower with PBR than with CBR. Following from the above two arguments about the cost per defect for defect detection and for the meeting, we would expect that the overall cost per defect for both phases to be smaller for PBR than for CBR. Based on the above explanation of the expected effects, below we state our four null hypotheses: Hoi

An inspection team is as effective or more effective using CBR than it is using PBR.

H02

An inspection team using PBR finds defects at the same or higher cost per defect than a team using CBR for the defect detection phase of the inspection.

Ho3

An inspection team using PBR finds defects at the same or higher cost per defect as a team using CBR for the meeting phase of the inspection.

H04

An inspection team using PBR finds defects at the same or higher cost per defect than a team using CBR for all phases of the inspection.

7

One can argue that more time could be taken during the meeting for an inspector to understand the other perspective and understanding aspects of the code not demanded by his/her own perspective. This could result in a greater meeting time for PBR than for CBR. However, in such a case one would expect inspectors to share the abstractions that they have constructed during their individual reading, making it easier for an inspector to comprehend the reasoning behind the other perspective's defects. For CBR such abstractions are not available.

190

O. Laitenberger & D. Rombach

Research Method Experimental Design Our study was conducted as part of a large training course on software inspection within Bosch Telecom GmbH. We ran the initial quasiexperiment, and then replicated it twice (we refer to each one of these as a "run"). In each run we had 20 subjects, giving a total of 60 subjects who took part in the original quasi-experiment and its two replications. Each of the original quasi-experiment and its two replications took 6 days. Therefore the full study lasted 18 days. In this section we describe the rationale behind the quasi-experimental design that we have employed, and the alternatives considered. In particular, since we make causal interpretations of the results, we discuss the potential weaknesses of a quasi-experiment and how we have addressed them. To prelude the discussion of the experimental design, we emphasise that experimental design is, according to Hays [42], a problem in economics. Each choice that one makes for a design has its price. For example, the more treatments, subjects, and number of hypotheses one considers, the more costly an experiment is likely to be. This is particularly true for field experiments. Therefore, the objective is to choose a design that minimises the threats to validity within the prevailing cost constraints. Description of the Environment Bosch Telecom GmbH is a major player in the telecommunication market and develops high quality telecommunication systems (e.g., modern transmission systems based on SDH technology, access networks, switching systems) containing embedded software. One major task of the embedded software is the management of these systems. In the transmission systems and access network this comprises alarm management, configuration management, performance management, on-line software download, and on-line database down- and up- load. There are four typical characteristics for this kind of software. First, it is event triggered. Second, there are real time requirements. Third, the software must be highly reliable which basically means that the software system must be available 24 hours. Finally, the developed software system must be tailorable to different hardware configurations. Because of these characteristics and the ever increasing competition in the telecommunications market, high product quality is one of the most crucial demands for software development projects at Bosch Telecom GmbH.

(Quasi-)Experimental Studies in Industrial Settings

191

Although reviews are performed at various stages of the development process to ensure that the quality goals are met, a large percentage of defects in the software system are actually found throughout the integration testing phase. A more detailed analysis of these defects revealed that many (typically one half) have their origin in the implemented code. Since defects detected in the integration testing phase are expensive (estimate: 5000 DM per detected and fixed defect), several departments at Bosch Telecom GmbH decided to introduce code inspections to detect these defects earlier and, thus, save detection as well as correction cost.8 Part of the effort to introduce code inspections was a code inspection training course for the software developers. We organised this as a quasiexperiment and two internal replications. This quasi-experimental organisation is advantageous for both participants and researchers. The participants not only learnt how software inspection works in theory but also practised them using a checklist as well as a PBR scenario for defect detection. In fact, the practical exercises offered the developers the possibility to convince themselves that software inspection helps improve the quality of code artefacts and, therefore, are beneficial in the context of their own software development projects. In a sense, these exercises help overcome the "not applicable here (NAH)" syndrome [50] often observed in software development organisations, and alleviated many objections against software inspections on the developers' behalf. From a researcher's perspective, the training effort offered the possibility to collect inspection data in a quasi-controlled manner. Subjects All subjects in our study were professional software developers of a particular business unit at Bosch Telecom GmbH. Each developer within this unit was to be trained in inspection and, thus, participated in this study. In order to capture their experiences we used a debriefing questionnaire. We captured the subjects' experience in the C-Programming language and in the In this environment, code inspections are expected to reduce detection and correction costs for the following reasons. First, the integration test requires considerable effort just to set up the test environment. A particular test run consumes several hours or days. Once a failure is observed this process is stopped and the setup needs to be repeated after the correction. So, if some defects are removed beforehand, the effort for some of these cycles is saved (defect detection effort). Second, once a failure is observed it usually consumes considerable effort to locate or isolate the defect that led to the failure. In a code inspection, a defect can be located immediately. Hence, code inspection helps save correction cost (assuming that the effort for isolating defects is part of the defect correction effort).

192

O. Laitenberger & D. Rombach

application domain on a six-point scale as the most prevalent types of experience that may impact a subject's performance. Fig. 3 shows boxplots of subjects' C-programming and application domain experiences. very experienced

experienced

I

•

1

rather experienced

'

1

1

|

•

1

D

rather inexperienced

inexperienced I I

very inexperienced C-Programming

Application Domain

Min-Max 1 25%-75%

•

Median value

Fig. 3. Subjects' experience with the C-programming language and the application domain. We found that subjects perceived themselves experienced with respect to the programming language (median of 5 on the 6 item scale) and rather experienced regarding software development in the application domain (median of 4 on the six item scale). We can consider our subject pool a representative sample of the population of professional software developers. Because of cost and company constraints many empirical studies on software inspection were performed with students, that is, novices as subjects. Although these studies provide valuable insights, they are limited in the sense that the findings cannot be easily generalised to a broader context. As Curtis points out [22], generalisations of empirical results can only be made if the study population are professional software engineers rather than novices, and it has been forcefully emphasised that the characteristics of professional engineers differ substantially from student subjects [23]. The few existing results in the context of empirical software engineering support this statement since

(Quasi-)Experimental Studies in Industrial Settings

193

differences between professional software developers and novices were found to be qualitative as well as quantitative [98]. This means that expert and novice software developers have different problem solving processes causing experts not only to perform tasks better, but also to perform them in a different manner than novices. Hence, in the absence of a body of studies that find the same results with experts and novice subjects, such as the one presented by Porter and Votta [78], generalisations between the two groups are questionable and more studies with professional software developers as subjects are required. The 2x2 Factorial Counterbalanced Repeated-Measures Design Our final experimental design is depicted in Table 2. For the original quasiexperiment, the subjects were split in two groups. The first group (Group 1) performed a reading exercise using PBR first, and then measures were collected (Oi). Subsequently they performed a reading exercise using CBR, and again measures were collected (0 2 ). The second group (Group 2) performed the treatments the other way round. The vertical sequence indicates elapsed time during the whole study. Therefore, the first replication was executed with groups 3 and 4 after groups 1 and 2 and the second replication was run with groups 5 and 6 after groups 3 and 4. Original Quasi-Experiment Group 1

XPBR/Module 1

0,

XCBR/Module 2

02

Group 2

XCBR/Module 1

o3

XPBR/Module 2

o4

First Replication Group 3

XPBR/Module 3

0,

XCBR/Module 4

02

Group 4

XCBR/Module 3

o3

XPBR/Module 4

o4

Second Replication Group 5

XPBR/Module 5

Group 6

XCBR/Module 5

o, o3

XCBR/Module 6

o2

XPBR/Module 6

04

Table 2. Design of the quasi-experiment and its two replications.

194

O. Laitenberger & D. Rombach

Below we discuss a number of issues related to this design and its execution: Different Code Modules Within a Run. Given that each group performs two reading exercises using different reading techniques, it is not possible for a group to read the same code modules twice, otherwise they would remember the defects that they found during the first reading, hence invalidating the results of the second reading. Therefore, a group reads different code modules each time. Within a run, a module was used once with each reading technique so as not to confound the reading technique completely with the module that is used. Different Code Modules Across Runs. During each of the quasiexperiment and its two replications different pairs of code modules were used. Since there was considerable elapsed time between each run, it was thought that past subjects may communicate with prospective subjects about the modules that were part of the training. By using different code modules, this problem is considerably alleviated. Confounding. In this design the interaction between the module and the reading technique is confounded with the group main effect. This means that the interaction cannot be estimated separately from the group effect. However, it has been argued in the past that this interaction is not important because all modules come from the same application domain, which is the application domain that the subjects work in [3] [4]. Effect of Intact Groups. As noted earlier, the groups in our study were intact (i.e., they were not formed randomly). The particular situation where there is an inability to assign subjects to groups randomly in a counterbalanced design was discussed by Campbell and Stanley [9]. In this design there is the danger that the group interaction with say practice effects confounds the treatment effect. However, if we only interpret a significant treatment effect as meaningful if it is not due to one group, then such a confounding would have to occur on different occasions in all groups in turn, which is a highly unlikely scenario [9]. Furthermore, it is noted that if there are sufficient intact groups and if they are assigned to the sequences at random, then the quasi-experiment would become a true experiment [9], which is also echoed by Spector [90]. This last point would argue for pooling the original quasi-experiment and its replications into one large experiment. However, it is known that pooling data from different studies

(Quasi-)Experimental Studies in Industrial Settings

195

can mask even strong effects [89] [99], making it much preferable to combine the results through a meta-analysis [58]. A meta-analysis allows explicit testing of homogeneity of the groups. Homogeneity refers to the question whether the different groups share a common effect size. If homogeneity is ensured, one can combine the results to come up with an overall conclusion. Otherwise, the results must be treated separate from each other. An explicit test for homogeneity is conducted during our analysis to check if effect size estimates exhibit greater variability than would be expected if their corresponding effect size parameters were identical [44]. Replication to Alleviate Low Power. Ideally, researchers should perform a power analysis9 before conducting a study to ensure that their experimental design will find a statistically significant effect if one exists. However, in our case, such an a priori power analysis was difficult because the effect size is unknown. As mentioned earlier, there have been no previous studies that compared CBR with PBR for code documents, and therefore it was not possible to use prior effect sizes as a basis for a power analysis. We therefore defined a medium effect size, that is an effect size of at least 0.5 [17],10 as a reasonable expectation given the claimed potential benefits of PBR. Moreover, we set an alpha level of a = 0.1. Usually, the commonly The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the investigated phenomenon exists [17]. Statistical power analysis exploits the relationship among the following four variables involved in statistical inference: • Power: The statistical power of a significance test is the probability of rejecting a false null-hypothesis. • Significance level a: the risk of mistakenly rejecting the null-hypothesis and thus the risk of committing a Type I error. • Sample size: the number of subjects/teams participating in the experiment. • Effect size: the discrepancy between the null hypothesis and the alternate hypothesis. According to Cohen [17] effect size means "the degree to which the phenomenon is present in the population", or "the degree to which the null hypothesis is false". In a sense, the effect size is an indicator of the strength of the relationship under study. It takes the value zero when the null hypothesis is true and some other nonzero value when the null hypothesis is false". In our context, for example, an effect size of 0.5 between the defect detection effectiveness of CBR inspectors and PBR inspectors would indicate a difference of 0.5 standard deviation units. Ideally, in planning a study, power analysis can be used to select a sample size that will ensure a specified degree of power to detect an effect of a particular size at some specified alpha level. 10 In the allied discipline of MIS, Cohen's guidelines for interpreting effect sizes [17] have also been suggested in the context of meta-analysis [48].

196

O. Laitenberger & D. Rombach

accepted practice is to set a = 0.05. n However, in order to control Type I error (a) and Type II error (P) requires either rather large effects sizes or rather large sample sizes. This represents a dilemma in a software engineering context since much treatment effectiveness research in this area involves relatively modest effects sizes, and in general, small sample sizes. As pointed out in [66], if neither effect size nor sample size can be increased to maintain a low risk of error, the only remaining strategy—other than abandoning the research altogether—is to permit higher risk of error. This explains why we used a more relaxed alpha level for our studies. With the anticipation of 10x2 inspector teams and an a = 0.1, t-test power curves [57][66] for a one-tailed significance test12 indicated that the experimental design has about a 0.3 probability of detecting, at least, a medium effect size. This was deemed to be a small probability of rejecting the null hypotheses if they were false (Cohen [17] recommends a value of 0.8). While this power level was not based on observed effect sizes, it already indicated potential problems in doing a single quasi-experiment without replications. After the performance of the quasi-experiment, we found for several hypotheses that the difference between CBR and PBR was not statistically significant. One potential reason for insignificant findings is low power. Using the obtained effect size from the quasi-experiment, an a posteriori power analysis was performed for all four hypotheses. Table 3 presents the a posteriori power levels. H4

H,

H2

Quasi-Experiment

>0.9

0.49

H3 >0.9

0.69

1st Replication

0.51

0.71

0.84

0.81

0.84

0.28

0.76

0.49

2

nd

Replication

Table 3. A posteriori power analysis results.

11 It is conventional to use an a level of 0.05. Cowles and Davis [21] trace this convention to the turn of the century, but credit Fisher for adopting this and setting the trend. However, some authors note that the choice of an a level is arbitrary [46] [36], and it is clear that null hypothesis testing at fixed alpha levels is controversial [19] [92]. Moreover, it is noted that typical statistical texts provide critical value tables for a = 0.1 [87] [88], indicating that this choice of a level is appropriate in some instances. As we explain in the text, our choice of a level was driven by power considerations. 12 We used one-side hypothesis tests since we seek a directional difference between the two reading techniques.

(Quasi-)Experimental Studies in Industrial Settings

197

It is seen that low power was a potentially strong contributor for not finding statistical significance, making it difficult to interpret these results.13 One possibility to tackle the problem of low power is to replicate an empirical study and merge the results of the studies using meta-analysis techniques. Meta-analysis techniques have been primarily designed to combine results from a series of studies, each of which had insufficient statistical power to reliably accept or reject the null hypothesis. This is the approach we have adopted and hence the prevalent reason for performing the two replications. Replications to Increase Generalisability of Results. Replication of experimental studies provides a basis for confirming the results of the original experiment [24]. However, replications can also be useful for generalising results. A framework that distinguishes between close and differentiated replications has been suggested to explain the benefits of replication in terms of generalising results [79]. Our two replications can be considered as close replications since they were performed by the same investigators, using the same design and reading artefacts, under the same conditions, within the same organisation, and during the same period of time. However, there were also some differences that facilitate generalisation of the results. First, the subjects were different. Therefore, if consistent results are obtained in the replications we can claim that the results hold across subjects at Bosch Telecom GmbH. Second, the modules were varied. Again, if consistent results are obtained then we can claim that they are applicable across different code modules at Bosch Telecom GmbH. By varying these two elements in the replications, one attempts to find out if the same results occur despite these differences [79].14 Process Conformance. It is plausible that subjects do not perform the CBR and PBR reading techniques but revert to their usual technique that they use in everyday practice. This may occur, for example, if they are faced with familiar documents (i.e., documents from their own application domain Not finding a statistically significant result using a low power study does not actually tell us very much because the inability to reject the null hypothesis may be due to the low power. A high power study that does not reject the null hypothesis has more credibility in its findings. 14 As noted in [79], however, this entails a gamble since if the results do differ in the replications, then one would not know which of the two elements that were changed are the culprit.

198

O. Laitenberger & D. Rombach

within their own organisation) [4]. This is particularly an issue given that the subjects are professional software engineers who do have everyday practices. As alluded to earlier, with PBR it is possible to check this explicitly by examining the intermediate artefacts that are turned in. We did, and determined that the subjects did perform PBR as defined. For CBR, it will be seen in the post-hoc analysis section of the paper that more subjects found CBR easier to use for defect detection than their current reading technique (ad-hoc); actually more than double. Given such a discrepancy, it is unlikely that the subjects will revert to a technique that is harder to use when offered CBR. Therefore, we expect process conformance when using CBR to also be high.15 It has also been suggested that subjects, when faced with time pressure, may revert to using techniques that they are more familiar with rather than make the effort of using a new technique [4]. During the conduct of the quasi-experiment and its replications, there were no time limits. That is, the subjects were given as much time as required for defect detection using each of the reading techniques. Therefore, this would not have affected the application of the reading techniques. Furthermore, when conducting analyses using defect detection cost, time limits would not invalidate conclusions drawn since there was no artificial ceiling on effort. Feedback Between Treatments. The subjects did not receive feedback about how well they were performing after the first treatment. This alleviates some of the problems that may be introduced due to learning [4]. If subjects do not receive feedback then they are less likely to continue applying the practices they learned during the first treatment. However, we found it very important to provide feedback at the end of each run. Trainer Effects. For the quasi-experiment and its two replications, the same trainer was used. This trainer had given the same course before to multiple organisations, and therefore the quasi-experiment was not the first time that this course was given by the same person. Consequently, we expect no

15

In [4] it is suggested that subjects could be asked to cross-reference the defects that they find with the specific steps of the reading technique. This would allow more detailed checking of process conformance. However, this would have added to the effort of the subjects, and it is unknown in what ways this additional activity would affect the effort when using CBR and PBR. If such an effect does occur differently for CBR and PBR, then it would contaminate the effort data that we collected.

(Quasi-)Experimental Studies in Industrial Settings

199

differences across the studies due to different trainers, nor due to the trainer improving dramatically in delivering the material.16 An Alternative Design Researchers in software engineering empirical research at one point in time have to decide about the experimental design. Of course, many factors impact this decision. One is the availability of subjects and subject groups. In educational settings, it frequently occurs that treatments are assigned to intact classes, where the unit of analysis is the individual student (or teams) within the classes. This situation is akin to our current study whereby we have intact groups. In such situations one can explicitly take into account the fact that subjects are nested within groups and to employ a hierarchical experimental design. This allows the researcher to pool all the data into one larger experiment [90]. One can proceed by first testing if there are significant differences amongst groups. If there is group equivalence within treatments, then one can analyse the data as if they were not nested. However, if there were group differences, the unit of analysis would have to be the group and the analysis performed on that basis. This creates a difficulty in that reverting to a group unit of analysis would substantially reduce the degrees of freedom in the analysis. Hence, it would be harder to find any statistically significant effects even if such differences existed. Given the potential for this considerable methodological disadvantage, we opted a priori not to pool the results and rather perform a meta-analysis. Experimental Materials In each group the subjects inspected two different code modules. Apart from a code module, an inspector received a specification document which can be regarded as a type of requirements document for the expected functionality and which is considered defect free. All code modules were part of running software systems of Bosch Telecom GmbH and, thus, can be considered almost defect free. Table 4 shows the size of a code module in Lines of Code (without blank lines), their average cyclomatic complexity using McCabe's complexity measure [67], and the number of defects we considered in the analysis.

Although, of course, an experienced trainer is still expected to improve as more courses were given; this is inevitable. However, their impact would be minimal.

200

O. Laitenberger & D. Rombach Size(LOC)

Average Cycl. Complexity

Number of defects

Code Module 1

375

6.67

8

Code Module 2

627

11.00

9

Code Module 3

666

3.20

10

Code Module 4

915

4.38

16

Code Module 5

375

3.29

11

Code Module 6

627

5.44

11

Table 4. Characteristics of code modules. Execution The quasi-experiment and its two replications were performed between March and July 1998. Each consisted of two sessions and each session lasted 2.5 days and was conducted in the following manner (see Table 5). On the first day of each session, we did an intensive exercise introducing the principles of software inspection. Depending on the reading order, we then explained in detail either CBR or PBR. This explanation covered the theory behind the different reading approaches as well as how to apply them in the context of a code module from Bosch Telecom GmbH. Then, the subjects used the explained reading approach for individual defect detection in a code module. Regarding the PBR technique, the subjects either used the code analyst scenario or the tester scenario but not both. While inspecting a code module, the subjects were asked to log all detected defects on a defect report form. After the reading exercise we asked the subjects to fill out a debriefing questionnaire. The same procedure was used on the second day for the reading approach not applied on the first day. After the reading exercises, we described how to perform inspection meetings. To provide the participants with more insight into an inspection meeting, we randomly assigned two subjects to an inspection team and let them perform two inspection meetings (one for each code module) in which they could discuss the defects found in the two reading exercises. Of course, we ensured that within each of these teams one participant read a code module from the perspective of a code analyst and one read the code module from the perspective of a tester. The inspection team was asked to log all defects upon which both agreed. We then did an initial analysis by checking all

(Quasi-)Experimental Studies in Industrial Settings

201

defects that were either reported from individual subjects or from the teams, against the known defect list and presented the results on the third day (halfday). We consider the presentation of initial results a very important issue because, first, it gives experimenters the chance to get quick feedback on the results and, second, the participants have the possibility to see and interpret their own data, which motivates further data collection. At the end of each training session, each participant was given a debriefing questionnaire in which we asked the participant about the effectiveness and efficiency of inspections and the subjects' opinion about the training session.

Day 1

Group x

Group y

Morning

Afternoon

Introduction of Inspection Principles

PBR Explanation

Morning CBR Explanation

Defect Detection with PBR (Module A)

Defect Detection with CBR (Module B)

CBR Explanation

PBR Explanation

Defect Detection with CBR (Module A)

Defect Detection with PBR (Module B)

Introduction of Inspection Principles

Day 3

Day 2 Afternoon

Morning

Team Meetings for Module A and Module B

Feedback on Results

Team Meetings for Module A and Module B

Feedback on Results

Table 5. Execution of the quasi-experiment and its replications. Data Analysis

Methods

Analysis Strategy To understand how the different treatments affected individual and team results across the quasi-experiment and its replications, we started the data analysis by calculating some descriptive statistics of the individual and team results. We continued testing the stated hypotheses using a t-test for repeated measures, i.e., a matched-pair t-test [1]. The t-test allowed us to investigate whether a difference in the defect detection effectiveness or the cost per defect ratio is due to chance. Although the t-test seems to be quite

202

O. Laitenberger & D. Rombach

robust against violation of certain assumptions (i.e., the normality and homogeneity of the data), we also performed the Wilcoxon signed ranks test [88], which is the non-parametric counterpart of the matched-pair t-test. The Wilcoxon signed rank test corroborated the findings of the t-test in all cases. Hence, we do not present the detailed results of this test. We run the statistical tests for the quasi-experiment and the two replications separately. Our experimental design permits the possibility of carry-over effects. Grizzle [41] points out that when there are carry-over effects from treatment 1 to treatment 2, it is impossible to estimate or test the significance of any change in performance from Period 1 and Period 2 over and beyond carryover effects. In that situation, the only legitimate follow-up test is an assessment of the differences between the effects of the two treatments using Period 1 data only. Hence, it is recommended that investigators first test for carry-over effects. Only if this test is not significant, even at a very relaxed level, is a further analysis of the total data set appropriate. In that case, all the obtained data may properly be used in separate tests of significance for treatments. One goal of any empirical work is to produce a single reliable conclusion, which is, at first glance, difficult if the results of several studies are divergent or not statistically significant in each case. Close replication and replication in general, therefore, raises questions concerning how to combine the obtained results with each other and with the results of the original study. Meta-analysis techniques can be used to merge study results. Meta-analysis refers to the statistical analysis of a collection of analysis results from individual studies for the purpose of integrating the findings. Although meta-analysis allows the combination and aggregation of scientific results, there has been some criticism about its use [38]. The main critical points include diversity, design, publication bias, and dependence of studies. Diversity refers to the fact that logical conclusions cannot be drawn by comparing and aggregating empirical results that include different measuring techniques, definitions of variables, and subjects because they are too dissimilar. Design means that results of a meta-analysis cannot be interpreted because results from "poorly" designed studies are included Before starting the analysis, we performed the Shapiro-Wilks' W test of normality, which is the preferred test of normality because of its good power properties [91]. In all cases we could not reject the null hypothesis that our data were normally distributed. Cohen [16] provides ample empirical evidence that a non-parametric test should be substituted for a parametric test only under conditions of extreme assumption violation, and that such violations rarely occur in behavioural or psychological research.

(Quasi-)Experimental Studies in Industrial Settings

203

along with results from "good" designed studies. Publication bias refers to the fact that published research is biased in favour of significant findings because non-significant findings are rarely published. This in turn leads to biased meta-analysis results. Finally, dependence of the studies means that meta-analyses are conducted on large data sets in which multiple results are derived from the same study. Diversity, design, and publication bias do not play a role in our case. However, we need to discuss the issue of study dependency in more detail. It has been remarked that experiments using common experimental materials exhibit strong inter-correlations regarding the variables involving the materials. Such correlations result in non-independent studies [70]. In general, Rosenthal [82] discusses issues of non-independence in metaanalysis for studies performed by the same laboratory or research group, which is our case since we have two internal replications. He presents an example from research on interpersonal expectancy effects demonstrating that combining all studies across laboratories and combining studies where the unit of analysis is the laboratory results in negligible differences. Hence, in practice, combining studies from the same laboratory or research group is not perceived to be problematic. Another legitimate question is whether it is appropriate to perform a meta-analysis with a small meta-sample (in our case a meta-sample of three studies). It has been noted that there is nothing that precludes the application of meta-analytic techniques on a small meta-sample [58] "Meta-analyses can be done with as few as two studies or with as many studies as are located. In general, the procedures are the same [...] having only a few studies should not be of great concern". For instance, Kramer and Rosenthal [58] report on a meta-analysis of only two studies evaluating the efficacy of a vaccination to SIV in monkeys using data from Cohen [18]. In the realm of software engineering, meta-analyses have also tended to have small metasamples. For example, Hayes [43] performs a meta-analysis of five experiments evaluating DBR techniques in the context of software inspections. Miller [70] performs a meta-analysis of four experiments comparing defect detection techniques. Comparing and Combining Results using Meta-Analysis Techniques Meta-analysis is a set of statistical procedures designed to accumulate empirical results across studies that address the same or a related set of research questions [97]. As pointed out by Rosenthal [82] there are two major ways to merge and subsequently evaluate empirical findings in meta-

204

O. Laitenberger & D. Rombach

analysis—in terms of their statistical significance (e.g., p-levels) and in terms of their effect sizes (e.g., the difference between means divided by the common standard deviation). One reason for the importance of the effect size is that many statistical tests, such as the t-test, can be broken down mathematically in the following two components [84]: Significance test = Effect Size x Size of study This relationship reveals that the result of a significance test, such as the t-test, is determined by the effect size and the size of the study. Many researchers however only make decision based on the fact of whether the result of applying a particular test is statistically significant. They often do not take the effect size or the size of the study into consideration. This almost exclusive reliance of researchers on results of null hypothesis significance testing alone has been heavily criticised in other disciplines, such as psychology and the social sciences [19] [83] [86]. We therefore explicitly take the effect size into consideration during the meta-analysis. Two major meta-analytic processes can be applied to the set of studies to be evaluated: Comparing and combining [83]. When studies are compared as to their significance levels or their effect sizes, we want to know whether they differ significantly among themselves with respect to significance levels or effect sizes, respectively. This is referred to as homogeneity. When studies are combined, we want to know how to estimate the overall level of significance and the average effect size, respectively. In most cases, researchers performing meta-analysis, first, compare the studies to determine their homogeneity. This is particularly important in a software engineering context since, there, empirical studies are often heterogeneous [7] [69]. Once it is shown that the studies are, in fact, homogeneous the researchers continue with a combination of results. Otherwise, they look for reasons that cause variations. According to this procedure, we first compared and combined p-values and, second, compared and combined effect sizes of the team results. We limited the meta-analysis to the team results since these are of our primary interest. The meta-analytic approach is described in more detail in [63]. Results In this section we present the detailed results. Note that all t-values and effect sizes have been calculated to be consistent with the direction: PBR

A

CBR •

205

(Quasi-)Experimental Studies in Industrial Settings

Defect Detection Effectiveness Fig. 4 shows boxplots of the teams' defect detection effectiveness. Quasj-Experinent

0.1

^

0.J

T

31

Of

l,s III $0.3

I Ui»SD UeavSD DUunSE ItoSE D MID

PBR ReadngTechniqiM

2 ntptcatkn

I RepiCStKHI

— n= 0

,

—

fell

n=lO

o=ll

4= a

xi 111 lUeaitSO Mun-SD D«emtSf ItoSE n Hun Readiif Tednqw

.

I

BiaotSO liin<SD

D

»f»tSE llHriE

O lUA

CD Rtl4t| Tittup

Fig. 4. Box-plots of the team defect detection effectiveness.

For one team in the quasi-experiment, one member of one team dropped out of the study. Therefore we have only nine teams for the analysis.

206

O. Laitenberger & D. Rombach

The boxplots show that the inspection teams detected on average between 58% and 78% of the defects in a code module. These percentages are in line with the ones reported in the literature [37]. Teams using PBR for defect detection had a slightly higher team defect detection effectiveness than the same teams using CBR. In addition to the effectiveness difference, the boxplots illustrate that PBR teams exhibit less variability than CBR teams. The lower variability for PBR may be explained by the fact that the more prescriptive approach for defect detection to some extent removes the effects of human factors on the results. Hence, all the PBR teams achieved similar scores. However, before providing further explanation of these results, we need to check whether the difference between CBR and PBR is due to chance. We first assessed whether there is a carry-over effect for the team effectiveness. The results indicate no carry-over effect. Therefore we can proceed with the analysis of the data from the two periods. We investigated whether the difference among teams is due to chance. Table 6 presents the summary of the results of the matched-pair t-test for the team defect detection effectiveness. t-value

df

p-value (one-sided)

3.09

8

0.007

1 Replication

1.37

9

0.10

2nd Replication

2.39

9

0.02

Quasi-Experiment st

Table 6. t-Test results of the team defect detection effectiveness. Taking an alpha level of a = 0.1 we can reject Hoi for the quasiexperiment and the 2nd replication. We cannot reject hypothesis Hoi for the 1st replication. The findings suggest a treatment effect in two out of three cases, which is not due to chance. At this point we need to decide whether there is an overall treatment effect across studies. Therefore, we performed a meta-analysis as described previously to compare and combine the results. The test for homogeneity for p-values results in p=0.7. Hence, we cannot reject the null hypothesis that our p-values are homogeneous. This means that we can combine the pvalues of the three studies according to Fisher's procedure in the following manner:

(Quasi-)Experimental Studies in Industrial Settings

207

k

P = -2x^ln P i =22.14 i=l

2

Based on the % -distribution, this value of P results in a p-value of p=0.000016. Hence we can reject the null hypothesis and conclude that the resulting combination of the quasi-experiment and its replications revealed that a team using the PBR technique for defect detection had a significantly higher defect detection effectiveness than the team using CBR. We continued the analysis by looking at the effect sizes. Table 7 reveals the effect sizes of the three studies. S pooled

Hedges g

d

Quasi-Experiment

0.16

0.79

1.46

1st Replication

0.21

0.42

0.81

0.14

1.06

0.97

2

nd

Replication

Table 7. Effect sizes of the team defect detection effectiveness. Table 7 shows that the 1st replication has the lowest effect size, which explains why the results of the test were not statistically significant. To compare and combine the effect sizes, we first checked the effect size homogeneity by calculating Q. The calculated value of Q is Q = 0.91 which leads to a p-value of p=0.63. Hence, we cannot reject the null hypothesis that the effect sizes are homogeneous. The combination of the effect sizes reveals a mean effect size value of 1.08. This represents a large effect size, i.e., one would really become aware of a difference in the team defect detection effectiveness between the use of PBR and CBR. Based on our findings, we therefore can reject hypothesis Hoi. For three of the modules in our study all defects were detected using PBR. For the other three all defects except one were detected. No discernible pattern in terms of the types of defects that were not detected could be identified. Cost per Defect for the Defect Detection Phase Before looking at the team results, we first investigated how much effort each subject consumed for defect detection using either of the techniques and whether there is a difference. Table 8 depicts the average effort in

208

O. Laitenberger & D. Rombach

minutes that a subject spent for defect detection using either CBR or PBR and the results of the matched-pair t-test. CBR

PBR

p-values (one sided)

115.2

145.6

0.0003

1 Replication

118.8

132.5

0.16

2nd Replication

168

150.5

0.13

Quasi-Experiment st

Table 8. Average effort per subject for the defect detection phase. We found that in the quasi-experiment and the 1st replication, the subjects consumed less effort for CBR than for PBR. However, only in the first case the difference was statistically significant. This finding seems to indicate that if there is a significant difference PBR seems to require more effort on the inspector's behalf. The question is whether the extra effort is justified in terms of more detected and documented defects in the team meeting. Fig. 5 depicts boxplots of the cost per defect ratio of both inspectors on an inspection team. Fig. 5 reveals that on average an inspection team consumed between 28 and 60.5 minutes per defect. The average cost per defect of PBR teams is consistently lower than the cost per defect of CBR teams. In this case, there is also less variability in the cost per defect ratio of PBR teams. Based on these findings, the extra effort for PBR, if any, seems to be justified because PBR teams have a better cost per defect ratio than CBR teams. A formal test for carry-over effects was conducted, and none were identified. t-value

df

p-value (one-sided)

Quasi-Experiment

-1.33

8

0.11

1st Replication

-1.93

9

0.04

-0.75

9

0.23

2

nd

Replication

Table 9. t-Test results of the cost per defect ratio for the defect detection phase.

209

(Quasi-)Experimental Studies in Industrial Settings Quasi-Eiperiment

1 Replcation 100 9t n=!0 It

0

70

"rr

n.9 10

¥ SO

HCD

a

E

s m ft

I*

1

i

o

1

£ IiSid.Oev.

| 10

•

1 s o

isu. En. D Mean

DiSttEl.

u

Reading Technique

Readng Technique 2 HepicalHm

100

.=0

1 • 1 J

CBR

Z]

m

liSid.Oei. D>SU.Eit. D Mean

Reading Technique

Fig. 5. Box-plots of the cost per defect ratio for the defect detection phase. Table 9 shows the t-test results for the cost per defect during the defect detection phase. Taking a = 0.1 we can reject H02 for the 1st replication. We cannot reject hypothesis Ho2for the quasi-experiment and the 2 replication. To compare and combine the results we first performed the homogeneity check for p-values. This value with 2 degrees of freedom results in a p-value of p = 0.77. Hence, we cannot reject the hypothesis that our p-values are homogeneous. This finding allowed us to combine the p-values of the three studies. Calculating the combination of the p-values according to Fisher's procedure results in:

210

O. Laitenberger & D. Rombach k

P = - 2 x J l n P i = 13.57 i=l

Based on the X -distribution, this value of P results in a p-value of p=0.001. Hence we can reject the hypothesis H02 and conclude that the resulting combination of the quasi-experiment and its replications revealed that a team using the PBR technique for defect detection had a significantly lower defect detection cost per defect than the team using CBR. We continued the analysis by looking at the effect sizes. Table 10 reveals the effect sizes of the three studies.

Quasi-Experiment s

I ' Replication 2

nd

Replication

S pooled

Hedges g

d

24.03

-0.50

-0.72

22.59

-0.87

-0.78

12.59

-0.33

-0.30

Table 10. Effect sizes of the cost per defect ratio for the defect detection phase. Table 10 shows that the quasi-experiment and 2n replication have the lowest effect size, which explains why the results of these tests were not statistically significant. To compare and combine the effect sizes, we first checked the effect size homogeneity by calculating Q. The calculated value of Q is Q = 0.64 which leads to a p-value of p=0.73. Hence, we cannot reject the hypothesis that the effect sizes are homogenous. The combination of the effect sizes gives a mean effect size value of 0.6. Considering our effect size threshold of 0.5, we can conclude that we have, in fact, found an effect of practical significance. We therefore can reject hypothesis H02. Cost per Defect for the Meeting Phase We subsequently consider the cost per defect when accounting for the meeting phase. Fig. 6 shows the boxplots for the quasi-experiment and its replications. Fig. 6 reveals that the average cost per defect ratio of PBR was lower than the CBR one when only considering the effort of the meeting phase. Although there seems to be less variability for the 1st replication, there does not seem to be as much differences in the variability for the quasi-

211

(Quasi-)Experimental Studies in Industrial Settings

experiment and the 2nd replication. Overall, this result indicates that the meeting cost per defect is higher for CBR than for PBR. Qu*i-E iperimenl

l"* ReplicaAion

11

..*!...

n=10

a M

a

1* & 3

m\

1

;

• 1

D

£ s

I •

2

ItSld.D« • iStd.Eff. o Urn

ifitd.Oen ifitd.Ert. o Mean flsj*l 2

Technittue

Replication

X3 • CBR

PBH

IiSld.Dm. • iSld.En. °

UMn

RudHig Technique

Fig. 6. Box-plots of the cost per defect for the meeting phase. The results of a formal test for carry-over effects did not indicate any carry-over-effects.

Quasi-Experiment st

1 Replication 2

nd

Replication

t-value

df

p-value (one-sided)

-5.20

8

0.0004

-2.05

9

0.035

-2.55

9

0.016

Table 11: t-Test results of the cost per defect ratio for the meeting phase.

212

O. Laitenberger & D. Rombach

Taking a = 0.1 we can reject H03 for all three studies. The findings suggest a treatment effect in all three cases, which is not due to chance. We calculated ^(z^zf

= 1.396. This value with 2 degrees of

freedom results in a p-value of p=0.50. Hence, we cannot reject the null hypothesis that our p-values are homogeneous. Calculating the combination of the p-values according to Fisher's procedure results in: k

P = - 2 x ^ l n P i =30.59 i=l

Based on the % -distribution, this value of P results in a p-value of p<0.00001. Hence we can reject the hypothesis H03 and conclude that the resulting combination of the quasi-experiment and its replications revealed that a team using PBR for defect detection had a significantly lower cost per defect for the meeting phase than the team using CBR. We continued the analysis by looking at the effect sizes. Table 12 reveals the effect sizes of the three studies.

Quasi-Experiment st

1 Replication 2

nd

Replication

S pooled

Hedges g

d

1.80

-1.64

-2.17

2.99

-0.85

-1.06

1.65

-0.87

-0.86

Table 12. Effect sizes of the cost per defect ratio for the meeting phase. Table 12 shows that the 1st replication has the lowest effect size. However, the effect size is large enough for the test results to be statistically significant. To compare and combine the effect sizes, we first checked the effect size homogeneity by calculating Q. The calculated value of Q is Q = 3.23 which leads to a p-value of 0.20. Hence, we cannot reject the null hypothesis that the effect sizes are homogeneous. The combination of the effect sizes gives a mean effect size value of 1.36. This represents a large effect size considering our effect size threshold of 0.5. It was postulated that the increased effort and, thus, the higher cost per defect ratio for subjects using PBR lead to an increased understanding of the documents. We therefore investigated the subjects' perceptions of understanding the documents using both reading techniques. The debriefing

213

(Quasi-)Experimental Studies in Industrial Settings

questionnaire contained a question asking the subjects how well they understood the inspected code artefact using either CBR or PBR. Fig. 7 presents a histogram of the results of this question from the quasiexperiment and its replications. There is a clear trend confirming our expectation that using a PBR scenario for defect detection improves a subject's understanding of the inspected code artefact. A .. helps improve m y understanding of the c o d e .

20

!

.

15

ii

il

FT

M •

•R

to

|i I

F www&A

I

vJx/awJ^

•

PBR Scenario

Fig. 7. Histogram of subject's understanding of the inspected code modules. The above results tell a consistent story. Using PBR scenarios requires individual subjects to spend more effort for defect detection. Although this results in a higher checking rate, the cost per defect ratio of subjects using PBR for defect detection is acutally lower than for subjects using CBR. But the higher preparation effort together with the procedural support lead to an increased understanding of the inspected code module. This is then expected to lead to inspectors who can easily explain the defect that they found to their counterpart on the inspection team. Furthermore, it will take less effort to resolve false positives due to this enhanced understanding of the document. These would lead to less cost per defect for PBR teams when compared with CBR teams, which is the result that we obtained above. We therefore can reject hypothesis FI03. Cost per Defect for the Overall Inspection We now consider the cost per defect results for the whole of the inspection. Fig. 8 shows boxplots of the overall cost per defect.

214

O. Laitenberger & D. Rombach i"ntpkaioi

too

••r~r

.

IiSId.DK DtSllEn.

i: i:

a Hern

• 11=16

° t

i

I •

can

Rearing Technique

1

i

PBR

aSU.DK aSld. ErT. c lea

fleadng Technque

!»»**. 1

• 100

1» I* ft

l»

= so n

Q !0

im 1

itlO

1 ° 1 , 1,

znr! : i ! t 1

S a. |0

; XaSHDW. •

aSM.En. o Mean

Raadaaj Tathlique

Fig. 8. Box-plots of the overall cost per defect. When looking at the overall cost per defect, the boxplots are consistent with the ones presented previously. PBR seems to have a lower cost per defect ratio and less variability than CBR. This result is consistent across all three studies. Hence, the extra effort that individual subjects spent using a PBR scenario is justified because in this case the inspection team reports more defects. We also observed less variability in the costs per defect when a team used a scenario for defect detection than a checklist, which may be due to the reduction of the influence of individual characteristics. A test for carry-over effects was performed, and the results did not indicate any. Table 13 presents the results of the matched-pair t-test for the cost per defect for teams for the overall inspection process.

215

(Quasi-)Experimental Studies in Industrial Settings t-value

df

p-value (one-sided)

-1.86

8

0.05

1 Replication

-2.35

9

0.02

2nd Replication

-1.32

9

0.11

Quasi-Experiment st

Table 13. t-Test results of the cost per defect for the overall inspection. Taking a = 0.1 we can reject Ho4 for the quasi-experiment and the 1st replication. We cannot reject H04 for the 2nd replication. We performed a meta-analysis as described previously and calculated V (Zj -zf

= 0.32. This value with 2 degrees of freedom results in p=0.85.

Calculating the combination of the p-values according to Fisher's procedure results in: k

P = -2x^Tln P i = 18.08 i=l

Based on the % -distribution, this value of P results in a p-value of p=0.00012, which is statistically significant at an a= 0.1 level. We can, therefore, reject the hypothesis H04 when looking at all three studies. The pooled result of the quasi experiment and its replications revealed that a team using the PBR technique for defect detection had a significantly lower cost per defect ratio than the team using CBR. Table 14 shows the effect sizes of the three studies. It can be seen that, in fact, the 2nd replication has the lowest effect size value. The low effect size for the 2nd replication explains why the result of the statistical test turned out to be not statistically significant.

Quasi-Experiment st

1 Replication 2

nd

Replication

S pooled

Hedges g

d

26.83

-0.67

-1.01

24.88

-0.97

-0.97

13.81

-0.51

-0.54

Table 14. Effect sizes of the cost per defect ratio for the overall inspection.

216

O. Laitenberger & D. Rombach

For comparing and combining the effect sizes, we first checked the effect size homogeneity by calculating Q. The calculated value of Q is Q = 0.64 which leads to a p-value of p=0.73. Hence, we cannot reject the hypothesis that the effect sizes are homogeneous. This result allows us to combine the effect sizes. The combination of the effect size gives a value of 0.84. Considering our effect size threshold of 0.5, we can conclude that we have, in fact, found an effect of practical significance. Given that the results for each of the individual phases, defect detection and meeting, point in the same direction, it is not surprising that cost per defect for PBR is lower than CBR for the whole of the inspection process. We can therefore reject hypothesis H04. Post-Hoc Analysis We performed a post-hoc analysis to evaluate subjects' perceptions of ease of use of CBR and PBR. The rationale was to determine whether subjects are likely to revert back to using PBR if they apply CBR after PBR. If they find that CBR is easier to use, then this supports the argument that there is a reasonable amount of process conformance in the PBR-»CBR ordering. In a debriefing questionnaire, which subjects completed after each of the original quasi-experiment and the two replications, we asked them the following question: Which technique is the easiest one to use for defect detection? The three response categories were: PBR, CBR, and their everydaypractice reading technique. We pooled the answers of all three studies. We found that 57% (34/60) selected CBR, only 18% (11/60) selected PBR, and 22% (13/60) selected their everyday practice reading technique. This provides some assurance of process conformance as described in Section 0. However, even if there was contamination in the form of subjects who are supposed to be using CBR actually using PBR in the PBR—>CBR ordering, that would be expected to improve the results of the CBR subjects. We found that PBR is better than CBR on all of our dependent variables. Therefore, if such a contamination existed, it was not sufficient enough to affect our results, and in fact, we could then consider that our results underestimate the beneficial impact of PBR when compared to CBR. Sample Size Requirements for Future Studies The planning of future studies that compare CBR with PBR can benefit from the estimates of effect size that we obtained. For repeated measures

217

(Quasi-)Experimental Studies in Industrial Settings

designs, we used the obtained mean effect sizes and the average correlation coefficients to estimate the minimal number of teams that would be necessary to attain a statistical power of 80% for one tailed tests at an alpha level of 0.1 using the paired t-test. These are summarised in Table 15.19

Team Defect Detection Effectiveness

Cost per Defect (Defect Detection)

Cost per Defect (Meeting)

Cost per Defect (Overall Inspection)

Mean Effect Size

1.08

0.6

1.36

0.84

Estimated Sample Size

16

22

8

13

Table 15. Estimation of sample size (number of teams).

Threats to Validity It is the nature of any empirical study that assumptions are made that later on restrict the validity of the results. Here, we list all these assumptions that impose threats to internal and external validity. Threats to Internal Validity Experiments in general and quasi-experiments in particular suffer from the problem that some factors may affect the dependent variables without the experimenter's knowledge [8]. This is referred to as a threat to internal validity. Although the threats to internal validity must be minimised, it is often not possible to exclude them completely. For this study, we identified a potential history effect that may represent a threat to internal validity [9] [52] that was not addressed during the study.

19

These sample size estimates are for two-person inspection teams. It is plausible that if there are more than two inspectors the effect size will be larger. Therefore, if one utilises the above sample sizes in planning a study, they are certain to attain 80% power for a study with more than two inspectors.

218

O. Laitenberger & D. Rombach

An experimenter cannot enforce subjects to apply a reading technique all of the time.20 Inspectors start their defect detection activity by either reading the specification or the code documents and may already find defects during their first reading. Hence, it is plausible that a proportion of the reported defects are not directly attributable to the application of a particular reading technique, even if the subject using it applies it fully. However, as pointed out in [69], there is little possibility in quantifying this proportion. Threats to External Validity Threats to external validity limit the ability to generalise the results from an experimental study to a wider population under different conditions. There are three threats to external validity that we have identified for the current study: •

Single Organisation Our study was performed with subjects and code documents from a single organisation. While this enjoys greater external validity than doing studies with students in a "laboratory" setting, it is uncertain the extent to which the results can be generalised to other organisations.

•

Inspection Process In this study, we assume that defect detection is an individual rather than a group activity. However, other inspection processes in industry may exist that consider defect detection a group activity, such as the one presented in [30].

•

Type of Inspected Documents The code documents used in this study can be claimed to be representative of industrial code documents. However, we cannot

Recall that we checked that subjects followed PBR by checking that they had produced the necessary abstractions and test cases. Also, we showed that a relatively large number of subjects found CBR to be easier to use than their everyday practice reading technique and also easier to use tiian PBR. Therefore, we do not expect that subjects when they are supposed to use CBR will revert to either of the other two, apparently more difficult, reading techniques. Furthermore, no performance feedback was provided after the initial treatment, and this should dampen the potential for continuing to use the first reading technique. Nevertheless, despite the above, we cannot be sure that the subjects followed the specified techniques all of the time during their defect detection activity and all detected defects are found because of applying a reading technique.

(Quasi-)Experimental Studies in Industrial Settings

219

generalise our findings for other type of documents, such as design or requirements documents. To attain such generalisations, it is necessary to replicate the current study under different conditions. Industrial Impact From a scientific perspective, the quasi-experiment allowed for a detailed empirical investigation of two defect detection approaches without adding much effort on behalf of the company. However, there are also two main benefits from the perspective of a company: The first one is the fact that the employees have the necessary knowledge required for inspection participation. This was a major goal of this training effort, which was clearly fulfilled. The second benefit is related to the the in-process usage of the PBR technique. Once inspections were performed for some time, an inspection data analysis revealed that about -40 percent of the inspectors voluntarily used the PBR approach for defect detection. The data as well as the possibility to use the more systematic approach throughout the training convinced them about the benefits of this reading approach. In this way, the quasi-experiment offered them a forum to use the technology at no risk. There, they had the possibility to gather their own experiences. Quasi-experimentation therefore provides an important vehicle at low costs and risks for technology transfer. It allows the study of cause/effectrelationships while maintaining high relevancy for the organization. This path is pursued at the Fraunhofer Institute for Experimental Software Engineering (www.iese.fhg.de), which mission is to promote experimental software engineering [34]. There, the experimental approach represents one of the best methods for introducing engineering style rigor into business practice. As shown in this chapter, this way of working provides customers with measurable facts about their development practices and enables informed decision making. Measurable facts, analysis, and continuous feedback of findings are the engine for goal-oriented continuous improvement and for risk-controlled innovation. Performing quasiexperiments in the context of training sessions represents one successful instance of this strategy. It is expected that others in the software engineering community follow this example.

220

O. Laitenberger & D. Rombach

Conclusion Quasi-experimentation is more of a style of investigation than a slavish following of predetermined experimental designs. If a researcher's concern is to get at cause and effect relationships, and he or she is not in a position to conduct a controlled experiments, he or she ought to be able to carry out a quasi-experiment. If the quasi-experiment is carefully planned and designed, this type of study allows the researcher to counter many threats to validity that are likely to be problematic. And if one is not solely concerned with cause and effect relationships, then some of the problems with real life experimentation, including generalizability difficulties might reinforce the traveling down a quasi-experimental road. In this chapter, we have shed light on quasi-experimentation as an approach to increase the volume of software empirical research in industrial settings. Our basic motivation was the difficulty associated with designing and running controlled experiments in software organizations. Since quasiexperiments help overcome many of the described obstacles, we strongly believe that this is path to follow for addressing the stated problem. Even if a true experiment is planned, it can become a quasi-experiment because of mortality or confounding variables in the experimental treatment. Anyone familiar with the problems encountered in industrial settings must realize that although randomization and true experimentation are ideal goals in research, they are not always possible. This is particularly the case in software engineering. Some might argue that only true experiments bring valid causal inference, and it is always logically possible to contrive alternative explanations to any quasi-experimental design. But because the internal validity of quasi-experiments is lower than true experiments, it does not argue against using the judgments of quasi-experiments.

References [1] [2] [3]

A. Aron and E. Aron. Statistics for Psychology. Prentice Hall, 1st edition, 1994. V. Basili, R. Selby, D. Hutchens, Experimentation in Software Engineering. IEEE Transactions on Software Engineering, 12(7):733-743, 1986 V. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull, S. Sorumgard, and M. Zelkowitz. The Empirical Investigation of Perspective-based Reading. Empirical Software Engineering, 2(1): 133-164, 1996.

(Quasi-)Experimental Studies in Industrial Settings

221

[4]

V. Basili. Evolving and Packaging Reading Technologies. Journal of Systems and Software, 38(1), July 1997. V. Basili, F. Shull, and F. Lanubile. Using Experiments to Build a Body of Knowledge. Technical Report, University of Maryland, CS-TR-3983, 1998.

[5]

D. Bisant and J. Lyle. A Two-Person Inspection Method to Improve Programming Productivity. IEEE Transactions on Software Engineering, 15(10):1294-1304, October 1989. L. Briand, K. El Emam, T. Fussbroich, and O. Laitenberger. Using Simulation to Build Inspection Efficiency Benchmarks for Development Projects. Proceedings of the Twentieth International Conference on Software Engineering, pages 340-349. IEEE Computer Society Press, 1998.

[6]

[7]

[8]

A. Brooks, J. Daly, J. Miller, M. Roper, and M. Wood. Replication of experimental results in software engineering. International Software Engineering Research Network (ISERN) Technical Report ISERN-96-10, University of Strathclyde, 1996. R. Brooks. Studying Programmer Behavior Experimentally: The Problems of Proper Methodology. Communications of the ACM, 23(4):207-213, April 1980.

[9]

D. Campbell, 1957, Factors relevant to the validity of experiments in social settings, Psychological Bulletin, vol. 54, p. 297-312. [10] D. Campbell. 1963, From Description to Experimentation: Interpreting Trends as Quasi-Experiments, In C.W. Harris, Problems in Measuring Change, Univerisity of Winsconsing Press. [11] D. Campbell and J. Stanley. Experimental and Quasi-Experimental Designs for Research. Houghton Mifflin, Boston, 1966. ISBN 0-395-30787-2. [12] B. Cheng and R. Jeffery. Comparing Inspection Strategies for Software Requirements Specifications. Proceedings of the 1996 Australian Software Engineering Conference, pages 203-211, 1996. [13] Y. Chernak. A Statistical Approach to the Inspection Checklist Formal Synthesis and Improvement. IEEE Transactions on Software Engineering, 22(12):866-874, December 1996. [14] D. A': Christenson, H. Steel, and A. Lamperez. Statistical Quality Control applied to Code Inspections. IEEE Journal on Selected Areas in Communication, 8(2): 196-200, February 1990. [15] W. Cochran. Planning and Analysis of Observational Studies. John Wiley & Sons, 1983. [16] J. Cohen, Some Statistical Issues in Psychological Research, Handbook of Clinical Psychology, B.B. Woleman (ed.), McGraw-Hill New York, NY, 1965.

222

O. Laitenberger & D. Rombach

[17] J. Cohen. Statistical Power Analysis for the Behavioural Sciences. Lawrence Erlbaum Associate Publishers, second edition, 1988. [18] J. Cohen. A New Goal: Preventing Disease, Not Infection. Science, 262:18201821,1993. [19] J. Cohen. The Earth Is Round (p<.05). American Psychologist, 49:997-1003, 1994. [20] T. Cook and D. Campbell. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Rand McNally College Publishing Company, Chicago, 1979. [21] M. Cowles and C. Davis. On the Origins of the .05 Level of Statistical Significance. American Psychologist, 37(5):553-558,1982. [22] B. Curtis. Measurement and Experimentation in Software Engineering. Proceedings of the IEEE, 68(9):1144-1157, September 1980. [23] B. Curtis. By the Way, Did Anyone Study any Real Programmers? Empirical Studies of Programmers: First Workshop, pages 256-262. Ablex Publishing Corporation, 1986. [24] J. Daly. Replication and a Multi-Method Approach to Software Engineering Research. PhD thesis, University of Strathclyde, 1996. [25] H. Deitel and P. Deitel. C How to program, 2nd ed.. Prentice Hall, 1994. [26] T. van Dijk and W. Kintsch. Strategies of Discourse Comprehension. Academic Press, Orlando, 1984. [27] E. Doolan. Experience with Fagan's Inspection Method. Software-Practice and Experience, 22(2): 173-182, 1992. [28] M. Dyer. The Cleanroom Approach to Quality Software Development. John Wiley and Sons, Inc., 1992. [29] E.S. Edgington. Randomization Tests. Dekker. 1980. [30] M. Fagan. Design and Code Inspections to Reduce Errors in Program Development. IBM Systems Journal, 15(3):182-211, 1976. [31] M. Fagan. Advances in Software Inspections. IEEE Transactions on Software Engineering, 12(7):744-751, July 1986. [32] R. Fisher. Combining Independent Tests of Significance. Statistician, 2(5), 1948.

American

[33] P. Fowler. In-process Inspections of Workproducts at AT&T. AT&T Technical Journal, 65(2): 102-112, March 1986. [34] Fraunhofer Institute for Experimental Software Engineering, Annual Report, available at: www.iese.fhg.de, 2001.

(Quasi-)Experimental Studies in Industrial Settings

223

[35] P. Fusaro and F. Lanubile. A Replicated Experiment to Assess Requirements Inspection Techniques. Empirical Software Engineering, 2(l):39-57, 1997. [36] J. Gibbons and J. Pratt. P-values: Interpretation and Methodology. The American Statistician, 29(l):20-25, 1975. [37] T. Gilb and D. Graham. Software Inspection. Addison-Wesley Publishing Company, 1993. [38] G. Glass, B. McGaw, M. L. Smith, Meta-Analysis in Social Research, Sage Publications, 1981. [39] M. Graden, P. Horsley, and T. Pingel. The Effects of Software Inspections on a Major Telecommunications Project. AT&T Technical Journal, 65(3):32-40, May/June 1986. [40] A. Greenwald. Within-Subjects Designs: To Use or Not to Use? Psychological Bulletin, 83(2), September 1976. [41] J. Grizzle. The Two-period Chance-over Design and its Use in Clinical Trials. Biometrics, 21:314-320, 1965. [42] W. Hays. Statistics. Hartcourt Brace, 1994. [43] W. Hayes. Research Synthesis in Software Engineering: A Case for MetaAnalysis. To appear in Proceedings of the International Symposium on Software Metrics, 1999. [44] L. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, 1985. [45] D. T. Heinsman and W. R. Shadish, Assignment Methods in Experimentation: When Do Nonrandomized Experiments Approximate Answers From Randomized Experiments?, Psychological Methods, 1(2): 154169,1996. [46] R. Henkel. Tests of Significance. Sage Publications, 1976. [47] M. Hills and P. Armitage. The Two-period Cross-over Clinical Trial. British Journal of Clinical Pharmacology, 8:7- 20,1979. [48] M. Hwang. The Use of Meta-Analysis in MIS Research: Promises and Problems. The DATA BASE for Advances in Information Systems, 27(3):3548, 1996. [49] J. Cohen. Applied Multiple Regression/Correlation Analysis for the Behavioural Sciences. Lawrence Erlbaum Associates, Inc., Publishers, 1983. [50] P. Jalote and M. Haragopal. Overcoming the NAH Syndrome for Inspection Deployment. In Proceedings of the Twentieth International Conference on Software Engineering, pages 371-378. IEEE Computer Society Press, 1998. [51] P. Johnson and D. Tjahjono. Does Every Inspection Really Need a Meeting ? Empirical Software Engineering, 3:9-35, 1998.

224

O. Laitenberger & D. Rombach

[52] C. Judd, E. Smith, and L. Kidder. Research Methods in Social Relations. Holt, Rinehart and Winston, 6th edition, 1991. [53] G. Keren. A Handbook for Data Analysis in the Behavioural Sciences— Methodological Issues, Chapter 19: Between- or Within-Subjects Design: A Methodological Dilemma. Lawrence Erlbaum Associates, 1993. [54] B. Kernighan and D. Ritchie. Programming in C. Hanser Verlag, 1990. [55] B. Kitchenham, S. Linkman, and D. Law. Critical Review of Quantitative Assessment. Software Engineering Journal, pages 43-53, March 1994. [56] R. Kirk, Experimental Design—Procedures for the Behavioral Sciences, Brooks/Cole Publishing Company, 3rd ed., 1995. [57] H. Kraemer and S. Thiemann. How many Subjects. Sage Publications, 1987. [58] S. Kramer and R. Rosenthal: "Effect Sizes and Significance Levels in SmallSample Research". In R. Hoyle (ed.): Statistical Strategies for Small Sample Research, Sage Publications, 1999. [59] S. Kusumoto, A. Chimura, T. Kikuno, K. Ichi Matsumoto, and Y. Mohri. A Promising Approach to Two-Person Software Review in an Educational Environment. Journal of Systems and Software, (40): 115-123, 1998. [60] O. Laitenberger and C. Atkinson. Generalizing Perspective-based Inspection to handle Object-Oriented Development Artefacts, Proceedings of the 21 st International Conference on Software Engineering, Los Angeles, USA, 1999. [61] O. Laitenberger and J.-M. DeBaud. An Encompassing Life-cycle Centric Survey of Software Inspection. The Journal of Systems and Software Vol. 50, Nr. 1, S. 5-31, 2000 also published as International Software Engineering Research Network (ISERN) Technical Report ISERN-98-14, Fraunhofer Institute for Experimental Software Engineering, http://www.iese.fhg.de/ISERN/pub/isern_biblio_tech.html. 1998. [62] O. Laitenberger and J.-M. DeBaud. Perspective-based Reading of Code Documents at Robert Bosch GmbH. Information and Software Technology, 39:781-791, March 1997. [63] O. Laitenberger, Khaled El Emam, Thomas Harbich, An Internally Replicated Quasi-Experimental Comparison of Checklist and Perspective-based Reading of Code Documents, IEEE Transactions on Software Engineering, 2000. [64] L. Land, C. Sauer, and R. Jeffery. Validating the Defect Detection Performance Advantage of Group Designs for Software Reviews: Report of a Laboratory Experiment Using Program Code. In 6th European Software Engineering Conference, pages 294-309. Lecture Notes in Computer Science No 1301, ed. Mehdi Jazayeri, Helmut Schauer, 1997.

(Quasi-)Experimental Studies in Industrial Settings

225

[65] R. Linger, H. Mills, and B. Witt. Structured Programming: Theory and Practice. Addison-Wesley Publishing Company, 1979. [66] M. Lipsey. Design Sensitivity. Sage Publications, 1990. [67] T. McCabe. A Complexity Measure. IEEE Transactions on Software Engineering, 2(4):308-320, December 1976. [68] J. McCall. Quality Factors. In J. Marciniak, editor, Encyclopedia of Software Engineering, volume 2, pages 958- 969. John Wiley and Sons, 1994. [69] J. Miller, M. Wood, and M. Roper. Further Experiences with Scenarios and Checklists. Empirical Software Engineering, 3(l):37-64, 1998. [70] J. Miller. Applying Meta-Analytical Procedures to Software Engineering Experiments. To appear in Journal of Systems and Software. [71] T. Moher, G. Schneider. Methodology and experimental research in software engineering, International Journal of Man-Machine Studies, vol. 16, 65-87, 1982. [72] G. Myers. A Controlled Experiment in Program Testing and Code Walkthroughs / Inspections. Communications of the ACM, 21(9):760-768, September 1978. [73] National Aeronautics and Space Administration. Software Formal Inspection Guidebook. Technical Report NASA-GB-A302, National Aeronautics and Space Administration, August 1993. http://satc.gsfc.nasa.gov/fi/fipage.html. [74] Panel on Statistical Methods in Software Engineering, http ://www. nap .edu/readingroom/books/statsoft/, 1993. [75] D. Parnas and D. Weiss. Active Design Reviews: Principles and Practice. Journal of Systems and Software, 7:259-265,1987. [76] A. Porter, H. Siy, C. Toman, and L. Votta. An Experiment to Assess the CostBenefits of Code Inspections in Large Scale Software Development. IEEE Transactions on Software Engineering, 23(6):329-346, June 1997. [77] A. Porter, L. Votta, and V. Basili. Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment. IEEE Transactions on Software Engineering, 21(6):563-575, June 1995. [78] A. Porter and L. Votta. Comparing Detection Methods for Software Requirements Inspections: A Replication Using Professional Subjects. Empirical Software Engineering, 3:355-379, 1998. [79] R. Lindsay and A. Ehrenberg. The Design of Replicated Studies. The American Statistician, 47(3):217-228,1993. [80] B. Regnell, P. Runeson, and T. Thelin. Are the Perspectives Really Different? Further Experimentation on Scenario-Based Reading of Requirements. Empirical Software Engineering, vol. 5, no.4, December 2000.

226

O. Laitenberger & D. Rombach

[81] S. Rifkin and L. Deimel. Applying Program Comprehension Techniques to Improve Inspection, Proceedings of the 19?h Annual NASA Software Engineering Workshop, NASA, 1994. [82] R. Rosenthal. Meta-Analytic Procedures For Social Research. Sage Publications, 1984. [83] R. Rosenthal and R. Rosnow. Essentials of Behavioural Research: Methods and Data Analysis. McGraw Hill Series in Psychology, 1991. [84] R. Rosnow and R. Rosenthal. Beginning Behavioural Research: A Conceptual Primer. Prentice Hall International Editions, 1996. [85] K. Sandahl, O. Blomkvist, J. Karlsson, C. Krysander, M. Lindvall, and N. Ohlsson. An Extended Replication of an Experiment for Assessing Methods for Software Requirements Inspections. Empirical Software Engineering, 3:327-254, 1998. [86] F. Schmidt. What Do Data Really Mean? Research Findings, Meta-Analysis, and Cumulative Knowledge in Psychology. American Psychologist, 47:11731181,1992. [87] D. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, 1997. [88] S. Siegel and J. Castellan. Nonparametric Statistics For The Behavioural Sciences. McGraw Hill, Inc., 2nd edition, 1988. [89] E. Simpson. The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society, B13:238-241,1951. [90] P. Spector. Research Designs. Number 07-023 in Quantitative Applications in the Social Sciences. Sage Publications, 1995. [91] S. Shapiro and M. Wilk. A Comparative Study of Various Tests of Normality. Journal of the American Statistical Association, 63:1343-1372, 1968. [92] M. Slatker, Y. B. Wu, N. S. Suzuki-Slatker, *, **, and ***; Statistical Nonsense At the .00000 Level, Nursing Reasearch, 40(4).-248-249, 1991. [93] L. Votta. Does Every Inspection Need a Meeting? ACM Software Engineering Notes, 18(5): 107-114, December 1993. [94] L. Votta. Does the Modern Code Inspection Have Value? Presentation at the NRC Seminar on Measuring Success: Empirical Studies of Software Engineering, March 1999. Available at: http://www.cser.ca/seminar/ESSE/slides/ESSE Votta.PDF [95] B. Winer, D. Brown, and K. Michels. Statistical Principles in Experimental Design, 3 rd edition. McGraw Hill Series in Psychology, 1991. [96] C. Wohlin, P. Runeson, Introduction to Experimentation in Software Engineering, 2000.

(Quasi-)Experimental Studies in Industrial Settings

227

[97] F. Wolf. Meta-Analysis: Quantitative Methods for Research Synthesis. SAGE University Paper, 1986. [98] E. Youngs. Human Errors in Programming. International Journal of ManMachine Studies, 6:361-376, 1974. [99] G. Yule. Notes on the Theory of Association of Attributes in Statistics. Biometrika, 2:121-134, 1903.

Acknowledgement The authors thank Khaled El Emam for his contributions to the quasiexperiment at Bosch Telecom GmbH.

Oliver Laitenberger is currently a researcher and consultant at the the Fraunhofer Institute for Experimental Software Engineering (IESE) in Kaiserslautern. His main interests are software quality assurance with software inspections, inspection measurement, and inspection improvement. As a researcher, Oliver Laitenberger has been working for several years in the development and evaluation of inspection technology. As a consultant, he has worked with several international companies in introducing and improving inspections. Oliver Laitenberger received the degree DiplomInformatiker (M.S.) in computer science and economics from the University of Kaiserslautern, Germany, in 1996. H. Dieter Rombach is a professor in the computer science department at the University of Kaiserslautern and director of the Fraunhofer Institute for Experimental Software Engineering. His research interests are predictable software methodologies, modeling and measurement of software processes and resulting products, software reuse, and integrated software-development environments. Rombach received a BS and an MS from the University of Karlsruhe, Germany, in 1975 and 1978, and a Ph.D from the University of Kaiserslautern, Germany, in 1984 respectively.

This page is intentionally left blank

CHAPTER 6

Experimental Validation of New Software Technology Marvin V. Zelkowitz Department of Computer Science and Institution for Advanced Computer Studies University of Maryland, College Park, Maryland 20742 and Fraunhofer Center - Maryland College Park, Maryland 20740 Dolores R. Wallace^ SRS Information Services NASA Goddard Space Flight Center, Greenbelt, MD 20771 David W. Binkley Computer Science Department Loyola College, Baltimore, Maryland

When to apply a new technology in an organization is a critical decision for every software development organization. Earlier work defines a set of methods that the research community uses when a new technology is developed. This chapter presents a discussion of the set of methods that industrial organizations use before adopting a new technology. First there is a brief definition of the earlier research methods and then a definition of the set of industrial methods. A survey taken by experts from both the research and industrial communities provides insights into how these communities differ in their approach toward technology innovation and technology transfer. Research supported in part by National Science Foundation grants CCR9706151 and CCR0086078 to the University of Maryland. f Research performed while an employee of the Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD. 229

230

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley Keywords: Experimentation; models; software engineering; survey; technology transfer.

1. Introduction Although the need to transition new technology to improve the process of developing quality software products is well understood, the computer software industry has done a poor job of carrying out that need. Often new software technology is touted as the next "silver bullet" to be adopted, only to fail and disappear within a short period. New technologies are often adopted without any convincing evidence that they will be effective, yet other technologies are ignored despite the published data that they will be useful. One problem is that two distinct communities are involved in the technology transition process: 1. The research community, which investigates new technology 2. The industrial community, which needs to use new technology to improve the way it develops software The purpose of this work is to enable understanding of these communities and understand their differences. Just as Snow described the literary and scientific communities having difficulties because each lacked respect for and mistrusted the other [Snow63], as time passed, these communities did accept the other because they began to understand the role of each. In the two communities of research and industry, understanding each other better should help in developing research programs that are better able to meet the needs of both of these communities. The methods used by each community to evaluate new technology differ, and this "disconnect" between these two communities is part of the difficulty in moving new ideas from the research laboratory into industry. Researchers have studied the role of experimentation in computer science research, for example [Fenton94]. However, most of these have looked at the relatively narrow scope of how to conduct valid replicated scientific experiments within this domain. The problems of the role of experimentation as an agent in transferring new technology into industry are the focus of this chapter. Researchers, whether in academia or industry, have a desire to develop new concepts and are rewarded when they produce new designs, algorithms, theorems, and models. The "work product" in these cases is often a published paper demonstrating the value of their new technology.

Experimental Validation of New Software Technology

231

Researchers often select their research topics according to their own interest; the topics may or may not be directly related to a specific problem faced by industry. After achieving a result that they consider interesting, they have a great desire to get that result in print. Providing a good scientific validation of the technology is often not necessary for publication, and studies have shown that experimental validation of computer technology is particularly weak, e.g., [Tichy95] [Zelkowitz98]. Development professionals, however, have a desire and are paid to produce a product using whatever technology seems appropriate for the problem at hand. The end result is a product that produces revenue for their employer. In industry, producing and getting a profit out of a product is most important, and the "elegance" of the process used to produce that product is less important than achieving a quality product on time as a result of the process. Being "state of the art" in industry often means doing things as well (or as poorly) as the competition, so there is considerable risk aversion in trying a new technology unless the competition is also using it. As a consequence, researchers produce papers outlining the values of new technology, yet industry often ignores that advice since there has been no empirical justification that the technology will be effective in making their job easier. On the other hand, "gatekeepers" in industry adopt other assorted "silver bullets" proposed as solutions to the "software crisis" without any good justification that they may be effective [Brooks87]. They are used for a time by large segments of the community and then discarded when they indeed turn out not to be the solution. Clearly the research community is not generating results that are in tune with what industry heeds to hear, and industry is making decisions without the benefit of good scientific developments. The two communities are severely out of touch with one another. This chapter describes two classifications of techniques that have been used to support the introduction of new technologies by each of these communities. Not surprisingly, these two taxonomies differ. But there are basic similarities that are often overlooked by the two communities. The chapter describes both classifications and presents results from several studies that show how these methods are used in practice to demonstrate the effectiveness of a new method. Section 2 of this chapter will discuss general methods for scientific experimentation, whereas Section 3 defines an experimentation model applicable for software technology. In Section 4 a model for industry

232

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

validation of new technology is presented, while Section 5 presents a survey that compares the two models - research and industrial. Section 6 presents a discussion and conclusions we derived from this activity.

2. Experimentation Software engineering is concerned with techniques useful for the development of effective software programs, where "effective" depends upon specific problem domains. Effective software can mean software that either is low cost, reliable, rapidly developed, safe, or has some other relevant attribute. To make the assumption that to answer the question "Is this technique effective?" requires some measurement of the relevant attribute. Saying only that a technique is "good" conveys no real information. Instead, a measurement applied to each attribute is necessary so that a judgment can be made that one technique is more or less effective than another. For some attributes, this mapping of an effective attribute to a measurement scale is fairly straightforward. If effective for an attribute means low cost, then cost of development is such a measure. For other attributes (e.g., reliability, safety, and security), measures may be harder to derive. Measures like number of failures in using the product per day, errors found during development, or MTBF (Mean Time Between Failure) indicate reliability of a product in hardware domains. But for software, a count of the number of errors found during testing does not, by itself, indicate if there are or are not further errors remaining to be found. While safety is related to reliability, it is not the same attribute. A very unreliable program can be very safe if it can turn itself off each time the software fails. Does security mean the time it takes to penetrate the software to bypass its security protection, how many "security vulnerabilities" are present in the system, or what level of information the program is allowed to process? In evaluating a new method, the researcher needs to know if the various attributes result in an effective measurement. Experimentation determines whether these methods result in the relevant software attributes being as effective as necessary. Should the underlying theory upon which the technique is based be modified? What predictions can be made upon future developments based upon using these techniques?

Experimental Validation of New Software Technology

233

2.1 Pseudo Experimentation Experimentation is one of those terms frequently used incorrectly in the computer science community. Papers are written that explain some new technology and then "experiments" are performed to show the technology is effective. In almost all of these cases, this means that the creator of the technology has implemented the technology and shown that it seems to work. Here, "experiment" really means an example that the technology exists or an existence proof that the technique can be employed. Very rarely does it involve any collection of data to show that the technology adheres to some underlying model or theory of software development, or that it is effective, as "effective" was defined previously, to show that application of that technology leads to a measurable improvement in some relevant attribute. A typical example could be the design of a new programming language where the "experiment" would be the development of a compiler for the new language with sample programs compiled using this compiler. The designer may claim this language is better than others. However, the "success" for the experiment may be the demonstration that the compiler successfully compiles the sample programs, instead of providing data that shows the value or effectiveness of this new language. A confirming experiment would have demonstrated attributes proving utility of the language. Without a confirming experiment, why should industry select a new method or tool? On what basis should another researcher enhance the language (or extend a method) and develop supporting tools? A scientific discipline requires more than to simply say, "I tried it, and I like it." 2.2 How to Experiment? When one thinks of an "experiment," one often thinks of a roomful of subjects, each being asked to perform some task, followed by the collection of data from each subject for later analysis. However, there are four approaches toward experimentation [Adrion93]: 1. Scientific method. A theory to explain an observable phenomenon is developed. A given hypothesis is proposed and then alternative variations of the hypothesis are tested and data collected to verify or refute the claims of the hypothesis.

234

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

2. Engineering method. A solution to a hypothesis is developed and tested. Based upon the results of the test, the solution is improved, until no further improvement is required. 3. Empirical method. A statistical method is proposed as a means to validate a given hypothesis. Unlike the scientific method, there may not be a formal model or theory describing the hypothesis. Data is collected to statistically verify the hypothesis. 4. Analytical method. A formal theory is developed, and results derived from that theory could be compared with empirical observations. The common thread of these methods is the collection of data on either the development process or the product itself. When researchers conduct an experiment, more properly an experiment using the scientific method described above, they are interested in the effect that a method or tool, called a factor, has on an attribute of interest. The running of an experiment with a specific assignment to the factors is called a treatment. Each agent that the researchers are studying and collecting data on (e.g., programmer, team, source program module) is called a subject or an experimental unit. The goal of an experiment is to collect enough data from a sufficient number of subjects, all adhering to the same treatment, in order to obtain a statistically significant result on the attribute of concern compared to some other treatment. (For more on experimentation, see, for example [Campbell63].) In developing an experiment to collect data on an attribute, researchers have to be concerned with several aspects of data collection [Kitchenham96]: 1. Replication - A researcher must be able to replicate the results of an experiment to permit other researchers to reproduce the findings. To ensure this, the researcher must not confound two effects. That is, the researcher must make sure that unanticipated variables are not affecting the results. If there is not a homogeneous sample of subjects for all treatments, paradoxically, this confounding effect can be counteracted by randomizing the factors that are not of concern. 2. Local control - Local control refers to the degree to which the treatment applied to each subject can be modified (e.g., the researcher usually has little control over the treatment in a case study, as defined in Section 3.1). Local control is a major problem in computer science research since many of the treatments incur significant costs or expenditures of time. In a blocking experiment, the researcher assumes each subject of a

Experimental Validation of New Software Technology

235

treatment group comes from a homogeneous population. Thus if the researcher randomly select subjects from a population of students that represents a blocked experiment of students. In a factorial design the researchers apply every possible treatment for each factor. Thus if there are three factors to evaluate, and each has 3 possible values, then they need to run 9 experiments, with subjects randomly chosen from among the blocked factors. 3. Experimental validity - Researchers must also be concerned with the validity of the experimental results. They want the experiment to have internal validity. That is, the factor being measured should indeed be the factor responsible for the effect the researchers are seeking. In addition researchers want the experiment to have external validity. That is it should be possible to generalize the results of the experiment to other similar environments so that the results obtained are useful in other contexts. The process of randomizing the subjects, mentioned previously, is one way to ensure that researchers haven't accidentally included some other factor. With software development, there are two additional aspects to consider: 4. Influence. In developing experiments involving large, complex, and expensive methods, such as software development, researchers need to know the impact that a given experimental design has on the results of that experiment. The authors will call this influence and classify the various methods as passive (viewing the artifacts of study as inorganic objects that can be studied with no effects on the object itself) or active (interacting with the artifacts under study often affecting the behavior of the objects, as in the case of the well-known "Hawthorne" effect). 5. Temporal properties. Data collection may be historical (e.g., archaeological) or current (e.g., monitoring a current project). Historical data will certainly be passive, but may be missing just the information needed to come to a conclusion.

3. Research Models To understand the differences between the research and industrial communities, the authors examined experimentation models for computer technology research and developed a simple taxonomy to classify that research. It was possible to identify 14 methods used by researchers to

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

236

develop new technology that have been used in the computer field (Table 3.1) and verified their usage by studying 612 papers appearing in three professional publications at 5-year intervals [Zelkowitz98] from 1985 through 1995. About 15% of the papers contained no validation at all and another third contained a weak ineffective form of validation (called an assertion in this study). The figure for other scientific disciplines was more like 10% to 15% with no validation [Zelkowitz97].

Type

Method

Type

Assertion

Informal

No validation

Informal

Case study

Observ

Project monitoring

Observ

Dynamic analysis

Contrl

Replicated

Contrl

Field study

Observ

Simulation

Contrl

Legacy data

Hist

Static analysis

Hist

Lessons learned

Hist

Synthetic

Contrl

Literature search

Hist

Theoretical analysis

Formal

Method

Table 3.1. Experimentation validation models.

The 14 methods can be grouped in 5 general areas: 1. Observational methods - These methods involve the monitoring of a project, as it develops, to collect data on the effectiveness of a new technology. 2. Historical methods - These methods involve an analysis of collected data to discover what happened during the development of a previously developed project. 3. Controlled methods - These methods involved careful study of alternative strategies to determine the effectiveness of one method as compared to other methods. This is the more traditional concept when one thinks of an "experiment." 4. Formal methods - These methods involve using a formal model to describe a process. Ultimate validation depends upon using another validation method to determine whether the formal model agrees with reality.

Experimental Validation of New Software Technology

237

5. Informal methods - These methods are generally ad hoc and do not provide significant results that the technique under study provides the benefits that are claimed. The 14 validation models can be grouped according to the above 5 general areas as follows. Observational Methods 1. Project monitoring - In this method, the researcher collects and stores development data during project development. The available data will be whatever the project generates with no attempt to influence or redirect the development process or methods that are being used. After the project is finished, the data will be analyzed to determine if there is anything of interest. This method rarely produces significant results. 2. Case study - A project is monitored and data is collected over time. The data that is collected is derived from a specific goal for the project. A certain attribute is monitored (e.g., reliability, cost) and data is collected to measure that attribute. This is the method that is most common in studying large projects. The impact on a given project is relatively low, but the data can be significant. The downside to this method is that the effectiveness of the new technique being studied cannot be compared to other projects not using the new technique. However, if many such projects are studied over time assuming they represent a blocking experimental group, the relative value of a new technique can be determined. 3. Field study - Data is collected from several projects simultaneously. Typically, data is collected from each activity in order to determine the effectiveness of that activity. Often an outside group will monitor the actions of each subject group, whereas in the case study model, the subjects themselves perform the data collection activities. This has an advantage over the case study in that several projects, some using the new technique and some not using it, can be studied at the same time. However, the data is usually relatively meager, so only broad generalizations can be determined. Since there is little control over the environment for each of the given projects, the results from each study are not directly comparable. In contrast, the controlled studies, described below, offer more control over the environment to allow for more precision in the observed results.

238

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

Historical Methods 4. Literature search - In this method, previously published studies are examined. It requires the investigator to analyze the results of papers and other documents that are publicly available (e.g., conference and journal articles). Meta-analysis, a technique where the results from several studies are combined to increase the amount of data that is available, is sometimes used to allow for results to be obtained from experiments where each one individually cannot be used to specify a conclusion [Miller99]. The problem here is selection bias. It is not clear if the published literature represents representative use of a technology. In particular, failures in the use of a technology are rarely reported and even if reported are rarely accepted for publication. This is also called publication bias or the file drawer problem, the probability that a study reaches the literature and is thus available for combined analysis, depends on the results of that study [Scagle99]. Since positive results are more likely to be published, this can have the effect of skewing the results. 5. Legacy data - Data from previous projects is examined for understanding in order to apply that information on a new project under development. Available data includes all artifacts involved in the product, (e.g., the source program, specification, design, and testing documentation, as well as data collected in its development). 6. Lessons-learned - Qualitative data from completed projects is examined. Lessons-learned documents are often produced after a large industrial project is completed. A study of these documents often reveals qualitative aspects, which can be used to improve future developments. 7. Static analysis - This is similar to the above two methods, except that it centers on an examination of the structure of the developed product. Since the developed product is implemented in some programming language (whether C, C++, or HTML for a web product), it is defined by some formal syntax, which allows for an automated tool to access the source files and perform the analysis. Controlled Methods 8. Replicated experiment - The researcher monitors multiple versions of product. In a replicated experiment several projects are staffed to

Experimental Validation of New Software Technology

239

perform a task in multiple ways. Control variables are set (e.g., duration, staff level, methods used) and statistical validity can be more applied. This is the "classical" scientific experiment where similar process is altered repeatedly to see the effects of that change. However, within software development, this rarely applies due to the cost of replication. The simpler synthetic environment method is more often used. 9. Synthetic environment - A researcher replicates one or more factors in a laboratory setting. In software development, projects are usually large and the staffing of multiple projects (e.g., the replicated experiment) in a realistic setting is usually prohibitively expensive. For this reason, most software engineering replications are performed in a smaller artificial setting, which only approximates the environment of the larger projects. For example, multiple instances of a technique (e.g., code reading) are duplicated (e.g., using students in a classroom). This provides insights into the effectiveness of the method. But since the method is applied in isolation, the impact of this method relative to the other methods used in a project is not immediately apparent. This form of experimentation leads to evolutionary changes in a development method since only one or two factors are under study for change at any one time. Major shifts in technology cannot be tested in this manner since the proposed changes are too extensive to be tested in isolation. 10. Dynamic analysis - A product is executed for certain runtime information (e.g., performance). Software is often instrumented by adding debugging or testing code in such a way that features of the product can be demonstrated and evaluated when the product is executed. 11. Simulation - Related to dynamic analysis is the concept of simulation where a researcher executes the product with artificial data often in a model of the real environment. The limitation of simulation is how well the model corresponds to the real environment. Formal Methods 12. Theoretical analysis - The researcher uses mathematical logic or some other formal theory to validate a technique. Validation consists of logical proofs derived from a specific set of axioms. This method also requires one of the 11 previous methods to be applied to show that the

240

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

model that was developed agrees with reality; that the concrete realization of the abstract model is correct. Informal Methods 13. Assertion - This is a weak form of validation. It is usually presented as an example use of the new technology where the developer of the technology demonstrates its value, rather than to objectively assess its relevance compared to competing technologies. In the study of 612 published papers, almost one-third of the papers fell into this category. 14. No validation - In about 15% of the papers that were studied, there was no validation at all. The authors simply explained a new technology and claimed success. Although some of these technologies were validated in later publications, the high percentage of no validation and assertion validations (almost half of the total number of papers) is disturbing. Table 3.2 briefly summarizes the strengths and weaknesses to each method. Study Results A study based upon of 612 papers from IEEE Transactions on Software Engineering (TSE), IEEE Software magazine (Soft.) and the International Conference on Software Engineering (ICSE) for the years 1985, 1990 and 1995 [Zelkowitz98], enabled a classification of all according to the 14 methods presented here. Of the 612 papers, 50 were considered not applicable since they were not a research contribution (e.g., a tutorial, a report about some new activity or some political or social issue affecting the software engineering community). The results of the remaining 562 papers from that study are summarized in Table 3.3. The authors' results were consistent with those found by Tichy in his 1995 study of 400 research papers [Tichy95]. He found that over 50% of the design papers did not have any validation in them. In a more recent paper [Tichy98], he makes a strong argument that more experimentation is needed and refutes several myths deprecating the value of experimentation.

Experimental Validation of New Software Technology

241

Validation method

Description

Weakness

Strength

Project monitoring

Collection of development data

No specific goals

Provides baseline for future; Inexpensive

Case study

Monitor project in depth

Poor controls for later replication

Can constrain one factor at low cost

Field study

Monitor multiple projects

Treatments differ across projects

Inexpensive form of replication

Literature search

Examine previously published studies

Selection bias; Treatments differ

Large available database; Inexpensive

Legacy

Examine data from completed projects

Cannot constrain factors; Data limited

Combine multiple studies; Inexpensive

Lessons learned

Examine qualitative data from completed projects

No quantitative data; Cannot constrain factors

Determine trends; Inexpensive

Static analysis

Examine structure of developed product

Not related to development method

Can be automated; Applies to tools

Replicated

Develop multiple versions of product

Very expensive; "Hawthorne" effect

Can control factors for all treatments

Synthetic

Replicate one factor in laboratory setting

Scaling up; Interactions among multiple factors

Can control individual factors; Costs moderate

Dynamic analysis

Execute developed product for performance

Not related to development method

Can be automated; Applies to tools

Simulation

Execute product with artificial data

Data may not represent Can be automated; Applies to tools; reality; Not related to Evaluate in safe development method environment

Theoretical

Use of formal logic to prove value of technology

Not clear if formal model agrees with reality

Inexpensive validation; If model correct, is effective method

Assertion

Ad hoc validation

Insufficient validation

Basis for future experiments

Table 3.2. Summary of validation models.

242

Method

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley 1985

1990

1995

ICSE Soft.

TSE ICSE Soft.

TSE ICSE Soft.

Project monitoring

0

0

0

Case study

5

2

12

Field study

1

0

1

Literature search

1

1

3

Legacy data

1

1

Lessons learned

7

Static analysis

Total TSE

No.

%

1

0

0

0

0

1

0.2

7

6

6

4

6

10

58

10.3

0

0

1

1

1

2

7

1.2

1

5

1

0

3

2

17

3.0

2

2

0

2

1

1

1

11

2.0

5

4

1

4

8

5

7

8

49

8.7

1

0

1

0

0

0

0

0

2

4

0.7

Replicated

1

0

0

0

0

1

1

0

3

6

1.1

Synthetic

3

1

1

0

1

4

0

0

2

12

2.1

Dynamic analysis

0

0

0

0

0

3

0

0

4

7

1.2

0

Simulation

2

0

10

0

0

11

1

1

6

31

5.5

Theoretical

8

0

14

6

0

14

3

0

7

52

9.3

Assertion

12

13

54

12

19

42

4

14

22

192

34.2

No validation

8

11

42

2

8

27

7

3

7

115

20.5

Yearly totals

50

34

143

31

44

120

27

36

76

562

Table 3.3. Use of research validation methods.

4. Industrial Models While Table 3.1 defines a taxonomy for evaluating research results, a better taxonomy needs to represent the efforts used by industry in its technology adoption process. A few industrial interviews and some earlier work by Brown and Wallnau [Brown96] provided a basis for defining an industrial transition taxonomy for technology evaluation, as used by industry (Table 4.1). While the transition models include some used by researchers, there are additional methods. The industrial taxonomy includes a total of 15 different models industrial organizations use to evaluate a new technology.

Experimental Validation of New Software Technology

243

The major difference between the two sets of models is that the ultimate goal of the research community is to determine the effectiveness of the new technology to be tested compared to competing technologies. However, the ultimate goal of industry is to develop a product and realize revenue by delivering the product to customers. Thus achieving the best technology is not always of uppermost concern. The goal is to be good enough yet better than the competition. Cost is a factor, and industrial methods are usually skewed to those that cost less to accomplish.

Method

Type

Method

Type

Case study

Observ

Literature search

Hist

Demonstrator projects

Contrl

Pilot study

Contrl

Education

Hist

Project monitoring

Observ

External

Informal

Replicated project

Contrl

Expert opinions

Hist

Synthetic benchmark

Contrl

Feature benchmark

Hist

Theoretical analysis

Formal

Field study

Observ

Vendor opinion

Informal

Legacy data

Hist

Table 4.1. Industrial transition models. The 15 methods, using the same five categories of observational, historical, controlled, formal and informal methods, are defined as follows. Observational Methods 1. Project monitoring - Data is continually collected on development practices. This data can be investigated when a new technology is proposed. This is also called measurement. By building a baseline of data that describes a development environment, it provides data useful to comparing new projects. It is an important adjunct if techniques such as case studies or field studies are also employed, since it provides a yardstick that can be used to compare a project that uses the new technology with other projects developed by that organization. This is the basic method used by the NASA Goddard Space Flight Center's Software Engineering Laboratory (SEL) in developing the large body of

244

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

knowledge on development practices that have characterized SEL research from 1975 through 2001 [Basili02]. 2. Case study - Sample projects, typical of other developments for that organization, are developed, where some new technology is applied and the results of using that technology are observed. This is viewed as an initial experiment to see if the new method is an improvement over past practices. However, since this is a solitary project, there is no definitive test to determine what the results would have been if the new technology were not used. If multiple case studies are performed testing different technologies, this process is similar to the research case study model. 3. Field study - An assessment is made by observing the behavior of several other development groups over a relatively short time. There is less control over development environment, and it has similar characteristics as the field study research method. Historical Methods 4. Literature search - Information is obtained from professional conferences, journals, and other academic sources of information are used to make a determination of the effectiveness of a new technology. The advantage to this method is that there is less risk of misusing a poorly conceived new technology without positive experiences from others. The disadvantage is that it is not always clear that the tested environment is similar to the new industrial environment, so that the experiences may not be the same. 5. Legacy data - Completed projects are studied in order to find new information about the technologies to develop those projects. The technique of data mining is often used to see if any relationships are hidden within the data collected from a completed project in order to be able to generalize the use of this new technology [BerryOO]. This is similar to the legacy data research method. 6. Expert opinion - Experts in other areas (e.g., other companies, academia, other projects) are queried for their expert opinion of the probable effects of some new technology. This informal method is most similar to the lessons learned method used in the research community. This method uses individuals outside of the organization. On the other hand, the Education method (below) refers to using employees that are part of the organization.

Experimental Validation of New Software Technology

245

7. Feature benchmark - Alternative technologies are evaluated and comparable data are collected. This is usually a "desk study" using documentation on those features present in the new technology. It is most similar to the static analysis research method where the structure of a new technology is evaluated. 8. Education - The quickest way to install a new technology within an organization is to train the staff to use the new technology. The advantages are that the new technology does not need to be tested, thus saving much money and that transfer of the technology to the new organization is relatively rapid. The real downside is that it is not clear if the new technology is applicable to the new organization's goals and processes. This method can be divided into two subcategories: (a) People - Hire the experts in a technology to help learn about it. This immediately provides the organization with the necessary expertise to use the new technology.1 (b) Training - Course materials to teach a new technology are given to current employees. Often it is not practical to hire new employees expert in the new technology and this method provides a larger group of individuals knowledgeable about a new technology. However, each newly trained employee cannot be considered an expert in the use of that technology. Controlled Methods 9. Replicated project - One or more projects duplicate another project in order to test different alternative technologies on the same application. Although is the same as the replicated project of the research methods, it is an expensive method since multiple instances of a project are developed, only one of which is necessary. It is also called a shadow project. 10. Demonstrator projects - Multiple instances of an application, with essential features deleted, are built in order to observe behavior of the new system. This has some of the same characteristics as the synthetic environment of the research method, where the new technology is tested multiple times in isolation of its interaction with other aspects of the system.

1

A comment once heard by one of the authors (source unknown): "The best technology transfer vehicles is often the moving van transporting a new PhD to his first job."

246

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

11. Synthetic benchmarks - A benchmark, or executing a program using a predefined data set, is often used to compare one product against others. If the benchmark is truly representative of the class of problems for which the product will be used, it is an effective evaluation tool. The major difficulty with benchmarks is that vendors often try and skew benchmarks to make their product appear effective and may not truly represent its use in actual production. But if the benchmark is truly representative, then it is one of the few methods for determining objective comparisons between competing products. 12. Pilot study - A pilot study (also called a prototype) is a sample project that uses a new technology is developed. This is generally smaller than a case study before scaling up to full deployment, but is more complete than a demonstration project. It most closely relates to the simulation research method. Formal Methods 13. Theoretical analysis - Much like the theoretical analysis research method, an organization can base its acceptance of a new technology on an opinion based on the validity of the mathematical model of a new technology. Informal Methods 14. Vendor opinion - Vendors (e.g., through trade shows, trade press, advertising, sales meetings) promote a new technology that convinces an organization to adopt it. This can be a reasonable approach, but is missing a critical analysis of alternative methods, since vendors are interested in promoting their own products. It is quite similar to the assertion research method. 15. External - Sometimes the need to use a new technology is not up to the organization. Outside forces can dictate the use of a new technology. For these methods, there is often little evaluation of the effectiveness of the new technology. The organization is instructed to change their methods. This can happen in one of two ways: (a) Edicts - Occasionally an organization is told to use a new technology. This can be upper management (e.g., corporate headquarters of a company, a government rule or regulation). The mandated use of the Ada programming language during the early 1990s and the need for an organization to be rated at the Software

Experimental Validation of New Software Technology

247

Engineering Institute's Software Capability Maturing Model's (CMM) level 3 in order to secure certain government contracts are examples of the use of edicts to change the technology within organizations, (b) State of the art - An organization often will use a new technology that is based upon purchaser or client desires or government rules to only use the latest or best technology. Examples include converting to object oriented design technology in order to show customers that the organization is using the latest techniques for software design. In general, the methods used by the research community can be considered as exploratory, in the researchers' attempts to understand and develop new technology. Industry, on the other hand, wants methods that work, so their techniques are more confirmatory, showing that a given method does indeed have the desired properties. As the explanations above clearly demonstrate, there is a strong agreement between the two sets of techniques. This relationship between the exploratory research methods of Table 3.1 and the confirmatory industrial techniques of Table 4.1 is given in Table 4.2. The two industrial methods - education and external edicts or state of art- do not have research analogues. Research exploratory methods Assertion

Industrial confirmatory methods Vendor opinion

Case study Dynamic analysis Field study Legacy data Lessons learned Literature search Project monitoring Replicated Simulation Static analysis Synthetic Theoretical analysis None None

Case study Synthetic benchmarks Field study Legacy data Expert opinion Literature search Project monitoring Replicated project Pilot study Feature benchmark Demonstrator projects Theoretical analysis Education External

Table 4.2. Relationship between two transition models.

248

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

Some strengths and weaknesses of the research models are indicated in Table 3.2. Researchers principally use the methods from Table 3.1 in order to demonstrate the value of their technological improvements, and industry selects new technology to employ by using the methods in Table 4.1. How do these communities interact? How can their methods support forward growth in computer technology and its application in real systems? A better understanding of what each community understands and values could perhaps enable identification of commonalties and gaps, and from there, mechanisms to enable each community to benefit better from the other.

5. Valuation of the Models To understand the different perceptions between those who develop technology and those who use technology, the authors surveyed the software engineering community to learn their views of the effectiveness of the various models of Tables 3.1 and 4.1. This section presents the development and results of this survey. 5.1 Development of the Survey The survey was intentionally kept simple in order to increase the likelihood of a higher than average response rate from the sample population. The survey did not ask for proprietary data, which while providing useful quantitative results would have further limited the response rate. Also, by keeping the questions simple, the results is a valid instrument that allows generalizing the results to other domains readily. These tradeoffs are sometimes referred to as "Thorngate's clock" [Thorngate76]. This is the psychological equivalent of the Heisenberg uncertainty principle in quantum physics where location and momentum cannot both be measured precisely. In this case, the choice was among accuracy, generality, and simplicity for developing the survey instrument. Selecting two meant sacrificing the third. The result was selection of simplicity and generality, and thus accuracy suffers somewhat. The results described here indicate general trends, which more in depth surveys need to address. Survey questions are based on a previous survey [Daly97], modified for current purposes. Each survey participant was to rank the difficulty of each of the 12 experimental models (or 13 original industrial transition models)

Experimental Validation of New Software Technology

249

according to 7 criteria, criteria 1 and 2 being new and 3 through 7 being the same as the Daly criteria. Having ordinal values between 1 and 20 for each criterion supported objectivity of scoring. A value of 1 for a criterion is considered an exact match between it and the experimental model, 10 being the maximum effort that a given company would apply in practice for that model, and 20 an impossible condition for that model. 5.2 Survey Questions The survey consisted of the following eight questions: 1. How easy is it to use this method in practice? - What is the effort in using this method? The values of 1, 10, and 20, as described above, represent relative costs for using that experimental or validation model in practice. A value of 1 indicates the method is trivial to use, a 20 means that it is impossible, and a 10 indicates it requires the maximum effort practical in an industrial setting. 2. What is the cost of adding one extra subject to the study? - If the researcher wants to add an additional subject (another data point) to the sample, what is the relative cost of doing so? This would increase the precision of the evaluation process by having additional experiments to study. 3. What is the internal validity of the method? - What is the extent to which one can draw correct causal conclusions from the study? That is, to what extent can the observed results be shown to be caused by the manipulated dependent experimental variables and not by some other unobserved factor? 4. What is the external validity of the method? - What is the extent to which the results of the research can be generalized to the population under study and to other settings (e.g., from student subjects to professional programmers, from one organization to others, from classroom exercises to real projects)? 5. What is the ease of replication? - What is the ease with which the same experimental conditions can be replicated (internally or externally) in subsequent studies? It is assumed that the variables that can be controlled (i.e., the dependent variables) are to be given the same value. 6. What is the potential for theory generation? - What is the potential of the study to lead to unanticipated a priori and new causal theories explaining a phenomenon? For example, exploratory studies tend to have a high potential for theory generation.

250

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

7.

What is the potential for theory confirmation? - What is the potential of the study to test an a priori well-defined theory and provide strong evidence to support it? 8. For the eighth question, each participant was asked to rank the relative importance (again using the 1-20 ranking) of each of the 7 prior questions when making a decision on using a new technology. That is, on what basis (e.g., criterion) is a decision on technology utilization made? This determines for industry the major influence for choosing to use a new technology. These 8 questions led to two different survey instruments — one for ranking each of the 14 research validation methods of Table 3.1 (i.e., the research survey) and one for ranking each of 13 evaluation methods of Table 4.1 (i.e., the industrial survey).2 5.3 Population Samples For the two survey instruments there were three random populations to sample. Sample 1 included U.S.-based authors with email addresses published in several recent software engineering conference proceedings. These were mostly research professionals, with a few developers. Approximately 150 invitations to participate were sent to these individuals, and 45 accepted. The survey, conducted via email, was not sent until the participant agreed to fill out the form, estimated to take about an hour to read and complete. About half of the individuals returned the completed form. Sample 2 included U.S.-based authors with email addresses from several recent industry-oriented conferences. About 150 invitations to participate were sent and about 50 responded favorably to this invitation. They were then sent the industrial survey. Again, about half completed and returned the form. Sample 3 were adult students in a graduate software engineering course at the University of Maryland taught by one of the authors. Almost all of the students were working professionals with experience ranging up to 24 years. This sample was given the research survey. Not surprisingly, the return rate of the form for this sample was high at 96% (44 of 46).

2

Education, synthetic benchmark and expert opinion were added classifications after survey was conducted, and the technique survey was dropped as a redundant technique.

251

Experimental Validation of New Software Technology

It is important to realize that the responders would be giving their subjective opinions on the value of the respective validation techniques. Not everyone returning the survey had previously used all, or even any, of the listed methods. It was simply desirable to get their views on how important they thought the methods were. However, by choosing the sample populations from those writing papers for conferences or taking courses for career advancement, the authors believe that the sample populations are more knowledgeable, in general, about validation methods than the average software development professional. The invitations were sent early in 1998, and data was collected later that year. Table 5.1 summarizes the 3 sample populations. Sample

Transition methods

Industrial Other (e.g., developer Consultants)

Sample size

Years exper.

Academic Position

Industrial R&D

1 (Research) Research

18

18.6

9

3

3

3

2 (Industry)

Industry

25

19.1

0

5

8

12

3 (Students)

Research

44

6.6

1

5

27

11

Table 5.1. Characteristics of each survey sample.

5.4 Survey Results In order to understand the differences between the goals of the research community and the goals of the industrial community data were collected across all 7 criteria for all 62 who filled out one of the two research surveys. 5.4.1 Overall statistics Figure 5.1 shows the average values for each technique using a "high-lowclose" stock graph for this sample. The graph shows the average score for each of the 12 experimental methods over all 7 criteria as a small horizontal line. The vertical high and low "whiskers" show the confidence interval for a = 0.05. The length of the whiskers and their relative positions provide an indication of the level and range of confidence in each average. The interest is in the bars, which do not overlap. These indicate a strong probability that they represent values from characteristically different sets. (The "7" in each criterion in the figure represents the midpoint (i.e., the literature search method) among the methods for ease in reading the figure.)

252

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley Easy to do

Additional $

Internal valid

External valid

Ease of repl

Theory gen

Theory corf

1 -case study 2-< ynamic analysis: -field study 4-less ms learned 5-lec acy data 6-project monitoring 7-liter iture search 8replicate I exper ment 9-simulatiOf 10-static analysis 11-synthetic stut y 12-theoretical aialysis

1

'

I'll'I I IIII ill

7' 7

.

7

Villi r 'ill I'••III 'II

"

Fig. 5.1. Range in scores for 18 responses from research community survey. 5.4.2 Practical and impractical techniques Unfortunately, there is considerable overlap of the bars in Figure 5.1. While there are a few interesting bars (e.g., the average value of 12 for ease in performing a replicated experiment is much higher than the 4.3 value for project monitoring), almost every bar overlaps with another. Therefore it was necessary to use a weaker form of significance to get an indication of how these techniques compared. The methods for each criterion are split into three partitions: practical, neutral, and impractical, using the following procedure (recall that a low value indicates a more important technique): 1. Each method whose upper confidence interval is below the average value for all techniques was placed in the practical partition. These methods are all "better than average" according to the cc=0.05 confidence criterion. 2. Each method whose lower confidence interval is above the average value for all methods was placed in the impractical partition. These methods are all "worse than average" according to the confidence criterion. 3. All other methods are in the neutral partition.

Experimental Validation of New Software Technology

253

Table 5.2 presents the practical and impractical partitions. Looking at Question 1 from the survey (Ease of use), techniques that involve real projects (e.g., case study, legacy data, and project monitoring) are all considered practical techniques with respect to this criterion. Yet none of these are viewed as practical with respect to internal validity (e.g., measuring what one wants to measure) and only legacy data is viewed as practical with respect to external validity (e.g., the results can be generalized to an entire population). Not too surprising, the controlled experiments (replicated and synthetic) and the theoretical analysis were viewed as impractical techniques with respect to ease of use.

Ease of use Practical

Addit. $

Case study Legacy Legacy data Proj. data Proj. mon. mon.

Int. val.

Ext. val.

Ease of repl.

Dyn. anal. Field study Simulation Simulation Legacy Static anal. data

Theory gen.

Theory conf.

Field study Simulation Theoretical

Impractical Replicated Replicated Proj. mon. Synthetic Field study Project Synthetic Synthetic Theoretical Theoretical Les. learned mon. Theoretical Theoretical

Proj. mon.

Table 5.2, Practical and impractical techniques.

The data shown in Figure 5.1 is aggregate data. It is interesting to separate the two populations - the research workers and professional developers - that make up this aggregate. Separating the data from Figure 5.1 into the two sample populations shows how each group viewed the same criteria from a different background perspective. Tables 5.3 and 5.4 present the practical and impractical techniques for these two populations according to the same rules of significance used in Table 5.2. In comparing Tables 5.3 and 5.4 three clear differences and one similarity between the two groups become evident. Consider first, techniques such as dynamic analysis and static analysis, which simply "exercise" the program in a laboratory setting. The research group (Table 5.3) considers such a "laboratory" validation as practical with respect to ease of use. This is not so with the industrial group. In a similar vein, replication is viewed as practical from the research community for several of the questions, but never with the industrial community.

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

254

Practical

Ease of repl. Theory gen.

Theory conf.

Dyn. anal. Replicated

Dyn. anal. Simulation Static anal.

Replicated

Case study

Case study Field study Les. learned

Legacy data

Ease of use

Addit. $

Int. val.

Dyn. Anal Les. Learned Legacy data Static anal.

Legacy data Proj. mon. Static anal. Replicated

Impractical Replicated Synthetic

Ext. val.

Table 5.3. Practical and impractical techniques from research sample. Ease of use Addit. $ Practical

Int. val.

Ext. val.

Case study

Ease of repl.

Case study Case study Case study Legacy data Legacy data Dyn. Anal. Legacy data Proj. mon. Proj. mon. Simulation Lit. search Case study

Impractical Replicated Synthetic

Replicated Synthetic

Proj. mon. Synthetic Theory anal. Theory anal.

Theory gen.

Theory conf.

Case study Field study Field study Theory anal. Proj. mon. Proj. mon.

Theory anal. Theory anal.

Table 5.4. Practical and impractical techniquesfromdeveloper sample. Second consider how the two groups differed in their belief in the effectiveness of theoretical analysis with respect to internal and external validity (Questions 3 and 4). Whereas the research group considered a theoretical validation as likely to be used as much as any other technique (i.e., in the neutral partition), the industrial group considered it most difficult to use. The industrial group preferred instead the "hands on" techniques of case study and legacy data over the more formal arguments. Finally, case study is an interesting technique (Bold in Tables 5.3 and 5.4) that clearly shows the different biases of the two populations. The research community considers it particularly impractical with respect to internal validity and ease of replication, two important criteria for determining repeatability of a phenomenon. Yet the industrial community considers it practical for these two criteria. The authors can only guess at why this is so. We offer two possible hypotheses: 1. The research community generally deals in formal theories and causeeffect relationships, and human subjects (the "objects" of study in a case

255

Experimental Validation of New Software Technology

study) are not precise. Measuring human performance is, therefore, viewed as suspect. On the other hand, the industrial community is generally wary of laboratory research results, so puts great faith in industrial experiences. 2. The research community has less access to developers and thus would find case study research hard to do. On the other hand, developers are well acquainted with other developers and would find such studies easier to accomplish. More in depth studies would be needed to distinguish between these two (or any other) hypotheses. None of the other criteria exhibited significant differences among the respondents. However, when combining the criteria into a single composite number for each technique, other differences do become apparent, as demonstrated in the next section. Several similarities exist within the first three questions. The strongest of these arises in the first two (Easy to use, and Additional cost) where both groups thought legacy data was useful. Furthermore, both also thought that replicated studies were impractical. This commonality provides a starting place to bridge the gap between the two groups.

Value of Questions

• research • industry • student

1

2

3

4

5

6

7

1=easy to do; 2=additional $; 3 = int. validity; 4 = ext. validity; 5 = ease of repl.; 6=theory gen.; 7=theory conf.

Fig. 5.2. Relative importance of each criterion.

256

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

5.4.3 Relative importance of each criterion A final eighth question of the survey was to rate the relative importance of each of the other 7 questions when making a decision on using a new technology. The purpose was to determine which of the criteria was viewed as most important when making such a decision. Figure 5.2 summarizes those answers on a single chart, with 3 columns for each question representing the three separate populations that were surveyed. (Remember that lower scores signify more important criteria.) Figure 5.2 shows that the two samples made up mostly of industrial developers (Sample 2 and 3) agreed more closely with each other than with the research sample (Sample 1). This provides some internal validity to the rating scale for this study. Furthermore, Figure 5.2 shows that the participants in samples 2 and 3 viewed easy to do, internal validity (that the validation confirmed the effectiveness of the technique) and the ease of replicating the experiment as the most important criteria in choosing a new method. While internal validity was important, external validity was of less crucial concern. That can be interpreted as the self-interest of industry in choosing methods applicable to its own environment and of less concern if it also aided a competitor. In contrast, for the research community, internal and external validity the ability of the validation to demonstrate effectiveness of the technique in the experimental sample and also to be able to generalize to other samples were the primary criteria. Confirming a theory was next, obviously influenced by the research community's orientation in developing new theoretical foundations for technology. At the other end of the scale, cost was of less concern; it rated as last. Taken collectively, this addresses some of the problems addressed at the beginning of this chapter. The research community is more concerned with theory confirmation and validity of the experiment and less concerned about costs, whereas the industrial community is more concerned about costs and applicability in their own environment it was less concerned about general scientific results, which can aid the community at large. 5.4.4 A Composite measure One final view of the data is illustrative. A quantitative comparison of all of the validation techniques may show if any one of them was considered, in general, more important, than the others. The authors generated a composite measure for evaluating the effectiveness of the various validation methods using the relative importance of the 7 given criteria. Since the respondents

257

Experimental Validation of New Software Technology

provided their impressions of the relative importance of each of the 7 criteria and the relative importance of each method for each criterion, it was easy to compute the weighted sum of all the criteria evaluations: methodj = Xci*Vj where cj is the average value of the i* criterion and v> is the importance of that criterion (from Figure 5.2). In this case, the lowest composite value would determine the most significant method. Table 5.5 presents these results. Sample 1 (Research)

Sample 3 (Prof, student)

Sample 2 (Industry)

Simulation

288 Case study

284 Project monitoring

258

Static analysis

292 Legacy data

314 Legacy data

305

Dynamic analysis

298 Field study

315 Theoretical analysis

324

Project monitoring

301 Simulation

333 Literature search

325

Lessons learned

339 Dynamic analysis

355 Case study

326

Legacy data

345 Static analysis

361 Field study

327

Synthetic study

346 Literature search

370 Pilot study

329

Theoretical analysis

348 Replicated experiment

387 Feature benchmark

338

Field study

363 Project monitoring

388 Demonstrator project

345

Literature search

367 Lessons learned

391 Replicated project

361

Replicated experiment

368 Theoretical analysis

405 External

407

Case study

398 Synthetic study

418 Vendor opinion

469

Table 5.5. Composite measures. Table 5.5 reveals some interesting observations: 1. For the research community, tool-based techniques dominate the rankings (Bold items). Simulation, static analysis, and dynamic analysis are techniques that are easy to automate and can be handled in the laboratory. On the other hand, techniques that are labor intensive and require interacting with industrial groups (Italics) (e.g., replicated experiment, case study, field study, legacy data) are at the bottom of the list. This confirms anecdotal experiences of the authors over the past 25 years; working with industry on real projects certainly is harder to manage than building evaluation tools in the lab.

258

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

2. For the industrial community (the professional student population), almost the opposite seems true. Those techniques that can confirm a technique in the field using industry data (e.g., case study, field study, legacy data) dominate the rankings, while "artificial" environments (e.g., theoretical analysis, synthetic study) are at the bottom. Again, this seems to support the concept that industrial professionals are more concerned with effectiveness of the techniques in live situations than simply validating a concept. 3. The industrial group evaluating the industrial validation methods cannot be compared directly with the others two groups since the methods they evaluated were different; however, there are some interesting observations. For one, project monitoring (measurement - the continual collection of data on development practices) clearly dominates the ranking. The situation has apparently not changed much since a 1984 study conducted by the University of Maryland [Zelkowitz84]. In that earlier survey, the authors found that data was owned by individual project managers and was not available to the company as a whole in order to build corporate-wide experience bases. This is surprising considering the difficulty the software engineering measurement community has been having in getting industry to recognize the need to measure development practices. With models like the Software Engineering Institute's Capability Maturity Model (CMM), the SEI's Personal Software Process (PSP) and Basili's Experience Factory [Basili88] promoting measurement, perhaps the word is finally getting out about the need to measure. But actual practice does not seem to agree with the desires of the professionals in the field. For example, theoretical analysis came out fairly high in this composite score, but that does not seem to relate to experiences in the field and may be wishful thinking. 4. Finally, within the industrial group, the need to be state of the art (edict classification) came near the bottom of the list (11 th out of 12) as not important. Basing decisions on vendor opinions was last. Yet image (being state-of-the-art) and vendors often influence the decision making process. Furthermore, vendor opinion was also judged to be least effective with respect to internal and external validity. Apparently since vendor opinion was judged to be one of the easiest to do, users rely on it even though they know the results are not to be trusted.

Experimental Validation of New Software Technology

259

6. Discussion Software engineering has been described as being in its alchemy stage. However, some widely recognized scientific principles are emerging. One of these is the importance of validating software-engineering techniques. This chapter identifies the experimental validation techniques in use by two communities: researchers and practitioners. The study identified 14 research models found to be in use in the research community. These models are driven by the demands on the research community and reflect the biases and reward system of that community. Similarly, there are 15 different models in use by industry. These reflect the ultimate goal of industrial software engineering, which is to solve a problem given a set of constraints. In this case, the problem is the production of a piece of software and the main constraint is funding. It is clear from comparing these techniques that the research community is primarily focused on exploratory methods while industry focuses on confirmatory techniques. There are many similarities between the two sets of models. These provide a place to begin bridging the gap between the two communities. However, the need to better understand the relationships between models exists. Some of these relationships were explored in Section 5. Section 5 also provides an example of how to setup and run an experiment. This chapter is in essence an example of data mining and a field study of software engineering validation techniques. The results of this experiment facilitate the understanding of the models currently used by the two communities and the connections between them. This, in turn, facilitates technology transfer, the ultimate goal. Technology transfer is known to be a difficult process. A 1985 study by Redwine and Riddle showed that a typical software technology took up to 17 years to move from the research laboratory to general practice in industry [Redwine85]. (This time is consistent with other engineering technologies.) In fact, many new technologies do not even last 20 years! But once developed, it often takes up to 5 years for a new technology to be fully integrated into any one organization [Zelkowitz96]. Because of the short lifecycle of many critical technologies, we need to understand the transition process better in order to enable effective methods to be adopted more rapidly. This chapter is a first step toward understanding the models in current use by the two communities. To formalize relationships between them and

260

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

to better understand the universe of possible techniques, it is useful to have formalism in which to place the models. One such formalization is the three faceted approach of Shaw. Each method (an "experiment" in Shaw's terminology) fulfills three facets [ShawOl]: 1. Question - Why was the research done? 2. Strategy - How was it done? 3. Validation - How does one know it worked? This resulted in a 3 by 5 matrix (Table 6.1) where a research project is the result of choosing one item from each column. Question

Strategy

Validation

Feasibility

Qualitative model

Persuasion

Characterization

Technique

Implementation

Method

System

Evaluation

Generalization

Empirical model

Analysis

Selection

Analytic model

Experience

Table 6.1. Shaw's research validation model.

A case study that tries software inspections on an important new development could be classified according to this model as 1. Question: Feasibility - Do software inspections work? 2. Strategy - Technique - Apply it on a real project. 3. Validation - Experience - See if it has a positive effect. On the other hand, determining whether inspections or code walkthroughs are more effective could be classified as 1. Question: Selection - Are inspections or walkthroughs more effective? 2. Strategy - Technique - Apply them on multiple real projects. 3. Validation - Evaluation - Use them and compare the results. This chapter contains definitions for strategies and validation mechanisms as shown by the authors' research. Persuasion is most similar to the assertion method in the authors' taxonomy. In fact, this classification can be mapped into Shaw's model as given in Table 6.2.

261

Experimental Validation of New Software Technology Method

Strategy

Validation

Project monitoring

Technique

Persuasion

Case study

System

Experience

Field study

Qualitative model

Evaluation

Literature search

Technique

Evaluation

Legacy data

Empirical model

Evaluation

Lessons learned

Qualitative model

Persuasion

Static analysis

System

Evaluation

Replicated

Empirical model

Evaluation

Synthetic

Empirical model

Evaluation

Dynamic analysis

System

Evaluation

Simulation

Technique

Evaluation

Theoretical

Analytic model

Analysis

Assertion

Technique or System

Persuasion or Experience

No validation

—

Persuasion

Table 6.2. Research validation techniques modeled in Shaw's model. Given the assumption that some new means of performing a task of software engineering has been proposed or has been attempted, the following may serve as generic definitions for the questions: • Feasibility: the possibility or probability that a software engineering task may be conducted successfully by applying the method or technique X under study to a software engineering problem. • Characterization: Those features which identify X: its purpose, that is, what problem does it attempt to solve; its measurable characteristics; its differences from other solutions to the same problem; the circumstances under which it may be applied. • Method / Means: The procedures by which X may be accomplished. In addition to the actual conduct of X, these include the potential for automating X and for incorporating it into an existing software engineering paradigm or culture. • Generalization: the features or process steps that specify how X may be applied generally beyond the specific research example where it has been used. • Discrimination: the criteria by which X is to be judged for use.

262

M. V. Zelkowitz, D. R. Wallace & D. W. Binkley

Identifying the models in current use is a first step in the understanding of technology transfer in the software field. A formal understanding of the different models used to evaluate software is a second step. While the final step a still far off, the efficient transfer of software technology is necessary.

References [Adrion93] Adrion W. R., Research methodology in software engineering, Summary of the Dagstuhl Workshop on Future Directions in Software Engineering, W. Tichy (Ed.), ACM SIGSOFT Software Engineering Notes, 18, 1, (1993). [Basili02] Basili V., F. McGarry, R. Pajerski, M. Zelkowitz, Lessons learned from 25 years of process improvement: The rise and fall of the NASA Software Engineering Laboratory, IEEE Computer Society and ACM International Conf. on Soft. Eng., Orlando FL, May 2002. [Basili95] Basili V., M. Zelkowitz, F. McGarry, J. Page, S. Waligora and R Pajerski, SEL's software process improvement program, IEEE Software 12, 6 (1995) 83-87. [BerryOO] Berry M. and G. Linoff, Mastering Data Mining, John Wiley & Sons, 2000. [Brooks87] Brooks, F. No Silver Bullet: Essence and Accidents of Software Engineering, IEEE Computer (1987), 10-19. [Brown96] Brown A. W. and K. C. Wallnau, A framework for evaluating software technology, IEEE Software, (September, 1996) 39^19. [Campbell63] Campbell D. and J. Stanley, Experimental and quasi-experimental designs for research, Rand McNally, Chicago, (1963). [Daly97] Daly, J., K. El Emam, and J. Miller, Multi-method research in software engineering, 1997 IEEE Workshop on Empirical Studies of Software Maintenance (WESS '97) Bari, Italy, October 3, 1997. [Fenton94] Fenton N., S.L.Pfleeger, and R. L. Glass, Science and substance: A challenge to software engineering, IEEE Software, Vol. 11, No. 4, 1994, 86-95. [Kitchenham96] Kitchenham B. A., Evaluating software engineering methods and tool, ACM SIGSOFT Software Engineering Notes, (January, 1996) 11-15. [Miller99] Miller J., Can software engineering experiments be safely combined?, IEEE Symposium on Software Metrics (METRICS'99), Bethesda, MD (November 1999).

Experimental Validation of New Software Technology

263

[Redwine85] Redwine S. and W. Riddle, Software technology maturation, 8 IEEE/ACM International Conference on Software Engineering, London, UK, (August, 1985) 189-200. [Scargle99] Scargle J.D., Publication Bias (The "File-Drawer Problem") in Scientific Inference, Sturrock Symposium, Stanford University, Stanford, CA, (March, 1999). [ShawOl] Shaw M., Keynote presentation, International Conference on Software Engineering, Toronto, Canada, May 2001 (http://www.cs.cmu.edu/~shaw). [Snow63] Snow, C.P., The two cultures and the scientific revolution, New York: Cambridge University Press, 1963. [Thorngate76] Thorngate W., "In General" vs "It Depends": Some comments on the Gergen-Schlenker debate. Personality and Social Psychology Bulletin 2, 1976,404-410. [Tichy95] Tichy W. F., P. Lukowicz, L. Prechelt, and E. A. Heinz, Experimental evaluation in computer science: A quantitative study, J. of Systems and Software Vol. 28, No. 1,1995 9-18. [Tichy98] Tichy, W., Should computer scientists experiment more?, IEEE Computer, 31,5,1998, 32-40. [Zelkowitz84] Zelkowitz M. V., Yeh R. T„ Hamlet R. G., Gannon J. D., Basili V. R., Software engineering practices in the United States and Japan, IEEE Computer 17, 6 (1984) 57-66. [Zelkowitz96] Zelkowitz M. V., Software Engineering technology infusion within NASA, IEEE Trans, on Eng. Mgmt. 43, 3 (August, 1996) 250-261. [Zelkowitz97] Zelkowitz M. and D. Wallace, Experimental validation in software engineering, Information and Software Technology, Vol. 39, 1997,735-743. [Zelkowitz98] Zelkowitz M. and D. Wallace, Experimental models for validating technology, IEEE Computer, 31, 5, 1998,23-31.

Series on Software Engineering and Knowledge Engineering -Vol. 12 LECTURE NOTES ON EMPIRICAL SOFTWARE ENGINEERING edited by Natalia Juristo & Ana M Moreno (Universidad Politecnica de Madrid, Spain) Empirical verification of knowledge is one of the foundations for developing any discipline. As far as software construction is concerned, the empirically verified knowledge is not only sparse but also not very widely disseminated among developers and researchers. This book aims to spread the idea of the importance of empirical knowledge in software development from a highly practical viewpoint. It has two goals: (1) Define the body of empirically validated knowledge in software development so as to advise practitioners on what methods or techniques have been empirically analysed and what the results were; (2) as empirical tests have traditionally been carried out by universities or research centres, propose techniques applicable by industry to check on the software development technologies they

www. worldscientific. com 4930 he