TRUSTWORTHY COMPUTING Analytical and Quantitative Engineering Evaluation
M. SAHINOGLU, Ph.D. Department of Computer Science Troy University–Montgomery Campus Montgomery, Alabama
A JOHN WILEY & SONS, INC., PUBLICATION
TRUSTWORTHY COMPUTING
TRUSTWORTHY COMPUTING Analytical and Quantitative Engineering Evaluation
M. SAHINOGLU, Ph.D. Department of Computer Science Troy University–Montgomery Campus Montgomery, Alabama
A JOHN WILEY & SONS, INC., PUBLICATION
Copyright 2007 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and speciÞcally disclaim any implied warranties of merchantability or Þtness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of proÞt or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Wiley Bicentennial Logo: Richard J. PaciÞco Library of Congress Cataloging-in-Publication Data: Sahinoglu, Mehmet. Trustworthy computing: analytical and quantitative engineering evaluation by M. Sahinoglu. p. cm. ISBN 978-0-470-08512-7 (cloth) 1. Computer security. 2. Computer software—Reliability. 3. Computer systems—Reliability. I. Title. QA76.9.A25S249 2007 005.8—dc22 2006033567 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
To my late mother, Mehpare, for teaching me to be always kind and forgiving, and to my late father, Kamil, for advising me to be patient, calm, resolute, and never to give up my ideals and dreams. To my wife of 25 years, Suna, for her care, compassion, and loving devotion to her family. To my falcon sons, Gokturk, Efe, and Hakan, for their hard work and self conÞdence.
CONTENTS
Foreword
xiii
Preface
xvii
1 Fundamentals of Component and System Reliability and Review of Software Reliability
1
1.1 Functions of Importance in Reliability, 1 1.2 Hazard Rate Functions in Reliability, 6 1.3 Common Distributions and Random Number Generations, 8 1.3.1 Uniform (Rectangular) p.d.f, 8 1.3.2 Triangular p.d.f., 10 1.3.3 Negative Exponential p.d.f., Pareto, and Power Functions, 11 1.3.4 Gamma, Erlang, and Chi-Square p.d.f.’s, 13 1.3.5 Student’s t-Distribution, 16 1.3.6 Fisher’s F -Distribution, 16 1.3.7 Two- and Three-Parameter (Sahinoglu–Libby) Beta p.d.f.’s, 17 1.3.8 Poisson p.m.f., 20 1.3.9 Bernoulli, Binomial, and Multinomial p.m.f.’s, 20 1.3.10 Geometric p.m.f., 21 1.3.11 Negative Binomial and Pascal p.m.f.’s, 22 1.3.12 Weibull p.d.f., 23 1.3.13 Normal p.d.f., 25 1.3.14 Lognormal p.d.f., 27 1.3.15 Logistic p.d.f., 28 1.3.16 Cauchy p.d.f., 29 1.3.17 Hypergeometric p.m.f., 29 vii
viii
CONTENTS
1.3.18 Extreme Value (Gumbel) p.d.f.’s, 30 1.3.19 Summary of the Distributions and Relationships Most Commonly Used, 31 1.4 Life Testing for Component Reliability, 33 1.4.1 Estimation Methods for Complete Data, 33 1.4.2 Estimation Methods for Incomplete Data, 36 1.5 Redundancy in System Reliability, 40 1.5.1 Series System Reliability, 40 1.5.2 Active Parallel Redundancy, 41 1.5.3 Standby Redundancy, 42 1.5.4 Other Redundancy Limitations: Common-Mode Failures and Load Sharing, 44 1.6 Review of Software Reliability Growth Models, 45 1.6.1 Software Reliability Models in the Time Domain, 48 1.6.2 ClassiÞcation of Reliability Growth Models, 49 Appendix 1A: 500 Computer-Generated Random Numbers, 65 References, 66 Exercises, 71 2 Software Reliability Modeling with Clustered Failure Data and Stochastic Measures to Compare Predictive Accuracy of Failure-Count Models 2.1 Software Reliability Models Using the Compound Poisson Model, 78 2.1.1 Notation and Introduction, 79 2.1.2 Background and Motivation, 80 2.1.3 Maximum Likelihood Estimation in the Poisson∧ Geometric Model, 81 2.1.4 Nonlinear Regression Estimation in the Poisson∧ Geometric Model, 82 2.1.5 Calculation of Forecast Quality and Comparison of Methods, 91 2.1.6 Discussion and Conclusions, 96 2.2 Stochastic Measures to Compare Failure-Count Reliability Models, 99 2.2.1 Introduction and Motivation, 99 2.2.2 DeÞnitions and Notation, 100 2.2.3 Model, Data, and Computational Formulas, 101 2.2.4 Prior Distribution Approach, 104 2.2.5 Applications to Data Sets and Computations, 106 2.2.6 Discussion and Conclusions, 110 References, 113 Exercises, 116
78
CONTENTS
3 Quantitative Modeling for Security Risk Assessment 3.1 Decision Tree Model to Quantify Risk, 119 3.1.1 Motivation, 119 3.1.2 Risk Scenarios, 120 3.1.3 Quantitative Security Meter Model, 122 3.1.4 Model Application and Results, 124 3.1.5 Modifying the Quantitative Model for Qualitative Data, 127 3.1.6 Hybrid Security Meter Model for Both Quantitative and Qualitative Data, 127 3.1.7 Simulation Study and Conclusions, 129 3.2 Bayesian Applications for Prioritizing Software Maintenance, 131 3.2.1 Motivation, 131 3.2.2 Bayesian Rule in Statistics and Applications for Software Maintenance, 132 3.2.3 Another Bayesian Application for Software Maintenance, 135 3.2.4 Monte Carlo Simulation to Verify the Bayesian Analysis Proposed, 137 3.2.5 Discussion and Conclusions, 137 3.3 Quantitative Risk Assessment for Nondisjoint Vulnerabilities and Nondisjoint Threats, 138 3.3.1 Motivation Behind the Disjoint Notion of Vulnerabilities and Threats, 138 3.3.2 Fundamental Probability Laws of Independence, Conditionality, and Disjointness, 138 3.3.3 Security Meter ModiÞed for Nondisjoint Vulnerabilities and Disjoint Threats, 139 3.3.4 Security Meter ModiÞed for Nondisjoint Vulnerabilities and Nondisjoint Threats, 141 3.3.5 Discussion and Conclusions, 142 3.4 Simple Statistical Design to Estimate the Security Meter Model Input Data, 142 3.4.1 Estimating the Input Parameters in the Security Meter Model, 143 3.4.2 Statistical Formulas Used to Estimate Inputs in the Security Meter Model, 144 3.4.3 Numerical Example of the Statistical Design for the Security Meter Model, 145 3.4.4 Discrete Event (Dynamic) Simulation, 147 3.4.5 Monte Carlo (Static) Simulation, 147 3.4.6 Risk Management Using the Security Meter Model, 148
ix
119
x
CONTENTS
3.4.7 Discussion and Conclusions, 149 3.5 Statistical Inference to Quantify the Likelihood of Lack of Privacy, 150 3.5.1 Introduction: What Is Privacy?, 150 3.5.2 How to Quantify Lack of Privacy, 151 3.5.3 Numerical Applications for a Privacy Risk Management Study, 152 3.5.4 Discussion and Conclusions, 154 Appendix 3A: Comparison of Various Risk Assessment Approaches and CINAPEAAA, 154 Appendix 3B: Brief Introduction to Encryption, Decryption, and Types, 156 Appendix 3C: Attack Trees, 159 Appendix 3D: Capabilities-Based Attack Tree Analysis, 161 Appendix 3E: Time-to-Defeat Model, 162 References, 164 Exercises, 167 4 Stopping Rules in Software Testing 4.1 Effort-Based Empirical Bayesian Stopping Rule, 173 4.1.1 Stopping Rule in Test Case–Based (Effort) Models, 173 4.1.2 Introduction and Motivation, 174 4.1.3 Notation, Compound Poisson Distribution, and Empirical Bayes Estimation, 177 4.1.4 Stopping Rule Proposed for Use in Software Testing, 182 4.1.5 Applications and Results, 185 4.1.6 Discussion and Conclusions, 188 Appendix 4A: Analysis Tables, 191 Appendix 4B: Comparison of the Proposed CP Rule with Other Stopping Rules, 193 Appendix 4C: MESAT-1 Output Screenshots and Graphs, 200 4.2 Stopping Rule for High-Assurance Software Testing in Business, 205 4.2.1 Introduction, 205 4.2.2 EVM Methodology, 205 4.2.3 Typical SDLC Testing Management, 206 4.2.4 New View of Testing, 206 4.2.5 Case Study, 208 4.2.6 Discussion and Conclusions, 213 4.3 Bayesian Stopping Rule for Testing in the Time Domain, 215 4.3.1 Introduction, 215 4.3.2 Review of the Compound Poisson Process, 216
172
CONTENTS
xi
4.3.3 Stopping Rule, 217 4.3.4 Bayes Analysis for the Poisson∧ Geometric Model, 218 4.3.5 Empirical Bayesian Stopping Rule, 220 4.3.6 Computational Example, 220 4.3.7 Discussion and Conclusions, 221 Appendix 4D: MESAT-2 Applications and Results, 221 References, 225 Exercises, 229 5 Availability Modeling Using the Sahinoglu–Libby Probability Distribution Function
231
5.1 5.2 5.3 5.4
Nomenclature, 232 Introduction and Motivation, 233 Sahinoglu–Libby Probability Model Formulation, 234 Bayes Estimators for Various Informative Priors and Loss Functions, 235 5.4.1 Squared-Error Loss Function, 236 5.4.2 Absolute-Error Loss Function, 236 5.4.3 Weighted Squared-Error Loss Function, 237 5.5 Availability Calculations for Simple Parallel and Series Networks, 239 5.6 Discussion and Conclusions, 243 Appendix 5A: Derivation of the Sahinoglu–Libby p.d.f., 247 Appendix 5B: Derivation of the Bayes Estimator for Weighted Squared-Error Loss, 251 References, 252 Exercises, 253 6 Reliability Block Diagramming in Complex Systems 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8
Introduction and Motivation, 258 Simple Illustrative Example, 259 Compression Algorithm and Various Applications, 260 Hybrid Tool to Compute Reliability for Complex Systems, 265 More Supporting Examples for the Hybrid Form, 268 New Polish Decoding (Decompression) Algorithm, 268 Overlap Technique, 271 6.7.1 Overlap Ingress–Egress Reliability Method, 271 6.7.2 Overlap Ingress–Egress Reliability Algorithm, 274 Multistate System Reliability Evaluation, 275 6.8.1 Simple Series System, 276 6.8.2 Active Parallel System, 277 6.8.3 Simple Parallel–Series System, 278
257
xii
CONTENTS
6.8.4 Simple Parallel System, 279 6.8.5 Combined System, 279 6.9 Discussion and Conclusions, 281 Appendix 6A: Overlap Algorithm Described, 282 Appendix 6B: Overlap Ingress–Egress Reliability Algorithm Applied, Example 1, 285 Appendix 6C: Overlap Ingress–Egress Reliability Algorithm Applied, Example 2, 298 References, 303 Exercises, 306 Index
309
FOREWORD
Professor Mehmet Sahinoglu, Distinguished Chair Professor and Eminent Scholar of Troy University, Montgomery, Alabama, and a recipient of 2006 Microsoft International Trustworthy Computing Curriculum Research Award, has written a new book titled, Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, integrating the various aspects, theories, and practices underlying software reliability, testing, and security engineering. The author has been a very proliÞc and creative researcher for many years in the areas of computer reliability, security, software engineering, and statistics. Rarely do we encounter active researchers like him taking time off to write books to convey advanced and innovative ideas on their subjects in ways that help us to understand complex topics and to realize and build on the concepts for practical beneÞt and use. It is an excellent and unique book and deÞnitely a seminal contribution and the Þrst of its kind. In my humble opinion, it is an outstanding addition to one of the most important areas of information technology. Professor Sahinoglu professionally has pioneered in several major research fronts. He developed the Sahinoglu–Libby probability distribution to model and characterize the behavior of failure patterns in components/networks and software systems. He pioneered in the development of optimal algorithms and stopping rules to terminate software testing based on economic and speciÞcation requirements. Most recently, he created the concept of the security meter, which is a fast decision-theoretic tool that evaluates the ability of a set of protective measures to provide a required level of security for the system. The book itself is a commendable achievement, and it deals with security and software reliability theory in an integrated fashion, with emphasis on practical applications to software engineering and information technology. With his new book on the shelf, Dr. Sahinoglu generously shares with his readers and the scientiÞc world, on the twenty-Þfth anniversary of his Ph.D. from Texas A&M, a new vision: How can I best quantify the risk to improve the trustworthiness of cyber systems? Professor Sahinoglu is both an internationally renowned statistician and an outstanding software engineer who has been on the faculties of Case Western Reserve and xiii
xiv
FOREWORD
Purdue universities. The book emphasizes the theoretical foundations of the topics as well as providing unique insights based on his past and ongoing research in software reliability theory, security, and software engineering. I recommend this book not only for academia but also for the practicing engineer with an eye for innovative techniques to add value to his/her project solutions to improve risk quantiÞcation and trustworthy computing. I congratulate and commend him on his superb contribution to information technology. C. V. RAMAMOORTHY University of California–Berkeley
Computing trustworthiness is a fundamental issue in today’s highly connected world with its increasing risks of malicious attacks on our computers. Although the book is written primarily as a textbook for upper undergraduate and graduate students, I highly recommend this book to any professional hardware/software engineer, as it provides a truly comprehensive understanding of how to make sure that any computing device could be worthy of the trust of its users. The author provides a vigorous data-analytic approach for understanding the risks by objectively utilizing a multitude of quantitative modeling and estimation techniques. A very important aspect of this book is the use of economical effectiveness measures, such as cost-effective stopping rules, to make sure that there is a good return on an investment in risk mitigation. I especially like the book’s CD-ROM, in which hot links are provided for special terms to make the reading and computing easier. The book is well written, with many lucid illustrations of case studies. RAYMOND YEH University of Texas
Professor Sahinoglu’s work in the area of quantitative measurement of trustworthiness as to reliability, security, and privacy risks is truly visionary and signiÞcantly ahead of market capabilities. His work addresses a long-standing shortcoming in the information security industry—that is, a means to accurately measure the risk of compromise as well as the discrete Þnancial impact of various security and privacy events. Dr. Sahinoglu’s book, while meant primarily as an academic textbook rather than a Þeld guide, does provide a broad foundation of knowledge for the reader to apply in the analysis and prioritization of security and privacy risks. While the concepts are still somewhat heady for the Þeld engineer, once they are grasped they can be adapted to a broad array of scenarios. Unlike many scenario-speciÞc methodologies used in industry today, the information provided by Professor Sahinoglu and outlined in this book can be applied across many security disciplines and domains and should have a long future in industry. Mehmet’s book
FOREWORD
xv
delivers groundbreaking work in our Þeld and should be a resource in every security researcher’s library. STEPHEN GOLDSBY Integrated Computer Solutions, Inc.
PREFACE
This book was implanted in my young mind by my late pediatrician and congressman father (1919–1999), himself a child of deprivation during the Turkish War of Independence (1922), who advised me always to carry an extra banknote, and house and car keys in my wallet to reduce the risk of misery and increase security! Trustworthy Computing was written during a period of six years (2000–2006) while teaching a course for students and practitioners on the recognition of data analytical and metric aspects of security and reliability modeling dealing with evaluating software and hardware quality, and security. The course traditionally covered topics on trustworthy computing. However, over the years, I was not able to identify a single book that integrated coverage of applied and quantitative concepts dealing with security and reliability. The goal of this book is to establish metrics or indexes to identify the common enemy—the malicious and nonmalicious risk—so as not to solely qualify the imminent danger within the conventional standards of high or medium or low risk, but also to quantify it. A cross product of computer security and reliability measures constitutes a concern that dominates today’s world, which is now deÞnitely data-driven, no longer verbal. Numerical data on security breaches and chance failures surround us. Innocent and malicious risk data must be collected, analyzed, and processed objectively to convert them into useful information not only to inform, but also to instruct, answer, or aid in decision making as to how to combat the disastrous consequences of the computer-addicted world of industry, commerce, Þnance, science, and technology. What used to be the mainstay, the chance failure-based reliability is now outsmarted by malicious-failure-based lack of security. Unless these hostile or nonhostile problems are dealt with scientiÞcally and objectively by employing data modeling techniques toward creating quantiÞable indexes or metrics, there is no way that a budgetary assessment can be obtained merely by guessing and acting with a complacent subjectivism. Students or other readers of this book should have fundamental training in statistics and probability, or be cognizant of the interpretation of scientiÞc data. Of course, the assumption is that the empirical data used in the solution of problems are measurement error–free, xvii
xviii
PREFACE
random, and unbiased. The book CD-ROM focuses on helping the reader to solve problems and to gain a sense of industrial experience. The objective is to provide an elementary and reasonably self-contained overview of the engineering aspects of trustworthiness in the general sense of the word, integrating reliability, security, and privacy. Every book must have a solid and clear purpose for coming to life. The purpose of a new text such as this is to inform senior undergraduate or beginning graduate students across the board in engineering disciplines about new advances in reliability and security modeling with a metric-based quantitative approach as opposed to the more common verbal or qualitative or subjective case histories, which form some of the experiential background in this book. Rather than what this book is about, what is this book not about? This book is not a collection of case stories and already available chapters that can be found in a multitude of Þne books, therefore avoiding repetitious information available. It is objective, quantitative, empirical, metric-oriented, and data-driven. However, earlier methods that deal with reaching the new frontiers are also examined. Therefore, in Chapter 1, there is some, but minimal, duplication of material widely available, such as descriptions of the statistical probability distributions accompanied by their respective random number generations, hardware reliability methods for components and systems, and software reliability-growth models. There are references to the CD-ROM to enable students to work with projects that provide hands-on experience in detail. The text is applicable to wireless engineering with an over-the-air medium. The book begins with a review to provide the supplemental material necessary to train readers with no previous knowledge of the basics of reliability theory as it relates to both hardware and software, with practical applications. Although the material is available in many books and tutorials, a general treatment will enable the reader to understand the nomenclature used in the main body of the book without shufßing the pages of other books. In Chapter 1 we also study the simulation alternatives for each statistical probability distribution that exists in the literature, with a few exceptions, to model and calculate system or component availability when the analytical methods have serious shortcomings. The book continues with software reliability modeling of clustered failure data in an effort-based testing environment, taking up the less studied compound Poisson process approach in the Þrst half of Chapter 2. Then, as a follow-up to the Þrst half of the chapter, in the second half we study ways to compare forecast accuracy in a stochastic manner as opposed to the deterministic ways used conventionally. Multifaceted quantitative modeling of security and privacy risk is studied in Chapter 3, from quantitative, qualitative, and hybrid perspectives, with data analytical applications and estimation techniques of the risk parameters, as well as how to handle nondisjointness of vulnerabilities or threats, and how to prioritize during the maintenance cycle after assessment. Cost-effective stopping rules in an effort- and time-based failure environment are studied in Chapter 4, where economic rules of comparison are emphasized, with applications not only to software and hardware testing but also in the very active business
PREFACE
xix
and government domains. In Chapter 5 we employ the Sahinoglu–Libby probability distribution to model the availability of hardware components in cyber systems. In Chapter 6 we take up the topic of reliability block diagramming to compute source–target reliability using various novel methods for simple and complex embedded systems. Each chapter explains why there is a need for the methods proposed in comparing the material presented with that covered conventionally. All chapters work toward creating mathematical–statistical but engineering-oriented metrics to best quantify the lack of risk or the reliability of a system. A thorough course curriculum on how to use this text is given in the CD-ROM. Troy University’s Undergraduate/Graduate Catalog lists CS4451 and CS6653, Computer Security and Reliability, 3 credit hours. The inclusion of this course under its actual representative title with a new course number and an improved course description, due partially to Microsoft’s trustworthy computing curriculum research grant in 2006, was the result of many hours of work by the author, also in the capacity of Department Head of Computer Science (since 1999), to strengthen and update the CS curriculum with the changing and surprising trends evident at the beginning of the twenty-Þrst century. This is exactly why Troy University launched an IT Colloquium Series of the Millennium. For the past eight years, a distinguished computer scientist, usually on the exciting topic of IT security and reliability, will have been invited to speak by the time this book is published in the summer of 2007. This change is also important because we plan that our students who graduate with a degree in CS will be well equipped with an appreciation of analytical and quantitative measures to assess, compare, and improve the trustworthiness of cyber systems. Students must be “sensitized and proactive before the occurrence of undesirable episodes due to breach of security and poor reliability” and act “security-conscious and reliability-literate.” Simply stated, our students cannot afford to be ignorant in terms of software quality and information security concerns. In this book, objective quantiÞcation for security and reliability is asserted, not an obscure subjectivity as practiced conventionally. Trustworthy computing is important, as stated in the President’s Information Technology Advisory Committee report to the President of the United States in February 2005: “Ubiquitous interconnectivity = widespread vulnerability”; “fundamentally new security models and methods are needed”; and “the Federal government should intensify its efforts to promote recruitment and retention of cyber security researchers and students at research universities.” The report stressed the need to develop security metrics and benchmarks; economic impact assessment and risk analysis techniques, including risk reduction and cost of defense; and automated tools to assess compliance and/or risk. In addition, in the May–June 2005 issue of IEEE Security and Privacy, the Guest Editor remarks under the title “Infrastructure Security: Reliability and Dependability of Critical Systems: “This special issue of IEEE Security & Privacy focuses on the security, agility, and robustness of large-scale critical infrastructure. SpeciÞcally it examines the challenges associated with infrastructure protection for enhanced system security, reliability, efÞciency, and quality. The articles in this special
xx
PREFACE
issue go a long way toward addressing two key issues in distributed denialof-service (DDoS) and the development of pragmatic approach to quantifying security and calculating risk. M. Sahinoglu describes a security meter, which provides a quantitative technique with an updated repository on vulnerabilities, threats, and countermeasures to calculate risk.” Students using this textbook will have hands-on experience with applicationsbased software, available in the accompanying CD-ROM. The transdisciplinary nature of the Society of Design and Process Science (SDPS), of which the author is an elected Fellow (2002), encouraged the idea of such an interdisciplinary book. It is anticipated that the audience for the book will be advanced undergraduate and beginning graduate students in electrical, computer, and software engineering, in computer science and industrial and systems engineering, or in statistics and operations research departments in their courses on security, reliability, or assurance sciences in general and as well as in related programs. The draft Computing Curricula 2004 at http://www.acm.org/education/curricula.html#CC2005 provides a comparative weight of computing topics across the Þve degree programs: computer engineering, CE; computer science, CS; information systems, IS; information technology, IT; and software engineering, SE. In the tables in the report, such as the one below, Min (≥ 0) represents the minimum called for by the curriculum guidelines, and Max (≤ 5) represents the greatest emphasis one might expect in the typical case of a student who chooses to undertake optional work in that area or who graduates from a school that requires its students to achieve mastery beyond that required by the curriculum reports. In the knowledge areas across the board, the grading of subject matter in this proposal implies increased emphasis, which does not agree with the reality of what is offered, due to a probable lack of specialized textbooks, and not including it in the core program. CE Knowledge Area Software veriÞcation and validation Software quality Security issues and principles Security implementation and management Risk management (project safety risk)
CS
IS
IT
SE
Min
Max
Min
Max
Min
Max
Min
Max
Min
Max
1
3
1
2
1
2
1
2
4
5
2 2
3 3
2 1
3 4
1 2
2 3
1 1
2 3
3 1
4 3
1
2
1
3
1
3
3
5
1
3
2
4
1
1
2
3
1
4
2
4
In Appendix A of the report, Table 4.3 lists a variety of courses for computer engineering and related curricula, such as computer system engineering, software engineering, operating systems, networks, and probability and statistics, all knowledge areas to which this book is related directly or indirectly. The motivation in this alternative textbook is to go outside the box and implement new ideas, which have been tested through peer reviews in prestigious journals on the
PREFACE
xxi
assurance sciences. Practicing engineers will be able to use the book to beneÞt their case studies and projects by using the meticulously prepared CD-ROM. This book is a quantitative data-driven and metric-oriented package on assessing the dependability, and further, trustworthiness of components and systems. Dependability ≈ reliability × security, all in probabilities, where a component is only conditionally reliable (e.g., 95%) assuming that it is 100% secure. If not (e.g., 80% secure), it is only as dependable as the cross product of its reliability and security measures (e.g., 0.95 × 0.8 = 0.76 = 76%). Note that the reliability index may be assisted by a quantitative measure of availability—the readiness for use, also implying maintainability: the ability to undergo repair and evolutions—or safety (the nonoccurrence of catastrophic failures), whichever case applies. Therefore, dependability evolves to the trustworthiness of a component or system such that reliance and trust, both human and electronic, can justiÞably be placed on the service delivered to its users (i.e., T = R × S × P). This is why when multiplied by the privacy index (e.g., 92%)—where security and privacy metrics are proposed in Chapter 3—dependability yields to the trustworthiness index (e.g., 0.76 × 0.92 = 0.7 = 70%), which is quantiÞably measurable and improvable, and if not, manageable. Acknowledgments This huge effort could not succeed in the form of a textbook by Wiley—whose bicentennial in 2007 I am honored to celebrate—without the encouragement that I received from the wisdom and mentorship of a great mind and humble heart, the north star of modern software engineering and science, C. V. Ramamoorthy. Professors Ramamoorthy and Yeh are both humble minds and assets to the international scientiÞc world. In earlier decades, my Ph.D. (1977–1981) supervisors from Texas A&M’s Statistics Department, emeritus Professor Larry J. Ringer, later the mayor of College Station, Texas for two terms, and Professors M. Longnecker and Omar Jenkins; and those from the Electrical and Computer Engineering Department, Professors A. D. Patton and C. Singh, both reliability experts and ECE chairmen during their long academic careers, and Professor A. K. Ayoub (deceased), contributed to my academic development, reßected in the textbook, for which I am indebted. In latter years, many colleagues have also helped directly or indirectly, with words of experience and expertise, such as Dr. M. Tanik from UAB, Dr. S. Das from Troy University and emeritus from the University of Ottawa, as well as Dr. John Deely and Dr. E. H. Spafford, both from Purdue; and Steve Goldsby, CEO of ICS in Montgomery. David Tyson, my former graduate student at Troy University, and USAF Capt. Rob Barclay, both adjunct faculty and from Gunter AFB in Montgomery, assisted me with dedication whenever I needed help. I thank Ben Rice for his contributions to some of the Chapter 6 material; to all the CS students at Troy University who took this course and contributed, and to the secretarial staff (Angela L. Crooks in 2005 and 2006; Debbie H. Brooks in 2007), faculty, and administration, for their support. Last, but not least, I also like to thank George Telecki, Rachel Witmer,
xxii
PREFACE
and Angioline Loredo, all of John Wiley & Sons, Inc., for their encouragement and trust, with plenty of understanding, during the development of this text. Science is to know knowledge Knowledge is to know your Self If you don’t know your Self Then what’s the point of your studies? —Yunus Emre, the legendary mystic folk poet (1238–1320) M. SAHINOGLU
Come, let us rely on friends for once Let us make life easy on us Let us be lovers and loved ones The earth shall be left to no one. —Yunus Emre, the legendary mystic folk poet (1238–1320)
1 FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY AND REVIEW OF SOFTWARE RELIABILITY Nutshell 1.0 In this chapter we introduce some of the basic concepts, and mathematical and statistical functions used in hardware or software reliability and security evaluation. In the Þrst section of this chapter we review some common statistical properties of density functions. In the second section we introduce some functions of importance in reliability. In the third section we introduce an extended list of statistical distributions that can be used in reliability, and we discuss how to generate random variables of interest using simulation methods. In the fourth section we study the testing of reliability in a variety of data forms. In the Þfth section we move away from the components and start dealing with system reliability together with redundancy aspects and limitations. In the Þnal section we introduce and review in depth some basic concepts and models used in software reliability. 1.1 FUNCTIONS OF IMPORTANCE IN RELIABILITY In reliability theory there are a number of density functions of particular importance because of their theoretical and practical utilization and for their usefulness in illustrating statistical and reliability concepts. In this section, densities generally used are presented together with their more important characteristics. Some of the the most commonly used continuous density functions included are the exponential, normal, rectangular, Weibull, lognormal, and gamma. The Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
1
2
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
characteristics presented are those normally considered important in reliability technology, including the reliability function R(t), hazard function h(t), mean, variance, mode, and region of deÞnition. The derivations of these characteristics are readily available in statistical texts and are not presented here. It will be an excellent exercise for the reader to verify these derivations. The following deÞnitions are pertinent for the derivations. More detail on these deÞnitions is provided throughout. 1. The reliability:
tu
R(t) =
f (t) dt
(1)
t
where f (t) is the probability density function and tu is the upper bound on the region of deÞnition of f (t). 2. The hazard function: f (t) (2) h(t) = R(t) 3. The mean:
μ=
tf (t) dt
(3)
D
where D (domain) is the region of deÞnition of f (t). 4. The variance: 2 2 σ = (t − μ) f (t) dt = t 2 f (t) dt − μ2 D
(4)
D
5. The mode is that value of t (if it exists) such that f (t) is a maximum there for densities with a single maximum. 6. M is the median or 50th percentile if 0.50 =
M
f (t) dt
(5)
0
There are several functions of fundamental importance in modern reliability engineering [1]. Most of these are also important in applied and theoretical statistical studies. All are presented here for completeness. The Þrst and fundamental function of importance is the density function (Figure 1.1). For a discrete variable (deÞned only at speciÞc points t1 , t2 , . . . , tn ), the density function gives the probability of occurrence of each point and is denoted P (t). For a continuous variable (deÞned for all t in an interval I ), the density function, denoted f (t), gives the relative frequency with which the t-values occur. Characteristic of the density function is the fact that ni=1 P (ti ) = 1 for the discrete case and D f (t) dt = 1
3
f (t )
FUNCTIONS OF IMPORTANCE IN RELIABILITY
t
FIGURE 1.1 Probability density functions.
for the continuous case. D denotes the domain of deÞnition or interval of integration. All other functions considered depend on the density function and its characteristics. The second most important function from an estimation and interpretation standpoint is the cumulative density function (Figure 1.2). It is denoted F (t) and is given as follows, where k is the number of discrete values and t0 is the lower limit of domain D in the continuous case: F (tk ) =
k
P (ti )
(6)
f (t) dt
(7)
i=1
F (t) =
t
t0
The cumulative function thus gives the probability of a value less than or equal to t (or tk ) or by the fraction of values that are less than or equal to t (or tk ). By appropriate use of this function, it is impossible to evaluate a substantial
F (t)
1
0.5
0 t
FIGURE 1.2
Cumulative distribution function.
4
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
number of probabilities of interest. For example, P (t > t0 ) = 1 − F (t0 )
(8)
P (ta < t ≤ tb ) = F (tb ) − F (ta )
(9)
P (t ≤ t0 ) = F (t0 )
(10)
The joint density function of n independent random variables is given by f (t1 , t2 , . . . , tn ) = f1 (t1 )f2 (t2 ) · · · fn (tn )
(11)
This function is one of primary importance in estimation because it permits deÞnition of the likelihood function for a random sample of size n. The likelihood function is deÞned as the joint density of the sample. If we have a random sample t1 , t2 , . . . , tn and all ti have the same density (as is the case in random sampling), f (t1 , t2 , . . . , tn ) = f (t1 )f (t2 ) · · · f (tn )
(12)
The marginal density is deÞned when the density function is of higher order (a density function of more than one variable) as the function fX (x) =
f (x, y) dy
(13)
Ry
where Ry is the range of the y’s, f (x, y) is the density of x and y, and fX (x) is the marginal density. Thus, the marginal density may be considered to be the result of eliminating random variables that are not of interest. The conditional density is a density that describes a random variable (or variables) when other random variables are assigned speciÞc values. Thus, if f (x, y) is the joint density of x and y, the conditional density of f (x | y) is a function of x at a speciÞed y such that
b
F (a < x < b | y) =
f (x | y) dx
(14)
a
Consider an arbitrary failure distribution given by f (t) for t ≥ γ , and 0 otherwise. In addition to the usual statistical functions, such as the cumulative density function; F (t), which are of importance in reliability, the following are some of the others [2]. 1. Reliability function and reliable life. The reliability function, R(t) (Figure 1.3), is the probability that failure occurs after time t and is deÞned as
∞
R(t) = t
f (x) dx = 1 − F (t)
(15)
5
FUNCTIONS OF IMPORTANCE IN RELIABILITY
R (t )
1
0.5
0 t
FIGURE 1.3
Reliability function.
The reliable life, ρR , sometimes called the minimum life, is deÞned for any speciÞed R such that R(t) =
∞
f (t) dt = R(ρR )
(16)
ρR
The reliable life, ρR , is the same as the qth quantile, where q = 1 − R. A special case is when R = 12 ; ρR then becomes the median. Similarly, when R = 14 , ρR becomes the Þrst quartile, or 25th percentile, and when R = 34 , ρR becomes the third quartile, or 75th percentile. 2. Moments. When the threshold γ is a Þnite value, the moments of the failure distribution may be found from R(t). The kth moment of t is deÞned as μk
=
∞ −∞
∞
t f (t) dt = γ + k k
k
t k−1 R(t) dt
(17)
γ
In particular, when k = 1 and γ = 0, the mean time to failure (MTTF) is given by
∞
∞
dR dt dt γ γ ∞ ∞ ∞ R(t) dt = R(t) dt = −tR(t) | 0 +
MTTF = μ =
tf (t) dt = −
γ
t
(18)
0
3. Failure rate, hazard rate, and retired life. For a period of length δ, the failure rate, G(t, δ), is deÞned as 1 G(t, δ) = δ
t
t+δ
f (x) F (t + δ) − F (t) R(t) − R(t + δ) dx = = R(t) δR(t) δR(t)
(19)
6
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
The hazard rate or instantaneous failure rate, h(t), is the limit of G(t, δ) as δ approaches zero: f (t) h(t) = (20) 1 − F (t) The retired life, or replacement life, ξ , is deÞned for any speciÞed h and given by h=
f (ξ ) R(ξ )
(21)
4. Life expectancy. Suppose that an item has survived until time T . Then the expected additional life expectancy, L(t), is given by ⎧ ⎨μ − T, ∞ 1 L(t) = ⎩ R(t) dt, R(t) t
T ≤μ T >μ
(22)
5. Probable life. The probable life, B(T ), is the total expected life of an item of age T : B(T ) = L(T ) + T (23)
1.2 HAZARD RATE FUNCTIONS IN RELIABILITY Sometimes in selecting the distribution of failure times, one must use empirical data. For nonsymmetrical probability density functions the major difference between densities will be in the long tail. However, due to limited sample sizes, we have sparse data for this tail. As an alternative, we may appeal to physical considerations to select the function, or as it is commonly called, the hazard rate. The hazard rate or hazard function is also interpreted as the instantaneous failure rate. It is also known as the force of mortality in actuarial science and as the intensity function in statistical extreme value theory. Let F (t) be the cumulative distribution function (c.d.f.) of the time-to-failure variable, T , and let f (x) be the corresponding probability density function (p.d.f.). Consider the probability that the item fails in the interval (t, t + t] given that it has survived until time t: P (t < T ≤ t + t) | T > t) =
F (t + t) − F (t) 1 − F (t)
(24)
Dividing this probability by the length of the interval, t, gives a “per unit time” value. The hazard function or instantaneous failure rate is given by (Figure 1.4) F (t + t) − F (t) f (t) = t→0 t[1 − F (t)] R(t)
h(t) = lim
(25)
7
HAZARD RATE FUNCTIONS IN RELIABILITY
h(t)
t
FIGURE 1.4 Hazard function.
by the deÞnition of the derivative of F (t) and since the reliability R(t) = 1 − F (t). Now, on the basis of physical considerations we can choose a functional form for h(t). Making use of the relationship (25) and that f (t) = −R (t), we can write 1 h(t) dt = dR(t) (26) R(t) Recalling that R(0) = 1, and integrating over the range (0, t), we obtain
t 0
h(x) dx = −
1
ln R(t) = −
R(t)
1 dR(x) = − ln R(t) R(x)
(27)
t
h(x) dx
(28)
0
and Þnally, the general reliability equation, t
R(t) = exp − h(x) dx
(29)
0
There are three general types of hazard rates, as illustrated by the bathtub curve, detailed descriptions of which can be found in most textbooks on reliability (e.g., [3]). The Þrst part of the curve represents initial failures and has a decreasing hazard rate. These failures correspond to infant mortalities, such as those caused by hereditary defects and childhood diseases. In reliability applications these failures are generally caused by poor workmanship (e.g., poorly soldered connections, nuts not tightened, untested equipment). The second part of the curve is the chance failure portion, usually represented by a constant hazard rate. Here failures are due to severe and unpredictable environmental conditions. For human mortality tales, this period would represent deaths by accidents or unusual diseases, for example. In reliability applications, failures occur because of unusual events such as shocks and sudden voltage surges. The Þnal portion of the curve, which has an increasing hazard rate, corresponds to wear-out failures. In humans, these are failures due to heart diseases and deterioration of a body’s organs. For physical processes these failures would be caused by wear, so that parts no longer Þt, for example. Specifying a functional form for the hazard rate,
8
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
f(t)
f(t)
−R'(t)
∞
∞ t
t
f(x)dx
f(x)dx t
h(t)exp − h(x)dx 0
−R'(t) R(t) R(t)
h(t) t
exp − h(x)dx 0
FIGURE 1.5 Reversible mathematical relationships among f (t), h(t), and R(t).
we can, using equation (29), Þnd the functional form of the reliability function. Figure 1.5 shows the relationships studied above [4]. 1.3 COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS In this section we present some of the popular distributions used in reliability applications [5]. A summary of the distributions that receive most of the attention in reliability engineering is shown in Table 1.1. In the following sections we use the abbreviations p.d.f. for probability density function, p.m.f. for probability mass function, and c.d.f. for cumulative density function. The random number generations for each distribution for x (or t) will follow each [7,8]. 1.3.1 Uniform (Rectangular) p.d.f As the name implies, the uniform p.d.f. treats all values that have the same likelihood over the interval (a, b): ⎧ ⎨ 1 , a≤x≤b f (x) = b − a (30) ⎩ 0, otherwise
9
σ2
μ
−∞ ≤ t ≤ ∞
M = m = μ, symmetric
θ = λ−1
θ 2 = λ−2
0
θ ln 2
0≤t ≤∞
θ = rate
Mean
Variance
Mode, m
Median, M f (t) dt = 0.5
Range
Notes
Rectangular
a≤t ≤b
a+b 2
None
(b − a)2 12
a+b 2
IFR
b−t b−a
1 b−a
Uniform
Weibull
α = scale parameter β = shape parameter
0≤t ≤∞
α(ln 2)1/β
α(1 − β −1 )1/β , β>1
2 α2 +1 − β 2 1 +1 β
β > 1, IFR β = 1, CFR β < 1, DFR 1 α +1 β
e
−t β /α β
βt β−1 −t β /α β e αβ
Source: [2–6]. a IFR, increasing failure rate; CFR, constant failure rate; DFR, decreasing failure rate.
0
μ
Constant, λ
Hazard, h(t) = f (t)/R(t)
μ
IFR
exp(−λt)
Reliability, R(t)
1 2 2 √ e−(t−μ) /2σ σ 2π t −μ 1− σ
Normal
λ exp(−λt)
Exponential
Most Commonly Used Distributions and Reliability Functions a
Density, f (t)
Function
TABLE 1.1
e−(ln t−μ)
2
/2σ 2
− (1/n)
i
n
log xi
Geometric mean = t0 ,
= normal c.d.f.
0≤t ≤∞
e
M/w
t02 exp(w2 ) · [exp(w2 ) − 1]
μ = ln t0= w2 t0 exp 2
Depends on w = exp(σ 2 )
σ t 2π 1 t 1− ln w t0
1 √
Log Normal
0
M
f (t) dt = 0.5
β = scale parameter α = shape parameter
0≤t ≤∞
(α − 1)β, α > 1
αβ 2
αβ
α > 1, IFR α = 1, CFR a < 1, DFR
t
t α−1 e−t/β (α)β α ∞ f (t) dt
Gamma
10
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
The c.d.f. of the uniform random variable X is ⎧ 0, x
(31)
Note that median = E(x) = (a + b)/2 and Var(x) = (b − a)2 /12. No mode exists. How to Generate Random Numbers from a Uniform p.d.f. Using the inverse transform technique, u = (x − a)/(b − a) yields x ∗ = a + (b − a)ui , where ui ∼ U (0, 1) are generated through software or from a random number table such as Appendix 1A.
1.3.2 Triangular p.d.f. A triangular p.d.f. models a process if only minimum (a), maximum (c), and mode (b) values are known, such as the times for life testing a product until defective units are detected. It has the p.d.f. ⎧ 2(x − a) ⎪ ⎪ , a≤x≤b ⎪ ⎪ (b − a)(c − a) ⎪ ⎨ f (x) = (32) 2(c − x) ⎪ , b≤x≤c ⎪ ⎪ (c − b)(c − a) ⎪ ⎪ ⎩ 0, otherwise E(x) = (a + b + c)/3, Var(x) = [a(a − b) + c(c − a) + b(b − c)]/18, mode = 3E(x) − (a + c). The c.d.f. is ⎧ 0, x≤a ⎪ ⎪ ⎪ ⎪ 2 ⎪ ⎪ (x − a) ⎪ a≤x≤b ⎪ ⎨ (b − a)(c − a) , F (x) = ⎪ (c − x)2 ⎪ ⎪ ⎪ 1 − , b≤x≤c ⎪ ⎪ (c − a)(c − b) ⎪ ⎪ ⎩ 1, x≥c
and
(33)
How to Generate Random Numbers from a Triangular p.d.f. Generation of a triangular p.d.f. deviates by the inverse transform technique solving for u = F (x):
11
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
⎧ √ ⎪ ⎨a + [(b − a)(c − a)u], x∗ = √ ⎪ ⎩c − [(c − a)(c − b)u],
0≤u<
b−a c−a
b−a ≤u≤1 c−a
(34)
Therefore, Þrst generating ui , check if it is between the limits 0 and (b − a)/ (c − a). If yes, use the Þrst row in the random number equation above. Otherwise, use the second row. 1.3.3 Negative Exponential p.d.f., Pareto, and Power Functions One of the most common distributions used in reliability is the negative exponential distribution. One characteristic that has contributed to its popularity is the ease with which the distributions of statistics under various sampling plans can be derived. It is also used widely in queuing theory, since the time between arrivals may be negative exponentially distributed. Hence, for the exponential distribution, a device that is T0 units old but still operating has the same reliability as that of a new device. Obviously, this reliability function is not suitable for the infancy and wear-out portions of the bathtub hazard rate. How to Generate Random Numbers from a Negative Exponential p.d.f. Using the inverse transform technique, where 0 < ui ∼ U (0,1) < 1 is the uniform random number, ui = F (x) = 1 − e−λx ,
0 ≤ F (x) ≤ 1,
λ>0
(35)
e−λx = 1 − ui
(36)
−λx
(37)
ln(e
) = −λx = ln(1 − ui )
x∗ = −
ln(1 − ui ) λ
or x ∗ = −
ln(ui ) λ
(38)
is the negative exponential random deviate, since if ui is uniform, so is 1 − ui . Sometimes called antithetical variables in simulation studies, 1 − ui are used to help reduce the sampling error. If the time spent in each stage, X1 and X2 , is independent and exponentially distributed with respective parameters λ1 and λ2 , where λ1 = λ2, the overall time Y = X1 + X2 is hypoexponentially distributed [i.e., Y ∼ Hypo(λ1 , λ2 )] with f (t) and R(t) given by λ1 λ2 (e−λ1 t − e−λ2 t ), t ≥0 λ2 − λ1 λ2 λ1 R(t) = e−λ1 t + e−λ2 t , t ≥0 λ2 − λ1 λ2 − λ1 f (t) =
(39) (40)
12
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
Similarly, for k sequential stages with unequal rates, and if 0 < λi , ki=1 pi = 1, and 0 < pi < 1, then Y = X1 + X2 + · · · + Xk becomes k-phase hyperexponentially distributed with the p.d.f. f (t) =
k
pi λi e−λi t ,
t ≥0
(41)
i=1
F (t) =
k
pi (1 − e−λi t ) =
i=1
=1−
k
pi −
i=1 k
pi e−λi t ,
k
pi e−λi t
i=1
t ≥0
(42)
i=1
k
h(t) = i=1 k
pi λi e−λi t
i=1
pi e−λi t
,
t ≥0
(43)
f (t) = pi e−λi t = 1 − F (t) h(t) i=1 k
R(t) =
(44)
Note that h(t) is a decreasing failure rate. The hyperexponential distribution has more variability than the negative exponential distribution. Therefore, if the product is manufactured in several locations or stages, the failure density of the overall product is hyperexponential, known to be a mixture distribution. For the special case of k = 2, Y ∼ Hyper(λ1 , λ2 ), f (t) and h(t) are f (t) = p1 λ1 e−λ1 t + p2 λ2 e−λ2 t h(t) =
−λ1 t
(45)
−λ2 t
+ p2 λ2 e p1 λ1 e −λ t p1 e 1 + p2 e−λ2 t
(46)
If λ1 = λ2 , then Y ∼ two-stage Erlang (k = 2) (see Section 1.3.4). On the other hand, the double exponential distribution, also called as the Pareto or hyperbolic or power-law distribution, used to model the amount of central processing unit (CPU) time by an arbitrary process or the thinking time of a Web browser, to cite a few examples, has the following values: f (x) = cx −c−1 ,
1 ≤ x ≤ ∞,
c>0
(47)
F (x) = 1 − x −c −c
R(x) = x , c > 0 and R(t) = 0, c E(x) = , c>1 c−1
(48) c<0
(49) (50)
13
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
c Var(x) = − c−2 c h(t) = x
c c−1
2 ,
c>2
(51) (52)
A function very similar to the Pareto is the power function, where f (x) = cx c−1 , the power of x being positive c, not negative. The mode is 1 for c > 1 and 0 for c < 1. F (x) = x c
(53)
R(x) = 1 − x c
(54)
1/c
M(median) = (0.5) c E(x) = c+1 Var(x) =
(55) (56)
c (c + 2)(c + 1)2
(57)
How to Generate Random Numbers from the Pareto and Power Functions By the inverse transform technique, U (0, 1) = F (x), that is, 1 − x −c = u, 1 − u = x −c , (1 − u)−1 = x c , (1 − u)−1/c = x, and therefore for the Pareto function, x∗ =
1 1 − ui
1/c (58)
Again using the inverse transform technique, x c = u, and then taking the (1/c)th root of both sides, we obtain for the power function distribution x ∗ = (ui )1/c
(59)
1.3.4 Gamma, Erlang, and Chi-Square p.d.f.’s The general form of the gamma distribution is ⎧ α−1 −t/β ⎨t e , f (t) = β a (α) ⎩ 0,
t ≥ 0,
α, β ≥ 0
(60)
otherwise
When α is an integer, this distribution is also known as the Erlangian distribution, and when α = 1, the gamma density reduces to an exponential distribution with β = λ−1 . For example, a computer network fails when a mainframe computer
14
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
and two backup servers fail, and each may have a time to failure that is negative exponentially distributed. The Erlang distribution may be obtained as the distribution of the sum of α many independent identically distributed (i.i.d.) exponential random variables. Suppose that the failure of a device occurs when the kth shock arrives if a Poisson process with a parameter λ generates the shocks. Let the random variable T denote the arrival time of the kth shock. Then T = ki=1 Ti , where Ti is the time between the (i − 1)st shock and the ith shock. Then, from the property given above, ⎧ k k−1 −λt ⎨ λ t e , t ≥ 0, (k) = (k − 1)! f (t) = (61) (k) ⎩ 0, otherwise R(t) =
r−1 (λt)k k=0
k!
e−λt ,
t ≥ 0,
λ>0
(62)
This is also called the k-stage Erlang and is the same formula as equation (60) with β = λ−1 and α = k. How to Generate Random Numbers from the Erlang Distribution To generate an Erlang random deviate, we take advantage of the fact that the negative exponential is a special case of the Erlang distribution with the shape parameter α = 1. Therefore, we can generate an Erlang deviate by summing α negative exponential random deviates with mean β = λ−1 as follows: t∗ =
α − ln(1 − ui ) k=1
λ
(63)
Another use of the gamma distribution is for a parallel standby system with n identical devices. In this type of system, the system fails when all devices fail, and the system operates when only one device is operating at a time. Then the time to system failure, or the sum of the n failure times, has a gamma distribution with β = λ−1 and α = n. If X ∼ Gamma(α, β), m i=1 Xi is distributed as Gamma(mα, β). How to Generate Random Numbers from the Gamma Distribution (Johnk’s Rejection Technique) Let α be a noninteger shape parameter, α1 = [α], the largest truncated integer of α, and ui be the ith uniform random number, 0 ≤ ui ≤ 1. Then: 1 1. Let x = − ln αi=1 ui . 2. Set A = α − α1 , B = 1 − A.
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
15
Set j = 1. Generate random number uj and set y1 = (uj )1/A . Generate random number uj +1 and set y2 = (uj +1 )1/B . If y1 + y2 ≤ 1, go to f. Set j = j + 2 and go to b. Let z = y1 /(y1 + y2 ), which is a beta random deviate with parameters A and B. 3. Generate the random number uN and set w = − ln(uN ). 4. The random deviate desired for a gamma p.d.f. is then G = (x + zw)β. a. b. c. d. e. f.
If X ∼ Gamma(α = 0.5, β = n/2), X is said to have a chi-square distribution with n degrees of freedom. Then E(x) = n, Var(x) = 2n, and mode = n − 2, n ≥ 2 [10]. How to Generate Random Numbers from the Chi-Square Distribution A chi-square random variable with n degrees of freedom is the sum of squares of n independent normally distributed random variables with μ = 0 and σ 2 = 1, n that is, i=1 [N (0, 1)]2 . Case 1. For n even: ∗
x = −0.5 ln
n/2
(64)
ui
i=1
Case 2. For n odd: ∗
x = −0.5 ln
(n−1)/2
ui
+
i=1
n
[N (0, 1)]2
(65)
i=1
Another important analysis useful in standby redundancy is having two identical components, X and Y , each with a negative exponential time to failure with parameter λ. Only one component is required to be working for the system to operate. The second, spare component is a “cold standby,” inactive unless called for. Then Z = X + Y has a gamma density using the convolution formula, where f (z) = λ2 ze−λz , z > 0. Hence, Z ∼ Gamma(α = 2, β = λ−1 ), which is a twostage Erlang distribution, whose reliability function is given by equation (41) as R(t) = (1 + λt)e−λt ,
t >0
(66)
16
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
For the sake of comparison of simplex active parallel and standby parallel, the reliability values for any given λ are larger (favorable) for the standby than for the simplex. As a consequence of this theorem, the reliability expression for a standby redundant system with a total of n components each of which has a negative exponentially distributed lifetime is shown as in (62) by Rstandby (t) =
n−1 (λt)k k=0
k!
e−λt ,
t ≥ 0,
λ>0
(67)
1.3.5 Student’s t -Distribution
√ Let X be N (0,1) and Y be χν2 , independent of each other. Then set T = X/ Y /ν, which is deÞned as Student’s t distributed with ν degrees of freedom (d.f.) and denoted by tν , which has the p.d.f. [ 12 (ν + 1)][1 + (ν 2 /r)]−(1/2)(ν+1) , −∞ < t < ∞ (68) √ fT (t) = πν (ν/2) Student’s t, which is popularly used in conÞdence intervals and hypothesis testing for small samples (n < 15) with an unknown variance (σ ) is symmetric. Mode = median = μ = 0. ν Var(T ) = , ν>2 (69) ν−2 How to Generate Random Numbers from the t-Distribution Provided that a source of random deviates from a chi-square distribution (χν2 ) with ν degrees of freedom, one generates a t-deviate with ν − 1 degrees of freedom as follows: N (0, 1) ∗ xν−1 = χν2 /ν
(70)
1.3.6 Fisher’s F -Distribution Let X be χν21 and Y be χν22 independently, and set Fα,ν1 ,ν2 = χν21 ν2 /χν22 ν1 . The random variable Fα−1 has a p.d.f. with ν2 and ν1 degrees of freedom interchanged compared to the original at a level of 1 − α (i.e., F1−α,ν2 ,ν1 ): fF (f ) =
[(1/2)(ν1 + ν2 )](ν1 /ν2 )ν1 /2 (ν1 /2) (ν2 /2) ×
f (ν1 /2)−1 , [1 + (ν1 /ν2 )f ](1/2)(ν1 +ν2 )
f > 0 and fF (f ) = 0,
f =0 (71)
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
E(F ) =
ν2 , ν2 − 2
mode =
ν2 (ν1 − 2) , ν1 (ν2 + 2)
Var(F ) =
ν2 > 2
17
(72)
ν1 > 1
2ν22 (ν1 + ν2 − 2) , ν1 (ν2 − 2)2 (ν2 − 4)
(73) ν2 > 4
(74)
The F -distribution is in common use in statistical estimation theory for comparing variances in ANOVA (analysis of variance) to compare group means. Note that (1/Fα,ν1 ,ν2 ) ∼ F1−α,ν2 ,ν1 [10]. How to Generate Random Numbers from an F-Distribution Similarly, given two sources of random deviates from a chi-square distribution (χν2 ) with ν1 and ν2 degrees of freedom, one generates an F -deviate as follows: xν∗1 ,ν2 =
χν21 ν2 χν22 ν1
(75)
1.3.7 Two- and Three-Parameter (Sahinoglu–Libby) Beta p.d.f.’s The beta distribution is quite useful in the theory of statistics and has a wide variety of applications in applied engineering and quality control problems when the random variable of interest varies between 0 and 1. This p.d.f. is very ßexible and is used to model bounded random variables with Þxed upper and lower limits. This variable can be shifted away from zero by adding a constant. It can also have a larger range than 0 to 1 by scaling it with a constant larger than 1. It is used in forming conÞdence intervals for tolerance limits in distribution functions. It often occurs as both a prior and a posterior p.d.f. in the Bayes estimation, where it is treated as a conjugate for binomial density. It is useful to establish a relationship between the gamma and beta distributions. (See Figure 1.6 on page 32.) If a random variable y is distributed as Gamma(α1 , β) and z is distributed as Gamma(α2 , β), then x = y/(y + z) is distributed as Beta(A, B), where it is expressed as [10] E(x) =
A A+B
AB (A + B + 1)(A + B)2 A−1 , A > 1, mode = A+B −2
(76)
Var(x) =
(77) B>1
(78)
18
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
⎧ (A + B) A−1 ⎪ ⎪ x (1 − x)B−1 ⎪ ⎪ (A) (B) ⎪ ⎪ ⎪ ⎪ ⎪ (A + B − 1)! A−1 ⎪ ⎪ ⎪ (1 − x)B−1 , ⎨ = (A − 1)!(B − 1)! x f (x) = ⎪ ⎪ 0 ≤ x ≤ 1, A and B integers ⎪ ⎪ ⎪ ⎪ ⎪ 1 ⎪ A−1 ⎪ (1 − x)B−1 , A and B nonintegers ⎪ ⎪ B(A, B) x ⎪ ⎩
(79) (80)
where for A and B nonintegers, B(A, B) is the beta function: B(A, B) =
(A + B) = (A) (B)
1 0
uA−1 (1 − u)B−1 du
(81)
How to Generate Random Numbers from the Beta Distribution Assume random variables, y ∼ Gamma(A, 1) and z ∼ Gamma(B, 1); then A+B ∗ y = − ln A u and z = − ln i i=1 i=A+1 ui , so x = y/(y + z) is the Beta(A, B) deviate, where ui is a number from a uniform random generator and A and B are integers. For A and B nonintegers, use the algorithm to obtain the beta deviate in item 2f given in “How to Generate Random Numbers from the Gamma Distribution” in Section 1.3.4. For the three-parameter beta, otherwise known as the Sahinoglu–Libby p.d.f. for the FOR [forced outage rate = uptime/(uptime + downtime) = (failure rate)/ (failure rate + repair rate) = λ/(λ + μ)] with a Bayesian approach. See Chapter 5 and Appendix 5A for the derivation [11]. Let a = number of occurrences of operative (up) times sampled xT = total sampled uptime for a up occurrences b = number of occurrences of inoperative (down) times sampled yT = total sampled downtime for b down occurrences c = shape parameter of a gamma prior for component failure rate λ ξ = inverse scale parameter of a gamma prior for component failure rate λ d = shape parameter of a gamma prior for component recovery rate μ η = inverse scale parameter of a gamma prior for component recovery rate μ In using the distribution function technique, the p.d.f. of FOR, denoted as q = λ/(λ + μ), is obtained Þrst by deriving its c.d.f. GQ (q) = P (Q ≤ q) =
19
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
P (λ/(λ + μ) ≤ q) and then taking the derivative of gQ (q) for 0 ≤ q ≤ 1. (a + b + c + d) (ξ + xT )a+c (η + yT )b+d (1 − q)b+d−1 q a+c−1 (a + c) (b + d) [η + yT + q(ξ + xT − η − yT )]a+b+c+d
α+β 1 (α + β) β−1 a−1 (1 − q) q = Lα (82) (α) (β) 1 + q(L − 1)
gQ (q) =
where α = a + c, β = b + d, β1 = ξ + xT , and β2 = η + yT . If L = β1 /β2 = 1 or β1 = β2 , the conventional two-parameter beta p.d.f. is obtained. An alternative expression is Lα+c q α+c−1 (1 − q)b+d−1 B(b + d, a + c)[1 − (1 − L)q]a+b+c+d
gQ (q) =
(83)
where B(b + d, a + c) =
(a + c) (b + d) (a + b + c + d)
and L =
ξ + xT η + yT
(84)
How to Generate Random Numbers from the Sahinoglu–Libby p.d.f. Assume the random variables y ∼ Gamma(α1 = a + c, β1 = ξ + xT ) and z ∼ Gamma(α2 = b + d, β2 = η + yT ), where the random variable q = y/(y + z) has the p.d.f.
(m + n ) m n (1 − q)m −1 q n −1 a b gQ (q) = (m ) (n ) [a + q (b − a )]m +n
(85)
and the c.d.f. GQ (q) = 1 − GF2m ,2n
a n −1 q − 1) b m
a n = P F2m ,2n > C1 = (q −1 − 1) bm
(86)
Resubstituting for n = a + c, m = b + d, b = ξ + xT , and a = η + yT , we obtain for equation (85) gQ (q) =
(a + b + c + d) (η + yT )b+d (ξ + xT )a+c (a + c) (b + d) ×
(1 − q)b+d−1 q a+c−1 [η + yT + q (ξ + xT − η − yT )]a+b+c+d
(87)
20
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
where Snedecor’s F-distribution in equation (86) has been given by equation (71). By the inverse transform approach, Þnd the constant C1 = inverse of F2m ,2n (1 − ui ) as in equation (86): C1 =
a n −1 a n (q − 1) → q ∗ = , 0 < q∗ < 1 bm a n + C1 b m
(88)
is the SL(α = a + c, β = b + d, L = β1 /β2 ) random deviate, where ui is a random uniform. 1.3.8 Poisson p.m.f. The p.m.f. for a Poisson-distributed X is given by P (X = x) =
λx e−λ , x = 0, 1, 2, 3, . . . , ∞ x!
and λ > 0
(89)
The Poisson distribution is used to approximate the binomial distribution by allowing the number of trials n → ∞ for p 0. This function is used to describe the number of events to occur within some designated period of time, such as in inventory and quality control, and queuing models, where arrival rates are often considered to be Poisson distributed. The expected value of the Poisson random variable is E(X) = λ. Its variance is Var(X) = λ. Note that Var(X) = E(X). The Poisson rate λ is deÞned as the number of occurrences expected per unit time. Then the time between arrivals is negative exponentially distributed with mean θ = λ−1 . If Var(X) > E(X), it is a compound Poisson distribution [12–14]. How to Generate Random Numbers from a Poisson Distribution The relationship between the negative exponential and Poisson distributions can be used to generate Poisson random numbers. A Poisson x can be deÞned as x i=1
yi ≤ 1 ≤
x+1
yi
(90)
i=1
where, using the inverse transform method, yi = λ−1 ln ui = θ ln ui are variates from a negative exponential distribution of mean λ−1 . Cumulative sums of yi for i = 1, . . . , k are generated until the inequality (90) holds. Then x ∗ = x is the Poisson deviate that terminates the summation and satisÞes (90). 1.3.9 Bernoulli, Binomial, and Multinomial p.m.f.’s A discrete random variable Y = {0, 1} is distributed as a Bernoulli p.m.f. if failure probability P (Y = 0) = q and success probability P (Y = 1) = p add to unity
21
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
(i.e., p + q = 1). A sequence of n identical and independent Bernoulli trials, where X = ni=1 Yi , will result in the Binomial (X; n,p) p.m.f. The expected value for the binomial random variable is E(X) = np. Its variance is Var(X) = npq, where Var(X) < E(X). For nonindependent (Þrst-order Markov-dependent) Bernoulli random variables in a sequence, the limit of the sum leads to a compound Poisson, useful in systems such as electric power or computer networks [15–17]: B(X = k; n,p) =
n p k (1 − p)n−k , k
0 ≤ k ≤ n,
p > 0,
p+q =1
(91) A similar method for generating random multinomial vectors involves an extension of the binomial. Suppose, for example, that for a quadrinomial p.m.f., the line segment is drawn as 0|
|
p1
p2
|
p3
|
where 0 < ui < p1 , p1 < ui < p1 + p2 , and in general, k j m=1 pm = 1, where m = 4: m=1 pm , p0 = 0,
pm < ui <
p1k1 p2k2 p3k3 (1 − p1 − p2 − p3 )n−k1 −k2 −k3
(93)
M X1 = k1 , X2 = k2 , . . . , Xm = km ; n, =
n k 1 , k2 , k 3
j −1
(92)
m=1
|1
p4
k
pm = 1
m=1
How to Generate Random Numbers from the Bernoulli, Binomial, and Multinomial Distributions Generate a uniform number ui , and if 0 < ui < p, it is a Bernoulli success; otherwise, it is a failure (i.e., p ≤ ui < 1). Next, to generate a binomial random sample of x ∗ = k successes in n trials, draw n random uniform numbers and count those less than the p that is given. Moreover, to generate a multinomial random sample of xm∗ = km successes in n trials for a given m, draw n random uniform numbers and count the numbers falling in each of the m classes, yielding to n1, n2 , . . . , nm . It would be n1, n2, n3, n4 as above in the example of m = 4 classes.
1.3.10 Geometric p.m.f. The geometric p.m.f. of X is deÞned as the number of failures in a sequence of Bernoulli trials until the Þrst success occurs. Geometric p.m.f. is often regarded as the only discrete equivalent of the continuous negative exponential distribution with the memoryless or forgetfulness property. Letting p + q = 1, the geometric
22
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
p.m.f. can be denoted in one of two ways, either (94) or (95): x = 0, 1, 2, 3, . . . pq x , P (x) = pq x−1 , x = 1, 2, 3, . . . q q Var(x) = 2 E(x) = , p p
(94) (95) (96)
How to Generate Random Numbers from a Geometric p.m.f. Employing the inverse transform method, x ∗ is the desired geometric deviate for P (X = x) = pq x−1 , x = 1, 2, 3, . . . , where ui ∼ U (0,1): xi∗ =
ln ui ln q
(97)
For the other alternative, P (X = x) = pq x , x = 0, 1, 2, 3, . . . , it is slightly different (round up Xi∗ to the next larger integer): xi∗ =
ln ui −1 ln q
(98)
1.3.11 Negative Binomial and Pascal p.m.f.’s If we observe the number of trials until the kth success occurs, X is the negative binomial (NB) random variable, which is when we stop. If we now observe the number of trials up to and including when the kth success occurs, X is the Pascal random variable. The NB random variable models the number of trials required to achieve k − 1 successes: for example, the number of integrated circuits used in computer hardware to Þnd k = 5 defective chips. The expected value of the NB random variable is E(X) = nqp −1 . Its variance Var(X) = nqp −2 . Note that Var(X) > E(X). To calculate, observe n − 1 trials with k − 1 successes. At the nth trial, the long-awaited kth success occurs and the experiment is terminated. Then X is the number of failures before the occurrence of the kth success for the NB random variable. n−1 NB(X = k; n,p) = pk−1 (1 − p)n−k , k−1 0 ≤ k ≤ n, p > 0, p + q = 1 n−1 Pascal(X = k; n,p) = p pk−1 (1 − p)n−k k−1 n+k−1 = pk (1 − p)n−k , k−1
0 ≤ k ≤ n,
p > 0,
p+q =1
(99)
(100)
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
23
When k = 1 is the special case, the NB p.m.f. above reduces to the geometric p.m.f. for the Þrst success: n NB(k = 1; n,p) = pq n−1 , where 0 n! n = = 1, n = 1, 2, 3, . . . (101) 0 0!(n − 0)! How to Generate Random Numbers from the Negative Binomial and Pascal p.m.f.’s One way to generate random numbers from the negative binomial and Pascal p.m.f.’s is to generate a sequence of Bernoulli random numbers, as in the case of the binomial. Draw n random uniform numbers and record those less or more than the p given. When a number greater than p reaches the prescribed number of k successes, stop and identify that number as the NB random deviate. Since NB is the sum of k − 1 geometric random variates, an alternative way to generate an NB random number is k k ln ui ln ui ∗ xi = −1 = −k (102) ln q ln q i=1 i=1 where we round up to the next-larger integer. For a Pascal random deviate, simply sum k geometric deviates, xi∗ = ki=1 (ln ui / ln q), to round up to the next-larger integer. 1.3.12 Weibull p.d.f. In recent years the Weibull distribution has become more popular as a reliability function. It is named after the Swedish scientist Waloddi Weibull, who used it to analyze the breaking strength of solids. A chief advantage of the Weibull distribution is that as in the bathtub curve, its hazard rate function may be decreasing for β < 1, constant for β = 1, or increasing for β > 1. When β = 2, the Weibull is called the Rayleigh p.d.f. The hazard rate function is given by βt β−1 , t ≥ 0, α, β > 0 αβ The Weibull density and reliability functions are, respectively, h(t) =
f (t) =
βt β−1 −(t/α)β e , αβ
R(t) = e−(t/α) , β
t ≥ 0,
t ≥ 0,
α, β > 0
α, β > 0
(103)
(104) (105)
The Weibull family of distributions is a member of the family of extreme value distributions discussed later. The Weibull distribution is probably the most widely used family of failure (e.g., electronic component, mechanical fatigue) distributions, mainly because by
24
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
proper choice of its shape parameter β, it can be used as an IFR (increasing failure rate), DFR (decreasing failure rate), or CFR (constant failure rate, as in the negative exponential case). Often, a third parameter, known as the threshold or location parameter, t0 , is added to obtain a three-parameter Weibull, where R(t) = e−(t−to /α) , β
t ≥ t0 > 0,
α, β > 0
(106)
How to Generate Random Numbers Using the Weibull Distribution Employing the inverse transform method and solving the equation u = F (x) = β β e(−x/α) for x, we obtain u = 1 − e(−x/α) , and the random deviate for the Weibull p.d.f. xi∗ = β[− ln(1 − ui )]1/α ,
i = 1, 2, . . .
(107)
As in the bathtub curve explained in Section 1.2, a component or, equally, a system may possess three modes of failure, based on its decreasing, constant, and increasing failure rates. Again, the bathtub curve of an individual unit may be assumed to denote the sum of Weibull p.d.f.’s, as in the overall integral for the overall reliability: β2 β3 t β1 t t h(t) = λ(t ) dt = + + α1 α2 α3 0 t
β1 t R(t) = exp − h(t) dt = exp − α1 0 t β2 t β3 + exp − + exp − α2 α3
t
(108)
(109)
where, β1 < 1 refers to the infancy or commissioning period, β2 = 1 corresponds to the constant failure or useful life period, and β3 > 1 symbolizes the wear-out period. For such a system, preventive maintenance practices must be chosen carefully for those intervals where the wear-out period is in effect. Preventive maintenance therefore decreases reliability adversely in the infancy period; that is, β1 < 1 is in effect. Needless to say, it has no effect on the useful life period. Aside from preventive maintenance practices to increase instantaneous reliability, corrective maintenance plays a signiÞcant role in reducing the number of failures and the time required to make repairs. Then the notion of availability, A(t), the proportion of time that a system or component is in an operational state, proves useful. Maintainability is a degree or measure of how quickly a system can be brought back into a repaired state following the failures experienced. If a component or system cannot be repaired, its availability is equal to its reliability
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
at that point. Therefore, in general terms, 1 T R(t) dt A(t) = T 0
25
(110)
Hence, as the period T increases to inÞnity, the numerator integral approaches the MTTF. However, the denominator becomes inÞnite, and the long-run availability of a nonrepairable component or system becomes zero, since all fail with no repair action, yielding an average availability of zero. This case is shown as MTTF =0 MTTF + MTTR ∼ ∞). where MTTR is the mean time to repair (= A(∞) =
(111)
1.3.13 Normal p.d.f. The normal distribution is sometimes used as the wear-out density function. During the wear-out phase of the bathtub hazard curve, component life follows a normal distribution. It should be familiar to anyone who has studied statistical procedures. This p.d.f. models the distribution of a process as the sum of a large number of processes, although it cannot be used for negative times. The density function of the random variable of the time to failure, T , is given as
1 (t − μ)2 f (t) = √ exp − , −∞ < t < ∞, μ, σ > 0 (112) 2σ 2 σ 2π If we denote the standard normal distribution of z = (t − μ)/σ with μ = 0 and σ = 1 by 2 1 z φ(z) = √ exp − , −∞ < z < ∞ (113) 2 2π and its c.d.f. by
(z) =
z −∞
its reliability function is given by ∞ R(t) = f (u) du = t
z=
t −μ , σ
φ(u) du
∞
(114)
φ(z)dz = 1 − (z),
z
μ = 0,
σ =1
(115)
One difÞculty with the normal distribution as a reliability function is that it allows for negative values of the random variable. If (−μ/σ ) is negligible, this causes no trouble. If it cannot be ignored, the truncated distribution
−μ −1 1 (t − μ)2 , t ≥ 0, μ, σ > 0 f (t) = √ exp − σ 2σ 2 σ 2π (116)
26
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
should be used. Finally, by using the central limit theorem (CLT), the mean of a sample of n mutually independent random variables with a Þnite mean and variance is normally distributed asymptotically as n → ∞. Measurement errors often have this distribution, as in the case of all the (positive and negative) deviations from Greenwich–London time for all the clocks around the globe. The CLT also works for a sequence of nonindependent and nonidentical variables in a system given certain statistical assumptions [17–22]. How to Generate Random Numbers from a Normal Distribution Method 1: Law of Large Numbers An earlier method of generating exact standard normal N (0, 1) deviates that takes advantage of the law of large numbers is to generate a random sample of N uniform deviates and then calculate: X=
n ui i=1
(117)
n
For large n, the distribution of X (= average of X) approaches normality by the CLT. Actually, this happens for as small a value as n = 12; and it gets better with increasing n. Since U (0, 1) variables have a mean of 0.5 and variance of 1 , we can reformulate equation (117) to obtain the mean and variance by taking 12 n = 12 to get X ∼ N (0, 1): X=
12 (ui − 0.5) = ui − 6 n i=1 i=1 n
12
(118)
This method of generating N (0, 1) is faster, but the characteristics of the statistical distribution generated are not quite as good as those of the next method. Method 2: Mathematical Derivation Technique This method, also called the Box–Muller method (1958), combines the inverse transformation technique and the polar coordinates method. It generates two random deviates from the standard normal distribution upon generating a pair of uniform random numbers (u1 , u2 ): X1 = (−2 ln u2 ) sin 2πu1 (119) X2 = (−2 ln u2 ) cos 2πu1 (120)
1.3.14 Lognormal p.d.f. A distribution useful in maintainability and certain fracture problems is the lognormal distribution, which models the distribution of a process that can be considered as the product (as compared to normal distribution, which goes for the
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
27
sum) of a number of processes. For example, the rate of return on a compoundinterest investment is the product of the returns for a given number of periods. It is used primarily for the wear-out region of the bathtub curve, where the wear on a system may be proportional to the product of the magnitudes of the demands that have been exerted on it. If the random variable T = t1 t2 t3 . . . tn has the lognormal distribution (whereas previously T = t was, by CLT, normally distributed), i the variable y = ln T = ln ti is normally distributed. If the variables x1 and x2 have lognormal distributions, the product random variable q = x1 x2 is also lognormally distributed. The p.d.f. of T is
1 (ln t − μ)2 f (t) = √ exp − , t ≥ 0, μ, σ > 0 (121) 2σx2 σ t 2π If we let μ = ln t0 , and σx = w, then
ln(t/t0 )2 1 , f (t) = √ exp − 2w2 wt 2π
t ≥ 0,
and t0 , w > 0
(122)
The corresponding c.d.f. is obtained by integrating over t with a lower limit of t = 0. The results can be expressed in terms of the standardized normal integral as t −1 (123) FY (y) = w ln t0 For small values of w, the lognormal and normal distributions are similar in shape. When the t0 is the median of the random variable T , the mean and variance of the lognormal p.d.f. for T are 2 w μ = t0 exp (124) 2 σ 2 = t02 exp(w 2 )[exp(w 2 ) − 1]
(125)
The lognormal distribution may be derived by the following argument. Consider a certain process where failure is due to fatigue cracking. Let X 1 < X2 < · · · < Xn be a sequence of random variables denoting the size of the crack at successive stages of growth. Assume that the growth is proportional to the size of the crack. Xi − Xi−1 is randomly proportional to Xi−1 . Failure occurs when the crack size is Xn . The proportionality factors, i , are independent but not necessarily identically distributed random variables. Hence, skipping the intermediate steps and equations yields ln Xn = i + ln X0 (126) where X0 is the initial size of minute ßows, voids, and so on. By the CLT (Section 1.3.13), the i converge in distribution to the normal distribution, and hence ln Xn is asymptomatically normally distributed.
28
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
How to Generate Random Numbers from a Lognormal p.d.f. Method 1: Law of Large Numbers One follows the same steps for the normal random number generator presented above. The relationship of a lognormal variable with median m = exp(μ) and shape parameter σ to the standard normal deviate N (0, 1), and the relationship of N (0, 1) to the uniform random number ui will give x ∗ ∼ meσ N(0,1) = m exp σ
12
ui − 6
(127)
i=1
Method 2 See method 2 for the normal generator. One can apply the same derivation for the lognormal deviate y, where y = ln x, y ∼ normal, and x ∼ lognormal.
1.3.15 Logistic p.d.f. X is logistic if its p.d.f. f (x) =
exp[−(x − a)/k] sec h2 [−(x − a)/k] = k{1 + exp[−(x − a)/k]}2 4k
(128)
and its c.d.f., reliability, and hazard functions are F (x) = 1 − [1 + e−(x−a)/k ]−1 R(x) = [1 + e h(x) =
−(x−a)/k −1
]
e−(x−a)/k k[1 + e−(x−a)/k ]
(129) (130) (131)
Also, E(x) = mode = median = a and Var(x) = b2 = k 2 π 2 /3. For a standard logistic, if a = 0, k = 1, b = 3−1/2 kπ (132) How to Generate Random Numbers from a Logistic p.d.f. By the inverse transform technique, ui = R(x) = [1 + e−(x−a)/k ]−1 leads to [(1 − ui )/ui ] = e−(x−a)/k . Then taking the ln of both sides, ln[(1 − ui )/ui ] = −(x − a)/k; we get xi∗ = a − k ln[(1 − ui )/ui ].
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
29
1.3.16 Cauchy p.d.f. X is Cauchy if f (x) =
1 , πb{[(x − a)2 /b] + 1}
−∞ < x < ∞
(133)
as the ratio of two independent N (0, 1) random variables with a (location parameter) and b (scale parameter). The standard Cauchy is f (x) =
1 , + 1)
π(x 2
a = 0,
b=1
(134)
Symmetrical about x = 0, odd moments about the origin are zero and the mode is at x = 0. The reciprocal of a Cauchy(a, b) random variable is also Cauchy(a , b ), where a = a/(a 2 + b2 ) and b = b/(a 2 + b2 ). How to Generate Random Numbers from a Cauchy p.d.f. A Cauchy(0, 1) uses the ratio of two independent N (0, 1) standard random variables: 12 ui − 6 i=1 x ∗ = 12 (135) uj − 6 j =1
where ui and uj are independent uniform variates, i, j = 1, 2, 3, . . . , 12.
1.3.17 Hypergeometric p.m.f. The probability of x many successes in a sample size of n out of X many successes in a population of total N many elements is hypergeometric with a p.m.f. given as X N −X n n−x P (x) = (136) N n with mean and variance nX N (nX/N )(1 − X/N )(N − n) Var(x) = N −1 E(x) =
(137) (138)
30
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
How to Generate Random Numbers from a Hypergeometric p.m.f. Select n rectangular (uniform) independent random numbers, i = 1, 2, . . . , n, in sequence. If ui < pi , record a success, and sum the successes to obtain x ∗ , where p1 =
X , N
N1 = N,
pi+1 = (Ni pi − d)(N − i); and
d = 1 if ui ≥ pi ,
d = 0 if ui < pi
(139)
1.3.18 Extreme Value (Gumbel) p.d.f.’s 1. Smallest extreme value. Consider a sample of n independent identically distributed random variables from a distribution with c.d.f. F (x). Then the c.d.f. of x = min Xi , i = 1, . . . , n is G(x) = 1 − [1 − F (x)]n
(140)
As n gets larger there are three possible resulting distributions (as in a series system based on the principle that the system cannot get stronger than its weakest or minimum element): a. Type I distribution. If f (x) tends to zero exponentially as x → ∞ (i.e., normal distribution), then
x −γ G(x) = 1 − exp − exp , −∞ < x < ∞, γ , α > 0 α (141) b. Type II distribution. If the range of x is unlimited from below and if for some α, β > 0, lim (−x)α F (x) = β, then G(x) = 1 − exp
x−γ α
−β ,
−∞ < x ≤ γ ,
α, β > 0
(142)
c. Type III distribution. If the range of x is bounded from below, that is, F (x) = 0, x ≤ γ < ∞, and F (x) behaves like α(x − γ )β for some α, β > 0, as x → γ , which denotes uniform, exponential, and Weibull p.d.f.’s, then
x −γ G(x) = 1 − exp α
β ,
γ ≤ x < ∞,
α, β > 0
(143)
Type I asymptotic distribution of the smallest extreme results from deÞning the hazard rate to be of the form e x . The type I function may also be used as a failure model for a series system when the underlying distribution is exponential. Type II distribution is not very useful in reliability since it is also deÞned in the
31
COMMON DISTRIBUTIONS AND RANDOM NUMBER GENERATIONS
negative domain. Type III functions include the Weibull distribution. If x = ln T , where T has the Weibull distribution, X has a type I extreme value distribution. 2. Largest extreme value. Consider a sample of n many i.i.d. (independent identical distributed) random variables each having a c.d.f. of F (x). The distribution of the largest is U (x) = [F (x)]n (144) As n gets large, there are again three possible resulting distributions (as in a parallel system based on the principle that the system cannot get weaker than its strongest or maximum element): a. Type I distribution. If f (x) tends to zero exponentially as x → ∞ (i.e., exponential), then
x−γ U (x) = exp − exp α
−∞ < x < ∞,
,
γ , α > 0 (145)
b. Type II distribution. If the range of x is unlimited from below and if for some α, β > 0, lim (x)α [1 − F (x)] = β, then U (x) = exp
x−γ α
−β ,
x≥γ
(146)
c. Type III distribution. If the range of x is bounded above F (x) = 1, x ≥ α and for Þnite γ , 1 − F (x) behaves like α(x − γ )β as x → γ (i.e., uniform), then
x−γ U (x) = exp α
β ,
x ≤ γ,
α, β > 0
(147)
Type I asymptotic distribution of the largest value may be used for corrosive processes and for time to failure in parallel systems. How to Generate Random Numbers from the Smallest Type I Extreme Value Using the inverse transform technique for the smallest extreme value, xi∗ = γ + α ln(ln ui ), ∞ < x < ∞. For the largest extreme, the sign of x will be reversed.
1.3.19 Summary of the Distributions and Relationships Most Commonly Used Figure 1.6 shows the relationships among the distributions in Chapter 1.
32
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY Geometric p
min(x1...xn)
Rectangular n
b =1
Neg. Bin. n, p m = n(1−p) n→ ∞
Dis. Weibull p, b x1+ ... +xn
m = np n→ ∞
Poisson m s2 = m m→ ∞
x1 ... xn
ex
Lognormal m, s
Normal m, s
ln x
Hypergeom n, M, N
Beta Bin. a, b, n p = a/b n→ ∞
p = M/N N→ ∞
Binomial n, p m = np s2 = np(1−p) n→ ∞ n=1
x1+ ....+xn
Sahinoglu-Libby a, b, L
Bernoulli p
L=1 x1+ ....+xn a = b→ ∞ x1 x1+x2 m = ab Beta x−m 2 2 s = ab a, b s a→ ∞ b1≠b2 x1 m + sx x1+x2
1/x x1+ ... +xn Cauchy a, b
Std. Normal m = 0, s = 1 a=0 b=1
a + bX
n = n−1 a=1 b=1
x1+ ... +xn
n=1
b = v/ 2 a=2 2
Gamma a, b
Arc Sin
x12+ ... +xn x1+ ... +xn
a=n
Chi-Square v
Erlang b, n
a=b=1
X1/X2 Std.Cauchy x1/v1 x2/v2
1/X
F v1, v2
v1x l = 1/2 v2 = ∞ min(x1....xn)
v=1
a=1 b = 1/l n = 1 b = 1/l v=2 x1+ ... +xn Exponential l
√x
x1 − x2
2
x v→ ∞ x2
Rayleigh s
x1/s
a = b = 1/2
s=1
1 − l1nX
Std. Uniform
X 1 1 l=a = √2b Laplace a, b
a=0 b=1
a + (b − a)x
x1 − x2 t v
FIGURE 1.6
Weibull s, b
Triangular a = −1, b = 1
Uniform a, b
Relationship among most statistical distributions in Chapter 1.
LIFE TESTING FOR COMPONENT RELIABILITY
33
1.4 LIFE TESTING FOR COMPONENT RELIABILITY If the random variable X is for the lifetime or time to failure of a unit, then study of this variable is deÞned to involve life testing or reliability theory. The reliability of a component surviving until time X0 is deÞned as the reliability of the component at time X0 and is denoted as R(X0 ). Even though the random variable is referred to as representing time, one may have other variables, such as distance in the case of a vehicle. However, the same concepts would apply [24]. 1.4.1 Estimation Methods for Complete Data We consider next the estimation of reliability and the parameters of the failure distributions. There will be two situations: (1) when the failure distribution is not known (i.e., the nonparametric case), and (2) when the failure distribution is known (i.e., the parametric case). In each case, there will be another two situations: (1) when the exact failure times are known, and (2) when the only information available is the number of survivors at different points in time [1,2]. The data used in the estimation procedures are obtained primarily from life tests. Life testing is a procedure in which the failure data are obtained from a sample of N items put into the projected operating environment. If all units fail before the test is terminated, it is a complete test. Otherwise, the test is incomplete and censoring exists where the usual statistical calculations for MTTF (mean time to failure) or μ (average of failure times) no longer make sense. Type I censoring occurs when a life test is terminated at a speciÞed time, say T0 , before all N items fail. Type II censoring occurs when the life test is terminated at the time of a particular rth failure, r < N . A more complicated multiple censoring scheme contains some items removed during the test, in addition to failing. In most reliability studies, the form of the distribution of the variable (time to failure or an operational characteristic such as distance traveled) is assumed or known. Occasionally, however, it is not possible to make an assumption concerning the form of the distribution, in which case the nonparametric distributions come into play. Both parametric and nonparametric methods will be studied. For both analyses, data may be grouped or ungrouped. Ungrouped data occur when individual component failures are recorded in a laboratory setting when the sample size is not large and sufÞcient instrumentation and personnel are available to measure the exact failure times. The opposite is true when many data exist with no adequate funds for personnel or equipment to record all failure times, and the only way to accommodate the failures is to stop at equal or unequal time increments to group them. Therefore, if the data consist of the number of failures within each time period with no information about the exact times of failure, they are classiÞed as grouped data. Let’s now look at these two types of empirical data [1,2,25,27,34].
34
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
Ungrouped Data Ungrouped data consist of a series of failure recordings, ti , i = 1, . . . , N, for the N units available. The order statistics are given in ascending order of magnitude: t(1) < t(2) < t(3) < . . . < t(i) < . . . < t(N) . An estimate ˆ (i) ) = 1 − pi with jumps at t(i) , of the reliability function is denoted by R(t ˆ where p(i) = F (t(i) ). There are a number of formulas for p(i) ; for example, all but the Þrst formula shown below are cases of the general formula (i − a)/ (N − 2a + 1) [4,26] i−1 , N N i − 3/8 , N + 1/4
i i − 0.3 , (small samples), +1 N + 0.4 i − 0.44 i − 0.5 , (Hazen’s for N > 20) N + 0.12 N
(148)
are variations. For simplicity we will use the one with α = 0, which gives us the second formula in the series. We will show ways to plot the reliability, ˆ = 1 − Fˆ (t) = N + 1 − i = e−H (t) R(t) N +1
(149)
and the cumulative hazard, ˆ = ln(N + 1) − ln(N + 1 − i) Hˆ (t) = − log R(t)
(150)
From a plot of H (t), it is possible to judge if the hazard function is increasing, decreasing, or constant. A linear plot implies a constant hazard function, a convex plot an increasing hazard, and a concave plot a decreasing hazard. Direct estimation of the hazard function is given by ˆ = h(t)
1 (t(i+1) − t(i) )(N − i + 1 − α)
(151)
where α takes the same value as that used to estimate R(t). The density function ˆ ˆ is given by fˆ(t) = h(t)/ R(t) as in equation (2) in Section 1.1. The example below best explains use of the formulas above. Ungrouped Example The failure times in months were recorded on certain equipment for i = 1, . . . , 9 as 7.2, 9.7, 12.3, 13.5, 16, 18.2, 18.6, 19.8, and 21.3. The descending reliability and ascending cumulative hazard estimates at times ti are, respectively, {R(0) = 1.00, H (0) = 0.0}, {R(7.2) = 0.9, H (7.2) = 0.105}, {R(9.7) = 0.8, H (9.7) = 0.223}, {R(12.3) = 0.7, H (12.3) = 0.357}, {R(13.5) = 0.6, H (13.5) = 0.511}, {R(16) = 0.5, H (16) = 0.693}, {R(18.2) = 0.4, H (18.2) = 0.916}, {R(18.6) = 0.3, H (18.6) = 1.204}, {R(19.8) = 0.2,
35
LIFE TESTING FOR COMPONENT RELIABILITY
TABLE 1.2 Ungrouped Complete Data Example i
Ti
ni
R(ti ) = ni /N
H (ti ) = − ln R(t)
0 1 2 3 4 5 6 7 8 9
0.0 7.2 9.7 12.3 13.5 16.0 18.2 18.6 19.8 21.3
9 8 7 6 5 4 3 2 1 0
1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10
0.0 0.105 0.223 0.357 0.511 0.693 0.916 1.204 1.609 2.303
H (19.8) = 1.609}, {R(21.3) = 0.1, H (21.3) = 2.303}. Note that ni represents the remaining units at the ith stage and R(0) = 1 as in Table 1.2. Grouped Data As mentioned above, the data are monitored such that the only failure information available is the number of surviving items at times that are recorded in ascending order of magnitude, t(1) < t(2) < · · · < t(i) < · · · < ˆ = ni /N, i = 1, . . . , K. Therefore, combining into Hˆ (ti ) = t(N), , such that R(t) ˆ logR(ti ) = − ln N − ln ni , we acquire plots for both the reliability and cumulative hazards. Additionally, we may want to estimate the mean and variance of the failure distribution for grouped data. Whereas for ungrouped data, the mean μ was simply the arithmetic average of the time intervals (differences) of the individually recorded failures and variance σ 2 denotes the sum of squares of the deviations of these time intervals from the mean all divided by N − 1; for the grouped data it is a different story. We approximate f (t) by a histogram in which there are K intervals and the midvalue of each interval between i and i − 1 is Mi , with a frequency of fi , where K 1 fi = N ; then, G denoting the group μG = 1/N
K
fi Mi
(152)
fi (Mi − μG )2
(153)
i
σG2 = 1/N
K 1
To calculate the percentile, or the quantile of order P , we must locate the observation with rank r = P n + 0.5. Then obtain the cumulative frequencies and determine the class that includes the percentile; it is the class whose cumulative frequency is the Þrst to exceed r. Denote the lower and upper limits of this class by L and U , the frequency of this class by f , and the number of observations that
36
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
TABLE 1.3 Grouped Complete Data Example i
Ti
ni
R(ti ) = ni /N
H (ti ) = − ln R(t)
0 1 2 3 4 5 6
10 20 30 40 50 60 70
100 60 40 25 10 5 0
1.00 0.60 0.40 0.25 0.10 0.05 0.00
0 0.51082 0.91629 1.38629 2.30259 2.99573 —
are smaller than L by m; of course, m < r [27]. Then the (100P )th percentile is given by r −m L+ (U − L) (154) f Grouped Example In equal increments of 10 hours, the data in Table 1.3 were compiled in seven intervals with N = 100 units starting and ni remaining units at the ith stage, where R(0) = 1. 1.4.2 Estimation Methods for Incomplete Data Rather than wait for all N items to fail, it is often advantageous to halt the testing procedure earlier. This is done either by stopping to test when, for type I, the rth (r Þxed) failure occurs, or for type II, at a given time T 0 . Censoring types I and II are singly censored from the right, as we assume no censoring from the left in this section, like starting at a nonzero threshold value. We deÞne the data to be multiply censored if units are removed at different times during the time of life testing. There may be a couple of reasons for multiply censored data. Either the units are removed, thus becoming unavailable (e.g., the death of a cancer patient, who is thus no longer available for clinical tests), or because a new or irrelevant mechanism that is not under analysis, and not known to us, caused the failure. In this section we assume negative exponential density governing the distribution of failure times. The point and interval estimates for the reliability are
t R(t) = exp − μˆ
1 where μˆ = ti n i=1
2nμˆ 2nμˆ <μ< 2 2 χ(α/2),2n χ(1−α/2),2n
n
(155) (156)
which is a 100(1 − α)% conÞdence interval. Singly (Right)-Censored Data for Replacement and Nonreplacement Data To summarize the formulas given below, note that all situations command that the
LIFE TESTING FOR COMPONENT RELIABILITY
37
estimate of μ be given by the total good time divided by the number of failures, depending on the rule used for stopping the test. Rep. denotes replacement and NRep. denotes nonreplacement. 1. Type I (NRep.). The estimator of μ = MTTF when testing stops at a predetermined time T0 : r
1 μˆ = ti + (n − r)T0 (157) r i=1 A 100(1 − α)% conÞdence interval for μ is given by −T0 −T0 <μ< ln(1 − pU ) ln(1 − pL )
(158)
where pL and pU are the conÞdence limits for a binomial parameter p = 1 − R(T0 ) with a nonparametric estimator pˆ = 1 − r/n (good for p ≤ 0.5), or if a negative exponential (constant failure rate) is assumed, use pˆ = 1 − exp(−t/μ). ˆ 2. Type I (Rep.) 1 μˆ = (N T0 ) (159) n is the type I (Rep.) estimator for the censored MTTF. 3. Type II (NRep.). The estimator of μ = MTTF when testing stops at the rth failure is r
1 ti + (n − r)tr (160) μˆ = r i=1 A 100(1 − α)% conÞdence interval for μ is given by 2r μˆ 2r μˆ <μ< 2 2 χ(α/2),2n χ(1−α/2),2n 4. Type II (Rep.) μˆ =
1 N tr n
(161)
(162)
is the type II (Rep.) estimator for the censored MTTF. Equations are presented below for obtaining unbiased estimators of R(t) for NRep. and Rep. tests. Even though μˆ is an unbiased estimator of μ, their funcˆ = exp(−t/μ) tions, such as λˆ (rate) = μˆ −1 and reliability, R(t) ˆ are not unbiased estimators of λ and R(t), respectively. An unbiased estimator of R(t) for the
38
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
nonreplacement and replacement tests stopped given by ⎧ t r−1 ⎨ 1 − , ˆ = R(t) r μˆ ⎩ 0, ⎧ t r ⎨ 1− , ˆ = R(t) r μˆ ⎩ 0,
at the rth failure is respectively t < r μˆ
(163)
elsewhere t < r μˆ
(164)
elsewhere
For MLE, by invariance property, the functions of μˆ are still the MLE. Multiply Censored Data We study the nonparametric analysis of multiply censored data where some units are removed from the test before failure for many different reasons, especially in the biomedical and medical (cancer) research community. The estimation technique used most to calculate the reliability function (or survival function, in human-related tests) is the product-limit or Kaplan–Meier estimate. Ungrouped Data As in the individual failure times given below, the sequence consists of a series of times, t(1) < t(2) < · · · < t(i) < · · · < t(N) . Each of these times signiÞes either the failure or censored removal (indicated by an asterisk). To start estimating reliability, we need to derive a recursive relation for R(ti ) in terms of R(ti−1 ). With no censoring, as noted earlier,
By taking the ratio
ˆ i−1 ) = N + 2 − i R(t N +1
(165)
ˆ i) R(t N +1−i = R(ti−1 ) N +2−i
(166)
one obtains the recursive relationship ˆ i−1 ) ˆ i ) = N + 1 − i R(t R(t N +2−i
(167)
an equation of conditional reliability. That is, the probability that a unit survives up to time ti is the product of the probability that it survives up to time ti−1 multiplied by the conditional probability (N + 1 − i)/(N + 2 − i) that it will not fail between ti−1 and ti given that it is operating at time ti−1 . Now if a censoring ˆ i ) = R(t ˆ i−1 ). action occurs at ti , the reliability remains the same, such that R(t Therefore, ⎧ ⎨N + 1 − i ˆ i | ti−1 ) = N + 2 − i , failure at ti R(t (168) ⎩ 1, censoring at ti
39
LIFE TESTING FOR COMPONENT RELIABILITY
This leads to the product-limit or Kaplan–Meier estimator as follows [1]: ˆ i | ti−1 )R(t ˆ i−1 | ti−2 )R(t ˆ i−2 | ti−3 ) · · · R(t ˆ 1 | 0) ˆ i ) = R(t R(t
where R(0) = 1 (169)
Ungrouped Example The following failure data are from a mechanical engineering experiment: {30, 53, 54∗ , 68, 83, 99∗ , 107, 116, 149∗ , 158}. The corresponding conditional reliabilities for the 10 sequential activities are calculated to be ˆ i | ti−1 ) = {0.909, 0.900, 1.0, 0.875, 0.857, 1.0, 0.8, 0.75, 1.0, 0.5} R(t
(170)
Finally, the Kaplan–Meier (reliability) product limits are ˆ i ) = {0.909, 0.818, 0.818, 0.716, 0.614, 0.614, 0.491, 0.368, 0.368, 0.184} R(t (171) Grouped Data Contrary to individual failure times, all the analyst has is the number of failures recorded at separate (disjoint) intervals. Suppose that the number of failed and removed (nonfailed) items is recorded. New expressions are derived for conditional reliability to apply to grouped data. Let di be the number of defectives (or failures) during the ith interval, and with ci censored (unfailed but removed), there would be ni (remaining) = ni−1 − di − ci , where n0 = N : di ˆ i | ti−1 ) = 1 − (172) R(t ni−1 − qi ci If removed items are recorded, a qi = 50% fraction of the items removed are universally assumed to fail. However, the qi proportions of failure (or death) may be entered differently at varying intervals at will, or a generic 50% for all intervals may be used. To estimate the reliability, the point reliability and conditional reliability values are updated at the termination of time intervals where failures have occurred. The removals will not change the reliability value. TABLE 1.4
Grouped Multiply Censored Data Example
i
ti (CPU seconds)
ni−1
di
ci
ni
ˆ i |ti−1 ) R(t
ˆ i) R(t
H (ti )
0 1 2 3 4 5
0 1000 2000 3000 4000 5000
— 150 145 126 50 14
— 5 19 66 30 10
— 0 0 10 6 4
150 145 126 50 14 0
— 0.966667 0.868966 0.454545 0.361702 0.166667
1.0 0.966667 0.84 0.381818 0.138104 0.023017
0.0 0.033902 0.174353 0.962811 1.97975 3.77152
40
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
Grouped Example The preceeding censored grouped data in Table 1.4 were collected by an independent quality-focused nonproÞt organization on the failure of identical commercial software modules sold on the market for 0 < t < 5000 CPU seconds at 1000-second intervals. The nonparametric reliability and cumulative hazard functions were asked (qi = 0.5 for convenience). 1.5 REDUNDANCY IN SYSTEM RELIABILITY In earlier sections of this chapter we dealt with life testing and reliability evaluations of the individual components that constitute a system. The same techniques could be expensive, inaccurate due to environment change, and mathematically intractable to perform and apply during a system analysis. Also, positive or negative correlations exist—even though hard to evaluate—between the individual components. This prohibits the convenient assumption of independence. Therefore, when more than one component is included to form a system and also to improve the reliability of a system, the concept of redundancy begins. But whether the type of redundancy preferred and designed is beneÞcial to the aims of the system analyst is a different matter. Next we study a variety of systems in terms of their reliability. However, another way of looking at the quality of system performance is through its availability, deÞned as the proportion of time that a system is ready to be used. One can upgrade instantaneous reliability by adding reliability or by improving reparability and maintainability [28,29]. In the following subsections, the components are assumed to be statistically independent. 1.5.1 Series System Reliability Series reliability simply assumes that the product of individual reliabilities in a system will result in lower reliability than that of the weakest component reliability in the system. Sometimes such systems are called nonredundant due to this negative effect. The formula depicts this reality: R = R1 R2 R3 Rn · · · RN ,
0 < Ri < 1
(173)
If Ri = 0.9, then R = R1 R2 = 0.81. If Ri = 0.99, then 200 i=1 Ri = 0.134, or down to 13.4% from 99% for a single component. However, series systems are sometimes good for purposes other than reliability, such as usability or convenience. Before the advent of calculators and computers, charts were the only means available to calculate the system reliability of N components in series. In the earlier stages of space missions, practitioners did not understand why payloads stacked on top of each other in a series format would not result in a more reliable performance in space [29]. This mystery lasted until someone Þgured out that the system reliability of series systems was worse than its weakest link. This indicates openly that reliability modeling should start in the design stage, not
REDUNDANCY IN SYSTEM RELIABILITY
41
after the fact, when it is both too late and too expensive to redeem. Eventually, reliability analysts turned to parallel redundancy to upgrade system reliability. 1.5.2 Active Parallel Redundancy In these designs, two or more components will form a system without a switch, such that the failure of one will not lead to system breakdown as it did in series systems. The phenomenon of such duplication is deÞned as redundancy. There exist four basic active parallel topologies. Assume that independence prevails between components. Simple Active Parallel System A simple parallel system is one in which if any individual component is a success, the system input–output or ingress–egress connection is a success. In a simple active (i.e., ready to operate without the need of a switch) parallel system, the individual components are positioned in parallel topology, where with n components, each of Ri , the system reliability is given as Ra = 1 −
n
(1 − Ri ) = 1 − (1 − Ri )n
(174)
i=1
1 − Ra = (1 − Ri )n
(175)
Reorganizing (174) as in (175) and taking the natural logarithm of both sides yields ln(1 − Ra ) n= (176) ln(1 − Ri ) If Ri = 0.95, then to raise the active parallel system to 99.5% reliability, n=
ln(1 − 0.995) −5.3 ∼ = =2 ln(1 − 0.95) −3.0
(177)
Therefore, if Q = 1 − R for all, a measure of improvement called relative efÞciency is given by 1 − Qn E= (178) 1−Q and the maximum improvement potential (or relative efÞciency) as n goes to ∞ is then 1 100% 1 Emax = = = (179) 1−Q R R Series-in-Parallel System The series-in-parallel system is composed of k separate series subsystems arranged in active parallel in that if any series branch
42
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
with n components is a success, the overall system is a success. Such systems are also called high-level redundant (HL). The reliability is RHL = 1 − (1 − R n )k
(180)
For example, given R1 = R2 = R3 = R4 = 0.9 and k = 2, n = 2, RHL = 1 − (1 − 0.92 )2 = 0.9639. For six identical components, for example, RHL = 1 − (1 − R n )k = 2R 3 − R 6 = (2)(0.9)3 − 0.96 = 0.9266. Parallel-in-Series System The parallel-in-series system is one for which maximum reliability can be obtained but in which reliability is hardest to maintain. In this system an individual component is replaced by k components in an active parallel system to increase reliability in a series of n branches. Such systems are also called low-level redundant (LL). This structure is to be used when very high reliability is necessary for a given length of time, as in a missile Þring. The reliability is RLL = [1 − (1 − R)k ]n (181) For example, given R1 = R2 = R3 = R4 = 0.9 and k = 2, n = 2, RLL = [1 − (1 − 0.9)2 ]2 = 0.9801. For six identical components this time, RLL = 1 − (1 − Rn )k = (2R − R 2 )3 = (2)(0.9)3 − 0.96 = 0.9703. We have RLL − RHL = 6R 3 (1 − R)2 = 0.0437 for our example of six identical components. Consequently, RLL > RHL . Partial Parallel Topology This structure is designated by a group of n components in parallel, out of which any k are required to operate for the system to be successful. A popular example is when of three parallel engines in a jet plane, where R + Q = 1, at least two (k = 2) must work for the jet to operate. If one expands (R + Q)3 = R 3 + 3R 2 Q + 3RQ2 + Q3 , the probability of success is given by R 3 + 3R 2 Q. In the case of n unlike components, ni=1 (Ri + Qi ) = 1. We then get rid of those terms that represent failure scenarios for the deÞned “at least k” statement of system reliability. 1.5.3 Standby Redundancy Standby redundancy is identical to the structure of an active parallel system except that it is termed “standby” redundant when a switch exists to direct the ßow of current as desired [3]. Provided that the switch operates perfectly in a more efÞcient two-component standby system, the second unit is not activated until the Þrst unit fails. The MTTF (reciprocal of λ, which is the rate of failure) of an active two-unit system is 1.5 times higher than that of a single component. On the other hand, the MTTF of a standby system is twice as high as that of a single unit. However, for λt 1, R(t) = 1 − exp(−λt) ≈ 1 − λt for a single component, R(t) ≈ 1 − (λt)2 for an active parallel system, and Þnally, for a standby system, R(t) ≈ 1 − 0.5(λt)2 . This can be interpreted as follows: For short time intervals
43
REDUNDANCY IN SYSTEM RELIABILITY
where t 1, the standby system failure probability, F = 1 − R, is only one-half of the reliability of an active parallel system. That is, Fa ≈ (λt)2 and Fstby ≈ 0.5(λt)2 . See equation (67) for a general statistical treatment of n components in standby [1–3]. Rstby (t) = e−λt
n−1 (λt)k k=0
k!
t ≥ 0,
,
λ>0
(182)
As an example, consider a system of two units with λ = 0.01. The reliability at t = 10 hours using equation (67) is Rstby = [exp(−0.01 × 10) + 0.1] [exp(−0.01 × 10)] = 0.90484 + (0.90484)(0.01) = 0.995324. On the other hand, the reliability of two identical units in an active parallel system, when R(10) = 0.90484, is Ra (10) = 1 − (1 − 0.9048)2 = 0.99095. In this twocomponent example, the advantage of a standby system over an active parallel structure is not signiÞcantly higher. This advantage is forfeited if the switching reliability is not perfect (Rss < 1). For a two-unit standby, Rstby(t) = (1 + Rss λt) exp(−λt)
(183)
Further, the reliability of an n-unit standby system with imperfect switching reliability [2,3] is Rstby (t) = e−λt 1 + Rss
n−1 (λt)k k=0
k!
,
t ≥ 0,
λ>0
(184)
Additionally, a component in standby mode may fail before primary system failure. Such standby component failures can happen very rarely when the secondary, tertiary, or kth standby unit is necessitated, and may have deteriorated while waiting its turn of duty in the standby mode. For two components, the system reliability is then calculated as
Rstby
λ1 = 1 + Rss [1 − exp(−λ2 t) exp(−λ1 t) λ2
(185)
where λ2 is the failure rate of the second unit while standing by. What happens when the failure rate λ3 of the second component, while standing by, differs from the Þrst, λ1 ? Then Rstby = exp(−λ1 t) + Rss [exp(−λ2 − λ3 )t − exp(−λ1 − λ2 − λ3 )t]
(186)
Finally, if a system consists of a number of equal components in series supported by one or more spares, it cannot fail until the failure after the last spare is replaced. In this case, the system failure rate would be N λ, with N components in series with a constant failure rate λ. If n spares were on hand, however, the system
44
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
would not be operable until n + 1 failures happened. This results in a system MTTF of (n + 1)/N λ. Then N λt Rs (t) = exp − (187) n+1 Example A jet plane has an airborne radar system with 30 identical integrated circuits (ICs), and the pilot has three spares for a 1-hour ßight. If the IC hourly failure rate is λ = 0.01, the series system reliability with no spares at time t = 1 hour is Rs (1) = exp[−N λt/(n + 1)] = exp(−0.3) = 0.74082 ≈ 74%. With n = 3 spares, the reliability will be increased to 0.3 N λt = exp − = 0.928 ≈ 93% (188) Rs (1) = exp − n+1 4 When the standby component is too cold to be called upon, it may experience switching problems when activated. Therefore, in equation (188), the switching reliability Rss and standby failure rate λ2 will both be reduced. At the other extreme, when the standby is hot, the switching failure will be reduced, and switching reliability Rss is higher. Also, when too hot, the standby reliability λ2 will equal that of the primary component, λ1 . In that instance, when λ1 = λ2 = λ, the reliability equation for two components will change to Rstby = {1 + Rss [1 − exp(−λt)]} exp(−λt) = (1 + Rss ) exp(−λt) − Rss exp(−2λt)
(189)
From this equation, if the switching reliability is almost perfect (Rss ∼ = 1) due to hot standby, the equation converts to that of an active parallel equation, as if no standby switch existed: Rstby = 2 exp(−λt) − exp(−2λt) = Ra
(190)
1.5.4 Other Redundancy Limitations: Common-Mode Failures and Load Sharing The positive advantages of redundant systems are forfeited when dependencies between components create unexpected disadvantages [1]. Common-Mode Failures These occur when common connections or stresses inßuence the redundant components such that they fail simultaneously. This may be due to a bird fracturing a jet engine, in turn causing a commercial jet liner to crash. This would be like installing a component having a reliability R in series with a parallel structure. This can be displayed as follows if R = exp(−λt): Ra = (2R − R 2 )R
(191)
45
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
In the example of a twin-engine aircraft, if each engine had probability p = 10−6 , the common mode failure being p = 10−9 , the system failure probability, pa ≈ 10−9 , is dominated entirely by common-mode failure. If a subscript I denotes “independent” and C denotes “common mode”, and if for λ = λI + λC , we deÞne a factor β = λC /λ, then for an active parallel system, Ra = [2 exp(−λI t) − exp(−2λI t)] exp(−λC t)
(192)
and using λC = βλ and (1 − β)λ = λI , we can reformulate this as Ra = {2 − exp[−(1 − β)λt]} exp(−λt)
(193)
The system reliability decreases with the increase in β, as in the rare event approximation β2 Ra ≈ 1 − βλt − 1 − 2β + (−λt)2 + · · · 2
(194)
as compared to 1 − (λt)2 when no common-mode failure was present. Also, MTTFa = [2 − (2 − β)−1 ] · MTTF
(195)
Load Sharing This limitation is another factor that degrades system reliability in active parallel systems. The failure rate of the second component, λL , will increase due to the stress of the Þrst, which fails with λ, resulting in λL > λ. With no common-mode failures, Ra = 2 exp(−λL t) + exp(−2λt) − 2 exp[−(λ + λL )t]
(196)
which defaults to the original equation for the active parallel structure if λL = λ. Now if λL → ∞, equation (196) reduces to the reliability of two components placed in series: Ra = exp(−2λt) = Rseries (197) This means that if any component causes an instantaneous failure of the second component shared, the active parallel system failure rate will be great as that of a single unit. 1.6 REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS There are hundreds of papers and tens of models in a multitude of references on the subject of software reliability, a subject that has risen to unimaginable dimensions as the information age galloped full speed with a “Þn de siecle” (end of the century) spirit [30–32]. It would be impractical to explain relevant
46
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
methods on an individual basis. However, for the general purpose of introducing Þrst-time readers to these techniques, a classiÞed breakdown of methods is in order. Due to space limitations, a sample handful of pioneering and representative techniques with mathematical–statistical modeling applications leading to applicable and practical software reliability engineering solutions are carefully chosen and included. Although the author would have liked to include them all, software reliability is only a piece of the puzzle of trustworthy computing concepts taken up in this book. In this chapter we describe reliability models based on the time domain only. The effort (time-independent) domain is studied later. Software reliability models do not consider the same statistical techniques as those used in the hardware reliability models described earlier. Simplicity and practicality are two key factors in bridging the gap between the state of the art and that of applied software reliability modeling. The assumptions must be realistic and testable as well as applicable and accurate and valid from a predictive viewpoint. One should perform goodness-of-Þt tests to assess how reasonably the model Þts the data given. Examples are provided to help readers understand the comparisons. A software reliability model is an essentially mathematical–statistical technique used to model an engineering phenomenon: speciÞcally, obtain a quantitative measure of reliability such as the expected number of failures within a given or residual time interval, the failure intensity during operation, or the mean time between failures. These models are not simply a cookbook approach but require academic expertise in statistics and mathematics [33]. Some of these modeling concepts are outside the general discipline of computer science and therefore cannot easily be appreciated or interpreted by software developers. By reliability in software we mean the probability that the software will fulÞll its intended function without failure(s) in a speciÞed time interval. This deÞnition is no different from that of its hardware counterpart when hardware is replaced by software. However, as the reader will observe, software models are considerably different from hardware models. Again, there have been many books and thousands of journal and conference proceeding papers on this broad topic since the inception of software reliability science and engineering in the early 1970s [34–40]. In a pioneering software reliability study in 1967, Hudson modeled software development as a Markovian birth (fault generation during the design or debugging stages) and death (failures resulting from the triggering of faults) process with transition probabilities from one to another [41]. He showed that the number of faults detected, which increased with time, displayed a binomial distribution whose mean value function had the form of a Weibull p.d.f. Other studies followed with the advances in software that took place in the late 1970s. Leading software reliability taxonomists have broken down the multiplicity of research papers into several general areas [31,34]: 1. Time-between-failures models. Some of the earliest examples of research on times between failures are those of Jelinski Moranda (1972, 1975), Shooman (1972), and Shick and Wolverton (1978) [42–45].
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
47
2. Failure-counting models. A representative group of failure-counting models is that of nonhomogeneous Poisson processes, where predictions can be made for future epochs. The eailier leading models in this category are the popularly used model of Goel and Okumoto of 1979 [46], and the Musa–Okumoto logarithmic Poisson in 1984 [47]. Discrete versions of this type of model have been studied by Ohba [48], Duane [49], and Littlewood [50], all in 1984; Yamada et al. in 1986 [51]; Knaß and Sacks in 1991 [52]; and Sahinoglu through his CPSRM (compound Poisson software reliability modeling) techniques [12–16] and Zhao and Xie [53], both in 1992. Musa’s basic execution model of 1975 is also in this category [54]. 3. Bayesian models. This type of model, a Bayesian estimation technique for models already studied, uses prior distribution to represent the view from past behavior, and thus a posterior distribution to integrate current data with past judgment. By way of posterior distributions, after deciding on the choice of loss functions and minimizing the loss expected, estimates of the unknown parameter are substituted in the reliability or hazard functions. For example, for a squarederror loss function, the best estimate is the mean of the posterior distribution; and for an absolute-value loss function, the median of the posterior is the best estimate. However, if the empirical Bayesian approach is used to derive more appropriate models, they can be classiÞed as another modeling technique. The most popular model in this category is the Littlewood–Verral (1973) empirical Bayes model [55], which Mazzuchi and Soyer (1988) later modiÞed using Bayesian principles [56]. There are many other papers on Bayesian treatment of the Jelinski–Moranda model: for example, those of Langberg and Singpurwalla [57] and Jewell [58], both in 1985. These models are difÞcult to apply without parameter estimation solutions. 4. Static (nondynamic) models. These models, which include complexity measures, failure injection, and fault seeding, do not deal with time. One of the Þrst models was that of Nelson in 1978 [59]. An excellent must-read review paper for all interested beginners is that of Ramamoorthy and Bastani published in 1982 [60]. Bastani and Ramamoorthy later (1986) emphasized correctness estimation of software failures rather than time-dependent probability [61]. The latter publication describes a detailed study of correctness probability, which is estimated using a type of continuity assumption. Also discussed is a fuzzy set–based input domain model that is focused on developing more theoretical models. The earlier model of Nelson [59] was a special case of an input domain–based model, extended by Munson and Khosgoftaar in 1981 [62]; Hamlet [63] and Scott et al. [64], both in 1987; and Weiss and Weyuker in 1988 [65], in the area of software fault tolerance, a subject also studied by Littlewood and Miller in 1989 [66] and by Butler and Finelli in 1993 [67]. Software fault trees used as a conventional reliability engineering method were studied by Stalhane in 1989 [68], and Wohlin and Korner (1990) proposed a fault-spreading model [69]. The original seeding model discussed by Mills [70] has never been formally published other than as an IBM report, although Huang in 1984 [71], Duran and Wiorkowski in 1981 [72], and Schick and Wolverton in 1978 [45] have written on the topic.
48
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
5. Others. This group combines all the other topics, such as papers on the release time of software after testing by Xie in 1991 [30] and by Sahinoglu in 1995 and 2003, to name a few [13,15]. Model comparison papers have also been published, such as those by Keiller and Miller in 1991 [73], Khoshgoftaar and Woodcock in 1991 [74], and Lyu and Nikora in 1991 [75], in addition to Bendell and Mellor in 1986 [76] and Littlewood in 1987 [77]. A complete stochastic treatment that compared the predictive accuracy among competing reliability models using Bayesian principles was published by Sahinoglu in 2001 [16]. 1.6.1 Software Reliability Models in the Time Domain Next we study time-domain (not effort-based) models, in which time is either continuous (nonstop) or discrete (in distinct time units, such as days, weeks, or years). The basic goal is to model past failure data to predict behavior in the future (i.e., reliability projection) before software is released to the customer at the end of the development cycle. Reliability models are also useful to model failure patterns and provide input to maintain software before faults (defects) are triggered, causing failures. The data consist of failures per time period, meaning the number of failures discovered in a time period, or time between failures, denoting the calendar or CPU time actually observed between software failures. We take up nontime- or effort-based models in Chapter 2, where efforts are made at equal intervals (e.g., days or weeks or months) or simply effort by effort, where the effort can be a test case or any input in a calendar time period. This approach can be likened to a time domain if efforts are made at equal intervals. Any model used for prediction has to be tested for goodness of Þt. In this book we do not distinguish between failures and faults, but recorded failures are actually triggered faults inherent in the software. There is another classiÞcation that may be used with respect to the type of statistical distribution that underlies the Þnite failure count within a given period. We consider the Poisson process over time for the countable Þnite quantity of failures, the binomial model, or other types. In the Poisson model we have a Poisson process over time where the total number of failures is not known in advance. Poisson-type models assume that the number of failures detected within distinct time intervals are independent with separate means: (1) with the same rate of failure, the homogeneous Poisson process; (2) with a varying rate of failure, the nonhomogeneous Poisson process (NHPP), or (3) comprising a compound Poisson process beyond the HPP and NHPP if the failures occur in sizes or clusters rather than in terms of the conventional assumption of a single failure at a time. Binomial-type models are based on similar assumptions: a binomial setting in which (1) a software defect will be removed whenever a failure occurs, (2) there is a known quantity of embedded defects or faults independent in the program in advance, and (3) the hazard rates are identical for all defects. Models that differ from these two types of count processes we call “other types.”
49
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
1.6.2 ClassiÞcation of Reliability Growth Models Again, for failure distribution over time, whether the distribution is negative exponential, Weibull, or other, the models differ from one another. Let’s Þrst study the negative exponential class of failure time models in Poisson, binomial, and other types. In software reliability we employ the mean value function, μ(t) = E[M(t)], to represent the expectation of failures with respect to time, where M(t) is a random process to denote the number of failures achieved until time t. On the other hand, the failure intensity function, λ(t) = μ (t), is the Þrst derivative of μ(t) with respect to (w.r.t.) time. λ(t) denotes the instantaneous rate of change of the expectation of failures w.r.t. time t. Note that the hazard rate h(t) = f (t)[R(t)]−1 is the conditional failure density given that there were no failures up to time t. Equations (20) and (25) to (29) showed these facts. Negative Exponential Class of Failure Times In this class, the failure intensity λ(t) is in the form of a negative exponential. The binomial types for this class have a constant hazard rate h(t) = c and λ(t) = N c exp(−ct). The Poisson types in this class also have a constant hazard rate of h(t) = c, but with a negative exponential time to failure f (t) = c exp(−ct). However, the number of failures that occur over a given period of time for either a HPP or an NHPP is Poisson. Next, let’s look at models contained in this class. Jelinski–Moranda (J-M) De-eutrophication Model (Binomial Type) A very early model proposed in 1972 by Jelinski and Moranda is the J-M time-between-failures (i.e., negative exponential) model [42]. The model assumes N faults (or potential failures) triggered randomly with equal probability. One also assumes that the failure Þx (“as good as new”) time is negligible, and this leads to the software’s improvement by the same amount at each Þx. Now the hazard function during the time xi = ti − ti−1 between the (i − 1)st and ith failures is given by h(xi ) = φ[N − i + 1]
(198)
where N is the total count of software faults at the very beginning, with φ a proportionality constant. The hazard function remains a constant between the failures but decreases in steps of φ after the removal of each fault, a fact that results in the improvement of the time between failures. Now, let’s study the mathematical–statistical model in which xi = ti − ti−1 are i.i.d. with a negative exponential p.d.f. with mean θ = [φ(N − i + 1)]−1 f (xi ) = (1/θ ) exp(−xi /θ ) is the p.d.f. of interarrival times. μ(t) = N [1 − exp(−φt)] is the Þnite mean value function, since limt→∞ μ(t) = N λ(t) = N k exp(−φt) is the failure intensity function. For the model above, the estimates of the parameters and reliability prediction are given by n i=1
1 n n ! n = ˆ − 1/ N −i +1 N xi (i − 1)xi i=1
φˆ =
N
n i=1
(199)
i=1
n
! n xi − (i − 1)xi i=1
(200)
50
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
First, N is estimated from the Þrst nonlinear equation and then installed in the second nonlinear equation to estimate k. Then, after n = i − 1 faults have been observed, the estimate of the MTBF for the (n + 1)st fault is {z(t)}−1 = ˆ Nˆ − n). Shooman’s safeguard reliability model (1972) is very similar to 1/φ( the J-M model [37]. These pioneering models have inspired others in sequence and have since been replaced by more modern methods. Moranda’s Geometric Model (Poisson Type) The geometric model proposed by Moranda is a variation of the original J-M model [43]. The interarrival time for failures is also a negative exponential, f (xi ) = Dφ i−1 exp(−Dφ i−1 xi ), whose mean decreases with respect to a geometric trend [i.e., h(t) = Dφ i−1 , i = 1, 2, . . . , n; 0 < φ < 1 at the (i − 1)st failure]. The expected time between failures is E(Xi ) = h−1 (ti−1 ). The hazard rate decreases in a geometric progression as each failure occurs. The functional form of the failure intensity (in terms of the expected number of failures) is geometric. The mean value and failure intensity functions, where β = − ln φ, 0 < φ < 1 in an inÞnite failure model, are μ(t) =
1 ln{[Dβ exp(β)]t + 1} and β
λ(t) =
D exp(β) [Dβ exp(β)]t + 1
lim μ(t) = ∞
t→∞
(201) (202)
To estimate the parameters, we take the natural logarithm of the likelihood function ni=1 f (Xi ) and the partial derivatives with respect to φ and D. The maximum likelihood estimators (MLEs) are then solutions of the following pair: n i φˆ i Xi ˆ n+1 φn i=1 ˆ (203) and n = D = n i i 2 φˆ Xi φˆ Xi i=1
i=1
Using these MLEs and their invariance property, the MLE of the failure intensity and mean value function can be estimated by inserting these MLEs in the equations for μ(t) and λ(t). Goel–Okumoto Nonhomogeneous Poisson Process (Poisson Type) This Poissontype model was proposed by Goel and Okumoto in 1979 using the number of failures observed per unit time in groups [46] (see Goel’s 1985 paper [78] for a well-done overview). They suggested that the cumulative count of failures N (t) observed at time t can be modeled as an NHPP, a Poisson process with time-varying failure rate, which follows a negative exponential distribution: P [N (t) = y] = where
[μ(t)]y −μ(t) e , y!
μ(t) = N (1 − e−bt )
y = 0, 1, 2, . . .
(204)
(205)
51
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
is the mean value function for b, the per-fault detection rate. N , the expected number of faults, is not known (hence not of binomial type) and has to be estimated. The failure intensity function, λ(t) = μ (t) = N be−bt
(206)
is strictly decreasing for t > 0. It is not difÞcult to see that μ(t) and λ(t) are the cumulative function, F (t), and probability density function, f (t), of the negative exponential, respectively. The MLEs of N and b can be estimated as solutions for the following pair of equations: n Nˆ =
1
fi i=1 − e−btˆ n
ˆ
and
tn e−btn
n
1−e
i=1 ˆn −bt
fi
ˆ
=
ˆ
fi (ti e−bti − ti−1 e−bti−1 ) e−btˆ i−1 − e−btˆ i
(207)
The second equation is solved for bˆ by numerical (nonlinear) techniques. Then it is substituted into the Þrst equation to calculate Nˆ . One can then substitute these MLEs to Þnd others, such as ˆ ˆ btˆ ˆ μ(t) ˆ = Nˆ (1 − ebt ) and λ(t) = μ (t) = Nˆ be
(208)
and hence the estimated expected number of faults to be detected in the (n + 1)st observation period is given by ˆ ˆ Nˆ (e−btn − e−btn+1 )
(209)
Okumoto and Goel also determined an optimal release time (ORT) of observation for a software product if the reliability desired is R for a speciÞed operational period of To [79]: ORT =
1 1 ln[a(1 − e−bTo ) − ln ln b R
(210)
Earlier, Schneidewind (1975) adopted the same model by assuring that each time period T during which the software is observed is of the same length [80]. That is, ti = iT , i = 1, 2, . . . , n for some constant T > 0, and N = α/β, where α and β are Schneidewind’s model parameters for his ith time period: μ(t) = α/β(1 − e−βi ). Therefore, Schneidewind’s model defaults to Goel–Okumoto’s NHPP and therefore does not need a repetitious coverage of the same principles [81]. There is a trend toward diminishing defect rates or failures with the negative exponential assumption. However, in real life, there have been cases where the failure rate Þrst increases (due to adding code, etc.) and then decreases (due to Þxes) or sometimes cruises at a constant rate (adding code, and at the same
52
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
time, an equal effect of Þxes). Goel’s 1985 paper generalized the Goel–Okumoto NHPP model using a three-parameter Weibull model [78]: μ(t) = N (1 − e−bt ) d
(211)
λ(t) = μ (t) = N bde−bt t d−1 d
(212)
where the shape parameter d = 1 gives a negative exponential with a constant hazard rate, and d = 2 for the Rayleigh model. The shape parameter d < 1 denotes infancy, d = 1 denotes useful life, and d > 1 denotes the wear-out period in the traditional bathtub curve of the hazard function for most electronic components. Example 1. For a generalized Goel–Okumoto NHPP model, given the data N (the number of failures expected at the end of mission time) = 100 and b (the fault detection rate per fault) = 0.02, calculate the mean value of failures at 80 hours. Take d = 1 and d = 2, respectively. ⎧ d ⎪ N (1 − e−bt ) = 100(1 − e−0.02(80) ) = 100(1 − e−1.6 ) ⎪ ⎪ ⎪ ⎨ = 79.8 failures d=1 (213) μ(t) = d ⎪ ⎪ N (1 − e−bt ) = 100(1 − e−0.02(80)(80) ) ⎪ ⎪ ⎩ ≈ 100 failures d=2 (214) Musa’s Basic Execution Time Model (Poisson Type) John D. Musa’s model was one of the earliest to use the actual central processing unit (CPU) execution time rather than the clock or calendar time, which is actually irrelevant to the operating stress of the software environment [54]. The fundamental assumptions are: 1. The cumulative number of failures, M(t), follows a nonhomogeneous Poisson process where the probability distribution functions of the random t process, with mean value function μ(t) = β0 (1 − e−β1 ), vary with time. It is a Þnite failure model: limt→∞ μ(t) = β0 . 2. The interfailure times are piecewise negative exponentially distributed, implying that the hazard rate for a single fault is a constant: λ(t) = μ (t) = β0 β1 (1 − e−β1 ) t
(215)
The conditional reliability and hazard functions after i − 1 failures have occurred are R(t | ti−1 ) = exp{−[β0 exp(β1 ti−1 )][1 − exp(−β1 t)]}
(216)
h(t | ti−1 ) = β0 β1 exp(β1 ti−1 )[exp(−β1 t)]
(217)
Assume that n failures have occurred, that tn is the last failure time, and that tn + x is the stopping time. The MLEs of β0 and β1 , which possess the invariance
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
53
property needed to estimate other functions, such as reliability, hazard, and failure intensity, are given by βˆ0 =
n 1 − exp[−βˆ1 (tn + x)]
(218)
n n(tn + x) − − ti = 0 exp[βˆ1 (tn + x)] + 1 i=1 βˆ1 n
(219)
Example 2. Let us consider a software program with an initial failure density of 10 failures/hour and 100 total failures to be experienced in an inÞnite time. Determine the failure intensity, λ(t), and number of failures predicted, μ(t), at t = 10 and 100 hours. Use the basic execution model. SOLUTION At t = 10, λ(t) = λ0 exp(−λ0 /ν0 )t = 10 exp[−(10/100)(10)] = 10 exp(−1) = 3.68 failures/CPU hour. Note that λ0 = β0 β1 = 10, and β1 = 0.1, t and the initial number of failures β0 = 100. μ(t) = β0 (1 − e−β1 ) = 10(1 − e−0.1(10) ) = 100(1 − e−1 ) = 100(1−0.368) = 63 failures. At t = 100, λ(t) = λ0 exp[−(λ0 /ν0 )t] = 10 exp(−(10/100)(100)] = 10 exp(−10) = 0.454 × 10−6 failures/CPU hour. μ(t) = β0 (1 − e− β1t ) = −0.1(100) −10 ) = 100(1 − e ) ≈ 100 failures. 100(1 − e Musa–Okumoto Logarithmic Poisson Execution Time Model (Poisson Type) This is similar to the G-O NHPP model, in which the number of failures experienced by a certain time t, M(t), also follows a nonhomogeneous Poisson process with a negative exponentially decreasing intensity function, λ(t) = λ0 exp[θ μ(t)], where μ(t) = (1/θ ) ln(λ0 θ t + 1) is the mean value function, θ > 0 is the failure decay parameter (or rate of reduction in the normalized failure intensity per failure), and λ0 is the initial failure rate [47]. Hence, when μ(t) is substituted, we obtain λ(t) = λ0 /λ0 θ t + 1, since limt→∞ λ(t) → ∞. This is an inÞnite failure model compared to the basic execution model’s Þnite behavior. The rate of decrease explains the fact that earlier Þxes of the failures detected reduced the failure rate of the latter part, thus causing fewer Þxes by the end. The difference from the G-O NHPP is that its mean value function is different. It is deÞned to be logarithmic Poisson since the number of failures expected over time is a logarithmic function. The logarithmic Poisson process is thought to be superior for highly nonuniform distributions. The data needed are actual times, ti , i = 1, 2, . . . , or interfailure times, xi = ti − ti−1 . If we let β0 = θ −1 and β1 = λ0 θ , which is the same as λ0 = β0 β1 in the basic execution model, the failure intensity and mean value functions become λ(t) =
λ0 β 0 β1 = λ0 θ t + 1 β1 t + 1
μ(t) = β0 ln(β1 t + 1)
(220) (221)
54
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
The conditional reliability and hazard rate functions at time t after the (i − 1)st failure are
β0 β1 ti−1 + 1 (222) R(t | ti−1 ) = β1 (ti−1 + t) + 1 h(t | ti−1 ) =
β0 β1 β1 (ti−1 + t) + 1
(223)
Note that “|” denotes “given that” or “conditional upon.” Use the reparametrized model to Þnd the MLEs from the failure intensity and mean functions: βˆ0 =
n ln(1 + βˆ1 tn )
n 1 1 n(tn ) = (1 + βˆ1 tn ) ln(1 + βˆ1 tn ) βˆ1 i=1 1 + βˆ1 ti
(224) (225)
As in the basic execution model, these MLEs calculated using their invariance property can be substituted in the failure intensity and mean value functions to ˆ estimate λ(t) and μ(t). ˆ Example 3. Let us consider a software program with an initial failure density of 10 failures/hour and 100 total failures to be experienced in inÞnite time. Find the failure intensity, λˆ (t), and number of failures predicted, μ(t), ˆ at t = 10 and 100 execution hours. Also, θ = 0.02 (two defects per 100 hours will decrease with time). Use the logarithmic Poisson model. SOLUTION At t = 10 hours, λ(t = 10) =
λ0 10 = = 3.33 failures/CPU hour λ0 θ t + 1 (10)(0.02)(10) + 1
Note that λ0 = β0 β1 = 10 and β1 = 10/50 = 0.2, since β0 = θ −1 = 50. Then λ(t = 10) =
(50)(0.2) 10 β0 β1 = = = 3.33 failures/CPU hour β1 t + 1 (0.2)(10) + 1 3
Also, μ(t = 10) = θ −1 ln(λ0 θ t + 1) = 50 ln[(10)(0.02)(10) + 1] = 50 ln(3) = 55 failures, or μ(t = 10) = β0 ln(β1 t + 1) = 50 ln[(0.2)(10) + 1] = 50 ln(3) = 55 failures. At t = 100 hours, λ0 10 = λ0 θ t + 1 (10)(0.02)(100) + 1 10 = 0.476 failure/CPU hour = 21
λ(t = 100) =
55
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
Also, λ0 = β0 β1 = 10 and β1 = 10/50 = 0.2, since β0 = θ −1 = 50. Then λ(t = 100) =
(50)(0.2) 10 β0 β1 = = = 3.33 failures/CPU hour β1 t + 1 (0.2)(100) + 1 21
Also, μ(t = 100) = θ −1 ln(λ0 θ t + 1) = 50 ln[(10)(0.02)(100) + 1] = 50 ln(21) = 152 failures, or μ(t = 100) = β0 ln(β1 t + 1) = 50 ln[0.2(100) + 1] = 50 ln(21) = 152 failures. Littlewood–Verral Bayesian Model This model is a result of a Bayesian approach by Littlewood and Verral (1973) in which they regarded software reliability measures as representing the strength of belief that a program is operating successfully [55]. This opposed the classical view taken by the majority of models in which the reliability is a measure of goodness or success in a given number of random trials. Whereas the hazard rate is a function of the number of defects remaining, the L-V model assumed that it was a random variable, a fact that has caused uncertainty in the effectiveness of the fault correction or failure prevention process. Therefore, even though failure time distributions are negative exponential (assumed in earlier classical models to behave with a certain failure rate), that rate is a random variable under the principles of Bayesian prior and posterior analysis. The distribution of this random failure rate powered by a gamma prior is also a gamma posterior distribution. An identical Bayesian approach was adopted independently by Sahinoglu for the failure and repair rates of power generators [82] and was used in later research [11,21] to estimate their FOR (forced outage rate) in the estimation of the electric power system reliability index, LOLE (loss of load expected). Littlewood’s differential fault model (1981), a variant of the original L-V model that uses the hazard rate as a random variable in a Bayesian framework [83], was a binomial model using a Pareto class of interfailure time distributions. However, the reliability growth is modeled in a process of two mechanisms, such as fault detection and fault correction, similar to some earlier models that adopted the same approach of differing stages. Later, Keiller et al. (1983) proposed a variation of the model very similar to their initial model, using the same randomness of hazard rate but employing a different parameter (the shape parameter, α, rather than the scale parameter, ξ ) of that prior distribution to explain the effect of change on reliability [84]. Although their model used a negative exponential class of failure time distributions, it was neither of Poisson or binomial type, but “other.” There are many other Bayesian approaches, such as Liu’s Bayesian geometric model and Thompson and Chelson’s Bayesian model, to name but two [85,86]. Formulation of the L-V model can be summarized as follows. The sequential failure times are assumed to be independent exponential random variables with parameter λi : f (xi ) = λ exp(−λxi ),
i = 1, 2, . . . , a,
λ > 0,
xi > 0
(226)
56
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
Now let the software failure rate λ have a prior distribution from the gamma family: ξ c c−1 θ1 (λ) = λ exp(−λξ ), λ>0 (227) (c) The joint distribution of data and prior, assuming that all shape and scale parameters are identical, is given by k(x, λ) = f (x1 , x2 , . . . , xn ; λ) =
ξ a n+a−1 λ exp[−λ(xT + ξ )] (a)
(228)
where n is the number of occurrences and xT = ni=1 xi represents the total sampled failure times for n occurrences. Thus, the posterior distribution for λ can be derived as h(λ | x) =
k(x, λ) 1 = (xT + ξ )λn+a−1 exp[−λ(xT + ξ )] (229) (n + a) λ f (x, λ) dλ
which is Gamma[n + a, (xT + ξ )−1 ]. For h(λ | xi ) ∼ Gamma[α + 1, (xi + ξi )−1 ], E(λ) = (α + 1/xi + 1) using a quadratic loss function. Recall that x denotes a vector of xi . Then the marginal distribution of the random variable, xi > 0, i = 1, 2, . . . , n, given the gamma prior, can be derived as f (xi | α, ξi ) =
α(ξi )a (xi + ξi )α+1
(230)
which is a Pareto distribution with joint density αn f (x1 , x2 , . . . , xn ) = n
n
(ξi )a α+1 i=1 (xi + ξi ) i=1
(231)
For model and reliability estimation, if one assumes that ξi = β0 + β1 i (the linear form) or ξi = β0 + β1 i 2 (the quadratic form), then by using the foregoing marginal distribution for the xi ’s, we calculate the MLEs for α, β0 , and β1 as solutions to the following system of equations: n ˆ + ln ξi − ln(xi + ξˆi ) = 0 αˆ i=1 i=1
(232)
αˆ
n n 1 1 − (αˆ + 1) =0 ˆ ˆξi i=1 i=1 xi + ξi
(233)
αˆ
n n i i − (αˆ + 1) =0 ˆ ˆ i=1 ξi i=1 xi + ξi
(234)
n
n
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
57
where ξi = β0 + β1 i and i = i or i = i 2 . Using a uniform prior U (a, b) for the shape parameter α, Littlewood and Verrall derived the marginal distribution of the xi ’s as a function of β0 and β1 only. Once the three unknowns α, β0 , and β1 are estimated, the linear intensity function is, for example, α−1 λlinear (t) = " β02 + 2β1 t (α − 1)
(235)
A Þnal procedure is to estimate the least-squares estimates using the fact that for a Pareto p.d.f., E(Xi ) = ξi /(α − 1). Once the parameters are estimated, reliability measures such as reliability and failure intensity functions can be estimated. Additionally, the mean time to failure for the ith failure can be estimated as E(Xi ) = MTTF = ξi /(α − 1), where i is the linear or quadratic assumed term for i. Again, recall that ξi = β0 + β1 i and i = i or i = i 2 . A later paper by Mazzuchi and Soyer (1988) suggested that α, β0 , and β1 are all random variables with selected priors to estimate these unknown parameters [56]. Musa and Okumoto in 1984 proposed that ξi be a function related inversely to the number of failures remaining, inspired by an efÞcient debugging process: ξi = N (α + 1)/λ0 (N − i), where N is the number of defects expected as the time lengthens, λ0 is the initial failure intensity function, i is the failure index, and α is the shape parameter of the gamma prior for the rate λ [87]. It shows that the scale parameter increases as the number of remaining failures decreases with diminishing i. Sahinoglu’s Poisson∧ Geometric and Poisson∧ Logarithmic Series Models A generalized compound Poisson process model is proposed for estimation of the residual count of software failures in references 12 to 16. It is observed that conventional nonhomogenous Poisson process models do not allow for the possibility of multiple counts, and the compound Poisson model is superior when clumping of failures exists at any given epoch [31]. SpeciÞcally, a model called Poisson∧ geometric (or stuttering Poisson) is studied in which the underlying failure process is assumed to be Poisson while a geometrically distributed number of failures may be detected at each failure epoch. The model proposed is validated using a few of Musa’s data sets. Further, the Poisson∧ logarithmic series (equivalent to negative binomial given certain assumptions) is studied similarly, where the compounding p.d.f. is logarithmic whereas the counting process is the the same as before, NHPP [12,15]. The CD-ROM comprises both programs used to calculate the reliability and failure functions. The results from these programs can easily be used to obtain compound Poisson plots [88]. GENERALIZATIONS OF THE POISSON MODEL The Poisson theorem asserts that a counting process is a Poisson process if the jumps in all intervals of the same length are identically distributed and independent of past jumps (an assumption of stationary and independent increments), and the events occur singly at each epoch (an assumption of orderliness) [89]. Failure interarrival times may be
58
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
negative-exponentially distributed, but this is not sufÞcient to prove that the counting process is Poisson [90, p. 434]. Let us observe two generalizations (sometimes called degenerations) of the Poisson process [91,92]: The Þrst is the well-known NHPP, obtained by dropping the “stationary increments” property in the Poisson theorem and replacing it with the “time-dependent increments” property, where the Poisson failure arrival rate β varies with time t (e.g., in software testing or unexpected ambulance calls on an ordinary day). The second is the less popularly known compound Poisson process (CPP), which is the process obtained if the orderliness property is dropped from the conventional Poisson theorem and replaced with that of stationary jumps: Let Zn be the size of the nth jump, where {Zn , n = 1, 2, . . .} are i.i.d. random variables. Let J (t) be the size of jumps that occur during (0,t]; then N (t) is a compound Poisson process with N (T ) = Z1 + Z2 + · · · + ZJ (t) , t ≥ 0 [89]. The discrete compound Poisson p.d.f. in this section is one of two types. It may be of geometric density type with its forgetfulness property to govern the failure size (x > 1) distribution, whereas the conventional Poisson is a special case when q (= variance/mean) = 1 [93–96]. The symbol ∧ designates that the parent Poisson distribution to the left of ∧ is compounded by the compounding distribution to the right of ∧ [12]. A similar publication by Sahinoglu on the Poisson∧ geometric p.d.f. reports on a study of the limiting sum of Markov Bernoulli variables [17]. Or, if the forgetfulness property does not exist, there is a positive or negative correlation between the failures in a clump upon arrival. The author uses a logarithmic-series distribution (LSD) for jump sizes with a true-contagion property (positive correlation). The sum of LSD random variables governed by a Poisson counting process results in a Poisson∧ logarithmic series, which simply defaults to a negative binomial distribution (NBD) given that a certain mathematical assumption holds [97,102]. A compound Poisson with a speciÞc compounding distribution has negative exponentially distributed failure interarrival times with rate β. This implies that the p.d.f. of negative exponential interarrival times is independent of or not inßuenced by the earlier arrival epochs—hence the forgetfulness property of the Poisson process. Suppose that each Poisson arrival dictates a positive discrete amount x of failures that are i.i.d. as {fx }. Then the total number of demands follow a CP distribution within an afÞxed time interval given by [88] TRUNCATED POISSON∧ GEOMETRIC (STUTTERING POISSON)
P (X) =
∞ (βt)Y e−βt Y =0
Y!
∗
f Y (X)X = 0, 1, 2, . . . ,
β>0
(236)
∗
where f Y (x) is the Y -fold convolution of {fx } when fx = 1 for x = 1 and fx = 0 for x = 0 for a conventional Poisson process. Therefore, this equation reduces to a Poisson distribution in the case of a single failure per arrival. On the other hand, the geometric distribution is given as fx (x) = (1 − r)r x−1 ,
x = 1, 2, 3, . . .
(237)
59
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
Thus, a special case of CP distribution is the Poisson∧ geometric. The rate β of the Poisson process is the average number of arrivals per unit time, and r is the probability of Þnding the next independent failure in the batch or clump within each arrival. Then p = 1 − r is the probability of starting the Poisson process for the next arrival. In summary, the total count of failures X = xi within time interval t is a Poisson∧ geometric distribution [12], where P (X = 0|Y = 0) = e−βt , or e−β if t = 1: X (βt)Y e−βt X − 1 X−Y P (X|Y ) = r (1 − r)Y , Y −1 Y! Y =1
X = 1, 2, 3 . . . ,
0 < r < 1,
β<0
(238)
From equation (238), the joint distribution of P [X(t), Y (t)] for unit time t = 1 is given in Table 1.5. The expected value of the compound Poisson random variable for the marginal distribution of X is obtained by multiplying the Þrst row by 0, the second row by 1, the third row by 2, and so on, and adding the columns. Summing over X rows of 1 to Y columns, we get E(X) = TABLE 1.5
β β = 1−r p
(239)
Joint Distribution of Poisson∧ Geometric X and Poisson Y Y
0
X
1
2
0
e−βt
1
—
βe−β (1 − r)
2
—
βe−β r(1 − r)
β 2 e−β 2!
— 1 (1 − r)2 1
3
—
βe−β r 2 (1 − r)
β 2 e−β 2!
2 r(1 − r)2 1
4
—
βe−β r 3 (1 − r)
β 2 e−β 2!
5
—
βe−β r 4 (1 − r)
β 2 e−β 2!
. . .
—
Suma
0
—
. . . βe−β [(1 − r)] (1 + 2r + 3r 2 + 4r 3 + 5r 4 + · · ·)
3
—
3 2 r (1 − r)2 1 4 3 r (1 − r)2 1
β 3 e−β 3!
β 3 e−β 3! β 3 e−β 3!
. . . β 2 e−β (1 − r)2 (1 + 3r + 6r 2 + 10r 3 + · · ·)
...
Expected, XP (X)
E(X) =
—
0
—
βe−β (1 − r)
—
2βe−β r(1 − r) + β 2 e−β (1 − r)2
2 (1 − r)3 2
3 r(1 − r)3 2 4 2 r (1 − r)3 2
3βe−β r 2 (1 − r) + 3β 2 e−β (1 − r)2 + β 3 e−β (1 − r)3 2 ... ...
. . . ...
. . . β by 1−r math series expansion
E(X) =
Source: [12]. a Each term in the summation is the product of Poisson p.d.f. and geometric p.m.f. values that add up to E(X) = β/(1 − r) of the Poisson∧ geometric.
60
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
Var(X) = E(X 2 ) − [E(X)]2 =
β β(1 + r) β(1 + r) = = 2 (1 − r) (1 − r)(1 + r) 1−r
(240)
which is identical to E(X). Therefore, E(X 2 ) = β/p + β 2 /p 2 . Similarly, E(Y ) = Var(Y ) = β in the Poisson process. Note that geometric p.m.f. is a discrete analog of the continuous negative exponential p.d.f. and has a similar nonaging Markov ∧ forgetfulness property. Consequently, for the X ∼ Poisson geometric, and x ∼ geometric(r) processes, variance E(x 2 ) (q + q 2 )p −2 = = mean E(x) qp −1
Q=
q + q2 1+q 1+1−p = = = (2 − p)p−1 qp p p
=
(241)
∧
The probabilities and moments of the zero-truncated P G(μ, Q) are (1 − e − βt)−1 times the nontruncated: μ = E(X) = βt[(1 − r)(1 − e−βt )]−1
(242)
Of a number of chance mechanisms generating the negative binomial distribution (NBD), it can also be deÞned to be a Poisson sum of logarithmic-series-distributed (LSD) random variables [101,102]. Let X = x1 + x2 + x3 + · · · + xn , where the xi are LSD random variables with a p.m.f. ∧
TRUNCATED POISSON LOGARITHMIC SERIES
fx =
aθ x , x
x = 1, 2, 3, . . . ,
a=
1 , ln(1 − θ )
0<θ <1
(243)
Then a randomly stopped sum of xi ∼ LSD(θ ) has an NBD(k, q) with parameters k=
β ln q
q=
1 1−θ
(or
β = k ln q)
(244) (245)
where E(x) = aθ/(1 − θ ). The stopping rule N (t) is a compound Poisson. Let x ∼ LSD be reparametrized. Let θ = p/q, and q = p + 1 = (1 − θ )−1 , p = θ (1 − θ )−1 , and a = −1/ ln(1 − q) = 1/ ln q; therefore, inserting for p = θ (1 − θ )−1 = q − 1 and a in equation (245), E(x) = (q − 1)/ ln q. Therefore, −1
fx = (x ln q)
x p , x = 1, 2, 3, . . . , q
q = p + 1 > 1,
q = (1 − θ )−1
a=
1 , ln q
0 < θ < 1, (246)
61
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
Then E(X) = kp for X ∼ NBD, where k = β(ln q)−1 . Therefore, the probability of X failures for the decapitated (truncated) NBD is P (X) =
(k + X − 1)! pX (k − 1)!X! q k+X − q x
(247)
See Tables 1.6 and 1.7. Gamma, Weibull, and Other Classes of Failure Times In the classes in which the failure intensity λ(t) is not in the form of a negative exponential but gamma, TABLE 1.6 Cross Section of Historical Failure Times of Data Sets T1(X , Y ) to T5(X , Y ) in CPUa T1: 21,700 object instructions delivered in 92 calendar days and 24.6 exec. hours of real time (X = 136, Y = 133): 130 singletons, 3 doubles 3, 33, 146, 227, 342, 351, 353, 444, 556, 571, 709, . . . , 5089, 5089, . . . , 11811, 12559,12559, . . . , 42188, 42996, 42996, . . . , 81542, 82702, 85566, 88682 (last) T2: 27,700 object instructions delivered in 72 calendar days and 30.2 exec. hours of real time (X = 54, Y = 52): 51 singletons, 1 triplet 191, 413, 693, 983, 1273, 1658, 2228, 2838, 3203, 3593, 3858, 4228, 5028, 6238, 6645, . . . , 62361, 62361, 62361, 62661, 71682, 74201, 81091, 84339, . . . , 108708 (last) T3: 23,400 object instructions delivered in 55 calendar days and 18.7 exec. hours of real time (X = 38, Y = 37): 36 singletons, 1 double 115, 115, . . . , 198, 376, 570, 706, . . . , 36818, 37381, 40151, . . . , 58065, 64789, 67335(last) T4: 33,500 object instructions delivered in 71 calendar days and 14.6 exec. hours of real time (X = 53, Y = 50): 47 singletons, 3 doubles 5, 78, 219, 710, 715, 720, 748, 886, 1364, . . . , 3148, 4572, 4572, . . . , 4664, . . . , 12850, 13021, 13021, 13664, 14551, . . . , 15885, 16489, 16489, 17263, . . . , 50896, 52422 (last) T5: 2,445,000 object instructions delivered in 173 calendar days and 1785 exec. hours of real time (X = 831, Y = 810): 794 singletons, 13 doubles, 2 triplets, 1 quintiplet 37320, . . . , 2712322, 2715360, 2715360, 2715360, 2861760, . . . , 3014160, 3014160, 3104338, . . . , 3277760, 3291260, 3291260, 3291260, 3291260, 3291260, 3294140, 3296120, 3299960, 3337820, 3337820, 3369440, . . . , 7108021, 7249520, 7249520, 7251320, . . . , 7407920, 7488620, 7488620, 7493936, . . . , 10712788, 10712788, 10729436, . . . , 15491208, 15566088, 15566088, . . . , 15573288, . . . , 15837888, 15837888 , 16073188, 16139088, 16139088, 16175088, . . . , 16277688,16277688, . . . ,16279488, 16502688, 16502688, . . . , 16750473, 16754343, 16754343, 16754343, . . . , 16847943, 16847943, . . . , 18855143, 18696039, 18747761, 18772252, 19608286, . . . , 27080864, 27080924, 27080924, 20901716, 20901716, . . . , 21120288 (last) a
The underlined entries are the failure repetitions upon one arrival, where X = total number of failures, Y = total number of arrivals [12,103,104].
62
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
TABLE 1.7 Statistical Analysis of Data Sets T1 to T5 for Poisson∧ Geometric and Poisson∧ Logarithmic q (NBD) from Geometric Empirical: No. No. Nonlinear q(Poisson∧ of q= of First Second Variable: E(x) = Geometric): Data failstops Moment: Moment: p = 1 − r: Var/Mean, q−1 Set ures:X Y E(x) = X/Y E(x 2 ) p = Y/X E(x 2 )/E(x) q = (2 − p)/p ln q T1 T2 T3 T4 T5
136 54 38 53 831
133 52 37 50 810
1.0226 1.0385 1.0270 1.06 1.0259
1.0677 1.1539 1.0841 1.18 1.0975
0.9779 0.9630 0.9737 0.9434 0.9747
1.0441 1.1111 1.0555 1.1132 1.0698
1.0451 1.0769 1.0541 1.12 1.0523
1.0219 1.0546 1.0275 1.0556 1.0345
Weibull, or other distributions, the number of failures to occur over a given period of time is either homogeneous or nonhomogeneous Poisson, binomial, or other. Next, we look at the models that fall in these classes. Yamada’s Delayed and Ohba’s Inßection S and Hyperexponential Models (Poisson Type) The interfailure distribution is gamma p.d.f., but the failure count per unit time is a Poisson type of model, not binomial. Ohba, Yamada, and joint authors proposed a software reliability growth model based on the assumption that there exist two types of defects, one of which is easier to detect [48,51]. Yamada et al. (1986) claimed that a testing process is a combination of a defect detection process and a defect isolation process. They indicated that their model is more reasonable and useful when defects can be classiÞed into two such categories. Since the model employs more parameters than those of the simple Goel–Okumoto model, it becomes more complicated to apply this technique in practice. With this method there can be a signiÞcant delay between the time of the Þrst failure observation and the time of reporting it. The authors proposed a delayed S-shaped reliability model where the experienced growth curve of the cumulative count of defects that are detected becomes S-shaped. Their model is based on an NHPP to govern the failure count process, but their mean value function (similar to the Goel–Okumoto), whose limit is Þnite [i.e., limt→∞ μ(t) = κ (total count of defects) < ∞], reßects the delay in reporting the defects, as the team members become familiar with the software, usually followed by growth and then decay as the residual defects get more difÞcult to discover. This results in an S-shaped curve, unlike the exponential growth of the Goel–Okumoto model: μ(t) = κ[1 − (1 + λt)e− λt ]
(248)
and the inter-failure-time distribution is gamma: f (t) = λ2 te−λt
(249)
REVIEW OF SOFTWARE RELIABILITY GROWTH MODELS
63
where t is the time of occurence, λ is the error detection rate, and κ is the total count of defects. Ohba later (1984) proposed another S-shaped reliability model, also an NHPP, which he called the inßection S model [48]. The new model explains a software failure detection process, where the defects detected are mutually dependent. This implies that the more failures that one detects, the more undetected failures become easily recognizable. This proposal introduces a more practical approach than that of earlier models, which invariably adopt the independence assumption of defects in a software program by default. The mean value function for this approach, on the other hand, is μ(t) = κ
1 − e−λt 1 + ie−λt
(250)
where t is the time, λ is the error detection rate, i is the infection factor, and κ is the total count of defects, or total cumulative defect rate (count per unit time). On the other hand, the failure intensity function can be found to be λ(t) = μ (t) for the Þrst derivative of μ(t) with respect to time. Yamada’s delayed S and Ohba’s inßection S model are both considered to account for the learning or training period (reßected as delayed or inßection patterns of the respective methods) that analysts experience at the infancy level of the software testing process. The mean value function (c.d.f.) and failure intensity function (p.d.f.) curves of both models are different from the negative exponential model. The exponential model assumes that the peak defect arrival is time zero (beginning) of the test phase and decays afterward. The delayed S model assumes a slightly delayed peak, and the inßection S model assumes a sharper peak later in a symmetrical shape. The hyperexponential model is an extended exponential model. That is, the various sections of the software experience varying rates of failure, as when differing groups of people do too many different things in different languages under different circumstances. In mathematical–statistical analysis, the sum of these varying exponential curves can best be formulated by a hyperexponential. The cumulative count of failures by time t, M(t), follows a Poisson process with a mean value function μ(t) = N ki=1 pi [1 − exp(−βi t)], where 0 < βi < 1, k i=1 pi = 1, and 0 < pi < 1; N is Þnite in a Þnite failure model. The defect counts in each testing interval, the fi ’s, are given as input data, as well as the completion time of each period of software observation, ti ’s. If k = 1, it is an NHPP. Since λ(t) = μ (t) and Npi is the expected number of faults for the ith class, k λ(t) = N pi βi [1 − exp(−βi t)] (251) i=1
is strictly decreasing for t > 0. For model estimation and reliability prediction, estimate Ni as the number of defects in each class from the MLE equations presented in the NHPP model. The MLE of N is then the sum of the MLEs over the k classes. If the practice suggests that there exist only two classes,
64
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
such as new versus old or easy versus difÞcult, the new model is called the modiÞed exponential software reliability growth model [51]. Also, Laprie and Kanoun designed a hyperexponential model for k = 2 with an equivalent failure rate function [105], identical to h(t) formula on page 46 in 1.3.3. λ(t) =
p1 β1 e−β1 t + p2 β2 e−β2 t p1 e−β1 t + p2 e−β2 t
(252)
and then they derived a system availability model to integrate the two classes: k = 1 for hardware and k = 2 for software. Schick–Wolverton Model (Binomial Type) Schick and Wolverton modiÞed the J-M model by assuming that the failure rate is not only proportional to the number of operating errors, but also increases linearly with operating time t, where the hazard function is [45] h(t) = k(N − i + 1)t (253) The interfailure times are assumed to be with respect to the Weibull p.d.f., with the negative exponential being a special case. Later, they proposed a more general model [106] in which the per-fault hazard rate is parabolic instead of a linear function as in their original model: h(t) = k(N − i + 1)(−b1 t 2 + b2 t + b3 )
(254)
Critics have pointed out that the model is no longer valid since h(t) should decrease over time. The main reason is that the latter errors are hidden and difÞcult to encounter in operation in modern software operations. Duane and AMSAA Model (Poisson Type) This model was originally proposed as a hardware design by Duane [49], who discovered that the cumulative failure rate or cumulative hazard function H (t), versus the accumulated testing time, resulted in a straight line on log-log plotting paper [49]. It is also an NHPP, where the fact that the failure intensity function has the same rate for a Weibull p.d.f. has been used for some software systems. It is a reliability growth model later adopted by AMSAA (Army Material Systems Analysis Activity), which uses a relationship between a cumulative test time and cumulative failures. It is an inÞnite failure model, since limt→∞ μ(t) = ∞. The reason it is also referred to as a power model (Section 1.3.3) is because the mean value function for the cumulative number of failures by time t is taken as a power of t: μ(t) = at b
for a > 0, b > 0 (b = 1 implies a homogeneous Poisson process)
(255)
The actual times to failure need to be given or the elapsed time between failures, where t0 = 0. The cumulative count of failures, M(t), follows a Poisson process
65
500 COMPUTER-GENERATED RANDOM NUMBERS
with a mean value function μ(t) = at b . If we divide the right- and left-hand sides by total testing time T and take the natural log of both sides, we obtain Y = ln
at b μ(t) = ln = ln a + (b − 1) ln T T T
(256)
One plots this equation versus T on ln-ln plotting paper to get a straight line. On the other hand, λ(t) = abt b−1 is the failure intensity function strictly increasing for b > 1 (no reliability growth recorded), strictly decreasing for 0 < b < 1 (reliability growth recorded), and constant for b = 1 (homogeneous Poisson process with a constant rate). Ref. 107 derived the MLEs for a and b to be, where t n = T , aˆ =
n T bˆ
n−1 and bˆ = n ln(T /ti )
(257)
i=1
ˆ ˆ ˆ b−1 ˆ which when inserted in μ(t) ˆ = at ˆ b and λ(t) = aˆ bt give the MLEs for their respective functions. In 1974 in the AMSAA model, Ref. 107 also derived the MLE for MTTF = μˆ = t(n) /nbˆ for the time to the (n + 1)st failure, and constructed conÞdence intervals for the MTTF reliability measure for unrepairable systems.
APPENDIX 1A: 500 COMPUTER-GENERATED RANDOM NUMBERS 0.6953 0.0082 0.6799 0.8898 0.6515 0.3976 0.0642 0.0377 0.5739 0.5827 0.0508 0.4757 0.6805 0.2603 0.8143 0.5681 0.1501 0.8806 0.4582 0.0785 0.1158 0.2762 0.9382 0.5102
0.5247 0.9925 0.1241 0.1514 0.5027 0.7790 0.4086 0.5250 0.5181 0.0341 0.7905 0.1399 0.9931 0.7507 0.7625 0.7854 0.9363 0.7989 0.7590 0.1467 0.6635 0.7018 0.6411 0.7021
0.1368 0.6874 0.3056 0.1826 0.9290 0.0035 0.6078 0.7774 0.0234 0.7482 0.2932 0.5668 0.4166 0.6414 0.1708 0.5016 0.3858 0.7484 0.4393 0.3880 0.4992 0.6782 0.7984 0.4353
0.9850 0.2122 0.5590 0.0004 0.5177 0.0064 0.2044 0.2390 0.7305 0.6351 0.4971 0.9569 0.1091 0.9907 0.1900 0.9403 0.3545 0.8083 0.4704 0.5274 0.9070 0.4013 0.0608 0.3398
0.7467 0.6885 0.0423 0.5259 0.3134 0.0441 0.0484 0.9121 0.0376 0.9146 0.0225 0.7255 0.7730 0.2699 0.2781 0.1078 0.5448 0.2701 0.6903 0.8723 0.2975 0.2224 0.5945 0.8038
0.3813 0.2159 0.6515 0.2425 0.9177 0.3437 0.4691 0.5345 0.5169 0.4700 0.4466 0.4650 0.0691 0.4571 0.2830 0.5255 0.0643 0.5039 0.3732 0.7517 0.5686 0.4672 0.3977 0.2260
0.5827 0.4299 0.2750 0.8421 0.2605 0.1248 0.7058 0.8178 0.5679 0.7869 0.5118 0.4084 0.9411 0.9254 0.6877 0.8727 0.3167 0.9439 0.6587 0.9905 0.8495 0.5753 0.4570 0.1250
0.7893 0.3467 0.8156 0.9248 0.6668 0.5442 0.8552 0.8443 0.5495 0.1337 0.1200 0.3701 0.3468 0.2371 0.0488 0.3815 0.6732 0.1027 0.8675 0.8904 0.1652 0.6219 0.9924 0.1884
0.7169 0.8166 0.2186 0.1033 0.2871 0.4680 0.9155 0.9518 0.1167 0.7870 0.9800 0.1857 0.5029 0.3288 0.4154 0.2526 0.7872 0.5321 0.0702 0.4219 0.0200 0.5445 0.9446 0.8064 0.0014 0.7379 0.8664 0.9553 0.8635 0.3155 0.5541 0.9833 0.6283 0.2631 0.9677 0.4597 0.2905 0.3058 0.8177 0.6660 0.2039 0.2553 0.6871 0.9255 0.8398 0.8361 0.3432 0.1192 (Continued)
66
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
APPENDIX 1A (Continued ) 0.2354 0.9082 0.6936 0.4042 0.9410 0.0917 0.8532 0.8980 0.8412 0.5688 0.5006 0.5748 0.1100 0.5802 0.1019 0.9909 0.6292 0.9430 0.9938 0.4690 0.2028 0.6141 0.2757 0.0561 0.1419 0.3125
0.7410 0.7906 0.0702 0.8158 0.2201 0.2504 0.4869 0.0455 0.8792 0.8633 0.1215 0.4164 0.0873 0.7747 0.6628 0.8991 0.4923 0.2579 0.7098 0.1395 0.3774 0.4131 0.8479 0.0126 0.4308 0.0053
0.7089 0.7589 0.9716 0.3623 0.6348 0.2878 0.2685 0.8314 0.2025 0.5818 0.8102 0.3427 0.9407 0.1285 0.8998 0.2298 (1.0276 0.7933 0.7964 0.0930 0.0485 0.2006 0.7880 0.6531 0.8073 0.9209
0.2579 0.8870 0.0374 0.6614 0.0367 0.1735 0.6349 0.8189 0.9320 0.0692 0.1026 0.2809 0.8747 0.0074 0.1334 0.2603 0.6734 0.0945 0.7952 0.3189 0.7718 0.2329 0.8492 0.0378 0.4681 0.9768
0.1358 0.1189 0.0683 0.7954 0.0311 0.3872 0.9364 0.6783 0.7656 0.2543 0.9251 0.8064 0.0496 0.6252 0.2798 0.6921 0.6562 0.3192 0.8947 0.6972 0.9656 0.6182 0.6859 0.4975 0.0481 0.3584
0.8446 0.7125 0.2397 0.7516 0.0688 0.6816 0.3451 0.8086 0.3815 0.5453 0.6851 0.5855 0.4380 0.7747 0.7351 0.5573 0.4231 0.3195 0.1214 0.7291 0.2444 0.5151 0.8947 0.1133 0.2918 0.0390
0.1648 0.6324 0.7753 0.6518 0.2346 0.2731 0.4998 0.1386 0.5302 0.9955 0.1559 0.2229 0.5847 0.0112 0.7330 0.8191 0.1980 0.7772 0.8454 0.8513 0.0304 0.6300 0.6246 0.3572 0.2975 0.2161
0.3889 0.1096 0.2029 0.3638 0.3927 0.3846 0.2842 0.4442 0.8744 0.1237 0.1214 0.2805 0.4183 0.3958 0.6723 0.0384 0.6551 0.4672 0.8294 0.9256 0.1395 0.9311 0.1574 0.0071 0.0685 0.6333
0.5620 0.5155 0.1464 0.3107 0.7327 0.6621 0.0643 0.9941 0.4584 0.7535 0.2628 0.9139 0.5929 0.3285 0.6924 0.2954 0.3716 0.7070 0.5394 0.7478 0.1577 0.3837 0.4936 0.4555 0.6384 0.4391
0.6555 0.3449 0.8000 0.2718 0.9994 0.8983 0.6656 0.6812 0.3585 0.5993 0.9374 0.9013 0.4863 0.5389 0.3963 0.0636 0.0507 0.5925 0.9413 0.8124 0.8625 0.7828 0.8077 0.7563 0.0812 0.6991
REFERENCES 1. E. E. Lewis, Introduction to Reliability Engineering, 2nd ed., Wiley New York, 1996. 2. M. Sahinoglu, Reliability Theory and Applications, unpublished class notes, Middle East Technical University, Ankara, Turkey, 1982. 3. K. S. Trivedi, Probability and Statistics with Reliability: Queuing and Computer Science Applications, 2nd ed., Wiley, Hoboken, NJ, 2002. 4. L. C. Woltenshome, Reliability Modeling: A Statistical Approach, Chapman & Hall, London, 1999. 5. N. A. J. Hastings and J. B. Peacock, A Handbook for Students and Practitioners, 2nd ed., Butterworth, London, 1975. 6. V. Rothschild and N. Logotheis, Probability Distributions, Wiley, New York, 1985. 7. J. Banks, J. S. Carson II, B. L. Nelson, and D. M. Nicol, Discrete Event Simulation, 3rd ed., Prentice Hall, Upper Saddle River, NJ, 2001. 8. M. Sahinoglu, Random Number Generation and Simulation, unpublished class notes, Middle East Technical University, Ankara, Turkey, 1992. 9. D. R. Anderson, D. J. Sweeny, and T. A. Williams, An Introduction to Management Science: Quantitative Approaches to Decision Making, 11th ed., Thomson SouthWestern, Mason, OH, 2005.
REFERENCES
67
10. G. G. Roussas, A First Course in Statistics, Addison-Wesley, Reading, MA, 1973. 11. M. Sahinoglu, D. Libby, and S. R. Das, Measuring Availability Indices with Small Samples for Component and Network Reliability Using the Sahinoglu–Libby Probability Model, IEEE Trans. Instrum. Meas., 54(3), 1283–1295 (June 2005). 12. M. Sahinoglu, Compound-Poisson Software Reliability Model, IEEE Trans. Software Eng. 18, 624–630 (July 1992). 13. P. Randolph and M. Sahinoglu, A Stopping Rule for a Compound Poisson Variable, J. Appl. Stochastic Models Data Anal., 11, 135–143 (June 1995). 14. M. Sahinoglu, Alternative Parameter Estimation Methods for the Compound Poisson Software Reliability Model with Clustered Failure Data, J. Software Test. Reliab. VeriÞcation, 17, 35–57 (March 1997). 15. M. Sahinoglu, An Empirical Bayesian Stopping Rule in Testing and VeriÞcation of Behavioral Models, IEEE Trans. Instrum. Meas., 52, 1428–1443 (October 2003). 16. M. Sahinoglu, J. Deely, and S. Capar, Stochastic Bayesian Measures to Compare Forecast Accuracy of Software Reliability Models, IEEE Trans. Reliab., 50, 92–97 (March 2001). 17. M. Sahinoglu, The Limit of Sum of Markov Bernoulli Variables in System Reliability Estimation, IEEE Trans. Reliab., 39, 46–50 (April 1990). 18. M. Sahinoglu, On Central Limit Theory for Statistically Non-independent and Nonidentical Variables, J. M.E.T.U. Stud. Dev. Appl. Stat., Special Volume, pp. 69–88 (1982). 19. M. Sahinoglu and O. L. Gebizlioglu, Exact PMF Estimation of System Indices in a Boundary-Crossing Problem, Commun. Fac. Sci. Univ. Ankara Ser. A1 , 36(2), 115–121 (1987). 20. A. D. Patton, C. Singh, and M. Sahinoglu, Operating Considerations in Generation Reliability Modeling: Analytical Approach, IEEE Trans. Power Appar. Syst., 100, 2656–2663 (May 1981). 21. M. Sahinoglu, M. T. Longnecker, L. J. Ringer, C. Singh, and A. K. Ayoub, Probability Distribution Function for Generation Reliability Indices: Analytical Approach, IEEE Trans. Power Appar. Syst., 102, 1486–1493 (October 1983). 22. M. Sahinoglu and A. S. Selcuk, Application of Monte Carlo Simulation Method for the Estimation of Reliability Indices in Electric Power Generation Systems, Tubitak Doga-Tr., Turk. J. Eng. Environ. Sci., 17, 157–163 (1993). 23. S. Kokoska and C. Nevison, Statistical Tables and Formulae, Springer-Verlag, New York, 1989. 24. L. J. Bain, Statistical Analysis of Reliability and Life-Testing Models: Theory and Models, Marcel Dekker, New York, 1978. 25. M. J. Crowder, A. C. Kimber, R. L. Smith, and T. J. Sweeting, Statistical Analysis of Reliability Data, Chapman & Hall, London, 1991. 26. C. Cunnane, Unbiased Plotting Positions: A Review, J. Hydrol., 37, 205–222 (1978). 27. J. Ledolter and R. V. Hogg, Applied Statistics for Engineers and Physical Scientists, 2nd ed., Macmillan New York, 1992. 28. W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliability Data, Wiley, New York, 1996.
68
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
29. R. Billinton and R. N. Allen, Reliability Evaluation of Engineering Systems: Concepts and Techniques, Plenum Press, New York, 1983; personal communication, University of Manchester Institute of Science and Technology, Manchester, England, 1975. 30. M. Xie, Software Reliability Modeling, World ScientiÞc, Singapore, 1991. 31. M. Xie, Software Reliability Models: Selected Annotated Bibliography, Software Test. VeriÞcation Reliab., 3, 3–28 (1993). 32. W. Farr, Chap. 3 in M. R. Lyu (ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press/McGraw-Hill, New York, 1996. 33. S. H. Kan, Metrics and Models in Software Quality Engineering, Addison-Wesley, Reading MA, 1995. 34. J. D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill International, Singapore, 1987. 35. J. C. Munson, Software Engineering Measurement, Auerbach Publishing, Boca Raton, FL, 2003. 36. M. A. Friedman and J. M. Voas, Software Assessment: Reliability, Safety, Testability, Wiley, New York, 1995. 37. M. L. Shooman, Software Engineering: Design, Reliability and Management, McGraw Hill, New York, 1983, Chap. 5. 38. M. L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, Wiley, Hoboken, NJ, 2002. 39. F. B. Bastani, Software Reliability, IEEE Trans. Software Eng., Special Issue, 1993. 40. L. Bernstein and C. M. Yuhas, Trustworthy Systems Through Quantitative Software Engineering, IEEE Computer Society, Los Alamitos, CA, 2005. 41. G. R. Hudson, Program Errors as a Birth and Death Process, Report SP-3011, System Development Corporation, Santa Monica, CA, 1967. 42. Z. Jelinski and P. B. Moranda, Software Reliability Research, in W. Freiberger (ed.), Statistical Computer Performance Evaluation, Academic Press, New York, 1972, pp. 465–497. 43. P. B. Moranda, Prediction of software reliability during debugging, Proceedings of the Annual Reliability and Maintainability Symposium, Washington, DC, IEEE Reliability Society, 1975, pp. 327–333. 44. M. L. Shooman, Probabilistic Models for Software Reliability Prediction, in W. Freiberger (ed.), Statistical Computer Performance Evaluation, Academic Press, New York, 1972, pp. 485–502, 45. G. J. Schick and R.W. Wolverton, An Analysis of Competing Software Reliability Models, IEEE Trans. Software Eng., 4(2), 104–120 (1978). 46. A. L. Goel and K. Okumoto, Time-Dependent Error-Detection Rate Model for Software Reliability and Other Performance Measures, IEEE Trans. Reliab., 28(3), 206–211 (1979). 47. J. D. Musa and K. Okumoto, A Logarithmic Poisson Execution Time Model for Software Reliability Measurement, Proceedings of the 6th International Conference on Software Engineering, Orlando, FL, IEEE Computer Society, 1984, pp. 230–238. 48. M. Ohba, Software Reliability Analysis Models, IBM J. Res. Dev., 28(4), 428–443 (1984). 49. J. T. Duane, Learning Curve Approach to Reliability Monitoring, IEEE Trans. Aerospace, 2(2), 563–566 (1964).
REFERENCES
69
50. B. Littlewood, Rationale for a modiÞed Duane Model, IEEE Trans. Reliab., 33(2), 157–159 (1984). 51. S. Yamada, S. Osaki, and H. Narihisa, Discrete models for software reliability, in A. P. Basu (ed.), Reliability and Quality Control, Elsevier, New York, 1986, pp. 401–412. 52. G. J. Knaß and J. Sacks, Poisson Process with Nearly Constant Failure Intensity, Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, IEEE Computer Society, 1991, pp. 60–66. 53. M. Zhao and M. Xie, On the Log-Power Model and its Applications, Proceedings of the International Symposium on Software Reliability Engineering, Research Triangle Park, NC, IEEE Computer Society, 1992, pp. 14–22. 54. J. D. Musa, A Theory of Software Reliability and Its Application, IEEE Trans. Software Eng., 1(3), 312–327 (1975). 55. B. Littlewood and J. L. Verrall, A Bayesian Reliability Growth Model for Computer Software, Appl. Stat., 22(3), 332–346 (1973). 56. T. A. Mazzuchi and R. Soyer, A Bayes Empirical-Bayes Model for Software Reliability, IEEE Trans. Reliab., 37(3), 248–254 (1988). 57. N. Langberg and N. D. Singpurwalla, A UniÞcation of Some Software Reliability Models, SIAM J. Sci. Stat. Comput., 6(3), 781–790 (1985). 58. W. S. Jewell, Bayesian Extensions to a Basic Model of Software Reliability, IEEE Trans. Software Eng., 11(12), 1465–1471 (1985). 59. E. Nelson, Estimating Software Reliability from Test Data, Microelectron. Reliab., 17(1), 67–74 (1978). 60. C. V. Ramamoorthy and F. B. Bastani, Software Reliability: Status and Perspectives, IEEE Trans. Software Eng., 8(4), 354–371 (1982). 61. F. B. Bastani and C. V. Ramamoorthy, Input-Domain-Based Models for Estimating the Correctness Of Process Control Programs, in A. Serra and R. E. Barlow (eds.), Reliability Theory, North-Holland, Amsterdam, 1986, pp. 321-378. 62. J. C. Munson and T. M. Khoshgoftaar, The Use of Software Complexity Metrics in Software Reliability Modeling, Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, IEEE Computer Society, 1991, pp. 2–11. 63. R. G. Hamlet, Probable Correctness Theory, Inf. Process. Lett., 25(1), 17–25 (1987). 64. R. K. Scott, J. W. Gault, and D. F. McAllister, Fault-Tolerant Software Reliability Modeling, IEEE Trans. Software Eng., 13(5), 582–592 (1987). 65. S. N. Weiss and E. J. Weyuker, An Extended Domain-Based Model of Software Reliability, IEEE Trans. Software Eng., 14(12), 1512–1524 (1988). 66. B. Littlewood and D. R. Miller, Conceptual Modeling of Coincident Failures in Multiversion Software, IEEE Trans. on Software Eng., 15(12), 1596–1614 (1989). 67. R. W. Butler and G. B. Finelli, The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software, IEEE Trans. on Software Eng., 19(1), 3–12 (1993). 68. T. Stalhane, Fault Tree Analysis Applied to Software, in T. Aven (ed.), Reliability Achievement: The Commercial Incentive, Elsevier, London, 1989, pp. 166–178. 69. C. Wohlin and U. Korner, Software Faults: Spreading, Detection and Costs, Software Eng. J., 5(1), 33–42 (1990). 70. H. D. Mills, On the Statistical Validation of Computer Programs, IBM Federal Systems Division, Gaithersburg, MD, Report FSC-72-6015, 1972.
70
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
71. X. Z. Huang, The Hypergeometric Distribution Model for Predicting the Reliability of Software, Microelectron. Reliab., 24(1), 11–20 (1984). 72. J. W. Duran and J. J. Wiorkowski, Capture–Recapture Sampling for Estimating Software Error Content, IEEE Trans. Software Eng., 7(1), 147–148 (1981). 73. P. A. Keiller and D. R. Miller, On the Use and the Performance of Software Reliability Growth Models, Reliab. Eng. Syst. Saf., 32(2), 95–117 (1991). 74. T. M. Khoshgoftaar and T. G. Woodcock, Software Reliability Model Selection: A Case Study, Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, IEEE Computer Society, 1991, pp. 183–191. 75. M. R. Lyu and A. Nikora, A Heuristic Approach for Software Reliability Prediction: The Equally-Weighted Linear Combination Model, Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, IEEE Computer Society, 1991, pp. 172–181. 76. T. Bendell and P. Mellor, Software Reliability: State of the Art Report, Pergamon Infotech, London, 1986. 77. B. Littlewood (ed.), Software Reliability: Achievement and Assessment, Blackwell, Oxford, 1987. 78. A. L. Goel, Software Reliability Models: Assumptions, Limitations, and Applicability, IEEE Trans. Software Eng., 11(12), 1411–1423 (1985). 79. K. Okumoto and A. Goel, Optimum Release Time for Software Systems Based on Reliability and Other Performance Measures, J. Syst. Software, 1(4), 315–318 (1980). 80. N. F. Schneidewind, Analysis of Error Processes in Computer Software, Sigplan Not., 10(6), 337–346 (1975). 81. W. H. Farr, A Survey of Software Reliability Modeling and Estimation, NSWC TR171, Naval Surface Warfare Center, September 1983, 1333 Isaac Hull Ave SE, Washington Navy Yard, DC, 20376–7107. 82. M. Sahinoglu, Statistical Inference on the Reliability Performance Index for Electric Power Generation Systems, Ph.D. dissertation, Texas A&M University, College Station, TX, 1981, pp. 15–32. 83. B. Littlewood, Stochastic Reliability Growth: A Model for Fault-Removal in Computer Programs and Hardware Designs, IEEE Trans. Reliab., 30(4), 313–320 (October 1981). 84. P. A. Keiller, B. Littlewood, D. R. Miller, and A. Sofer, Comparison of Software Reliability Predictions, Proceedings of the 13th IEEE International Symposium on Fault Tolerant Computing, 1983, pp. 128–134. 85. G. Liu, A Bayesian Assessing Method of Software Reliability Growth, in S. Osaki and J. Cao (eds.), Reliability Theory and Applications, World ScientiÞc, Singapore, 1987, pp. 237–244. 86. W. E. Thompson and P. O. Chelson, On the SpeciÞcation of Testing of Software Reliability, Proceedings of the 1980 Annual Reliability and Maintanability Symposium, IEEE, New York, 1980, pp. 379–383. 87. J. D. Musa and K. Okumoto, A Comparison of Time Domains for Software Reliability Models, J. Syst. Software, 4(4), 277–287 (1984). 88. C. C. Sherbrooke, Discrete Compound Poisson Processes and Tables of the Geometric Poisson Distribution, Memorandum RM-4831-PR, Rand Corporation, Santa Monica, CA, July 1966.
EXERCISES
71
89. E. Cinlar, Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs, NJ, 1975. 90. R. B. D’Agostino and M. A. Stephens, Goodness of Fit Techniques, Marcel Dekker, New York, 1986. 91. P. C. Consul, Generalized Poisson Distributions, Marcel Dekker, New York, 1989. 92. R. M. Adelson, Compound Poisson Distributions, Oper. Res. Q., 17, 73–74 (1966). 93. R. F. Serfozo, Compound Poisson Approximations for Sums of Random Variables, Ann. Probab., 14, 1391–1398 (1986). 94. R. A. Fisher, The SigniÞcance of Deviations from Expectation in a Poisson Series, Biometrics, pp. 17–24 (1950). 95. W. Feller, An Introduction to Probability Theory and Its Applications, 3rd ed., Vol. 1, Wiley, New York, 1968, pp. 288–292. 96. M. Sahinoglu, Geometric Poisson Density Estimation of the Number of Software Failures, IEEE Proceedings of the 28th Annual Reliability Conference, Spring Seminar of the Central New England Council, Boston Chapter Reliability Society, April 1999, pp. 149–174. 97. Student, Biometrika, 12, 211–215 (1919). 98. M. Greenwood and G. U. Yule, An Inquiry into the Nature of Frequency Distributions Representative of Multiple Happenings, J. Roy. Stat. Soc., 83, 255–279 (1920). 99. B. Brown, Some Tables of the Negative Binomial Distribution and Their Use, Memorandum RM-4577-PR, Rand Corporation, Santa Monica, CA, June 1965. 100. F. N. David and N. L. Johnson, The Truncated Poisson, Biometrics, pp. 275–285 (December 1952). 101. Encycl. Stat. Sci., 5, 92–93 111–113 (1988). 102. Encycl. Stat. Sci., 6, 169–176 (1988). 103. J. D. Musa, Software Reliability Data, Bell Telephone Laboratories, Whippany, NJ, 1979. 104. M. Sahinoglu, Applied Stochastic Processes: Class Notes SimpliÞed, Middle East Technical University, Ankara, Turkey, June 1992. 105. J. C. Laprie and K. Kanoun, X-Ware Reliability and Availability Modeling, IEEE Trans. Software Eng., 18(2), 130–147 (1992). 106. G. J. Schick and R. W. Wolverston, An Analysis of Competing Software Reliability Models, IEEE Trans. Software Eng., 4(2), 104–120 (1978). 107. L. H. Crow, Reliability Analysis for Complex, Repairable Systems, (in F. Proshan and R. J. Serßing (eds.), Reliability and Biometry, SIAM, Philadelphia, PA, 1974, pp. 379–410.
EXERCISES 1.1 At the end of one year of service, the reliability of a certain software product during its useful life period after the debugging process (assuming a constant failure rate) is 0.8. (a) What is the failure rate of this software product in hours?
72
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
(b) If four of these products are put in series and active parallel independently, what are the annual reliability Þgures in series and active parallel systems, respectively? (c) For active parallel, if 30% of the component failure rate may be attributed to common-mode failures, what will the annual reliability become for the two components in parallel? (d) Suppose that the failure rate for a software component is given as 0.08 per hour. How many components must be placed in active parallel form if a distributed system of modules will have to run for 100 hours with a system reliability of no less than 95%? (e) Assuming now that the annual reliability of the software module is improved to 0.95, a series system of four components is formed. A second set of four components is bought and a redundant system is built. What is the reliability of the new redundant system with (1) high-level redundancy by drawing the representation in numbered blocks, and (2) low-level redundancy by drawing the representation in numbered blocks? 1.2 (a) A wear test is run on 10 PC hard drives and the following times in months found: 27, 39+, 40, 54, 68+, 85, 93, 102, 135+, 144 Using the product limit (Kaplan–Meier) technique to account for censoring, make a nonparametric plot of the reliability and hazard functions. (b) A nonreplacement test is run for 60 hours on 40 microprocessors. Five failures occur at 12, 19, 28, 39, 47 hours. Estimate the value of the constant failure rate. Also Þnd approximate upper and lower bounds for the MTTF. 1.3 The reliability of an operating system in time is given by (1 − 0.2t)2 , R(t) = 0, t > 5
0≤t ≤5
where the original p.d.f. of failure time was f (t) = (0.4)(1 − 0.2t) for 0 < t ≤ 5. (a) Verify R(t) using f (t). Calculate the MTTF. (b) Find the failure or hazard rate h(t) Is the failure rate increasing or decreasing? Justify your result.
73
EXERCISES
(c) How often should it be updated if failures are to be held to no more than 5%? 1.4 A constant-failure-rate device (a desktop PC) has a MTTF of 2000 hours. The vendor offers a one year warranty. What fraction of the PCs will fail during the warm-up period? 1.5 A software module being marketed is tested for two months and found to have a reliability of 0.99; the module is known to have a constant failure rate. (a) What is the failure rate? (b) What is the MTTF? (c) What is the reliability of this product four years into its operation if it is in continuous use? (d) What should the warranty time be to achieve an operational reliability of 95%? 1.6 The reliability growth models are outlined in Section 1.6.2. Among the most popular are the generalized Goel–Okumoto NHPP (when c = 1, the Weibull becomes exponential, and when c = 2, it becomes the Raleigh model) and the Musa–Okumoto logarithmic Poisson execution time model, which have different mean value functions to predict the number of failures at the end of a mission time t. Given the following input data, calculate and compare the mean values of the number of failures expected to be predicted by the end of a time t for each model. These models are all NHPP. Use the following data as necessary: t (the time at which to predict) = 80 CPU hours; a (the number of failures expected at the end of the mission time) = υ0 = κ = 100; b = θ (the fault detection rate per fault) = 0.02; and λ0 (the initial failure intensity) = 10 per hour. 1.7 Table E1.7 lists uncensored grouped data on the failure of identical commercial software modules collected by an independent quality focused nonproÞt organization. Draw a nonparametric plot of the reliability and cumulative hazard functions versus time. TABLE E1.7 Interval (CPU seconds) 0
≤ 6000 ≤ 12,000 ≤ 18,000 ≤ 24,000 ≤ 30,000 ≤ 36,000
Number of Failures 5 19 61 27 20 17
74
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
1.8 Calculate the source–target (s-t) reliabilities of the following systems: (a) Take s = 1, t = 13 in Figure E1.8(a). (b) Take s = 5, t = 9 in Figure E1.8(b).
2 0.9
4 0.9
−2 0.8
7 0.9
−3 0.8
−1 0.8 1 0.9
5 0.9
−11 0.8
0.8
16 0.9
−13 0.8
3 0.9
−15 −16 0.8 0.8
−14 0.8 15 0.9
11 0.9
−10 0.8
−7
−9 0.8
0.8
6 0.9
−4 0.8
0.8
14 0.9
−12
−5 0.8
−19
8 0.9
−6 0.8
9 0.9
10 0.9
−8 0.8
12 0.9
17 0.9
−18
−20
−17 0.8
19 0.9
0.8
−1 1 0.9
4 0.9
−2 0.8
5 0.9 −4 0.8
−3 0.8
0.8 −11 0.8
3 0.9 −12 0.8
14 0.9
15 0.9
−13 0.8
−14 0.8
16 0.9
−7 0.8
17 0.9
−15 0.8 −16 0.8
13 0.9
0.8 8 0.9
−10 0.8
0.8
6 7 0.9 −5 0.9 0.8
12 0.9
−19 0.8
−22
0.8
18 0.9
−6 2 0.9
−21 0.8
9 0.9
10 0.9
−8 0.8
11 0.9 −21 0.8 13 0.9
−20 0.8 −17 0.8
19 0.9
−18 0.8
18 0.9
−9 0.8
−22 0.8
FIGURE E1.8(a)
2 0.9000
−3 1.0
3 0.9000
4 0.9000
−4 1.0
10 0.9000
−2 −1.3
1.0 5 0.9000
1.0
1 0.9000
−1.1
−1 −6
1.0 −7
−10
1.0 6 0.9000
7 0.9000
−9
1.0
1.0
−12 1.0 9 0.9000
1.0
1.0 11 0.9000
1.0
−8
−5 1.0
8 0.9000
FIGURE E1.8(b)
1.9 A disk drive has a constant failure rate λ with an MTTF of 5000 hours. What is the probability of failure for one year of operation? What is the probability of failure for one year of operation if two of the drives are placed in active parallel mode with failures assumed to be independent? What if in series?
EXERCISES
75
1.10 What is the probability of failure for one year of active parallel operation if the common-mode errors are characterized by β = 0.2? 1.11 Suppose that a system consists of two components placed in series, each with a failure rate λ = 1 per year. A redundant system is built consisting of four identical components. Derive and calculate expressions for the system’s failure rates after t = 1 year of operation: (a) for high-level redundancy (b) for low-level redundancy 1.12 Suppose that without preventive maintenance, the failure (hazard) rate of a computer hard drive is given by λ(t) = (1.1)(10−6 )t + (1.0)(10−9 )t 2 , where t is given in CPU hours and λ(t) is per year. (a) Calculate the design-life reliability of the hard drive for a design life of 10,000 CPU hours assuming that no preventive maintenance is performed. (b) Suppose that by overhaul the hard drive is returned to as-good-as-new condition. How frequently should the PC company perform overhauls to accomplish a design reliability of at least 0.90? 1.13 Suppose that a computer keyboard is life-tested and has an MTTF of 68 hours and an MTTR (mean time to repair) of 1.5 hours. What is the availability with corrective maintenance? If the MTTR is reduced to 1 hour with extra measures, what MTTF can be tolerated without altering the machine availability? 1.14 An annual life test revealed that 92% of computer hard disks were found to be operable. According to these data, what is the long-term availability of these disks? If one is not happy with this availability Þgure and desires to reach an availability value of 0.98, how often must one perform testing and replacement procedures? 1.15 At the end of one year of service, the unreliability of certain General Electric light bulbs, assuming a constant failure rate, is 10%. (a) What is the failure rate in hours? (b) If two bulbs are put independently in series and active parallel, what are the annual reliability Þgures, respectively? (c) For active parallel, if 20% of the component failure rate may be attributed to common-mode failures, what will the annual reliability become for the two components in active parallel? (d) Suppose that the design failure rate for the component is given to be 0.008 per hour. How many bulbs must be placed in active parallel if a system of lamps will have to run for 100 hours with a system reliability of no less than 95%. (e) Assuming now that the annual reliability of the bulb is improved to 0.96, a series system of three bulbs is formed. A second set of three
76
FUNDAMENTALS OF COMPONENT AND SYSTEM RELIABILITY
components is bought and a redundant system is built. What is the reliability of the new redundant system with (1) high-level redundancy; (2) low-level redundancy? 1.16 Let us consider a software program with an initial failure intensity of 10 failures/hour and 100 total failures to be experienced in inÞnite time. What is the failure intensity λ(τ ) going to be at τ = 10 and 100 execution hours (a) for Musa’s basic execution model and (b) for the Musa–Okumoto logarithmic Poisson model? Having calculated those, estimate the predicted number of failures, μ(τ ) at τ = 10 and τ = 100 execution hours for parts (a) and (b). 1.17 For the grouped (clustered) failure data formulated in Table E1.17(a) and presented in Table E1.17(b) for system T38 [34], use Sahinoglu’s compound Poisson software reliability model to estimate the number of failures to be expected by the end of the mission time (at the end of eleventh interval) at step 7 by applying (a) the Poisson∧ geometric distribution model; (b) the Poisson∧ logarithmic series distribution model; (c) the Goel–Okumoto model; (d) the Musa–Okomoto model. TABLE E1.17(a) Interval Number 1 2 .. . p
Grouped Failure Data Format
Total Test Time After the Interval
Duration of the Interval
Number of Failures in the Interval
Cumulative Number of Failures
x1 x2 .. . xp
x1 x 2 − x1 .. . xp − xp−1
y1 y2 .. . yp
z1 z2 .. . zp
1.18 Consider the following multiply censored data without replacement for the computer hard disks in months: 30 39
40+
54 68+
85
93+
102 135
144+
(a) Prepare a spreadsheet for these ungrouped data showing time, reliability function, and hazard function columns. Make a nonparametric plot of the reliability and hazard functions. Estimate the MTTF for the pumps. (b) Suppose that these were the initial n = 10 failures recorded at the end of the experiment (type II: stop at nth failure) for a total of N = 20 disks. Estimate the MTTF and Þnd the upper and lower 90% conÞdence interval estimates. Calculate for (1) nonreplacament and (2) replacement. 1.19 Given the failure times 5.2, 6.8, 11.2, 16.8, 17.8, 19.6, 23.4, 25.4, 32.0, and 44.8 hours.
77
EXERCISES
TABLE E1.17(b)
Interval Number 1 2 3 4 5 6 7 8 9 10 11
Grouped Failure Data for System T38
Total Test Time at the End of the Interval (CPU hours)
Duration of the Test Interval (CPU hours)
Number of Failures in the Interval
Cumulative Number of Failures
5 15 25 35 45 50 65 75 95 120 125
5 10 10 10 10 5 15 10 20 25 5
1 0 16 1 1 0 1 3 2 7 0
1 1 17 18 19 19 20 23 25 32 32
(a) Using a Weibull probability p.d.f., determine the scale and shape parameters. (b) Using a negative exponential p.d.f., estimate the failure rate, θ . (c) Which is a better estimate? Why? 1.20 A constant-failure-rate device (PC monitor) has MTTF = 2000 hours. The vendor offers a one-year warranty. What fraction of the PC monitors will fail during the warranty period? 1.21 Given the time-domain data sets T1, T2, T3, T4, and T5 in the data bank of this book’s CD-ROM use Þve software reliability models of your choice to predict the expected value and hazard functions at the end of the mission. Explain which is best, and why. 1.22 Given the grouped data for WD1, WD2, WD3, WD4, and WD5 in the data bank of the CD-ROM, use Þve software reliability models of your choice to predict expected value and hazard functions at the end of the mission. Which is best, and why? 1.23 Given q = 3 and mean (M) = 10, plot two curves for the p.d.f and survival function using the CD-ROM for both Poisson∧ geometric and Poisson∧ log series. 1.24 Using the uniform random table in Appendix 1A from right to left, generate at least a pair of random variables from each of (a) the p.m.f.’s in Section 1.3 and (b) the pdf’s in Section 1.3 from Table 1.1 only. 1.25 Generate a pair of random deviates from all the rest of p.d.f.’s and p.m.f.’s not covered in Exercise 1.24 using Appendix 1A.
Toil, earn, eat, and give others your wages. Our Þrst duty is good character and good efforts. Hand out to others what you earn, Do the poor people a good turn. —Yunus Emre, the legendary mystic folk poet (1238–1320)
2 SOFTWARE RELIABILITY MODELING WITH CLUSTERED FAILURE DATA AND STOCHASTIC MEASURES TO COMPARE PREDICTIVE ACCURACY OF FAILURE-COUNT MODELS 2.1 SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL Nutshell 2.1 The subject matter of this section is a failure count prediction technique in a software testing process known as a compound Poisson software reliability model (CPSRM), which is a generalization of a compound Poisson model for estimation of the residual number of software failures in testing. The conventional nonhomogeneous Poisson (NHPP) models do not treat or permit multiple counts as the Poisson∧ geometric model and Poisson∧ LSD (or negative binomial) do. These two models are superior to other competing models, such as Musa-Okumoto’s logarithmic Poisson method in most data sets with clumping of failures. CPSRM is a technique to predict the failure count at a future time. It clearly takes into account the clumping effect in an NHPP process by using the compound Poisson, where clustering was not modeled earlier in calculations. The probability density estimation of the remaining number of software failures in the event of clustering or clumping of the software failures is the subject of this section. Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
78
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
79
2.1.1 Notation and Introduction A discrete compound Poisson prediction model, as opposed to a Poisson process, is proposed for the random variable X, which denotes the remaining number of software failures. The compounding distributions, which are assumed to govern the failure sizes at Poisson arrivals, are taken to be geometric when failures are forgetful and logarithmic series when failures are contagious. The expected value of X is calculated as a function of Poisson and compounding distribution based on the failure sizes experienced. The compound Poisson (CP) software reliability model was Þrst proposed for time-between-failures data in terms of CPU seconds using the maximum likelihood estimation (MLE) method to estimate unknown parameters: hence, CPMLE [10]. However, another parameter estimation technique is proposed under nonlinear regression analysis (NLR) for the compound Poisson reliability model, giving rise to the name CPNLR [12]. It is observed that the CP model, with different parameter estimation methods, produces results as satisfactory as or more favorable than those obtained using the Musa–Okumoto (M-O) model, particularly in the event of grouped or clustered (clumped) software failure data. The sampling unit may be a day, week, or month within which the failures are clumped as dictated by the error-recording facilities within a software-testing environment. The CPNLR and CPMLE proposed yield comparatively more favorable results for certain software failure data structures, where the frequency distribution of the cluster size of the software failures, such as per week, displays a negative exponential behavior. Average absolute relative error (ARE), meansquared error (MSE), and average Kolmogorov–Smirnov (K–S) statistics are used as measures of forecast quality for the proposed and competing parameter estimation techniques in predicting the number of remaining future failures expected to occur until a target stopping time. Comparisons of Þve simulated data sets that contain weekly recorded software failures are made to emphasize the advantages and disadvantages of competing methods by means of chronological prediction plots around the true target value and zero percent relative error line. The generalized compound Poisson (MLE and NLR) methods proposed consistently produce more favorable predictions for software failure data with negative exponential frequency distribution of the failure clump size versus the number of weeks. Otherwise, the popularly used competing M-O logarithmic Poisson model is a better Þt for data with a uniform cluster size distribution to recognize the effect since the logarithm of the Poisson equation is a constant, hence uniform. The software analyst is urged to perform exploratory data analysis to recognize the nature of the software failure data before selecting a particular reliability estimation method. After the initial papers [1,2] presenting a software reliability model based on nonhomogeneous Poisson processes with failure count models, others followed [3–5], later to be collected in a bibliography by Xie [6], who included CP methods of estimation. The difÞculty lies in the limitations of existing methods for solving nonlinear parameter estimation techniques for many software
80
SOFTWARE RELIABILITY MODELING
reliability models in the literature. In this chapter we study the predictive signiÞcance of certain parameter estimation methods for the compound Poisson reliability model in the event of clustered multiple-failure data as opposed to single-failure data [7]. We will study the maximum likelihood estimation (MLE) and nonlinear regression analysis (NLR) methods: CPMLE and CPNLR. Clustered data are frequently collected in software testing practice, such as in the telecommunications world, where testing is carried on, for example, in units of days, weeks, or months. Such results have been observed in Bellcore’s and Jet Propulsion’s software testing laboratories [8]. In earlier publications on the CP model [3,9–13] regarding clustered and grouped data [14] and Moranda’s geometric de-eutrophication model, the original assumption is that the failure intensity is proportional to the current fault content [15]. Results computed using the CPMLE and CPNLR methods proposed are compared favorably with commonly used Musa–Okumoto logarithmic Poisson model [5,7] for grouped or clustered failure data on Þve separate data sets recorded weekly. 2.1.2 Background and Motivation As suggested intuitively [11,12], the rarely encountered phenomenon of clumped failures within a CPU second in the case of time-between-failures data was in fact a serious disadvantage from the viewpoint of the use of CP theory [11,12,16]. However, the CP model proposed proved more practical for clustered or grouped failure data where the multiple failures accumulated within a week or day, for example, worked considerably to the advantage of the CP model. In a publication by Sahinoglu [10], values of q = variance-to-mean ratio start to be signiÞcantly greater than 1. Owing to the noninßuencing phenomenon of the multiple failures found clumped within each sampling time unit, the compounding distribution was selected to be the geometric distribution after a series of goodness-of-Þt studies. The resulting distribution was thus called Poisson∧ geometric [16]. Note that the geometric distribution is a discrete analog of the continuous negative exponential probability density function. It has a property similar to the nonaging (Markov) property of the negative exponential distribution [10]. Another alternative, the Poisson∧ logarithmic series, in the event of failures within a clump, which have a contagious nature and inßuence each other, has been studied in the original CP software reliability model [10,17,44]. Parameter estimation for grouped, interval, or clustered data in software reliability has been studied [7,10,14] in detail in terms of both CPU seconds and grouped calendar time units. The CPMLE and CPNLR for the CP model proposed in the event of clustered failure data are new parameter estimation alternatives, which have superior characteristics in terms of prediction accuracy. The predictions will be compared with those of the M–O grouped data estimation method in terms of the average relative error (ARE) and a modiÞed version of mean-squared error (MSE) and average Kolmogorov–Smirnov test statistic (K-S average Dn ) [18–20]. Note that conventional nonhomogeneous Poisson process models do not permit the option of multiple counts per arrival, and the CP model is superior when failure clumping exists [6,10].
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
81
2.1.3 Maximum Likelihood Estimation in the Poisson∧ Geometric Model It is well accepted that the MLE method is quite straightforward, having many desired properties, such as asymptotic normality, admissibility, robustness, and consistency in Kendall and Stuart’s classic textbook [21]. The main idea behind MLE is the use of an n-tuple joint likelihood function of the random vector X under observation to estimate the parameters of the (compound) Poisson∧ geometric. That is, fX (Xi , θ ) = P (X) =
X (βt)Y e−βt
Y!
Y =1
θ = (β, r),
X−Y CYX−1 (1 − r)Y , −1 r
β > 0,
0
(1)
Let X1 , X2 , . . . , Xn be a vector of random samples from fX (Xi , θ ), and let L be its joint likelihood function whose logarithm is to be maximized to Þnd the maximum likelihood estimates of parameters b and r: L = L(θ ) =
n
fX (Xi , u)
(2)
i=1
L∗ = ln L
(3)
∗
∂L =0 ∂β
(4)
∂L∗ =0 ∂r
(5)
where the ∂ operator denotes “partial derivative of.” Hence, two nonlinear equations for β and r and two unknowns, Ypast yi ˆ , i = 1, . . . , n (6) β= = ti Xpast Ypast yi rˆ = 1 − = 1 − (7) xi Xpast are the maximum likelihood estimators of the unknown parameters β and r. Consequently, the expected value of X, the Poisson∧ geometric random variable, within the next unit time interval {t, t + 1} is β/p, where p = 1 − r. The CP reliability model proposed suggested that the number of remaining software failures expected to occur within the next time interval {t, trem } was [10,13] E[X(trem )] =
β trem + e 1−r
(8)
In other words, the estimate of the quotient β/(1 − r) multiplied by the unexecuted time units remaining (i.e., trem in CPU seconds or calendar weeks) will
82
SOFTWARE RELIABILITY MODELING
estimate the expected number of future failures remaining. Thus, Xrem , which is the remaining number of failures from the end of the initial testing phase until the prescribed stopping time, is a function of the past information and remaining (residual) time, trem . The total number of unknown failures is Xtot and the sum of all single or multiple failures already discovered is Xpast . Note that β of the Poisson process is the average number of (failure) arrivals per unit time and r is the probability of Þnding the next failure in the batch or clump (e.g., week) following that arrival. Then p = 1 − r is the probability of starting the Poisson process for the next failure arrival. The process may be likened to individual customers entering a supermarket independently during shopping hours (customer count due to Poisson) to purchase one or more products selected independent of each other (clump size due to geometric) at each customer entry. The total products X purchased within a time interval t then has a Poisson∧ geometric distribution. It is appropriate to describe brießy the weekly-recorded software failure weekly-data (WD) sets WD1 to WD5, which were time-based simulated. The data set WD1 corresponds to a 61-week-long software testing and has a total of 131 accumulated failures at the termination of testing activity. Similarly, the total number of failures and weeks for the other data sets with their code names are: 213 failures within 223 weeks for WD2, 340 failures within 41 weeks for WD3, 197 failures within 114 weeks for WD4, and 366 failures within 50 weeks for WD5. See Figures 2.1(a), 2.2(a), 2.3(a), 2.4(a), and 2.5(a) for the characterization of these clustered data sets, which are recorded on the basis of weekly intervals [5,12]. Additionally, the number of weeks (WEEKS) versus clump size (NUMBER OF FAILURES/WEEK) displays a negative exponential plot for WD1, WD2, WD4, and WD5 (see Figures 2.6, 2.7, 2.9, and 2.10). Concerning WD3, which displays a different frequency distribution from the others, the plot is quasiuniform and nonexponential (see Figure 2.8). This is why WD3 is clearly different. Using the principles noted above, Table 2.1 summarizes the CPMLE parameter estimation results for example data set WD1 illustrated in Figure 2.1(a). The results of Table 2.1 are plotted in Figure 2.1(b) and (c). Similarly, Figures 2.2(b) and (c), 2.3(b) and (c), 2.4(b) and (c), and 2.5(b) and (c) are for WD2, WD3, WD4, and WD5, respectively. Table 2.2 for WD1 can be obtained by applying an SPSS algorithm as illustrated in Figures 2.11 to 2.14. 2.1.4 Nonlinear Regression Estimation in the Poisson∧ Geometric Model The NLR technique, as a least-squares method, provides an excellent alternative to MLE and is used to estimate the unknown parameters for models that are not linear in their parameters [5,19,21]. The general improvement in the quality of estimation of the unknown parameters, and consequently in predicting the residual number of failures, is due to the nonlinear and nonnormal nature of the small or medium-sized sample studied. Following from equation (7) for estimating the
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
83
Error Discovery Rate 12 11 10
Failures per Week
9 8 7 6 5 4 3 2 1 0 1
6
11
16
21
26
31
36
41
46
51
56
61
Test Weeks (a) X_TOTAL VS TIME 180
+
X_TOTAL
150 120
+
+ +
+
+
+ +
+
+
+
+
+ +
+
+
+
+
90 CPNLR_TOT
60
+
CPMLE_TOT MO_TOT
30 0
10
20
30
43
50
60
70
80
90
100
NORMALIZED % OF TIME (b) % REL.ERROR VS TIME 40
% REL.ERROR
20 0 + 10 −20
+
+
+ + 20
+ 30
+ + + 43
50
+
+ 60
+
+ + 70
80
+
+ 90
100
−40 −60 −80
CPNLR_ERR +
−100
CPMLE_ERR MO_ERR
NORMALIZED % OF TIME (c)
FIGURE 2.1 Data set WD1: (a) failures per calendar week; (b) Xtot versus time; (c) percentage relative error versus time.
84
SOFTWARE RELIABILITY MODELING Error Discovery Rate 12 11 10
Errors per Week
9 8 7 6 5 4 3 2 1 0 1
26
51
76
101
126
151
176
201
Test Weeks (a) X_TOTAL VS TIME 250
X_TOTAL
200 11
21
31
39
50
61
70
80
90
100
150 100 CPNLR_TOT CPMLE_TOT
50
MO_TOT 0
NORMALIZED % OF TIME (b) % REL.ERROR VS TIME
20.00
%REL.ERROR
5.00 −10.00 11
21
31
39
50
61
70
80
90
100
−25.00 −40.00 −55.00 CPNLR_ERR
−70.00
CPMLE_ERR
−85.00
MO_ERR
−100.00
NORMALIZED % OF TIME (c)
FIGURE 2.2 Data set WD2: (a) failures per calendar week; (b) Xtot versus time; (c) percentage relative error versus time.
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
85
Errors per Week
Error Discovery Rate 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 20
25
30
35
40
45
50
55
60
Test Week (a) X_TOTAL VS TIME 700 630
X_TOTAL
560 490 420 350 28010
22
32
39
51
61
71
78
88
210
CPNLR_TOT
140
CPMLE_TOT
70
100
MO_TOT
0
NORMALIZED % OF TIME (b) % REL.ERROR VS TIME
% REL.ERROR
120.00 100.00 80.00 60.00 40.00 20.00 0.0 10 −20.00 −40.00 −60.00 −80.00 −100.00
22
32
39
51
61
71
78
88
100
CPNLR_ERR CPMLE_ERR MO_ERR
NORMALIZED % OF TIME (c)
FIGURE 2.3 Data set WD3: (a) failures per calendar week; (b) Xtot versus time; (c) percentage relative error versus time.
86
SOFTWARE RELIABILITY MODELING
Error Discovery Rate 10 9 8
Failures per Week
7 6 5 4 3 2 1 0 11
1
21
31
41
51
61
71
81
91
101
111
Test Weeks (a) X_TOTAL VS TIME 200
X_TOTAL
10
20
150
+ +
100 50 0
+ ×
×
×
+ + + + × + + + + + × + 61 70 80 91 100 30 42 50 × × + + × + + × × × + + × × × × × CPNLR_TOT × × + CPMLE_TOT × MO_TOT
NORMALIZED % OF TIME (b) % REL.ERROR VS TIME
15.0
% RELERROR
0.0 −15.0
20
10
+
−30.0 −45.0 −60.0 −75.0
+ +
−90.0 × −105.0
×
×
30 42 50 + + + + + + × × × × × ×
+
+ + + 61 70
+
×
×
×
×
+ 80 ×
+ ×
+ × 91
+ ×
+ × 100
× CPNLR_ERR + CPMLE_ERR × MO_ERR
NORMALIZED % OF TIME (c)
FIGURE 2.4 Data set WD4: (a) failures per calendar week; (b) Xtot versus time; (c) percentage relative error versus time.
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
87
Error Discovery Rate 26 24 22
Failures Per Week
20 18 16 14 12 10 8 6 4 2 0 1
6
11
16
21
26
31
36
41
46
Test Week (a) X_TOTAL VS TIME 600
X_TOTAL
500
300 200
+
+
400
+ 20
+ 10
+
+
30
+ ×
40
+ + + + + + + × + × × × + × + × × × + × × × 80 90 100 50 60 70
+
100
×
×
×
CPNLR_TOT
×
×
×
+ CPMLE_TOT MO_TOT
0
NORMALIZED % OF TIME (b) % REL.ERROR VS TIME 40.00 +
% REL.ERROR
20.00 0.00
+ 10
+ 20
+
30
+ 40
+ ×
+ + × + + + + + × × × + +× + × + × × × × 100 50 × 60 70 80 90
−20.00 −40.00
+
−60.00
×
−80.00 −100.00
× ×
×
×
×
×
CPNLR_ERR + CPMLE_ERR × MO_ERR
NORMALIZED % OF TIME (c)
FIGURE 2.5 Data set WD5: (a) failures per calendar week; (b) Xtot versus time; (c) percentage relative error versus time.
88
SOFTWARE RELIABILITY MODELING FAILURES PER WEEK 30
WEEKS
25 20 15 10 5 0 0
1
2
3
4
5
6
7
8
9
11
NUMBER OF FAILURES / WEEK
FIGURE 2.6
Failures per week for data set WD1.
WEEKS
FAILURES PER WEEK 140 120 100 80 60 40 20 0 0
1
FIGURE 2.7
2 3 4 5 6 7 NUMBER OF FAILURES / WEEK
8
11
Failures per week for data set WD2.
WEEKS
FAILURES PER WEEK 7 6 5 4 3 2 1 0 0 1 2 3 4 5 7 8 9 10 12 13 14 15 19 22 23 28 NUMBER OF FAILURES / WEEK
FIGURE 2.8
Failures per week for data set WD3.
WEEKS
FAILURES PER WEEK 50 45 40 35 30 25 20 15 10 5 0 0
1
FIGURE 2.9
2 3 4 5 6 NUMBER OF FAILURES / WEEK
7
8
Failures per week for data set WD4.
89
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL FAILURES PER WEEK
WEEKS
9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 16 18 20 NUMBER OF FAILURES / WEEK
FIGURE 2.10 TABLE 2.1
Failures per week for data set WD5.
CPMLE Parameter Estimation Results for Data Set WD1 (Weeks)
Time (%)
Xpast
Ypast
tpast
trem
1−r
β
Xtot
Rel. Error (%)
10 15 20 25 30 38.3 43.3 45 50 55 60 65 70 75 80 86.7 90 98.3 100
13 22 25 43 49 51 52 55 74 81 85 89 92 104 119 126 128 130 131
4 5 7 10 11 12 13 14 17 20 23 24 26 28 31 34 36 38 39
6 9 12 15 18 23 26 27 30 33 36 39 42 45 48 52 54 59 60
54 51 48 45 42 37 34 33 30 27 24 21 18 15 12 8 6 1 0
0.30769 0.22727 0.28000 0.23256 0.22449 0.23529 0.25000 0.25455 0.22973 0.24691 0.27059 0.26966 0.28261 0.26923 0.26050 0.26984 0.28125 0.29231 0.29771
0.66667 0.55556 0.58333 0.66667 0.61111 0.52174 0.50000 0.51852 0.56667 0.60606 0.63889 0.61538 0.61905 0.62222 0.64583 0.65385 0.66667 0.64407 0.65000
130.0 146.7 125.0 172 163.3 133 120 122.2 148 147.3 141.7 136.9 131.4 138.7 148.8 145.4 142.2 132.2 131
−0.76 11.98 −4.58 31.29 24.65 1.52 −8.39 −6.72 12.98 12.44 8.17 4.50 0.31 5.87 13.59 10.99 8.55 0.929 0.00
number of failures remaining as a function of the remaining time, the nonlinear vector equation takes the following form, where ε represents an error vector: Xrem = Xtot − Xpast = Xtot = Xpast +
β trem + ε 1−r
βtrem +ε 1−r
(9) (10)
90
SOFTWARE RELIABILITY MODELING
TABLE 2.2
CPNLR Parameter Estimation Results for Data Set WD1 (Weeks)
Time (%)
Xpast
Ypast
tpast
trem
1−r
β
Xtot
10 15 20 25 30 38.3 43.3 45 50 55 60 65 70 75 80 86.7 90 98.3 100
13 22 25 43 49 51 52 55 74 81 85 89 92 104 119 126 128 130 131
4 5 7 10 11 12 13 14 17 20 23 24 26 28 31 34 36 38 39
6 9 12 15 18 23 26 27 30 33 36 39 42 45 48 52 54 59 60
54 51 48 45 42 37 34 33 30 27 24 21 18 15 12 8 6 1 0
0.03880 0.03880 0.00419 0.00236 0.00010 0.00028 0.03050 0.11346 0.00005 0.08637 0.00413 0.04978 0.15697 0.04810 0.00271 0.00131 0.00852 0.00001 0.00282
0.08408 0.08408 0.00796 0.00630 0.00028 0.00067 0.06416 0.22707 0.00010 0.19479 0.00943 0.11287 0.34892 0.10763 0.00626 0.00307 0.02003 0.00003 0.00640
131.7 131.7 117.0 158.4 165.0 144.6 129.7 124.6 132.2 137.0 138.2 137.6 135.6 136.2 139.2 140.7 141.0 139.1 137.7
FIGURE 2.11
Rel. Error (%) 0.53 0.53 −10.69 20.92 25.95 10.38 −0.99 −4.88 0.92 4.50 5.49 4.96 3.51 3.97 6.26 7.32 7.64 6.18 5.11
Input data and nonlinear regression windows for WD1.
The Levenberg–Marquardt (L-M) algorithm is employed to solve for the unknown parameter vector (β, r) in this nonlinear regression equation by means of least-squares estimation [22–25]. The method developed by Levenberg and Marquardt appears to enlarge considerably the number of practical problems that
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
91
FIGURE 2.12 Input data and nonlinear equations windows for WD1.
can be tackled by nonlinear estimation. The L-M method is one that appears to work well in many circumstances and is thus a sensible practical choice. However, no method can be called “best” for all nonlinear problems [22, 23]. The Jacobian matrix, J (u), is an important step within nonlinear regression calculations, and if it is rank deÞcient or nearly so in Gauss–Newton iterations, this may admit multiple solutions. Therefore, the L-M modiÞcation transforms J (u) to a better-conditioned full-rank matrix. The L-M method is the technique used most often for nonlinear least-squares estimation problems [24–26]. The modiÞed Levenberg–Marquardt (L-M) algorithm is used and when the iteration required terminates to reach a convergence, the parameter estimates and standard errors are listed. Table 2.2 summarizes the results for data set WD1 by CPNLR using L-M. The results are plotted along with the CPMLE results in Figure 2.1(b) and (c). Similarly, the CPNLR results for WD2 to WD5 along with the CPMLE results are plotted in Figures 2.2 to 2.5(b) and (c). 2.1.5 Calculation of Forecast Quality and Comparison of Methods The frequency histograms shown in Figures 2.6 to 2.10 explore weekly data sets WD1 to WD5, illustrated in Figures 2.1(a) to 2.5(a). Clump size distribution
92
SOFTWARE RELIABILITY MODELING Iteration 1 1.1 1.2 2 2.1 3 3.1 4 4.1 5 5.1 6 6.1 7 7.1 8 8.1 9 9.1 9.2 10 10.1 11 11.1 12 12.1 13 13.1 14 14.1
Residual SS
XTOT
BETA
P
4428.000000 25891.57103 24272.06066 2132.121747 418.2751429 418.2751429 413.4601078 413.4601078 350.7329532 350.7329532 293.8995230 293.8995230 267.3373187 267.3373187 265.3509934 265.3509934 265.1028084 265.1028084 266.5961492 265.0497739 265.0497739 265.0493272 265.0493272 265.0493183 265.0493183 265.0493176 265.0493176 265.0493168 265.0493168 265.0493182
100.000000 132.233289 132.233497 132.233530 132.233530 132.233530 132.233531 132.233531 132.233529 132.233529 132.233530 132.233530 132.233529 132.233529 132.233530 132.233530 132.233529 132.233529 132.233530 132.233530 132.233530 132.233530 132.233530 132.233530 132.233530 132.233530 132.233530 132.233531 132.233531 132.233530
1.00000000 190.811079 28.0226085 1.94456559 2.48055763 2.48055763 1.15192006 1.15192006 .703300207 .703300207 .424009478 .424009478 .597951596 .597951596 .941894713 .941894713 1.62873530 1.62873530 .256653447 1.49208261 1.49208261 1.76668667 1.76668667 1.21738072 1.21738072 .668091710 .668091710 .393443146 .393443146 .145385871
1.00000000 189.654857 26.8663792 .788335187 1.10484457 1.10484457 .556831485 .556831485 .316412110 .316412110 .200227337 .200227337 .278718161 .278718161 .437625614 .437625614 .755944939 .755944939 .118537539 .691937305 .691937305 .819331318 .819331318 .564590321 .564590321 .309841182 .309841182 .182468563 .182468563 .067425618
Run stopped after 32 model evaluations and 14 derivative evaluations. Iterations have been stopped because the relative reduction between successive residual sums of squares is at most SSCON = 1.000E−08
FIGURE 2.13
Output data and convergence at Xtot = 132.23. See Table 2.2 for 50%.
will clearly illustrate the exponential nature of some data (WD1, WD2, WD4, and WD5) and the nonexponential nature of the rest (WD3). This is in line with the recommendation put forth by the working group in the Software Reliability Handbook [27]: that the starting point for exploring a software data set is to use the generalized exponential model for exponential classes and the M-O logarithmic Poisson model for nonexponential classes. Consequently, one should prefer to use the M-O model for failure data sets that have a nonexponential clump-size frequency distribution. On the other hand, CP as a generalized Poisson model is to be preferred for data sets that have an exponential clump-size frequency distribution. In order to use the algorithms for the CPMLE and CPNLR methods of parameter estimation, the entire study period of calendar weeks (for data sets WD1 to WD5) was divided into 5% intervals, since the exact denominations
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
93
FIGURE 2.14 ANOVA table for the nonlinear regression. r 2 = 0.93 and the 95% conÞdence interval for Xtot is (108.63, 155.83).
are not always physically feasible. Based on the software failure data studied, predictions were made of the time required using the MLE and NLR methods. The MLE algorithm is carried out simply by updating the accumulated data and applying the simple algebraic equations (6) and (7). The NLR algorithm is executed by applying equations (8) to (10) using an SPSS program [24–26] (see Figures 2.11 to 2.14). Finally, the corresponding total failure count predictions and relative error percentages are plotted in Figures 2.1(b) and (c), 2(b) and (c), 3(b) and (c), 4(b) and (c), and 5(b) and (c), respectively, for WD1 to WD5. In the CP process, the underlying failure process was assumed to be Poisson, whereas a geometric distributed number of failures may be observed upon each failure arrival in the clump. In this research, however, the Musa–Okumoto logarithmic Poisson execution time model predictions using geometric-family maximum likelihood parameter estimation are compared with those of the proposed CPMLE and CPNLR in the event of grouped or clustered failure data [11]. The comparisons are made in terms of Xtot , the total predicted number of software failures by the end of the study periods for the Þve weekly failure data sets WD1 to WD5. The totals and relative errors estimated are plotted versus the normalized percentage of time. As observed in Figures 2.1(b) and (c) to 2.5(b)
94
SOFTWARE RELIABILITY MODELING
and (c) and Table 2.3, the methods proposed, CPNLR and CPMLE, generally introduce more advantages by recognizing the software failure clumping or clustering effect. The absolute average relative error is the mean of the absolute relative errors calculated at each: # n # 1 ## estimatei − true ## ARE = # n i=1 # true
(11)
Another compact measure for assessing the prediction is a mean-squared error as the average of the deviation squares. Recall that for K-S statistics, the proposed estimation model’s Gn (x) is compared for goodness of Þt to the empirical distribution G(x), where Dn = supi |Gn (x) −G(x)| for H0 : Gn (x) = G(x), where i = 1, . . . , n, for n retrospective prediction points. Hence, the nonparametric K-S test statistics at each epoch of estimation can be summed and divided by n to give the n 1 K-S averageDn = (supi |G(x) − F (x)|) (12) n i=1 However, a major weakness in the K-S approach is that it is possible for a model to predict poorly over the entire data set and yet have a small K–S distance [18]. K-S statistics are popularly accepted nonparametric tests of goodness of Þt when no better measures exist. A Þnal comparative tabulation of results is given in Table 2.3 to emphasize the merits of competing methods [28]. Comparisons between different measurements for different estimation techniques are based on average relative error [4,20], mean-squared error (MSE) [29,30], and K-S average Dn [18,31] as measures of forecast quality. For a numerical illustration concerning data set WD1, consider the CPNLR ERR curve in Figure 2.1(c) showing the percentage relative error versus time relationship for 19 observations for CPNLR tabulated in Table 2.2. Absolute values of these 19 recordings are added, such as (0.53 + 0.53 + 10.69 + · · · + 6.18 + 5.11), and divided by 19 to Þnd an average relative error percentage: 6.89% as given in Table 2.3. Again consider Figure 2.1(c) and Table 2.2 and calculate MSE for CPNLR using WD1, where the squares of the differences TABLE 2.3
ARE, MSE, and K-S Average Dn for the Estimation Methods Selected M-O
WD1 WD2 WD3 WD4 WD5
CPNLR
CPMLE
ARE
MSE
Av. Dn
ARE
MSE
Av. Dn
ARE
MSE
Av. Dn
0.3201 0.5386 0.2127 0.4489 0.2759
3247 18902 13289 11906 27245
0.428 0.792 0.337 0.737 0.275
0.0689 0.3164 0.4543 0.1439 0.1599
160 6465 31869 1795 6200
0.128 0.209 0.151 0.195 0.097
0.0885 0.3183 0.3265 0.1623 0.0675
257 7389 16906 2501 1813
0.128 0.209 0.151 0.195 0.097
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
95
between the estimated and true values are added in the form [(131 − 131.67)2 + (131 − 131.67)2 + (117 − 131)2 + · · · + (137.68 − 131)2 ] and then divided by n − 1 = 19 − 1 = 18 to Þnd MSE = 160, as given in Table 2.3: 1 (true − estimatei )2 n − 1 i=1 n
MSE =
(13)
For the K–S test, the usual approach to Þnding the maximum deviation Di , i = 1, . . . , n between the empirical (cumulative) and true distribution is used for each of the n = 19 observations. These Di are added and divided by n = 19 observations to estimate the averaged K–S statistic or K–S Av. Dn = 0.128 for CPNLR as given in Table 2.3. The smaller Av. Dn , the better the predictor is. Table 2.3 indicates that in the CPNLR method, the ARE and MSE results are more favorable for WD1, WD2, and WD4. For WD5, the CPMLE results are best in terms of ARE and MSE. For WD3, however, M-O performs better. The K–S statistics (i.e., average Dn ) produced identical results for both CPMLE and TABLE 2.4 Grouped Failure Weekly Data Set: WD1a
Week
Number of Failures
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5 2 2 0 0 4 0 0 9 2 0 1 8 7 3 0 0 6 0 0
Week
Number of Failures
Week
Number of Failures
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
0 0 2 0 0 1 3 9 6 4 1 5 1 1 1 2 0 0 4 0
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 6 0 6 11 3 1 2 3 0 2 1 1 0 1 0 0 1 1
a n = 60, sample mean = 2.18, sample standard deviation = 2.71, sample variance = 7.34, sum = 131.
96
SOFTWARE RELIABILITY MODELING
TABLE 2.5 Grouped Failure Weekly Data Set: WD3a
Week
Number of Failures
Week
Number of Failures
1 2 3 4 5 6 7 8 9 10 11 12 13 14
4 12 15 9 28 19 8 7 4 8 9 12 8 4
15 16 17 18 19 20 21 22 23 24 25 26 27 28
14 19 23 12 22 12 13 19 10 5 5 5 7 7
Week
Number of Failures
29 30 31 32 33 34 35 36 37 38 39 40 41
1 3 1 2 0 1 9 1 0 0 0 1 1
a n = 41, sample mean = 8.29, sample standard deviation = 7.16, sample variance = 51.26, sum = 340.
CPNLR parameter estimation methods, due to their inherently close performance. Unlike ARE and MSE, the K-S statistic is not sensitive enough to detect differences between these two parameter estimation techniques (CPMLE and CPNLR), but it is sensitive enough to tell the difference between the CP and M-O methods of parameter estimation. However, the nonparametric K-S statistics are the least favorable for all the data sets WD1 to WD5 for the competing M–O model. Tables 2.4 to 2.6 display WD1, WD3, and WD5. 2.1.6 Discussion and Conclusions In terms of ARE, MSE, and K-S Av. Dn , the newly proposed parameter estimation methods—compound Poisson nonlinear regression (CPNLR) and compound Poisson maximum likelihood estimation (CPMLE)—are generally superior to the Musa–Okumoto logarithmic Poisson model in predicting the outcomes for the grouped or clustered software failure data sets described in this chapter. The basic reason is that the CP models proposed evaluate clumping or clustering effects of software failures not evaluated by the usual Poisson or logarithmic Poisson processes for certain data structures. For weekly-recorded software failure data sets WD1 to WD5, this phenomenon occurred in four out of Þve cases: For WD1, WD2, and WD4, CPNLR performed better, with CPMLE a better choice for WD5. The M-O Poisson interval estimation model did better only for WD3. This result was not a surprise, because an exploratory data analysis showed that WD3
SOFTWARE RELIABILITY MODELS USING THE COMPOUND POISSON MODEL
97
TABLE 2.6 Grouped Failure Weekly Data Set: WD5a
Week
Number of Failures
Week
Number of Failures
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
3 1 4 4 11 11 18 8 6 10 16 6 13 16 3 4 2
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
11 8 12 2 6 4 5 5 7 5 7 1 11 4 4 20 9
Week
Number of Failures
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
10 4 5 7 9 7 6 4 11 3 7 9 8 5 7 7
a
n = 50, sample mean = 7.32, sample standard deviation = 4.25, sample variance = 18.1, sum = 366.
(see Figure 2.8) has a nonexponential or quasiuniform frequency distribution for the random variable “clump size” per week as a result of the logarithmic effect. On the other hand, the frequency plots for WD1, WD2, WD4, and WD5 (see Figures 2.6, 2.7, 2.9, and 2.10) displayed an exponential nature. As outlined in the Software Reliability Handbook [27], the M-O logarithmic Poisson model is a better Þt for a nonexponential class of failure counts, whereas for the exponential class, generalized exponential models such as the generalized Poisson in this work are sound. One way to associate with this fact is that the natural logarithm of the Poisson equation yields to a constant value. No single summary measure alone is adequate to determine the best parameter estimation method for a given data set. Sometimes an estimation method can generate more favorable predictive results at, or after, a certain percentage of the normalized time as observed in a plot. However, the three performance measures given in equations (11) to (13) and cited in the literature serve to summarize the situation compactly for the entire range covering all prediction points [4,11,20,31]. As a matter of fact, one would prefer to see gradual smaller predictive error percentage values, because the further advanced the testing process is, the more expensive it becomes. Recall that of the two prospective (forward) performance measures, ARE recognizes the absolute value penalty,
98
SOFTWARE RELIABILITY MODELING
whereas MSE penalizes the deviations more severely by squaring them. Consequently, the K-S statistics are different in the retrospective (backward) sense, where the actual empirical distribution is compared to the cumulative distribution function estimated. This suggests that one should consider at least a pair or a triplet of measures of goodness of Þt—ARE, MSE, and K-S Av. Dn —instead of a single measure. Various diagnostic analyses in terms of plots are given in Figures 2.6 to 2.10 in accordance with the fundamental approach of exploratory “Þrst-model-validation-then-parameter-estimation,” rather than simply the approach of nonexploratory data-oriented work [32]. In conclusion, the CP parameter estimation methods proposed present a valid alternative technique with favorable prediction accuracy for estimating the total number of residual failures. Predictions and corresponding relative-error percentages can be observed in Figures 2.1 to 2.5 for the weekly-recorded failure data sets, WD1 to WD5. Table 2.3 provides an overall comparison of the estimation results due to different models and their parameter estimation techniques in terms of popular measures of forecast quality of Þt. It has been observed that the average relative error and mean-squared error outputs for the CPNLR and CPMLE methods are more favorable than the competing M-O method. This is also validated by the trends in parts (b) and (c) of Figures 2.1 to 2.5, where the CP (MLE and NLR) methods produce predictions and related errors closer to the target (or true) count and zero percent relative error line, respectively, except for Figure 2.3(b) and (c), where the predictions of the M-O model results are clearly closer. Another striking difference in these graphs is that the M-O model [7] mostly underestimates the target value, whereas CP models alternate in overor underestimating the target. The CPMLE method is very straightforward to apply, with negligible algebra involved, whereas CPNLR may occasionally require extensive computational time to reach convergence through the nonlinear iterative process involved in the L–M method. It is also more secure to use NLR estimates, due to the nonlinear and especially nonnormal nature of the expressions observed in equations (8) to (10). However, Þrst-guess easy-to-obtain MLE results are also useful for obtaining an initial value during the L-M nonlinear solution process. Despite all the advantages of the nonlinear regression method, MLE—due to its generally accepted statistical advantages of admissibility, consistency, and asymptotic normality—can still provide more favorable results, as in WD5. For further research to follow up on deterministic measures such as ARE and MSE, stochastic performance measures in the form of statistical tests are sought to assess the predictive quality of parameter estimation methods. This topic 2.2. Additionally, similar to MSE in equation (12), n is dealt with in Section 2 [(estimate − true) /true], over n inspection points can be approximated to i i=1 2 , and tested a popular chi-squared distribution with n − 1 degrees of freedom, χn−1 using the chi-square statistical tables, given that the “true > 10” assumption holds. Thus, if closed-form probability distribution functions can be derived for what used to be deterministic measures such as ARE and MSE, any two competing parameter estimation methods would be fully comparable in terms of statistical
STOCHASTIC MEASURES TO COMPARE FAILURE-COUNT RELIABILITY MODELS
99
hypothesis tests. Further, Bayesian methods using informative and noninformative priors are studied in Section 2.2 to assess the probability that one method’s predictive accuracy scoring is better than its alternative. Finally, it should be remembered that not all software reliability models and related parameter estimation methods are best for all software failure data types. Exploratory data and goodness-of-Þt analyses are necessary to judge the behavior of the software failure data before deciding on the type of software reliability prediction model to be used. 2.2 STOCHASTIC MEASURES TO COMPARE FAILURE-COUNT RELIABILITY MODELS Nutshell 2.2 Absolute RE (relative error) and SqRE (squared relative error) are random variables suggested as measurements of the accuracy forecast for the total number of software failures estimated at the end of a mission time. The purpose is to compare the predictive merit of competing software reliability models, an important concern to software reliability analysts. This technique calculates the Bayes probability of how much better the prediction accuracy is for one method relative to that of a competitor. The Bayesian approach is more realistic in an assessment of predictive merit than (1) comparing merely the average values of ARE and SqRE as done conventionally, or (2) conducting statistical hypothesis tests of pairwise means of ARE and SqRE, an approach somewhat more sensible than (1), because it incorporates the variability of predicted values, which (1) does not. To implement the Bayesian technique, noninformative or ßat priors (across the border) are used Þrst, then informative (speciÞed) priors. 2.2.1 Introduction and Motivation This chapter is related to the general problem of ranking the usual means discussed in the literature by Berger and Deely in 1988, and is an improved extension of the 2001 publication by Sahinoglu et al. focusing on statistical measures for comparing the predictive merits of software reliability models [33,34]. There is increasing pressure to develop and quantify measures of computer software reliability. With the ascent of software reliability models [6,35], there is now even more pressure on assessment of the predictive quality of these measures, both in the sense of their goodness of Þt and in pairwise comparisons [11–13,18–20,36]. However, current methods used to compare these models use constant measures, and hence their results do not reßect the uncertainty inherent in these observations. In particular, the predictive accuracy of various methods is compared through measures such as average absolute relative error (ARE) and meansquared error (MSE), as in Section 2.1, both of which are constant measures and thus do not consider the effect of stochastic (random) variability. Earlier, the author suggested designing and analyzing more precise methods for choosing the best predictive procedure through frequentist methods such as one- and
100
SOFTWARE RELIABILITY MODELING
two-sample t-tests of hypotheses, which do consider this inherent variability [36,37]. In this section we propose studying several data-supported Bayesian methods of comparative assessment which acknowledge the stochastic variation in the observed sequence of failure data. In addition to assessing the quality of Þt of an individual estimation technique such as through ARE, it was desirable to obtain comparisons between competing techniques by two-sample t-tests, and further, calculating the probability of one method scoring better than another [34]. Such research was necessary in order to choose between the many new and old reliability models [5, Chap. 13]. Pairs of certain reliability models’ predictive accuracy have already been compared employing statistical hypotheses tests in the frequentist sense. It was observed that a constant difference between the means of random variable ARE of any two methods did not necessarily prove statistically signiÞcant as to which of two competing estimation procedures was better. An alternative way of measurement through a more severe squared penalty reßected in random variable SqRE is also considered. In this chapter we bring a new dimension to a comparative assessment of the predictive accuracy of two competing failure count methods. In developing Bayesian methods, an innovative approach is proposed, not only to allow for determining which method is better, but additionally, to describe quantitatively how much better one is than the other. This is done by experimenting with prior noninformative distribution for the unknown parameters in the light of a priori software engineering Þeld experience. Informative prior analyses are left out of the context of this chapter, due to their complexity, except for summary charts. Results show the trend in comparisons beginning from a purely arithmetic approach by comparing absolute values, to those using statistical t-tests, and Þnally, to a probability-based Bayesian approach using (non)informative priors. 2.2.2 DeÞnitions and Notation Let y1 , . . . , yn denote the true failures observed over n time intervals, called checkpoints. For some given estimation procedure, let Xest (k) denote the estimate of the total number of software failures to be observed over n time intervals. The true number of failures over n intervals is given by Xtrue = yk , where k = 1, . . . , n. DeÞne the random variables | RE | (absolute relative error) and SqRE (squared relative error) as follows [11,12,36,37]: | RE | (k) =
| Xest (k) − Xtrue | Xtrue
(14)
SqRE(k) =
[Xest (k) − Xtrue ]2 Xtrue
(15)
Then, the popularly used ARE is the arithmetic average of | RE |, as in (11). AvSqRE is the arithmetic average of SqRE over n checkpoints. To summarize and review:
STOCHASTIC MEASURES TO COMPARE FAILURE-COUNT RELIABILITY MODELS
| RE | k yk Xest (k) Xtrue Xj ARE SqRE AvSqRE CPMLE CPNLR MO WD X Y Z U V W
101
absolute relative error checkpoint between 1 and n observations true number of software failures over k = 1, . . . , n forecast value of the number of software failures estimated at point, 1 < k < n time n k=1 yk error random variable j = 1, 2 for the two competing methods to compare arithmetic average of | RE | of sample observations squared | RE | of sample observations or SRE, arithmetic average of SqRE compound Poisson MLE method of estimation compound Poisson nonlinear regression method of estimation Musa–Okumoto logarithmic Poisson method of estimation weekly data ARE random variable for CPNLR ARE random variable for CPMLE ARE random variable for MO SRE random variable for CPNLR SRE random variable for CPMLE SRE random variable for MO
2.2.3 Model, Data, and Computational Formulas We are interested in comparing one method of predicting software reliability against another method based on the data observed and on predictions obtained from these methods. The Þve time-base simulated weekly-data sets (Tables 2.7 and 2.8) were described brießy in Section 2.1.3. WD1 has 131 failures in 64 TABLE 2.7 t-Tests with Decision, Arithmetic Difference, and Probabilistic Results for Comparing the Means of |RE|a Data Set CPMLE versus CPNLR WD1 WD2 WD3 WD4 WD5 a
t = 0.97 {0.024} t = 0.043 {0.003} t = −1.3 {−0.06} t = 0.33 {0.0197} t = 13.77 {0.544}
Accept H0 [0.7790] Accept H0 [0.5139] Accept H0 [0.8523] Accept H0 [0.6055] Reject H0 [0.9820]
MO versus CPNLR t = 3.89 {0.264} t = 2.75 {0.236} t = −2.74 {−0.180} t = 4.09 {0.322} t = 0.35 {0.123}
Reject H0 [0.9999] Reject H0 [0.9913] Reject H0 [0.9912] Reject H0 [0.9999] Accept H0 [0.8689]
MO versus CPMLE t = 3.46 {0.237} t = 2.54 {0.233} t = −1.63 {−0.120} t = 3.64 {0.303} t = −0.33 {−0.421}
Reject H0 [0.9993] Reject H0 [0.9853] Accept H0 [0.9099] Reject H0 [0.9996] Reject H0 [0.9858]
{·} are simple arithmetical differences of the respective means (ARE) with n = 18. Then [·] are Bayesian probabilities of comparisons of one method scoring worse (+ difference) or better (− difference). Accept denotes “do not reject equality of means” (i.e., H01 : μAREi = μAREj , i = j ).
102
SOFTWARE RELIABILITY MODELING
TABLE 2.8 t-Tests with Decision, Arithmetic Difference, and Probabilistic Results for Comparing the Means of SqREa Data Set CPMLE versus CPNLR t = 0.8 {0.76} t = 0.43 {4.35} t = −0.71 {−8.87} t = 0.56 {3.59} t = −2.19 {−11.98}
WD1 WD2 WD3 WD4 WD5
Accept H0 [0.7384] Accept H0 [0.6321] Accept H0 [0.7151] Accept H0 [0.6693] Reject H0 [0.9676]
MO versus CPNLR t = 3.51 {23.6} t = 5.73 {58.4} t = −1.03 {−19.5} t = 3.53 {51.33} t = 2.27 {57.47}
Reject H0 [0.9993] Reject H0 [0.9991] Accept H0 [0.7956] Reject H0 [0.9992] Reject H0 [0.9728]
MO versus CPMLE t = 3.39 {0.237} t = 4.59 {54.05} t = −0.51 {−10.63} t = 3.28 {47.74} t = 2.77 {69.49}
Reject H0 [0.9988] Reject H0 [0.9996] Accept H0 [0.6563] Reject H0 [0.9973] Reject H0 [0.9919]
a
{·} are simple arithmetical differences of the respective means (SRE) with n = 18. Then [·] are Bayesian probabilities of comparisons of one method scoring worse (+ difference) or better (− difference). Accept denotes “do not reject equality of means” (i.e., H02 : μSREi = μSREj , i = j ).
weeks; WD2 has 213 failures in 224 weeks; WD3 has 340 failures in 41 weeks; WD4 has 197 failures in 114 weeks; and WD5 has 366 failures in 50 weeks. The ARE and SRE of WD1 are shown in Figures 2.15 and 2.16. Detailed explanations of the competing methods CPMLE, CPNLR, and MO are given in references 5 and 10 to 12. In a frequentist treatment of the problem examined earlier by establishing hypotheses tests, it was shown that a difference in the mean values of the ARE or SqRE may not necessarily be statistically signiÞcant. In this study, we search for the probability that one method’s error mean is higher (worse) or lower (better) than another’s by a speciÞc margin in proportion to the difference between them. We approach this problem using Bayesian noninformative prior distributions and their posterior distributions computationally [38–44]. Note that t-values are calculated for testing H01 : μARE1 = μARE2
ARE
ARE vs Proportion sampling CPNLR
0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00
CPMLE MO
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion sampled
0.8
0.9
1
FIGURE 2.15 Absolute relative error versus proportion sampled obtained from Figure 2.1 for WD1.
STOCHASTIC MEASURES TO COMPARE FAILURE-COUNT RELIABILITY MODELS
103
SRE
SRE vs Proportion of Sampling 100 90 80 70 60 50 40 30 20 10 0
CPNLR CPMLE MO
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion sampled
0.8
0.9
1
FIGURE 2.16 AvSqRE (or SRE) versus proportion sampled obtained from Figure 2.1 for WD1.
and H02 : μSRE1 = μSRE2 , with the decision on their right whether to accept or reject if α (type I error probability) = 0.05 as in Table 2.7. The Bayesian probabilities are studied in detail in Section 2.2.4. From Tables 2.7 and 2.8 we can conclude that those t-test results with “Accept H0 ” show that the ARE or SqRE predictive error indices for the competing methods are not signiÞcantly different at a 0.05 type I error probability level. For WD3’s and WD5’s MO versus CPNLR hypothesis decisions to reject or accept, ARE and AvSqRE (SRE) tables do not concur. We may agree with Table 2.8, which supports the square penalty deÞnition to be on the conservative side. With a method and its prediction we associate an “error” random variable, which we will denote by Xj , where j = 1, 2 for the two methods being compared. In this work, the error random variable will be the means: ARE1 and ARE2 (or SqRE1 and SqRE2 ) as deÞned. We assume the following: 1. Xj is normally distributed with unknown mean μj and standard deviation σj , which will be taken as known. 2. The n (sample size) is large enough to facilitate a large-sample approach to the problem to utilize normal theory. 3. Even though μ1 and μ2 are unknown, method 1 is better than method 2 if μ1 < μ2 in probability. 4. The quantitative measure of how much better method 1 is than method 2 is obtained by assessing the difference μ1 − μ2 . This difference is unknown and can only be estimated, but the Bayesian model here produces a probability assessment of the magnitude of this difference. 5. In particular, as a comparison criterion, we compute the posterior probability that μ1 is smaller than μ2 by a tolerance b; that is, we compute the quantity
104
SOFTWARE RELIABILITY MODELING
P = P (μ1 ≤ μ2 − b|X1 X2 ),
b≥0
(16)
where b = γ [greater mean of ARE (or SRE) of X 2 − smaller mean of ARE (or SRE) of X1 ] A casual perusal of the comparison criterion in equation (16) should indicate why it can be used to make realistic and quantitative comparisons between any two methods being studied. It should also be pointed out here that if comparisons among a group of three or more methods were desired, then equation (16) could be suitably modiÞed to give the desired comparison; that is, the posterior probability that any one of several methods is sufÞciently smaller than all of the others could be computed. These details are discussed extensively for the general problem of ranking normal means, where we restrict the problem to comparing only two methods at a time [33]. We now introduce the Bayesian model with the relevant formulas necessary to compute the criterion function in equation (16). 2.2.4 Prior Distribution Approach For development of the prior distribution on μ1 and μ2 , we use a hierarchical Bayesian model which assumes, a priori, that the unknown means are exchangeable. This has the desirable property that knowledge of one mean gives some information about the other. For a fuller discussion of this general model, see Berger [38], and Sahinoglu et al. [34]. Let μ1 and μ2 have a normal distribution, say π(μ1 , μ2 |β, τ 2 ) with mean β and variance τ 2 , where β and τ 2 , hyperparameters, have hyperprior distributions denoted by h1 and h2 , respectively. Thus, the prior distribution on μ1 and μ2 is given by a mixture, π(μ1 , μ2 ) =
π(μ1 , μ2 |β, τ 2 )h1 (β)h2 (τ 2 )dβdτ 2
(17)
Choices for h1 and h2 depend on the type of prior information available in the problem. It is also true that a choice for π other than normal may also be indicated by the prior information. Even so, when π is normal, the closed form of this prior is not available, but this will not be necessary to compute the value of the criterion function in equation (16). Rather, only the conditional distributions will be used, as we show next. We can now derive the computational formulas required to obtain the value for P given in the criterion function. Using the conditional probability rules for densities, we can write P = P (μ1 ≤ μ2 − b | X1 , X2 ) = P (μ1 ≤ μ2 − b | X1 , X2 , β, τ 2 )h1 (β | X1 , X2 , τ 2 )h2 (τ 2 | X1 , X2 )dβdτ 2 (18)
STOCHASTIC MEASURES TO COMPARE FAILURE-COUNT RELIABILITY MODELS
105
and then note that the conditional distribution of μ1 − μ2 is normal with mean and variance given by m= var =
σ22 σ12 − σ12 + τ 2 σ22 + τ 2 σ22 σ12 + σ12 + τ 2 σ22 + τ 2
β+
X1 X2 − 2 2 2 σ1 + τ σ2 + τ 2
τ2
Thus, we can write the Þrst term in the integral above as −b − m 2 √ P (μ1 ≤ μ2 − b | X1 , X2 , β, τ ) = var
τ2
(19) (20)
(21)
where b = −γ (greater mean of X2 − smaller mean of X1 ), where γ ≥ 0 is a given as in equation (16). Equation (21) allows numerical calculation of P quite easily for various choices for h1 and h2 but would not be true if π(μ1 , μ2 | β, τ 2 ) were not chosen as normal. Even so, in that case the Monte Carlo evaluation for P is straightforward. We now give details for the noninformative case. The informative case is beyond the scope of this book, although it will be mentioned in comparison. For this case we assume that only vague opinions of the values for μ1 and μ2 are available. This knowledge is reßected by taking h1 as a normal distribution whose variance approaches inÞnity and h2 as h2 (τ 2 ) = (σ12 + σ22 + 2τ 2 )−1
(22)
A truly noninformative case for this variance type of random variable τ 2 would have been the improper choice 1/τ 2 . However, this does not lead to a proper posterior, and hence the foregoing choice was used. It is also the case that we can approach the same situation by taking the limit of uniform distributions on larger and larger intervals for τ 2 . The details of the choices above can be found in the work of Berger and Deely [33] and Sahinoglu et al. [34]. Given this model, the following formula can be derived: P =
2
0
−t/2 −1/2 t γ 2 e 2 − t − √ dt 2 − t f (X1 , X2 )
(23)
e−t/2 t −1/2 dt/f (X1 , X2 ) integrates to 1.0, where f (X1 , X2 ) is the normalizing factor. denotes the c.d.f. of the standard normal. To solve this integral by using Monte Carlo simulation, simulate ⎡ ⎛ ⎞ ⎤ " 2 γ ⎢ ⎜ ⎟ 2 2⎥ P ⎣Z ≤ ⎝ 2 − Z12 − " (24) ⎠ ∩ Z1 ≤ ⎦ 2 2 − Z1
106
SOFTWARE RELIABILITY MODELING
a large number N of times for a given γ > 0, where Z1 is a standard normal variable (see the book CD-ROM for applications). That is, draw a standard normal variable Z1 that satisÞes Z12 ≤ 2 ; otherwise, draw another standard normal variable that does. Among the feasible such choices m = 1, . . . , M that satisfy this criterion, calculate the expression ⎛ ⎞ " 2 γ ⎜ ⎟
⎝ 2 − Z12 − " (25) ⎠ 2 − Z12 that is, the quantity ⎞⎤ " 2 γ ⎟⎥ ⎢ ⎜ qm = P ⎣Z ≤ ⎝ 2 − Z12 − " ⎠⎦ 2 − Z12 ⎡
⎛
from standard normal tables. Divide the value of sum = runs. The Þnal result becomes P = sum/N . Note that =
M 1
(26)
qm by N simulation
greater mean of X2 − smaller mean of X1 " s12 /n1 + s22 /n2
(27)
However, for noninformative or ßat priors when γ = 0 (e.g., calculating for WD1), we obtain P =
() − 0.5e−0.5 2 () − 1
2
2
=
0.83303 − 0.31351
(0.9662) − 0.5e−0.5(0.9662) = = 0.779 2 () − 1 1.66606 − 1
(28)
as in Table 2.7, where = 0.966. See the book CD-ROM and click on the “FLAT” Java program for solutions and plots. 2.2.5 Applications to Data Sets and Computations Tables 2.9 to 2.13 cover data sets WD1 to WD5, where γ constant is varied between 0.0 and 1.0. X denotes ARE for CPNLR, Y denotes ARE for CPMLE, and Z denotes ARE for MO. Similarly, U denotes SqRE for CPNLR, V denotes SqRE for CPMLE, and W denotes SqRE for MO. Each table contains the probability that μi > μj , where i, j = X, Y , Z for ARE and i, j = U , V , W for SqRE, for i = j . The means, mi , standard errors, σmi , and standard deviations, σi , of each data set i = WD1 to WD5 (n = 18 checkpoints between the 10th and 95th percentiles, such as in Tables 2.1 and 2.2) are listed in Tables 2.9 to 2.13 for Bayesian comparative probabilities [34]:
107
ARE P (Y > X) P (Z > X) P (Z > Y ) AvSqRE P (V > U ) P (W > U ) P (W > V )
TABLE 2.10
ARE P (Y > X) P (Z > X) P (Z > Y ) AvSqRE P (V > U ) P (W > U ) P (W > V )
TABLE 2.9
0.6335 0.9868 0.9829
0.7384 0.9993 0.9988
0.5264 0.9212 0.9049
0.5346 0.9489 0.9218
0.5
0.4127 0.7069 0.6988
0.3928 0.7592 0.7156
0.75
0.2913 0.3789 0.3579
0.2377 0.3862 0.3773
1.0
1.96 24.8 24.8
0.094 0.334 0.334
m1
0.5075 0.9465 0.9225
0.5699 0.9848 0.9631
0.6321 0.9993 0.9988
0.25
0.5139 0.9913 0.9853
0.0
0.5105 0.9120 0.8656
0.5007 0.8176 0.7950
0.5
γ
0.4399 0.6956 0.6432
0.4916 0.6007 0.5710
0.75
0.3862 0.3645 0.3257
0.4877 0.3232 0.3034
1.0
34.69 88.74 88.74
0.3357 0.5685 0.5685
m1
Bayesian Noninformative Prior Analysis Results for Data Set WD2
0.6617 0.9947 0.9877
0.25
0.7799 0.9999 0.9994
0.0
γ
Bayesian Noninformative Prior Analysis Results for Data Set WD1
8.324 15.87 15.87
0.054 0.074 0.074
σm1
0.773 6.70 6.70
0.019 0.066 0.066
σm1
35.32 67.33 67.33
0.23 0.31 0.31
σ1
3.28 28.42 28.42
0.08 0.28 0.28
σ1
30.34 30.34 34.69
0.3327 0.3327 0.3357
m2
1.20 1.20 1.96
0.070 0.070 0.097
m2
5.877 5.877 8.324
0.043 0.043 0.054
σm2
0.544 0.544 0.773
0.016 0.016 0.019
σm2
24.93 24.93 35.32
0.1833 0.1833 0.23
σ2
2.31 2.31 3.28
0.0678 0.0678 0.0806
σ2
108
ARE P (Y > X) P (Z > X) P (Z > Y ) AvSqRE P (V > U ) P (W > U ) P (W > V )
TABLE 2.12
ARE P (X > Y ) P (X > Z) P (Y > Z) AvSqRE P (U > V ) P (U > W ) P (V > W )
TABLE 2.11
0.6195 0.6727 0.5836
0.7151 0.7956 0.6563
0.5118 0.5436 0.5184
0.5801 0.8239 0.6324
0.5
0.4065 0.3929 0.4424
0.4148 0.6013 0.4531
0.75
0.3105 0.2770 0.3740
0.2444 0.3227 0.2437
1.0
58.59 58.59 49.72
0.4044 0.4044 0.3446
m1
0.5505 0.9965 0.9877
0.5938 0.9873 0.9731
0.6693 0.9992 0.9973
0.25
0.6055 0.9999 0.9996
0.0
0.50761 0.91697 0.88532
0.50931 0.95663 0.93118
0.5
γ
0.42550 0.71516 0.65300
0.45862 0.77155 0.71945
0.75
0.34967 0.37614 0.35569
0.40735 0.39987 0.38255
1.0
12.70 60.44 60.44
0.1713 0.4739 0.4739
m1
Bayesian Noninformative Prior Analysis Results for Data Set WD4
0.7262 0.9446 0.7829
0.25
0.8523 0.9912 0.9099
0.0
γ
Bayesian Noninformative Prior Analysis Results for Data Set WD3
5.386 14.12 14.12
0.0454 0.0695 0.0695
σm1
6.231 6.231 10.70
0.0227 0.0227 0.0402
σm1
22.85 59.91 59.91
0.193 0.295 0.295
σ1
26.44 26.44 45.40
0.096 0.096 0.171
σ1
9.112 9.112 12.70
0.1516 0.1516 0.1713
m2
49.72 39.09 39.09
0.3446 0.2247 0.2247
m2
3.564 3.564 5.386
0.037 0.037 0.045
σm2
10.70 17.91 17.91
0.0402 0.0616 0.0616
σm2
15.12 15.12 22.85
0.1567 0.1567 0.1926
σ2
45.40 75.99 75.99
0.171 0.261 0.261
σ2
109
ARE P (Y > X) P (Z > X) P (Z > Y ) AvSqRE P (U > V ) P (W > U ) P (W > V )
TABLE 2.13
0.9106 0.7415 0.9251
0.8863 0.8970 0.9477
0.9676 0.9728 0.9919
0.25
0.9820 0.8689 0.9858
0.0
0.7290 0.7465 0.8279
0.7833 0.5638 0.07959
0.5
γ
0.5287 0.5290 0.6026
0.5696 0.4218 0.5615
0.75
0.2976 0.2767 0.3321
0.2989 0.2215 0.2897
1.0
16.93 74.44 74.44
0.7120 0.2913 0.7120
m1
Bayesian Noninformative Prior Analysis Results for Data Set WD5
4.695 24.92 24.92
0.0223 0.1256 0.0223
σm1
19.92 105.73 105.73
0.095 0.533 0.095
σ1
4.953 16.93 4.953
0.1679 0.1679 0.2913
m2
2.828 4.695 2.828
0.0326 0.0326 0.1256
σm2
12.0 19.92 12.0
0.138 0.138 0.533
σ2
110
SOFTWARE RELIABILITY MODELING
2.2.6 Discussion and Conclusions
Probability
The data in Tables 2.7 and 2.8 (supported by simple arithmetic differences and two sample t-tests as well as bracketed Bayesian probabilities of how much better or worse one may score than another with γ = 0) are plotted in Figures 2.17 and 2.18 for WD1 only. Tables 2.9 to 2.13 show in detail that as the γ tolerance constant increases from γ = 0 or b = 0, which is purely the hypothesis of testing means such as H01 : μAREi = μAREj , i = j to γ = 1 or b = X 2 − X 1 , in the criterion equation (16). A competing method’s predictive accuracy proving a poorer (+ difference) or better (− difference) probability decreases in support of the t-tests given in Tables 2.7 and 2.8. With the increase in γ , the difference between the two sample means is decreased in the hypothesis setting. Thus, the probability that one mean is greater than the other is decreased. Note that
ProbY>X ProbZ>X ProbZ>Y
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
Probability
FIGURE 2.17 WD1.
0.2
0.3
0.4 0.5 0.6 0.7 Gamma Multiplier
0.8
0.9
1
Noninformative probabilities from Table 2.9 for the ARE of data set
ProbV>U ProbW>U ProbW>V
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Gamma Multiplier
0.8
0.9
1
FIGURE 2.18 Noninformative probabilities from Table 2.9 for the AvSqRE (or SRE) of data set WD1.
STOCHASTIC MEASURES TO COMPARE FAILURE-COUNT RELIABILITY MODELS
111
one can also conduct two sample t-tests of equality of means by indicating a tolerance or threshold in the null hypothesis as shown in the book’s CD-ROM. Empirically, from the examples studied, a noninformative Bayesian probability are of exceeding 0.9 for ARE comparisons (note that lower probabilities are recorded for SRE with the squared penalty) using criterion function (16) concurs strongly with the rejection of the equality of means at a signiÞcance of α = 0.5. For example, the Bayesian noninformative (or ßat, where anything goes, with no restraint on the prior information of variance) probabilities of CPNLR predicting more accurately than CPMLE for ARE and SRE are 0.7791 and 0.7386, respectively, for the special case γ = 0. Recall that this probability was not adequate to reject the equality of ARE. Further, Tables 2.14 and 2.15 clearly show that these probabilities fall to 0.65 and 0.56 for ARE and SRE, respectively, when the upper boundary C for τ 2 is no longer inÞnite (anything goes) but is restrained by reasonable values. A trend illustrated in Figure 2.19, where C >> 0.002 for ARE, is already too large to call for a noninformative approach. Similarly, C >> 2.4 for SRE in Table 2.15 is too large, and hence we quickly approach inÞnity. TABLE 2.14 Informative Prior Results of Comparing P (Y > X ) for the ARE in WD1 from Table 2.9a γ 2
τ (0, C) (0,∞) (0,0.001468) (0,0.001101) (0,0.000734) (0,0.000367)
0.0
0.1
0.25
0.77985 0.70375 0.69223 0.68632 0.64455
0.72499 0.68742 0.70382 0.67054 0.64581
0.66167 0.61351 0.59847 0.57585 0.56924
a 2
τ = (0, C = ∞) ⇒ noninformative range, where C is the constant for upper boundary of τ 2
TABLE 2.15 Informative Prior Results of P (V > U ) for AvSqRE in WD1 from Table 2.9a γ 2
τ (0, C) (0,∞) (0,2.4) (0,1.8) (0,1.2) (0,0.6) a 2
0.0
0.1
0.25
0.73841 0.67962 0.65597 0.62332 0.55602
0.69524 0.66515 0.64137 0.60757 0.59866
0.63301 0.52232 0.50739 0.47465 0.45853
τ = (0, C = ∞) ⇒ noninformative range, where C is the constant for upper boundary of τ 2
112
SOFTWARE RELIABILITY MODELING
Gamma0 Gamma.1 Gamma.25
0.9 0.8
Probability
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10 15 C x10000
20
25
FIGURE 2.19 Informative probabilities from Table 2.14 for the ARE of WD1.
Therefore, the informative treatment of the problem may be only too productive in the case of borderline decisions because we are using more informative priors than none in order to produce more secure results, such as these shown in Table 2.7 for MO versus CPNLR in WD3, with an arithmetic difference of −0.18, a t-test statistic of −2.74, and a Bayesian noninformative comparative probability of 0.9912. If an informative approach were taken here with restraints placed on the upper values of prior variance τ 2 , then MO scoring better than CPNLR would be contested due to a lower (<0.9912) informative probability than that of a ßat prior, as trends show in Table 2.14. In such contested comparisons, it is useful to test using the SRE (the square penalty) in addition to ARE (the absolute penalty), as in Table 2.8, resulting in the failure to reject (or acceptance of) the hypothesis of equality of SRE between MO and CPNLR different from an earlier rejection. The Bayesian noninformative comparative probability of rejection went to a weaker 0.7956 from an earlier stronger 0.9912. The converse is true for the same comparison between MO and CPNLR in data set WD5, changing the “accept” decision to a “reject” when we use Table 2.8 for SRE in place of the ARE comparison in Table 2.7. The Bayesian noninformative comparative probability of rejection went up to a stronger 0.9728 from an earlier weaker 0.8688. Note that stronger Bayesian comparative probabilities signal a rejection of equality of predictive accuracies between two competing models. This way of quantifying whether one method is better (lower ARE or SRE) or worse (higher ARE or SRE) than another is far more realistic than deciding deterministically that one method is better merely by comparing the ARE or SRE values, or deciding stochastically by performing statistical hypotheses tests of pairwise means, an approach more realistic than simply comparing sample mean values [12,34]. In this way, one can place a measure of quantiÞcation on how much better or worse one method is than another in predictive accuracy.
REFERENCES
113
However, this quantiÞcation cannot be tested yet, so statistical hypotheses tests may be used to assist decision makers to reject the equality of means or failure to reject. One may test the Bayesian probability as in H0 : P = P0 . More in-depth study concerning informative priors is required. In this chapter, half-normal distribution is used for informative prior distribution of the means, ARE and SRE, to obtain Table 2.14 or 2.15. However, the formulation is beyond the scope of this chapter because ARE and SRE are positive quantities whose ideal values peak around zero. Recall that an absolute penalty of deviation of prediction from the true value can be attributed to alpha testing (before the release of software), in the case of ARE. However, the more severe squared penalty deviation of prediction from the true value may well be attributed to beta testing (after the release of software), as errors are more costly to redeem after software has been released to the end user. The impact of this methodology on software reliability measurement employing a variety of models is rather signiÞcant. It opens a new avenue for comparing and contrasting the predictive accuracy of competing methods’ ARE (due to an absolute penalty) and SRE (due to a squared penalty) in terms of how much better or worse they are rather than whether they are good or bad types through a qualitative comparison, as performed earlier [5,12,18–20]. In brief, the newly proposed stochastic measures of comparison are more accurate than the arithmetic comparisons of AREs performed conventionally by software analysts. Finally, with the rising number of software reliability estimation models, it is equally important to assess the predictive accuracy (or forecast quality) of these modeling techniques [12,18,34,45–48]. The material presented here is a novel attempt to quantify the probability of how much better one method’s prediction ability is than another’s rather than simply contrasting one to another. The probabilistic Bayesian computational technique proposed is superior to those that simply mention that one competing method has less ARE or SRE, or that the t-test of the equality of the corresponding means is rejected. REFERENCES 1. N. F. Schneidewind, Analysis of Error Processes in Computer Software, Proceedings of the International Conference on Reliable Software, April 21–23, 1975, pp. 337–346. 2. N. F. Schneidewind and H. M. Hoffmann, An Experiment in Software Error Data Collection and Analysis, IEEE Trans. Software Eng., 5(3), 276–286 (1979). 3. A. L. Goel and K. Okumoto, Time-Dependent Error-Detection Rate Model for Software Reliability and Other Performance Measures, IEEE Trans. Reliab., 28(3), 206–211 (1979). 4. S. Yamada and S. Osaki, Software Reliability Growth Modeling: Models and Applications, IEEE Trans. Software Eng., 11(12), 1431–1437 (1985). 5. J. D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, New York, 1987.
114
SOFTWARE RELIABILITY MODELING
6. M. Xie, Software Reliability Models: A Selected Annotated Bibliography, Software Test. VeriÞcation Reliab., 3(1), 3–28 (1993). 7. J. D. Musa and K. Okumoto, A Logarithmic Poisson Execution Time Model for Software Reliability Measurement, Proceedings of the 7th International Conference on Software Engineering, Orlando, FL, IEEE Computer Society Press, Los Alamitos, CA, 1984, pp. 230–238. 8. S. R. Dalal and A. A. McIntosh, When to Stop Testing for Large Software Systems with Changing Code, IEEE Trans. Software Eng., 20(4), 318–323 (1994). 9. M. Sahinoglu, The Limit of Sum of Markov Bernoulli Variables in System Reliability Evaluation, IEEE Trans. Reliab., 39(1), 46–50 (1990). 10. M. Sahinoglu, Compound-Poisson Software Reliability Model, IEEE Trans. Software Eng., 18(7), 624–630 (1992). 11. M. Sahinoglu and U. Can, An EfÞcient Predictive NLR Model for Reliability Modelling in Software Testing, Proceedings of the 2nd Bellcore/Purdue Symposium on Issues in Software Reliability Estimation, Bellcore, Livingston, NJ, October 12–13, 1992, pp. 29–38. 12. M. Sahinoglu and U. Can, Alternative Parameter Estimation Methods for the Compound Poisson Software Reliability Model with Clustered Failure Data, J. Software Test. Reliab. VeriÞcation, 17, 35–57 (1997). 13. P. Randolph and M. Sahinoglu, A Compound Poisson Stopping Rule, J. Appl. Stochastic Models Data Anal., 11(2), 135–143 (1995). 14. G. J. Knaß, Solving Maximum Likelihood Equations for Two-Parameter Software Reliability Models Using Grouped Data, Proceedings of the International Symposium on Software Reliability Engineering, 1992, pp. 205–213. 15. P. B. Moranda, Prediction of Software Reliability During Debugging, Proceedings of the Annual Reliability and Maintainability Symposium, Washington, DC, IEEE Reliability Society, 1975, pp. 27–33. 16. S. Kotz and N. L. Johnson (eds.), Encyclopedia of Statistical Sciences, Wiley, New York, 1988, Vol. 5, pp. 92, 111–113; Vol. 6, pp. 169–176. 17. M. Sahinoglu, An Empirical Bayesian Stopping Rule in Testing and VeriÞcation of Behavioral Models, IEEE Trans. Instrum. Meas., 52, 1428–1443 (October 2003). 18. T. Downs and A. Scott, Evaluating the Performance of Software-Reliability Models, IEEE Trans. Reliab., 41(4), 533–538 (1992). 19. T. M. Khoshgoftaar, B. B. Bhattacharya and G. D. Richardson, Predicting Software Errors, During Development, Using Nonlinear Regression Models: A Comparative Study, IEEE Trans. Reliab., 41(3), 390–395 (1992). 20. T. M. Khoshgoftaar, J. C. Munson, B. B. Bhattacharya and G. D. Richardson, Predictive Modeling Techniques of Software Quality from Software Measures, IEEE Trans. Software Eng., 18(11), 979–987 (1992). 21. M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 2, Hafner, New York, 1961. 22. N. R. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, 1966. 23. Y. Bard, Nonlinear Parameter Estimation, Academic Press, New York, 1974. 24. J. J. Mor´e, The Levenberg–Marquardt Algorithm: Implementation and Theory, in G. A. Watson (ed.), Numerical Analysis, Lecture Notes in Mathematics, No. 630, Springer-Verlag, Berlin, 1977, pp. 105–116.
REFERENCES
115
25. W. J. Kennedy, Jr. and J. E. Gentle, Statistical Computing, Marcel Dekker, New York, 1980. 26. SPSS Reference Guide, SPSS, Chicago, 1990 pp. 475–488. 27. D. M. Siefert and G. E. Stark, Software Reliability Handbook: Achieving Reliable Software, American Institute of Astronautics and Aeronautics, Reston, VA, 1992. 28. A. Iannino, J. D. Musa, K. Okumoto and B. Littlewood, Criteria for Software Reliability Model Comparisons, IEEE Trans. Software Eng., 10(6), 687–691 (1984). 29. N. F. Schneidewind, Software Reliability Model with Optimal Selection of Failure Data, IEEE Trans. Software Eng., 19(11), 1095–1104 (1993). 30. N. F. Schneidewind and H. M. Hoffmann, An Experiment in Software Error Data Collection and Analysis, IEEE Trans. Software Eng., 5(3), 276–286 (1979). 31. A. A. Abdel-Ghaly, P. Y. Chan and B. Littlewood, Evaluation of Competing Software Reliability Predictions, IEEE Trans. Software Eng., 12(9), 950–967 (1986). 32. M. Zhao and M. Xie, On the Log-Power Model and Its Applications, Proceedings of the International Symposium on Software Reliability Engineering, Research Triangle Park, NC, IEEE Computer Society, 1992, pp. 14–22. 33. J. O. Berger and J. J. Deely, A Bayesian Approach to Ranking and Selection of Related Means and Alternatives to AOV Methodology, J. Am. Stat. Assoc., 83, 364–373 (1988). 34. M. Sahinoglu, J. Deely and S. Capar, Stochastic Bayesian Measures to Compare Forecast Accuracy of Software Reliability Models, IEEE Trans. Reliab., 50, 92–97 (March 2001). 35. A. L. Goel, Software Reliability Models: Assumptions, Limitations and Applicability, IEEE Trans. Software Eng., 11(12), 1411–1423 (1985). 36. M.Sahinoglu and S. Capar, Statistical Measures to Evaluate and Compare Predictive Quality of Software Reliability Estimation Methods, Proceedings of the International Statistical Institute (ISI’97), IP-46, Istanbul, Turkey, August 18–26, 1997, pp. 525–528. 37. J. L. Romeu, Discussion of Invited Paper: Statistical Measures to Evaluate and Compare Predictive Quality of Software Reliability Estimation Methods, Proceedings of the International Statistical Institute (ISI’97), IP-46, August 18–26, 1997. 38. J. O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, New York, 1985. 39. J. J. Deely and J. B. Keats, Bayes Stopping Rules for Reliability Testing with the Exponential Distribution, IEEE Trans. Reliab., 43(2), 288–293 (1994). 40. J. J. Deely and A. F. M. Smith, Quantitative ReÞnements for Comparisons of Institutional Performance, J. Roy. Stat. Soc., A (1997). 41. J. J. Deely and W. J. Zimmer, Choosing a Quality Supplier: A Bayesian Approach, Bayesian Stat., 3, 585–592 (1988). 42. A. E. Gelfand and A. F. M. Smith, Bayesian Statistics Without Tears: A Sampling–Resampling Perspective, Am. Stat., 46(2), 84–88 (May 1992). 43. M. Sahinoglu and A. K. Alkhalidi, A Compound Poisson∧ LSD Stopping Rule for Software Reliability, presented at the 5th World Meeting of ISBA, Satellite Meeting to ISI-97, Istanbul, Turkey, August 1997.
116
SOFTWARE RELIABILITY MODELING
44. M. Sahinoglu, Negative Binomial Density of the Software Failure Count, Proceedings of the 5th International Symposium on Computer and Information Sciences (ISCIS), Vol. 1, 1990, pp. 231–239. 45. T. A. Mazzuchi and R. Soyer, A Bayes Empirical-Bayes Model for Software Reliability, IEEE Trans. Reliab., 37(3), 248–254 (1988). 46. T. A. Mazzuchi and R. Soyer, Software Reliability Software Assessment Using Posterior Approximations, in R. M. Heiberger (ed.), Computer Science and Statistics: Proceedings of the 19th Symposium on the Interface, 1987, pp. 400–402. 47. N. F. Schneidewind, Method for Validating Software Metrics, IEEE Trans. Software Eng., 18, 410–422 (1992). 48. W. J. Zimmer and J. J. Deely, A Bayesian Ranking of Survival Distributions Using Accelerated or Correlated Data, IEEE Trans. Reliab., 45(3), 499–504 (1996).
EXERCISES 2.1 Using the input data for WD1 in Table 2.1 and Table 2.4 by applying the straightforward analytical CPMLE method as in equations (6) to (8), verify the Xtot calculations for 20, 50, and 70%. You should get Xtot = 125, Xtot = 148, and Xtot = 131.4 using a calculator. 2.2 Do the same as in Tables 2.2 and 2.4 for WD1 by applying the SPSS (Statistical Software Package) implementing the CPNLR technique, which follows the L-M algorithm, and use equations (8) and (10). Stop at 50% of the observations and then estimate as follows: (a) Click on the SPSS Windows version available for PCs. (b) Go to “Files” and ask for the “50PER.SAV” input data Þle, which contains the input data up to and including 50% of the measurements, as in Figure 2.11. If not available, prepare your own input table. (c) Click on “Statistics” in the menu bar, then click “Regression” and “Nonlinear.” Stay in Figure 2.12. (d) Click on the nonlinear regression frame and enter “XPAST” for the “Dependent” window. (e) For the “Model Expression” window, enter XTOT−(BETA/P)* TREM. (f) For the “Nonlinear Regression: Parameters” window, enter the unknown initial values: Name: XTOT, Starting (initial) Value: 100, then click ADD; Name: BETA, Starting (initial) Value: 1, then click ADD; Name: P, Starting (initial) Value: 1, then click ADD. Now click on “CONTINUE.” (g) For the Loss-Function submenu, choose “Sum of Squared Residuals” about L-M. (h) For the Constraints submenu, choose “Unconstrained.”
117
EXERCISES
(i) Click OK to Þnalize the calculations and receive the convergence table in Figure 2.13. (j) Read R-Square = 0.93(very good) and XTOT = 132.23 with 95% upper and lower conÞdence estimates. 2.3 Repeat Exercise 2.2 for 70% cumulated data. 2.4 Download data set WD2 from the CD-ROM and do as in Exercises 2.1, 2.2, and 2.3. 2.5 Download data set WD3 from the CD-ROM (see Table 2.5) and do as in Exercise 2.4. 2.6 Download data set WD4 from the CD-ROM and do as in Exercise 2.4. 2.7 Download data set WD5 from the CD-ROM (see Table 2.6) and do as in Exercise 2.4. 2.8 Verify Tables 2.7 and 2.8 for the t-tests using the two-sample t-test program on the CD-ROM. 2.9 (Noninformative Bayes) Using Tables 2.9 to 2.13 and the FLAT program on the CD-ROM such as in Figures E2.9(a) and E2.9(b) of WD1 for comparing CPNLR with CPMLE and MO, respectively, verify the tables. Note that there may be differences due to round-off error.
FIGURE E2.9(a)
118
SOFTWARE RELIABILITY MODELING
FIGURE E2.9(b)
2.10 Using some other competing software reliability failure-count prediction techniques whose output data you know, compare any pair of methods employing the two-sample t-test and noninformative Bayesian approaches, using FLAT software in CD-ROM. It is required that at least 15 predictions are precalculated to fulÞll the underlying statistical normal-theory assumptions. Also, compare the ARE or SRE versus 0 (zero) such as in a one-sample right-tailed t-test to decide if your indicator of predictive quality can be accepted or rejected versus perfect accuracy with 0.0% (or 0.1% or 0.2%) precision.
Men of God’s truth are an ocean, Lovers must plunge into that sea; The sages too, should risk a dive, To bring out the best jewelry. —Yunus Emre, the legendary mystic folk poet (1238–1320)
3 QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT 3.1 DECISION TREE MODEL TO QUANTIFY RISK Nutshell 3.1 Several security risk templates employ nonquantitative attributes to express a risk’s severity. This approach is highly subjective and void of actual Þgures. The author’s design provides a quantitative technique with an updated repository on vulnerabilities, threats, and countermeasures to calculate risk. 3.1.1 Motivation Every day, I drive to my ofÞce at Troy University’s Computer Science Department located at Gunter Air Force Base in Montgomery, Alabama. During the morning journey, I often glance at two billboards that are on the way. One shows “weather condition” quantitatively, such as 68◦ F (it does not say “room temperature” or “warm” or “cold”). The second is at the gate where I show my pass to enter, where the billboard says, “Protection: ALPHA” or “BRAVO” or “CHARLIE” or “DELTA,” from least to most severe (green, yellow, orange, and red in the civilian sector), a qualitative indicator of the base’s daily security based on a national data repository. Like other passersby, I do not know how to differentiate today’s risk from yesterday’s. I wish there were an index value, such as 95 out of 100, so that I could tell just how secure we were thought to be, similar to the way we quantify the weather. As part of my research to quantify risk (assess it numerically), a security meter design that is system-speciÞc under the prevalent conditions per unit time is proposed [1]. This technique provides a purely quantitative and a semiquantitative (hybrid) alternative to frequently used qualitative Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
119
120
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
models that do not generate public interest and awareness of a practical userfriendly or tangible security concept. Those of Symantec at www.symantec.com, www.dsdlabs.com/security.htm of Data System Design Labs or in NSF’s 2002 workshop report www.nsf.gov/pubs/nsf03209/start.htm or recently the popular attack trees or similar are qualitative and attribute based, not numerical and cost convertible [2–4]. Moreover, quantitative risk measurements are needed to compare alternatives objectively and calculate monetary Þgures for reducing or minimizing existing risk. There are virtually no such quantitative measures in academia or the corporate world, other than high, medium, or low denominations. Among those existing analyses, such as capability-based attack trees or attack patterns, non-Bayesian spam Þltering or intrusion-aware design, or TTD (time-to-defeat) models, which favor a quantitative study, either (1) there is no probabilistic framework regarding whether to add or multiply risks in a probabilistic frame of mind, or (2) the risk calculations are handled loosely one by one in a nonsystematic approach [5,6]. See the appendixes for descriptions of methods of analysis that within a nonprobabilistic framework do not provide an accurate overall risk assessment, although they carry an empirical value to qualify the risk—better than none at all. In this chapter, along with the purely quantitative security meter model as the data allow, we look at a proposed modiÞcation of a decision tree–based security meter model for qualitative attributes in case quantitative data are not available, given that the underlying probabilistic assumptions hold [1,7–9]. Following the security meter analysis, prioritization of software maintenance is also achieved by ranking the vulnerabilities from worst to least severe through a Bayesian analysis [10]. Further, imperfections such as the nondisjointness of vulnerabilities and threats are treated using laws of probability [11]. The model proposed is practical and simple to use for beginners in the Þeld, but it also provides a mathematical–statistical foundation on which strategists or practitioners can construct a practical risk valuation. The probabilistic assumptions for the Monte Carlo simulation revolve around using a simple uniformly distributed random variable for the input variables by assigning an upper and a lower bound. The simulation can be improved by using other statistical distributions [12]. 3.1.2 Risk Scenarios Conventionally, risk scenarios involve possible chance-based catastrophic failures using scarce modeling of maliciously designed human interventions that threaten inherent system vulnerabilities. Risk scenarios involving critical computer communication networks are now more pervasive and severe than ever before because of the colossal redemption cost of nonmalicious chance failures that occur due to insufÞcient testing and lack of adequate reliability. We can use software reliability modeling and testing techniques to examine these chance failures in more detail [13,14]. But software security testing is a new Þeld [15]. For the intentional failures or malicious activities that critically increase the risk of ill-deÞned attacks, a physical scenario has not been thoroughly modeled, at
DECISION TREE MODEL TO QUANTIFY RISK
121
least not one that considers a uniÞed consistent scheme of vulnerabilities, threats, and countermeasures. A quantitative risk assessment provides results in numbers that management can understand, whereas a qualitative approach, although easier to implement, makes it difÞcult to trace generalized descriptive results. The security meter design proposed, which Þlls a void in the arena of much-sought-after quantitative risk evaluation, compares favorably to most current assessments that provide qualitative results. This is achieved by constructing a probabilistically accurate quantitative model to measure security risk [1]. This concrete numerical approach, which always works for all systems, can further facilitate security risk management and security testing. This means that the Þnal risk measure, calculated as a percentage, can be tested, improved, compared, and budgeted, as opposed to attributes such as high, medium, or low, which cannot be managed or quantiÞed numerically and monetarily for an objective assessment. Banks and other Þnancial institutions employ several commercially available security risk templates, mostly in verbal or qualitative form, that express the severity of a risk by a classiÞcation of attributes such as low, medium, or high. This approach is not only highly subjective but also lacks actual risk Þgures. Quantitative risk Þgures help mitigate or avoid future errors by allowing risk managers to compare project alternatives objectively and identify priorities for software maintenance. In existing analyses that favor quantitative study, either a probabilistic framework regarding whether to add or multiply risks does not exist, or the risk calculations are handled on a case-by-case basis without a network-oriented conclusion. Without using a probabilistic framework such as the one suggested in the security meter design (Figure 3.1), conclusions regarding the severity of a risk may be misleading and costly, due to over- or underestimation, especially during periods of military conßict, when risk scenarios are underestimated. The security meter design could be useful not only for commercial companies and military or government entities whose job it is to run daily risk assessments, but also for regular end users such as persons sending e-mail from household PCs. Much statistical planning and design remains to be done to reach a point where end users of all types have a consistently updated repository of vulnerability, threat, and countermeasure in continuous time.
Probabilistic Inputs: Vulnerability Constants:
Threat
Lack of countermeasure
Utility Cost Criticality
Security Meter Probability Model
Output: Residual risk and expected maximum cost to avoid risk
FIGURE 3.1 Quantitative security meter probability model with probabilistic and deterministic inputs, where the black box is a probabilistic decision tree diagram that performs the operations.
122
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
3.1.3 Quantitative Security Meter Model Let’s look more closely at the security meter model, which includes a description of input and output, all described in a probabilistic decision tree diagram approach. The same principles will be applied to those cases within a modiÞed approach to the method in which not all quantitative data are always available for the input parameters. Risk Management Risk management is the total process of identifying, measuring, and minimizing the uncertain events that can affect resources. This deÞnition also implies the process of bringing management (remedial action) and control into the risk analysis. A basic ingredient of risk assessment and analysis is the concept of vulnerability. Vulnerability is a weakness in any information system, system security procedure, internal controls, or implementation that an attacker could exploit. It can also be a weakness, such as a coding bug or design ßaw in a system. An attack occurs when an attacker with a reason to strike takes advantage of a vulnerability to threaten an asset [16–19]. The second most important ingredient in risk assessment is the concept of threat: any circumstance or event with the potential to affect an information system adversely through unauthorized success, destruction, disclosure, modiÞcation of data, or denial of service. Similarly, a threat to a system is a potential event that will have an unwelcome consequence if it becomes an attack on an asset [20]. Computer vulnerabilities replace software failures [21]. We can deÞne risk as the possibility that a particular threat will affect an information system adversely by exploiting a particular vulnerability. The third ingredient in the risk analysis after vulnerability and threat is a countermeasure or lack thereof. A countermeasure is an action, device, procedure, technique, or measure that reduces risk to an information system. Consequently, residual risk is the portion of risk remaining after a countermeasure is applied. If a perfect countermeasure exists, there would be no residual risk. The security meter design identiÞes the deterministic (constant) and probabilistic (random) inputs for the targeted output of residual risk: namely, an attack, as well as the projected cost to avoid or mitigate the risk that has been calculated. Fundamental Laws of Probability and Statistics Law 1: The probability P (V ) of any event V satisÞes 0 ≤ P (V ) ≤ 1 [22]. Law 2: If S is the sample space in a probability model, the sum of the probabilities of all outcomes, P (S) = 1. Law 3: Two events V1 and V2 are mutually exclusive (or disjoint) if they have no outcomes in common and therefore can never occur simultaneously. If V1 and V2 are disjoint, P(V1 or V2 ) = P (V1 ) + P (V2 ). This is the addition rule for disjoint events. Law 4: The complement of any event V is the event that V does not occur, denoted V C . The complement rule states that P (V C ) = 1 − P (V ).
DECISION TREE MODEL TO QUANTIFY RISK
123
Probabilistic Inputs The suggested vulnerability values vary between 0.0 and 1.0 (zero to 100%), adding up to unity. In a probabilistic sample space of feasible outcomes of the random variable of vulnerability, the sum of probabilities should add up to 1. This is like the probabilities of the faces of a die, such as 1 to 6, totaling to 1 whether the die is fair or tilted. If a cited vulnerability is not exploited in reality, it cannot be included in the model or Monte Carlo simulation study (which we examine in more detail later). Vulnerability has from one to several threats. A threat is deÞned as the probability of the exploitation of a vulnerability or weakness within a speciÞc time frame. Each threat has a countermeasure (CM) that ranges between 0 and 1 (with respect to the Þrst law of probability) whose complement gives the lack of countermeasure (LCM). The binary CM and LCM values should add up to 1, keeping in mind the second law of probability. The security risk analyst can deÞne, for instance, a network server (v 1 ) as a vulnerability located in a remote unoccupied hut in which a threat (t11 ), such as persons without proper access, or a Þre (t12 ), could result in the destruction of assets without countermeasures such as a motion sensor (CM111 ) or a Þre alarm (CM121 ), respectively. Let’s go over some words regarding malicious failures: that is, words that are part of malware (software that has an evil purpose) taxonomy. Varying forms of malware exist. For instance, a computer virus, which infects software by inserting itself, is a self-replicating code attached to another code with a payload that varies from the innocent (e.g., scaring you) to the harmful (e.g., discarding or changing useful existing Þles). A worm replicates, too, but it does not infect. The distinction is not all that obvious and is usually confused. A logic bomb is malware executed only when speciÞc trigger conditions are met. A Trojan horse is malware with hidden side effects not included in the speciÞcations and therefore not intended by the user executing the software. A hacker (a black-hat) is a person who harms your software unethically; white-hats use their skills to help develop software. Those who attack by borrowing software tools are script kiddies or anklebiters [2]. (Actually, according to the dictionary, a hacker is a person who makes furniture using an axe!) Deterministic Inputs System criticality, a constant that indicates how critical or disruptive a system is in the event of entire loss, is taken to be a single value ranging from 0.0 to 1.0 (zero to 100%). Criticality is low if residual risk is of little or no signiÞcance, such as the malfunctioning of an ofÞce printer. But in the case of a nuclear power plant, criticality is close to 100%, because its security has vital safety ramiÞcations for humans. Capital (investment) cost is the total expected asset loss in monetary units (dollars, etc.) for a particular system if it is destroyed completely and can no longer be utilized, excluding the other costs had the system continued to generate added value for the system. If there is a shadow or economic ripple effect, a multiplier is needed. Decision Tree Diagram Given that a simple sample system or component has two or more outcomes for each risk factor, vulnerability, threat, and countermeasure, the following probabilistic framework holds for the sums vi = 1 and
124
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
T1
LCM → (V1* T1*LCM) CM
V1 T2
LCM → (V1* T2*LCM) CM LCM → (V2* T1*LCM)
T1 V2
T2 T3
CM LCM → (V2* T2*LCM) CM LCM → (V2* T3*LCM) CM + Output: Total Residual Risk
FIGURE 3.2
General-purpose decision tree diagram for the model proposed.
tij = 1 for each i, and the sum of LCM + CM = 1 for each ij , within the tree diagram structure in Figure 3.2. Using the probabilistic inputs, we get the residual risk: residual risk = vulnerability × threat × lack of countermeasure
(1)
We can calculate the residual risks for all vulnerabilities with threats and LCMs, as well as the total residual risk. That is, if we add all the residual risks due to lack of countermeasures as in Figure 3.2, we can Þnd the overall residual risk. We apply the criticality factor to the residual risk to calculate the Þnal risk. Then we apply the capital investment cost to the Þnal risk to determine the expected cost of loss (ECL), which helps us to budget to avoid (before the attack) or repair (after the attack) the entire risk: Þnal risk = residual risk × criticality ECL = Þnal risk × capital cost
(2) (3)
3.1.4 Model Application and Results A risk analyst conducts Monte Carlo simulation to mimic the relationship among vulnerabilities, threats, and countermeasures as they exist in real life. That is, certain vulnerability is challenged by a threat, and therefore becomes an attack at the next level if the threat is not countermeasured by a Þrewall in a computer or motion sensor in the case of intrusion, or a Þre alarm in the case of Þre, according to what the actual situation could present. If fully countermeasured (i.e., CM = 1 or LCM = 0), no attack occurs, as is clear from equation (1), where the residual risk is zero since one of the factors is zero. The expected cost of loss (posthumously, if no action is taken) or the expected cost of repair to avoid the entire risk (if a countermeasure is taken proactively) can be determined using equation (3). Risk analysis has various inputs, such as vulnerability types
DECISION TREE MODEL TO QUANTIFY RISK
125
and each threat’s countermeasure; criticality and capital cost are constants, as is the number of simulations. From these input values, we determine the expected monetary loss to mitigate the residual risk. To represent each risk factor, such as vulnerability (vi ), threat (tij ), and countermeasure (CMij k ), by an educated guess, we assume uniform (or rectangular) density parameters which can take on values between a lower limit a and an upper limit b. The lower and upper limits are tabulated in Table 3.1 for all risk factors. The average or expected value of a uniformly distributed random variable is μ = (a + b)/2. That is, when placed in a decision tree diagram as in Figures 3.2 and 3.3, the values expected will result in an expected output, which should be veriÞed by the Monte Carlo simulation, in which thousands of runs are conducted, converging in the output expected. Let’s examine a sample application whose input data, tabulated in Table 3.1, revolves around a home ofÞce PC. In this hypothetical example, we assume that there exist Þve reportedly recognized types of vulnerabilities (v1 to v5 ), whose projected threats for any vulnerability may be either two- or threefold. Again, for each threat there is a countermeasure (CM) or lack of one (LCM), whose probabilities sum to 1. Figure 3.3 shows the values expected for the input random variables in Table 3.1 in order to produce the theoretical output expected. Residual risk is what is left of the risk after taking the product of the vulnerability and threat following the application of a countermeasure to circumvent the risk. That is, residual risk = risk × LCM. If we have perfect CM (CM = 1), the LCM is null (LCM = 1 − CM), which results in zero residual risk. The residual risks corresponding to each threat of a given vulnerability are added to Þnd the total
FIGURE 3.3 Spreadsheet for the results expected for the Þrst application in Table 3.1.
126
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
TABLE 3.1 Probabilistic Input Data for Vulnerabilities, Threats, and Countermeasures for a Home PC Vulnerability v1 (a = 0.1, b = 0.3), μv1 = 0.2
Threat
Lack of Countermeasure
t1 (a = 0.1, b = 0.6), μt1 = 0.35
LCM1 (a = 0.1, b = 0.5), μLCM1 = 0.3, μCM1 = 0.7; by subtraction LCM2 (a = 0.2, b = 0.6), μLCM1 = 0.4, μCM2 = 0.6; by subtraction LCM1 (a = 0.1, b = 0.7), μLCM1 = 0.4, μCM1 = 0.6; by subtraction LCM2 (a = 0.0, b = 0.2), μLCM1 = 0.1, μCM2 = 0.9; by subtraction LCM3 (a = 0.1, b = 0.4), μLCM1 = 0.25, μCM3 = 0.75; by subtraction LCM1 (a = 0.1, b = 0.4), μLCM1 = 0.25, μCM1 = 0.75; by subtraction LCM2 (a = 0.0, b = 0.3), μLCM1 = 0.15, μCM2 = 0.85; by subtraction LCM1 (a = 0.1, b = 0.4), μLCM1 = 0.25, μCM1 = 0.75; by subtraction LCM2 (a = 0.2, b = 0.6), μLCM2 = 0.4, μCM2 = 0.60; by subtraction LCM3 (a = 0.2, b = 0.6), μLCM3 = 0.4, μCM3 = 0.60; by subtraction LCM1 (a = 0.1, b = 0.3), μLCM1 = 0.2, μCM1 = 0.80; by subtraction LCM2 (a = 0.0, b = 0.3), μLCM2 = 0.15, μCM2 = 0.85; by subtraction
t2 by subtraction, μt2 = 0.65 v2 (a = 0.0, b = 0.4), μv2 = 0.2
t1 (a = 0.2, b = 0.6), μt1 = 0.40 t2 (a = 0.1, b = 0.3), μt2 = 0.20 t3 by subtraction, μt3 = 0.40
v3 (a = 0.0, b = 0.2), μv3 = 0.1
t1 (a = 0.1, b = 0.5), μt1 = 0.30 t2 by subtraction, μt2 = 0.70
v4 (a = 0.0, b = 0.1), μv4 = 0.05
t1 (a = 0.1, b = 0.4), μt1 = 0.25 t2 (a = 0.0, b = 0.5), μt2 = 0.25 t3 by subtraction, μt3 = 0.50
v5 by subtraction, μv5 = 0.45
t1 (a = 0.1, b = 0.5), μt1 = 0.30 t2 by subtraction, μt2 = 0.70
residual risk resulting from the entire set of vulnerabilities, and their attached threats, either countermeasured (to circumvent the risk) by a preventive measure such as a Þrewall against a hacker threat, or not countermeasured. We obtain the Þnal risk, using equation (2), by multiplying the resulting total residual risk by the criticality factor. As explained earlier, if the criticality is zero, there is no Þnal risk. If this equipment is critical for your job or school or the
DECISION TREE MODEL TO QUANTIFY RISK
127
nation, as in the event of a threat to a nuclear power plant, you attach to it a high criticality factor, such as 1.0. For the home PC example above, when we use equation (2) employing a criticality factor of 0.4, we Þnd a Þnal risk of 0.09575, or 9.58%. Now using equation (3) and employing a sample invested capital cost of $2500, the expected cost of loss due to Þnal risk is calculated to be $239.38: Þnal risk = residual risk × criticality = (0.239375)(0.4) = 0.09575
(4)
ECL = Þnal risk × capital cost = (0.09575)($2500) = $239.38
(5)
A Monte Carlo simulation study produced $239.377 compared to the value of $239.38 expected after the Þrst 5000 runs using equations (2) and (3). The difference between the expected (theoretical) result in Figure 3.3 and MC simulation is negligible. The purpose of Monte Carlo simulation, explained later, is to mimic actual operation, verify the theoretical results, and come up with a realization. A multiplier of 1.0 is used for shadow cost or ripple effect, to indicate that there will be no repercussions beyond the known value of the asset. 3.1.5 Modifying the Quantitative Model for Qualitative Data In the event that we do not possess purely quantitative values for each of the attributes in the decision tree diagram in Figure 3.2, and all we have are qualitative adjectives such as H (high; often), M (medium; sometimes), or L (low; seldom), we need to modify our approach as shown in Figure 3.4. We can then use the probabilities of H, M, and L [i.e., P (H), P (M), and P (L)] as long as the addition rule of unity holds for disjoint events (the second and third laws of probability). Such outcomes of vulnerability (Þrst branch of the tree diagram) or threat (second branch) may include H + L = 1, or M + L + L = 1, or L + L + L + L = 1, where H = 0.75, M = 0.5, and L = 0.25 hold, where for simplicity we have dropped the probability symbol P . Another feasible possibility is when H + L = 1, or M + L + L + L = 1, or 5L = 1 to signify Þve outcomes at most for either vulnerability or a threat variable, where H = 0.8, M = 0.4, and L = 0.2 hold. For up to Þve vulnerabilities, H + M = 1, or M + L + L + L = 1, or 5L = 1; then H = 0.6, M = 0.4, and L = 0.2. If, for example, as shown in Figure 3.4, H = 0.75, M = 0.5, and L = 0.25, the total risk = HHL + HLL + LMH + LLH + LLH = HHL + 3LLH + LMH = (0.752 )(0.25) + (3)(0.252 )(0.75) + (0.25)(0.5)(0.75) = 0.140625 + 0.140625 + 0.09375 = 0.375, or a 37.5% risk of losing the system’s security. Each branch now has letters to signify certain quantities for the vulnerability, threat, and countermeasure where the four fundamental laws of probability on p. 122 hold. 3.1.6 Hybrid Security Meter Model for Both Quantitative and Qualitative Data If we do not possess purely quantitative data, and all we have is a hybrid of quantitative risk values (i.e., probabilities between 0 and 1) and qualitative attributes
128
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT L → H*H*L H H L
H L → H*L*L H H → L*M*H
M L L
L
L
H → L*L*H L H → L*L*H L
FIGURE 3.4
Purely qualitative decision tree diagram modiÞed on the security meter.
L → H*H*L H H
H L → H*L*L
L
H .2 → L*4*2 A .8 L
35
L
.3 → L*35*3 .7 .4 → L*L*.4 .6
FIGURE 3.5
ModiÞed hybrid decision tree diagram of the security meter.
such as H, M, or L, the purely qualitative model of Figure 3.4 will transform to the hybrid model shown in Figure 3.5. There will be some branches with letters out of uncertainty (represented by probabilities according to the fundamental laws of probability) and some quantitative probability values that are obtained from past data monitoring with certainty. Figure 3.5 differs from Figure 3.4 in that the branches may carry qualitative attributes and quantitative values, both for the same vulnerability or threat variable. This model is quite feasible if the analyst abides by the laws of the probability as well as respecting the details of the low, medium, or high attributes for the problem under scrutiny. There will be compromises to obey the security meter principles, but that is a small price to pay for the numerical assessment of security risk through a simple and trustworthy security meter model. We can combine a mixture of these inputs as long as the
DECISION TREE MODEL TO QUANTIFY RISK
129
fundamental laws of probability hold. This necessity arises when a risk analyst will not be sure about the risk values but can only identify certain quantitative risks combined with uncertain adjectives as to high, medium, or low. For example, we may have H + 0.25 = 1, or M + 0.25 + 0.25 = 1, or 0.3 + 0.2 + L + L = 1, or 4 L = 1, where H = 0.75, M = 0.5, and L = 0.25 hold. In Figure 3.5, for some branches the risk probabilities are known, and in others, H, M, or L is given, as long as the fundamental laws of probability hold. For Figure 3.5’s example, H = 0.75 and L = 0.25 hold true where M is not used. Then the total risk = HHL + HLL + L(0.4)(0.2) + L(0.35)(0.3) + LL(0.4) = (0.752 )(0.25) + (0.75)(0.252 ) + (0.25)(0.4)(0.2) + (0.25)(0.35)(0.3) + (0.252 )(0.4) = 0.140625 + 0.046875 + 0.02 + 0.02625 + 0.025 = 0.25875, or 25.9% risk. As for the qualitative or hybrid model, there might be limitations to the denominations of vulnerabilities or threats according to the choice of estimated values for H, M, and L, reßecting the best educated guess. The analyst may sometimes have to go an extra step and choose H (high), M (medium), L (low), and W (rare). For example, where 8W = 1, M + 3L = 1 and H + 2W = 1, implying that H = 0.75, M = 0.40, L = 0.2, and W = 0.125, there may exist at most eight possible outcomes of the vulnerability or threat variable. This scenario is one of many feasible scenarios. 3.1.7 Simulation Study and Conclusions The security meter’s proposed mathematical accuracy is veriÞed by a Monte Carlo statistical simulation study. Five thousand runs, one of which is shown in Figure 3.6, resulting in 24.85%, are conducted by generating random variables from each vulnerability, threat, or countermeasure. Then the security meter
FIGURE 3.6 Screenshot of a spreadsheet for the residual risk of the security meter simulation runs.
130
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
method takes effect by multiplying the conditional probabilities at each branch with respect to equation (1) and Figures 3.2, 3.3, and Table 3.1 to calculate the residual risks and sum them up for the total residual risk. The average of a selected number of run cycles, such as 10,000, will yield the Þnal Monte Carlo result after 50 million runs. Then equations (2) and (3) are used to reach the Þnal risk and cost. Figure 3.6 displays the input data for v(a, b), t (a, b), and CM(a, b), which are taken to be uniformly distributed, U (a, b). The lower and upper bound values for the last windows in the case of Þfth vulnerability, or second or third threats in Figure 3.6, are left blank, as the software will complement it to 1.0 to obey the fundamental probability law of addition. Otherwise, it will refuse the random deviate to seek a new one. The budgetary portfolio at the end of such quantitative analyses is an asset. In this hypothetical or educational example, $239.38 is needed for proactive defense or to repair damage after the fact. Figure 3.6 shows the Þnal Monte Carlo simulation result for the Þnal risk and monetary extent of physical damage. An industrial product that might result through a security meter (SM) project from the design stage into the application stage for company or end-user beneÞt is envisioned. Security is a process under construction, not a product, but we need accurate and reliable products in order to calculate security quantitatively, thus improving security inch by inch rather than word by word as is done conventionally. A ubiquitous use of this practical technique is to install in everyone’s PC security meter software accompanied by a required data bank or repository to provide a daily report. Your home or ofÞce PC’s security index (%) ranking will provide a framework for relative improvement. A married couple enrolled in my course at Troy University in 2002 would have been helped had they taken the security meter concept seriously. A week before they suffered a hard drive crash and lost all their cyber-belongings, their security meter class project had indicated a relatively high (60%) residual risk. They thought this was the course instructor’s fantasy, but it became a stark reality. Evidently, they had no Þrewall or virus protection software against well-known vulnerabilities, and had plenty of threats based on their home PC’s real data. Inclusion of a scheme that caters to a dynamic criticality rating of an information system, or an asset with respect to time rather than a static factor, is very realistic to consider. The system asset may be more critically valuable in monetary terms on certain days of the month, such as the Þrst three or last two days, than on others. Also, a time-dependent shadow or opportunity cost (or rippleeffect) factor as a multiplier needs to be integrated into the realm of calculations. If one loses an asset worth, say, $5000, the lost opportunity may be more than that of the face value alone; it may amount to 10 times as much, $50,000, due to lost opportunities, or shadow and ripple effects. The opportunity cost factor may be time critical, based on the asset’s utility ßuctuations. The same idea holds for a security meter functioning as a dynamic timeßowing indicator, not as a deterministic constant. No matter how granulated the per-unit time may appear, the war between threats and countermeasures with changing vulnerabilities will offer an SM(t) snapshot. It can also appear that
BAYESIAN APPLICATIONS FOR PRIORITIZING SOFTWARE MAINTENANCE
131
there may be more than a single security meter prevailing in a system of multiple components. Therefore various buckets of vulnerabilities and threats (rather than singly selected ones in an idealized security meter model) can be treated in a dynamic time ßow as happens in real time.
3.2 BAYESIAN APPLICATIONS FOR PRIORITIZING SOFTWARE MAINTENANCE Nutshell 3.2 Quantitative risk Þgures are needed to compare alternatives objectively and to quantify monetary measures to enable budgeting to reduce or minimize the existing risk. They are also needed to determine costs related to maintain vulnerabilities or weaknesses of the software. The maintenance priorities can be assessed by using the security meter technique combined with Bayesian procedures. Some examples are cited from hypothetical applications since the simpler the examples, the easier it is to comprehend the philosophy behind the maintenance-priority problem. 3.2.1 Motivation Software maintenance is the general process of making changes to improve a system after its installation [23]. This strategy does not generally involve major architectural modiÞcations. Software maintenance can involve (1) the repair of software faults, ranging from coding errors, which are the cheapest to Þx, to design errors, which are more expensive, and Þnally, to requirement errors, which are the most expensive; (2) the adaptation of software to a new operating environment; and (3) adding or modifying the system’s functionality due to internal and external factors, such as changing laws and changing markets or business structures. The method we propose addresses the Þrst two items, corrective and adaptive action, by providing a quantitatively comparative risk assessment technique. Software maintenance consumes 60 to 80% of most companies’ software budgets and is thus the largest single contributor to high software costs [24]. Moreover, growth in system size averages 10% per year, and maintenance expenses generally increase as systems age. Therefore, research efforts have begun to integrate design and maintenance management policies to reduce unanticipated side effects [25]. Corrective maintenance identiÞes and corrects software performance and application failures, whereas adaptive maintenance adjusts the software to conform to new data requirements or processing environments to minimize the functionality risk that arises when the environment changes. Traditionally, maintenance cost does not measure future expected loss due to failures, only the historical cost of Þxing the software. That is, subjective judgments should be supported by quantitatively objective risk assessments to determine not only the proper type of maintenance, but where efforts should be focused [24,25]. But to
132
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
date, there has been little theoretical support for these assessments. The security meter approach can provide a quantitative comparison and inform the analyst of a budgetary portfolio paving the way for prioritizing, maintaining, or replacing a module to determine the most cost-effective maintenance strategy [1,7–11]. 3.2.2 Bayesian Rule in Statistics and Applications for Software Maintenance Risk analysis simulation is used to analyze a problem and determine the budget cost to calculate the expected cost of loss. Risk analysis has various inputs, such as types of vulnerability and threat and a countermeasure for each threat. Criticality and utility (capital) cost are constants, as well as the number of simulation runs. From these simplest input values, the output cost to mitigate the residual risk is determined (see Table 3.2): residual risk = 4(0.5)(0.5)(0.5) = 0.5
(6)
Þnal risk = residual risk × criticality = (0.5)(0.5) = 0.25, where criticality = 0.5
(7)
expected cost = utility cost × Þnal risk = ($1000)(0.25) = $250, where utility cost = $1000
(8)
Using a single shot for one simulation trial, as in Figures 3.7 and 3.8, let’s apply Bayesian principles to determine the vulnerability that requires the most maintenance. Let’s ask a Bayesian type of question as it relates to our maintenance problem. What is the probability that the ofÞce computer software risk is due to server (e.g., Þre, system down) or is due to E-mail (e.g., virus, hacking)? Another example of a similar tree diagram is presented in Figure 3.7. Let P (A) = P (vulnerability) = 0.05 and P (B | A) = P (threat | vulnerability) = 0.017, P (C | A, B) = P (CM | T and V ) (9) Now let’s go over the Bayesian rule [26]. If A, B, and C are any events whose probabilities are not 0 or 1, then P (A | B) = TABLE 3.2
E-mail (0.5)
(10)
Vulnerability–Threat–Countermeasure Spreadsheet for a PC
Vulnerability Server (0.5)
P (A ∩ B) P (B | A)P (A) = P (B) P (B)
Threat 1. 2. 1. 2.
Fire (0.5) System down (0.5) Virus (0.5) Hacking (0.5)
Countermeasure 1. 2. 1. 2.
Smoke detectors (0.5) In-house generator (0.5) Antivirus software (0.5) Firewall (0.5)
BAYESIAN APPLICATIONS FOR PRIORITIZING SOFTWARE MAINTENANCE
B .017 A
.6
C
.4
CC → (ABC C)
.983 .7 .3 BC
.05
B
.95 AC
.8 .2
133
= 0.00034
C CC → (ABCC C) = 0.014745 C
CC → (ACBC C) = 0.00019 .0001 .9999 .9 C
BC
.1
CC → (ACBCC C) = 0.0949905 + Output: Total Residual Risk = 0.11
FIGURE 3.7 Simplest tree diagram for two threats for each of the two vulnerabilities such as in Table 3.2 with varying input data.
where P (B) = P (A)P (B | A) + P (AC )P (B | AC ) and then P (A | B) =
P (A ∩ B) P (A)P (B | A) + P (AC )P (B | AC)
(11)
(12)
Now extending the analysis to three events rather than two, previously, in cascading form, P (C) = P (A)P (B | A)P (C | B, A) + P (A)P (B C | A)P (C | A, B C ) + P (AC )P (B | AC )P (C | B, AC ) + P (AC )P (B C | AC )P (C | AC , B C ) (13) Applying the probabilities shown in Figure 3.7 yields P (B) = (0.05)(0.0017) + (0.95)(0.0001) = (0.00085) + (0.000095) = 0.000945
(14)
P (C) = (0.05)(0.0017)(0.6) + (0.05)(0.983)(0.7) + (0.95)(0.0001)(0.8) + (0.95)(0.9999)(0.9) = 0.000051 + 0.034405 + 0.000076 + 0.8549145 = 0.8894465 P (C ) = 1 − P (C) = 0.1105535 C
(15) (16)
134
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
For the a posteriori probability calculations (i.e., Bayesian), given the risk event, C C , what percent comes from vulnerability source A and from source AC ? P (A)P (B | A)P (C C | B, A) + P (A)P (B C | A)P (C C | A, B C ) P (C C ) (0.05)(0.0017)(0.4) + (0.05)(0.983)(0.3) = 0.1105535 = 0.1337 (17)
P (A | C C ) =
P (AC | C C ) = 1 − 0.1337 = 0.8663
(18)
Knowing that initially, vulnerability A contributed 5% and is now contributing 13.37% of the risk, it needs to be explored carefully. The other vulnerability, AC , which a priori contributed 95%, is now causing 86.63% of the risk and has less of an effect on the overall risk picture. Therefore, vulnerability A needs to have the highest maintenance priority. Let’s look next at another example in Figure 3.8, where the screenshot shows one realization of the Monte Carlo simulation. We need to Þnd the following Bayesian posterior probabilities: 0.058375 = 0.1246 0.468838 0.070589 P (system down | risk) = = 0.1506 0.468838 P (Þre | risk) =
FIGURE 3.8 Screenshot for Table 3.2 in the Monte Carlo simulation study.
(19) (20)
BAYESIAN APPLICATIONS FOR PRIORITIZING SOFTWARE MAINTENANCE
0.154611 = 0.3298 0.468838 0.185262 P (hacking attack | risk) = = 0.3950 0.468838 P (virus attack | risk) =
135
(21) (22)
From these probabilities it is obvious that the posterior risk (R) due to chance failures of the Þrst vulnerability is 0.1246 + 0.1506 = 0.2752, or 27.52%. On the other hand, the prior contribution of chance failures at the very beginning stage was less: 0.2654, or 26.54%, in Figure 3.8. On the other hand, the prior contribution of vulnerability of the malicious failures was 0.7345, or 73.45%. The posterior contribution turned out to be 0.3298 + 0.3950 = 0.7248, or 72.48%. What this means is that although malicious causes of the second vulnerability constitute 73.45% of the totality of failures, these causes generate 72.48% of the risk. The implication is that more stringent software maintenance is required for the Þrst vulnerability than for the second. Also, the threat of a hacking attack (= 39.5%) is more severe than the threat of a virus attack (= 33%). For corrective maintenance, two remedial measures are feasible, in the order of applicability [10]. The exclamation marks show “Þrst priority” maintenance as in Figure 3.3’s last column. 1. We need to work to improve the countermeasures for vulnerability A, noting especially that the “system down” threat is greater than the Þre threat (i.e., 15.06% > 12.46%) in this example. 2. After preventive or corrective measures are taken regarding the vulnerability with the highest priority, the security meter analysis must be rerun to compute the updated posterior Bayesian probabilities. This is to see if any improvement is recorded by comparing the expected costs of loss pre- and postmaintenance. 3.2.3 Another Bayesian Application for Software Maintenance By using a single shot for one simulation trial, shown in Table 3.3, assuming that it is a hypothetical example, let’s use the Bayesian approach to determine the vulnerability that will require the most maintenance. What is the probability that the ofÞce computer software risk is due to chance (e.g., design, system down) or is malicious (e.g., virus, hacking)? Statistically, we need to Þnd the following Bayesian probabilities: 0.097097 0.506371 0.118578 P (system down | risk) = 0.506371 0.151268 P (virus attack | risk) = 0.506371 0.139429 P (hacking attack | risk) = 0.506371 P (design error | risk) =
= 0.1917
(23)
= 0.2341
(24)
= 0.2987
(25)
= 0.2755
(26)
136
b
0.2
By subtraction
Chance failure (A)
Malicious failure (B)
By subtraction
0.8
a
0.535462
0.464538
Random Value 0.4
By subtraction 0.4
By subtraction
System down Virus
Hacking
b
Design error
Threat 0.6
a
By subtraction
0.6
By subtraction
One Simulation Result for the Security Meter Example in Table 3.2
Vulnerability
TABLE 3.3
0.509492
0.490508
0.503857
0.496143
Random Value
0.4
0.4
0.4
0.4
LCML
0.6
0.6
0.6
0.6
LCMU
0.139429
0.151268
0.118578
0.097097
Risk
Residual: 0.506371
0.511076
0.575932
0.506611
0.421288
Random Value
BAYESIAN APPLICATIONS FOR PRIORITIZING SOFTWARE MAINTENANCE
137
From these Bayesian posterior probabilities, it is clear that the postmaintenance risk due to chance failures of the Þrst vulnerability is 0.1917 + 0.2341 = 0.4258, or 42.58%. The premaintenance contribution of chance failures was greater: 0.4645 or 46.45%, whereas for malicious failures it was 0.5354, or 53.54%. The posterior contribution turned out to be 0.2987 + 0.2755 = 0.5742, or 57.42%. What this means is that although malicious causes of the second vulnerability constitute 53.54% of failures, they generate 57.42% of the risk. The implication is that greater software maintenance is required on the second vulnerability than on the Þrst. For corrective maintenance at this Þnal stage, two remedial measures are feasible, in the order of applicability: 1. We need to improve countermeasures for vulnerability B, noting especially that virus attacks constitute more than half (29.87% > 27.55%) the threat in this example. 2. After preventive or corrective measures are taken regarding the vulnerability with the highest priority, the security meter analysis must be rerun to compute the updated Bayesian a posteriori probabilities. See if any improvement is recorded by comparing the expected costs of loss between pre- and postmaintenance. 3.2.4 Monte Carlo Simulation to Verify the Bayesian Analysis Proposed In our simulation for Table 3.2, where the risk analysis is done for a home ofÞce, 1000 simulation runs are conducted, with 5000 trials for each, totaling 5 million runs. Results are calculated using a Java code. For each of the risk factors, such as vulnerability, threat, and lack of countermeasure, the assumed uniform density parameters, a = upper limit and b = lower limit, which are both between 0 and 1, are designated as input data. We generate uniformly distributed random variables for all attributes given a and b. The average of U (a, b) is (a + b)/2. The simulation screenshot is shown in Figure 3.8. It is observed that the expected residual risk result (= 0.5) and simulated residual risk result (= 0.4999984) are almost identical after 5 million simulations. In addition to veriÞcation of the expected (theoretical) results, an advantage of simulation was to obtain a realization of a scenario or data set where none exists. This property was utilized above in Bayesian calculations to prioritize the maintenance schedule. 3.2.5 Discussion and Conclusions The security meter approach proposed provides a quick bird’s-eye view of a component’s or system’s software security risk [1]. Some of the earlier techniques used, such as attack trees, do not provide a probabilistically accurate overall picture. Vulnerabilities that need more surveillance can be ranked from most to least severe through Bayesian analysis. This is very useful for prioritization purposes, saving time and effort in the vast arena of software maintenance [10].
138
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
The model proposed is supported by a Monte Carlo simulation, which provides a purely quantitative alternative to conventional qualitative models, summarized in the appendixes. One assumes that the vulnerability–threat–countermeasure input data available will be reliable, a concern that we study in Section 3.4.
3.3 QUANTITATIVE RISK ASSESSMENT FOR NONDISJOINT VULNERABILITIES AND NONDISJOINT THREATS Nutshell 3.3 A Monte Carlo simulation study for the simplest statistical assumption illustrated the validity of the decision tree approach satisfactorily, citing some examples from hypothetical applications. In actual life scenarios, the components of the overall risk picture are nondisjoint (non-mutually exclusive) rather than purely disjoint. Earlier models designed for disjoint events have been reformulated with respect to nondisjoint scenarios. 3.3.1 Motivation Behind the Disjoint Notion of Vulnerabilities and Threats In detailed treatment of the security meter used as a novel quantitative risk assessment technique, all the vulnerabilities were assumed to be disjoint, as were the ensuing threats [1]. However, when the vulnerabilities of the quantitative security risk assessment are not perfect; (i.e., they are nondisjoint or not mutually exclusive), a new probabilistic approach is needed to replace the special case of disjoint outcomes. The security meter’s decision tree diagram has been reformulated in the light of this new reality [11]. 3.3.2 Fundamental Probability Laws of Independence, Conditionality, and Disjointness Here we continue the fundamental laws of probability begun in Section 3.1.3. Law 5 : If V1 and V2 are two independent events, P (V1 and V2 ) = P (V1 )P (V2 ). For three events, P (V1 and V2 and V3 ) = P (V1 ∩ V2 ∩ V3 ) = P (V1 )P (V2 )P (V3 )
(27)
This is the general multiplication rule for independent events [22]. Law 6 : If P (V1 ) > 0, the conditional probability of V2 given V1 is, where ∩ represents AND, P (V1 ∩ V2 ) P (V2 | V1 ) = (28) P (V1 )
139
QUANTITATIVE RISK ASSESSMENT
Law 7 : It follows from previous laws that P (V2 | V1 ) = P (V2 ) if V1 and V2 are independent. Law 8 : If V1 and V2 are two dependent events, P (V1 ∩ V2 ) = P (V1 )P (V2 | V1 ). For three events, P (V1 and V2 and V3 ) = P (V1 )P (V2 | V1 )P (V3 | V1 and V2 )
(29)
This is the multiplication rule for dependent events [22]. Law 9: If V1 and V2 are two disjoint (mutually exclusive) events, P (V1 ∪ V2 ) = P (V1 ) + P (V2 )
(30)
For three events, P (V1 or V2 or V3 ) = P (V1 ∪ V2 ∪ V3 ) = P (V1 ) + P (V2 ) + P (V3 )
(31)
Law 10: If V1 and V2 are two nondisjoint (non-mutually exclusive) events, P (V1 ∪ V2 ) = P (V1 ) + P (V2 ) − P (V1 ∩ V2 )
(32)
For three events, P (V1 ∪ V2 ∪ V3 ) = P (V1 ) + P (V2 ) + P (V3 ) − P (V1 ∩ V2 ) − P (V1 ∩ V3 ) − P (V2 ∩ V3 ) + P (V1 ∩ V2 ∩ V3 )
(33)
This is the addition rule for nondisjoint events [22]. 3.3.3 Security Meter ModiÞed for Nondisjoint Vulnerabilities and Disjoint Threats In Figure 3.9, V1 and V2 are given as disjoint. Note that, for simplicity, the threat outcomes are also assumed to be disjoint. The modiÞed diagram for nondisjoint case is shown in Figure 3.10.
T1 V1 T2
V2
LCM →
P(V1)*P(T1|V1)*P(LCM|V1,T1)
CM LCM →
P(V1)*P(T2|V1)*P(LCM|V1,T2)
CM
T1
LCM →
P(V2)*P(T1|V2)*P(LCM|V2,T1)
T2
CM LCM →
P(V2)*P(T2|V2)*P(LCM|V2,T2)
CM
FIGURE 3.9
+ Output: Total Residual Risk
Simplest tree diagram for two threats for each of the two vulnerabilities.
140
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
T1
V1∩V2C
T2 T1 V1C∩V2
T2 T1
V1∩V2 T2
FIGURE 3.10
LCM → P(V1 ∩ V2C ) P( T1|V1, V2 ) P( LCM | V1, V2, T1 ) = (0.45)(0.5)(0.5) = 0.1125 CM LCM → P(V1 ∩ V2C ) P( T2|V1, V2 ) P( LCM | V1, V2, T2 ) = (0.45)(0.5)(0.5) = 0.1125 CM LCM → P(V1C ∩ V2 ) P( T1|V1, V2 ) P( LCM | V1, V2, T1 ) = (0.35)(0.5)(0.5) = 0.0875 CM LCM → P(V1C ∩ V2 ) P( T2|V1, V2 ) P( LCM | V1, V2, T2 ) = (0.35)(0.5)(0.5) = 0.0875 CM LCM → P(V1 ∩ V2 ) P( T1|V1, V2 ) P( LCM | V1, V2, T1 ) = (0.20)(0.5)(0.5) = 0.05 CM LCM → P(V1 ∩ V2 ) P( T2|V1, V2 ) P( LCM | V1, V2, T2 ) = (0.20)(0.5)(0.5) = 0.05 + CM Output: Total Residual Risk = 0.5
ModiÞed tree diagram for nondisjoint vulnerabilities with disjoint threats.
In a hypothetical example, as in Figure 3.10, let P (V1 ) = 0.65, P (V2 ) = 0.55, P (V1 ∩ V2 ) = 0.2, and P (V1 ∩ V2C ) = 0.45, P (V2 ∩ V1C ) = 0.35, and P (V1C ∩ V2C ) = 0. In a Venn diagram setting, observe that the sets solely V1 : (V1 ∩ V2C ), solely V2 : (V2 ∩ V1C ), both V1 and V2 : (V1 ∩ V2 ), and if applicable, none of V1 and V2 : (V1C ∩ V2C ) are now mutually exclusive or disjoint. Then, since P (V1 ∪ V2 ) = P (V1 ) + P (V2 ) − P (V1 ∩ V2 ), V1 and V2 are not disjoint. Additionally, since P (V1 ∩ V2 ) = 0.2 is not equal to P (V1 )P (V2 ) = (0.65)(0.55) = 0.3575, V1 and V2 cannot be independent. Sometimes, two nondisjoint events may be independent if this equality holds. If they are disjoint, they are absolutely dependent. Therefore, because V1 and V2 are both nondisjoint and nonindependent, as vulnerabilities may occur in real life, the tree diagram of Figure 3.9 is no longer acceptable and is modiÞed in Figure 3.10. The same rules still apply for the Monte Carlo simulation, since the modiÞed sets add to unity because they are made disjoint. In real life, nondisjoint and dependent vulnerabilities with nondisjoint threats occur frequently in the form of “buckets,” where common events intersect. Note that for the disjoint threats given the vulnerabilities, P (T1 | V1 ) = 0.5, P (T2 | V1 ) = 0.5, P (T1 | V2 ) = 0.5, and P (T2 | V2 ) = 0.5. As before, P (LCM) = P (CM) = 0.5 for simplicity. One needs to formulate a similar working table for the threats when they are not disjoint. For a dichotomous threat scenario, one would have disjoint sets, such as solely T1 : (T1 ∩ T2C ), solely T2 : (T2 ∩ T1C ), both T1 and T2 : (T1 ∩ T2 ), and none of T1 and T2 : (T1C ∩ T2C ), if applicable. This approach is Þnally generalized to n > 2 for vulnerabilities or threats. For n = 3, such as in Moore and McCabe’s coffee, tea, and cola drinkers problem [26, p. 355], the disjoint sets will constitute the following—solely V1 : (V1 ∩ V2C ∩ V3C ), solely V2 : (V2 ∩ V1C ∩ V3C ), solely V3 : (V3 ∩ V1C ∩ V2C ), solely V1
141
QUANTITATIVE RISK ASSESSMENT
and V2 : (V1 ∩ V2 ∩ V3C ), solely V1 and V3 : (V1 ∩ V3 ∩ V2C ), solely V2 and V3 : (V2 ∩ V3 ∩ V1C ), all V1 , V2 , and V3 : (V1 ∩ V2 ∩ V3 ), and if applicable, none of V1 , V2 , and V3 : (V1C ∩ V2C ∩ V3C ). All are now mutually exclusive sets modifying the tree diagram in Figure 3.9. Thus, the 2n rule (i.e., 4 for n = 2 and 8 for n = 3) holds for the number of disjoint sets. 3.3.4 Security Meter ModiÞed for Nondisjoint Vulnerabilities and Nondisjoint Threats When the threat events are also not disjoint of each other, equation (32), as in Figure 3.11, will prevail. When T1 and T2 were initially given to be disjoint, the relevant conditional probabilities, such as P (T1 | T2 ) and P (T2 | T1 ), are now known, where earlier, P (T1 ) + P (T2 ) = P (T1 ∪ T2 ) to give disjointness. Note that P (T1 ∪ T2 ) = P (T1 ) + P (T2 ) − P (T1 ∩ T2 ), since the threats are no longer disjoint. If there are more than two outcomes, P (T1 ∪ T2 ∪ T3 ) = P (T1 ) + P (T2 ) + P (T3 ) − P (T1 ∩ T2 ) − P (T1 ∩ T3 ) − P (T2 ∩ T3 ) + P (T1 ∩ T2 ∩ T3 ). Note that both the vulnerability and threat outcomes are not disjoint. See Figure 3.11 for the modiÞed diagram when the threat events are also not disjoint of each other. When two or three events are not disjoint, they may or may not be independent, by equation (27). But when disjoint, they are deÞnitely dependent given no null: P (φ) = 0 or sure: P (S) = 1 sets. In a hypothetical example, such as in Figure 3.11 and for simplicity in calculations, let P (T1 ) = 0.65, P (T2 ) = 0.55, P (T1 ∩ T2 ) = 0.2; then P (T1 ∩ T2 |V1 , V2 ) = 0.2, P (T1 ∩ T2C | V1 , V2 ) = 0.45, P (T2 ∩ T1C ) = 0.35, and P (T1C ∩ T2C ) = 0. In a Venn diagram setting, observe that the sets solely T1 : (T1 ∩ T2C ), solely LCM→(risk)...
= (0.45)(0.45)(0.5) = 0.10125
T1∩T2C V1∩V2C
CM T2∩T1C...
(risk)...
T1∩T2... T1∩T2C V1C∩V2
(risk)...
LCM→
(risk)...
CM T2∩T1C... (risk)... T1∩T2 ...
V1∩V2
T1∩T2C
P(V1∩V2C) P(T1∩T2C |V1, V2) P(LCM|V1, V2, T1, T2)
(risk)... LCM→(risk)...
CM T2∩T1C... (risk)... T1∩T2 ...
(risk)... +
P(V1∩V2C) P(T2∩T1C |V1, V2) P(LCM|V1, V2, T1, T2) = (0.45)(0.35)(0.5) = 0.07875 P(V1∩V2C) P(T1∩T2 |V1, V2) P(LCM|V1, V2, T1, T2) = (0.45)(0.20)(0.5) = 0.0045 P(V1C∩V2) P(T1∩T2C |V1, V2) P(LCM| V1, V2, T1, T2) = (0.35)(0.45)(0.5) = 0.07875 P(V1C∩V2) P(T2∩T1C |V1, V2) P(LCM|V1, V2, T1, T2) = (0.35)(0.35)(0.5) = 0.06125 P(V1C∩V2) P(T1∩T2 |V1, V2) P(LCM|V1, V2, T1, T2) = (0.35)(0.20)(0.5) = 0.0035 P(V1∩V2) P(T1∩T2C |V1, V2) P(LCM| V1, V2, T1, T2) = (0.20)(0.45)(0.5) = 0.0045 P(V1∩V2) P(T2∩T1C |V1, V2) P(LCM|V1, V2, T1, T2) = (0.20)(0.35)(0.5) = 0.0035 P(V1∩V2) P(T1∩T2 |V1, V2) P(LCM|V1, V2, T1, T2) = (0.20)(0.20)(0.5) = 0.0020 Output: Total Residual Risk: 0.338
FIGURE 3.11 ModiÞed tree diagram for two nondisjoint vulnerabilities and two nondisjoint threats.
142
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
T2 : (T2 ∩ T1C ), both T1 and T2 : (T1 ∩ T2 ), and if applicable, none of T1 and T2 : (T1C ∩ T2C ) are all now mutually exclusive or disjoint. Therefore, because nondisjoint properties of T1 and T2 as threats may occur in real life, the tree diagram of Figure 3.9 is no longer acceptable and is modiÞed as Figure 3.11. The same rules still apply for the Monte Carlo simulation, since the disjoint sets are additive to unity. As before, P (LCM) = P (CM) = 0.5 for simplicity. This approach is Þnally generalized to n > 2 for vulnerabilities and threats. For n = 3, again as in Moore and McCabe’s coffee, tea, and cola drinkers problem [26, p. 355], dependent but disjoint sets will constitute the following: solely T1 : (T1 ∩ T2C ∩ T3C ), solely T2 : (T2 ∩ T1C ∩ T3C ), solely T3 : (T3 ∩ T1C ∩ T2C ), solely T1 and T2 : (T1 ∩ T2 ∩ T3C ), solely T1 and T3 : (T1 ∩ T3 ∩ T2C ), solely T2 and T3 : (T2 ∩ T3 ∩ T1C ), all T1 , T2 , and T3 : (T1 ∩ T2 ∩ T3 ), and if applicable, none of T1 , T2 , and T3 : (T1C ∩ T2C ∩ T3C )—all now mutually exclusive and dependent sets modifying the tree diagram in Figure 3.9. Thus, the 2n rule holds for the number of disjoint sets for threats as it did for the vulnerabilities. 3.3.5 Discussion and Conclusions For the security meter design to be effective, one assumes that vulnerability, threat, and countermeasure data are available and reliable. A statistical design, which is studied in the next section, must be devised to assist in reliable data collection and estimation of parameters. In this section we studied, formulated, and incorporated the effect of lack of disjointness among the vulnerabilities or threats themselves. The classical assumption of statistical disjointness or mutually exclusiveness no longer holds for most real-life problems. The difÞculty in data collection and parameter estimation is a challenge for practitioners in the testing Þeld. The budgetary portfolio in terms of expected cost of loss at the end of the proposed quantitative analyses is an additional asset when comparing maintenance practices to assess an objective improvement over conventionally popular subjective routines. 3.4 SIMPLE STATISTICAL DESIGN TO ESTIMATE THE SECURITY METER MODEL INPUT DATA Nutshell 3.4 The security meter design provides conveniences in the quantitative form highly desired in the security world. The validity of the decision tree approach will increase only if the input values fed into the security meter model are calculated correctly. This is possible only with a carefully crafted statistical design that mimics real-life events rather than being simply a hypothetical situation. An empirical study is presented and veriÞed by discrete event and Monte Carlo simulations. The design improves over time as more data are collected.
SECURITY METER MODEL INPUT DATA
143
3.4.1 Estimating the Input Parameters in the Security Meter Model Using an accurate statistical estimation design that mimics actual events, we can evaluate risk [27,31]. The next challenge is to create a practical statistical data collection scheme to estimate a risk model’s input parameters in terms of probabilities. In pursuit of a practical but accurate statistical sampling plan where the security breaches are recorded and the risks estimated, let’s study these security principles brießy one by one. Undesirable threats that take advantage of hardware and software weaknesses or vulnerabilities can affect the violation and breakdown of availability (readiness for use), integrity (accuracy), conÞdentiality, and nonrepudiation as well as such other aspects of software security as authentication, privacy, and encryption [2,28]. If you keep the security meter’s tree model as in Figures 3.1 to 3.3 and work from the Þnal stage toward the beginning, Þrst and foremost, an attack happens. If there is no attack, there is no need for a security meter model and no need for security precautions or modeling. Earlier there were no breaches of cyber security because there were no computers, or rather, none were interconnected. We must therefore collect data for malicious attacks, both prevented and not prevented. Let’s start retrospectively with the quantities known. Suppose that an attack occurs and it is recorded. On the other hand, we also have somehow to monitor attempts that did not turn into an attack. At least we need to come up with a percentage of failed attacks (preventions) and successful attacks (penetrations). Of 100 such attempts, we must determine how many became successful, which will provide an estimate of the percentage of LCM (lack of countermeasure). We can then trace the cause of the threat level retroactively in the tree diagram. Let’s imagine that the Þrewall did not catch it, resulting in a virus attack, which reveals the threat exactly. For example, as a result of this attack, whose threat is known, the e-mail system may be disabled. The vulnerability is thus the e-mail itself. We have completed the taxonomical “line of attack” on the tree diagram as illustrated in Figures 3.2 and 3.10 as well as in Tables 3.1, 3.2, and 3.3. The only difÞcult data to collect are those that would help us estimate events that do not happen. Overall, we resort to the outcome frequency approach. That is, of 100 such cyber attacks, which actually did harm the target operation maliciously in some manner? How many attacks were not prevented or were countermeasured by smoke detectors or antivirus software or Þrewalls? Of attacks not prevented by a CM device, how many were caused by threat 1, threat 2, and so on, of certain vulnerability? We can then calculate the percentage of vulnerabilities A or B or C. The only way that we can calculate the count of CM preventions is either by guessing a healthy estimator of an attack ratio (1 to 5% of all attacks are prevented by CM devices) or by having snifÞng software ready to count a probable attack detected prematurely even if it does not result in actual harm. A snifÞng event is feasible when it comes to a physical attack such as a Þre, which is visible and thus can be prevented by a smoke detector. But how does one sniff an intangible virus or a hacker who attempts to attack but does not quite make it to the end? At present, a partial answer to this question is to use
144
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
effective commercial tools or certain popular Þrewalls with which one can detect and quarantine or simply remove possible causes of a crash. Those detected can be counted as the number countermeasured, and those that cause the cyber crashes can be counted as ones that could not be countermeasured. To this end, statistical techniques such as the ratio of responders to a poll to nonresponders can be used [29, Chap.13]. It is always a challenging research topic to estimate those polled anonymously who did not respond since you do not have an accurate number of those who were polled. The trouble is that you do not know how many respondents to-be actually received a questionnaire or were reached by hard (traditional) or soft (electronic) mail. 3.4.2 Statistical Formulas Used to Estimate Inputs in the Security Meter Model We will employ the relative frequency (based on the law of large numbers) approach [22]. Let X be the total number of saves or crashes prevented by a CM device within a time unit such as a month or a year. Let Y be the number of unpreventable crashes that caused a breakdown for various reasons. Let’s assume that a track analysis showed the following in an all-doubles 2 × 2 × 2 security meter model such as that in Figure 3.9 and Table 3.2. Of Y crashes, there were Y11 (v1 , t1 ) counts due to threat t1 and Y12 (v1 , t2 ) counts due to threat t2 , all stemming from vulnerability 1. Further, it was determined that there were Y21 (v2 , t1 ) crashes due to threat t1 and Y22 (v2 , t2 ) crashes due to threat t2 , all stemming from vulnerability 2. One could generalize this to Y (vi , tj ) = Yij caused by the ith vulnerability and its j th threat. Similarly, one assumes that there were X(vi , tj ) = Xij “saves” which could have happened on the ith vulnerability and its j th threat. Y (no. of saves) =
i
j
Y (vi , tj ) =
i
Yi,j
j
where i = 1, 2, . . . , I and j = 1, 2, . . . , J X(no. of crashes) = X(vi , tj ) = Xi,j i
j
i
(34)
j
where i = 1, 2, . . . , I and j = 1, 2, . . . , J
(35)
Then we can Þnd the probability estimates for the threats, P (vi , tj ), by taking the ratios as follows: Xij + Yij for a given i and j = 1, 2, . . . , J, Yi + Xi Yij , Xi = Xij all for j = 1, 2, . . . , J Yi =
Pij =
j
j
(36)
145
SECURITY METER MODEL INPUT DATA
It follows that for the probabilities of vulnerabilities, (Xij + Yij ) j for i = 1, 2, . . . , I, Pi = (Xij + Yij ) i
j = 1, 2, . . . , J
(37)
j
Finally, the probability of LCM, P (LCMij ) for i = 1, 2, . . . , I and j = 1, 2, . . . , J , is estimated: P (LCMij ) =
Yij Yij + Xij
for a given i and j
P (CMij ) = 1 − P (LCMij )
(38) (39)
3.4.3 Numerical Example of the Statistical Design for the Security Meter Model Assume two vulnerabilities and two threats in a CM–LCM setup as in Figure 3.9 and Table 3.2: X(total number of attacks detected or crashes prevented) = approx. 360/year where let X11 = 98, X12 = 82, X21 = 82, X22 = 98. Y (total number of attacks undetected or crashes not prevented) = approx. 10/year where let Y11 = 2, Y12 = 3, Y21 = 3, Y22 = 2. When we implement equations (34) to (39), we obtain P11 (threat 1 probability for vulnerability 1) 100 X11 + Y11 = 0.54 = X11 + Y11 + X12 + Y12 185 P12 (threat 2 probability for vulnerability 1) =
85 X12 + Y12 = 0.46 = X11 + Y11 + X12 + Y12 185 P21 (threat 1 probability for vulnerability 2) =
85 X21 + Y21 = = 0.46 X21 + Y21 + X22 + Y22 185 P22 (threat 2 probability for vulnerability 2) =
=
100 X22 + Y22 = 0.54 = X21 + Y21 + X22 + Y22 185
146
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
P1 (probability for vulnerability 1) 185 X11 + Y11 + X12 + Y12 = 0.5 = X11 + X12 + X21 + X22 + Y11 + Y12 + Y21 + Y22 370 P2 (probability for vulnerability 2) =
=
185 X21 + Y21 + X22 + Y22 = 0.5 = X11 + X12 + X21 + X22 + Y11 + Y12 + Y21 + Y22 370
The probabilities of LCM and CM for the vulnerability–threat pairs in Figure 3.9: Y11 X11 + Y11 Y12 P (LCM12 ) = X12 + Y12 Y21 P (LCM21 ) = X21 + Y21 Y22 P (LCM22 ) = X22 + Y22
P (LCM11 ) =
2 = 0.02, 100 3 = 0.035, = 85 3 = 0.035, = 85 2 = 0.02, = 100 =
hence, P (CM11 ) = 1 − 0.02 = 0.98 hence, P (CM12 ) = 1 − 0.035 = 0.965 hence, P (CM21 ) = 1 − 0.035 = 0.965 hence, P (CM22 ) = 1 − 0.02 = 0.98
We place the estimated input values for the security meter in Figure 3.12 to calculate the residual risk. Therefore, once you build the probabilistic model from the empirical data, as above, which should verify the Þnal results, you can forecast or predict any taxonomic activity, whatever the number of vulnerabilities or threats or crashes (Figure 3.13). For the study above, the total number of crashes is 10 of 370, a ratio of 10/370 = 0.027. Using this probabilistically accurate model, we can predict what will happen in a different setting or year for a given explanatory set of data. If a clue suggests to us 500 episodes of vulnerabilities of V1 , then by the avalanche effect, we can Þll in LCM = .02
V1 = .5
→P(V1)*P(T1|V1)*P(LCM|V1, T1) =.5*.54*.02 = .0054
T1 = .54
CM = .98 LCM = .035 →P(V1)*P(T2|V1)*P(LCM|V1, T2) = .5*.46*.035 = .00805
T1 = .46
CM = .965 LCM = .035 →P(V2)*P(T1|V2)*P(LCM|V2, T1) = .5*.46*.035 = .00805
T1 = .46 V2 = .5 T2 = .54
CM = .965 LCM = .02
→P(V2)*P(T2|V2)*P(LCM|V2, T2) = .5*.54*.02 = .0054
CM = .98 + Output: Total Residual Risk = 0.0269 (or 2.69%)
FIGURE 3.12 Simplest tree diagram for two threats and two vulnerabilities.
147
SECURITY METER MODEL INPUT DATA
all the other blanks, such as for V2 = 500. Then (0.5405)(500) = 270.2 of T1 and (0.4595)(500) = 229.7 of T2 . Of 270.2 T1 episodes, (0.02)(270.2) = 5.4054 for LCM, yielding 5.4 crashes. Therefore, antivirus devices or Þrewalls have led to 264.8 preventions or saves. Again, for T2 of V1 there are (0.035)(229.7) = 8.1081 crashes and (0.965)(229.7) = 221.6 saves. The same holds for V2 in this example, due to symmetric data. See Figure 3.13 for 1000 attacks. If the asset is $2500 and the criticality constant is 0.4, the expected cost of loss: ECL = residual risk × criticality × asset = (0.027)(0.4)($2500) = $27
(40)
3.4.4 Discrete Event (Dynamic) Simulation The analyst is expected to simulate a component, such as a server, 10 times from the beginning of the year (e.g., 1/1/2008) until the end of 1000 years (i.e., 1/1/3008) in an 8,760,000-hour period, with a life cycle of crashes and saves. The input and output are given in Figure 3.14 for the simulation of random deviates. At the end of this planned time period, the analyst will Þll in the elements of the tree diagram for a 2 × 2 × 2 security meter model as in Figure 3.12. Recall that the rates are the reciprocals of the means for the assumption of a negative exponential probability density function to represent the distribution of time to crash. For example, if λ = 98 per 8760 hours, the mean time to crash is 8760/98 = 89.38 hours. Use the input as in Section 3.4.3 [30]. 3.4.5 Monte Carlo (Static) Simulation Using the information in Section 3.4.4, the analyst is expected to use the principles of Monte Carlo simulation to simulate the 2 × 2 × 2 security meter in Table 3.2 and Figures 3.9 and 3.12. One employs the Poisson distribution for generating rates for each leg in the tree diagram of the 2 × 2 × 2 model shown in Figure 3.15. The rates are given as the count of saves or crashes annually. The
(a)
(b)
FIGURE 3.13 (a) Estimation of the model parameters given the breakdown of attacks; (b) prediction.
148
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
FIGURE 3.14
Discrete event simulation results of the 2 × 2 × 2 security meter design.
FIGURE 3.15 design.
Monte Carlo simulation results of the 2 × 2 × 2 security meter sampling
necessary rates of occurrence for Poisson random value generation were given in the empirical data example above. For each security meter realization, get a risk value and average it over n = 10,000 in 1000 increments. When you average over n = 1000 runs, you should get the same value as in Figure 3.15. Using the same data, we get the same results [30]. 3.4.6 Risk Management Using the Security Meter Model Once the security meter has been applied and the residual risk calculated, a security risk manager would want to calculate how much he or she needs to spend on improving countermeasures (Þrewall, IDS, virus protection, etc.) to mitigate risk. On the negative side, there is a cost expense accrued per 1% improvement of the CM, the only parameter of the model that one may alter voluntarily. The
149
SECURITY METER MODEL INPUT DATA Vulnerab. Threat 0.35 0.48 0.16 0.32 0.04 0.26
0.32 0.02 0.66
0.39
0.32 0.59 0.09
BASE SERVER Asset= $8000 Criticality= 0.40
CM & LCM Res. Risk 0.7 0.3 0.0504 0.42 0.58 0.03248 0.7 0.3 0.0336 0.8 0.2 0.0028 0.7 0.3 0.02496 0.7 0.3 0.00156 0.97 0.03 0.005148 0.7 0.3 0.03744 0.7 0.3 0.06903 0.46 0.54 0.018954 Total Risk 0.276372 Percentage 27,64% Final Risk 0.1105488 ECL $884.39
CM & LCM Res. Risk 0.99 0.01 0.00168 0.72 0.28 0.01568 0.79 0.21 0.02352 0.86 0.14 0.00196 0.9 0.1 0.00832 0.9 0.1 0.00052 0.97 0.03 0.005148 0.94 0.06 0.007488 0.9 0.1 0.02301 0.64 0.36 0.012636 Total Risk 0.099962 Percentage 10,00% Final Risk 0.0399848 ECL $341.58 Delta ECL -$542.84
Change 0.29
Cost C = COST per 1% $89.32 $3.08
0.3
$92.40
0.09
$27.72
0.06
$18.48
2
$61.60
2
$61.60
24
$73.92
2
$61.60
18
$55.44
$0.00
1.76
$542.84 IMPROVED SERVER
FIGURE 3.16 Risk management template to break even at $542.84 for a total 170% CM improvement.
average cost C per 1% will be known to cover personnel, equipment, and all other expenses. On the positive side, the expected cost of loss (ECL) will decrease with a gain in ECL while the software and hardware improvements are applied on the CM facilities. At the break-even point the pros and cons are equal, guiding the security manager regarding how one can move to a better stage from that point on. In the base server of the Figure 3.16, the policy requirement for mitigating the residual risk from 27.64% down to at least 10% in the improved server has been illustrated through an optimization scheme. If the cost C = $3.08 per unitpercent improvement in the CM, then for each improvement, such as increasing from 70% to 99% for the branch of v1 t1 , (29)($3.08) = $89.32 is calculated. The sum total, (176%) ($3.08 per 1%) = $542.84 improvement cost, and ECL = $884.39 − $341.55 = $542.84 to lower the residual risk, are now identical at break-even. This is an example of how a security meter can be used effectively for risk mitigation [54]. 3.4.7 Discussion and Conclusions The incentives to evaluate security risks are sufÞcient that we should and could, rather than might, be making meaningful estimates. In this section we develop a new scientiÞc way to estimate and infer probabilities: empirically by observing the frequencies of outcomes and by calculating losses associated with security outcomes [31]. In this way we are kept informed about the extent of the cost of bringing hardware and software systems to a desirable percentage of security. The difÞculty in data collection and parameter estimation poses a challenge to practitioners in the testing Þeld. The author has employed the concept of simple
150
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
relative frequency, otherwise known as a counting technique [22]. Although we cannot predict the outcome of a random experiment, we can, for large values of N (hence, the law of large numbers), predict the relative frequency (which is the number of desirable events divided by the sample size) with which the outcome will be included within a desirable set [22]. Further, as time elapses, sample size n will approach N and the relative frequency (f ) will approach the axiomatic probability (p), with the sampling error becoming negligible. One can then establish a family of statistical distributions, such as (truncated) symmetric normal or nonsymmetric gamma probability distribution functions, to Þt the random variables of interest for a number of saves or crashes within a given time period in a given work environment as the sample size increases. This is why this introductory proposed sampling plan is signiÞcant in showing us how to break the ice regarding security professionals’ inertia or disinclination to use quantitative designs. Finally, the dynamic or time-dependent discrete event simulation of the security meter model to verify the statistical sampling design suggested proves the design’s validity. The same applies if one employs static or time-independent Monte Carlo simulation. We get the same result, 2.69%, in Figures 3.12, 3.14, and 3.15. For further research, the challenge lies in implementation of this quantitative model as to how to classify into taxonomies recording the count of saves and crashes for a desired vulnerability–threat–countermeasure track in the security model. Finally, a risk management example [54] is added to show how the security meter model can be employed effectively to mitigate the residual risk in terms of real dollars. This is achieved by calculating a break-even point: when the total expenses accrued for improvement of the CM devices equal the positive gain in the expected cost of loss due to lowering the residual risk. This practice will give risk managers a solid base from which to work toward risk mitigation. Simulation of cyber-breach activities can be emulated through the implementation of software projects to mimic the expensive, risky, and compromising real global conditions of information security in information management [32]. 3.5 STATISTICAL INFERENCE TO QUANTIFY THE LIKELIHOOD OF LACK OF PRIVACY Nutshell 3.5 In this section we analyze brießy the formulation of probability distribution functions for the estimation of lack of privacy. The privacy meter approach is time dependent. Examples are given to quantify and improve the risk of privacy through risk management. 3.5.1 Introduction: What Is Privacy? Privacy is a concern because of the anxiety related to any perceived potential risk of coming to harm if information collected and stored is abused or misused.
LIKELIHOOD OF LACK OF PRIVACY
151
Privacy violations cause possible negative and adverse consequences. Trust or privacy is based on the likelihood that information will not be abused [55]. The thin line between the commonsense rules of thumb of consent–transparency–proportionality and the fair use of information or its violation is very difÞcult to identify, as the laws are not absolutely clear or accepted internationally [56,57]. Privacy is all about data protection, not about data restriction [58]. A breach of privacy or information piracy can be deÞned differently at varying locations, and conditions, including the time and circumstances that dictate the event. Protecting information privacy and fair use of information are complementary, in that personal data from unauthorized exposure must be protected, and it must be ensured that this information will be used fairly in the economy as a pillar of corporate security [59]. Last but not least, some argue that a strong sense of security implies less personal privacy. Others argue that security attacks could not happen without identity theft, which points to a lack of privacy [55]. Therefore, the consensus is that a sense of security is needed for the privacy of the general population in daily life. Security is thus the external shield of the internal world of privacy, and whereas security is tangible, privacy is generally intangible and abstract. So far, the quantiÞcation of privacy, or its lack, has been only at the level of spreadsheets and tabulations that provide averages and means, or percentages [60]. In this brief analysis are outline a technique to conduct a simple statistical inference to calculate and manage the likelihood of a lack of privacy. Only if the source permits can we quantify and estimate the likelihood of a breach of privacy. A real example follows [61–63]. 3.5.2 How to Quantify Lack of Privacy Given a set of data to indicate privacy invasions, such as in the case of phishing, spamming, spooÞng, or tampering, probability distribution functions are proposed to conduct a statistical inference. The objective is to estimate the probability (likelihood) of the number of breaches within a given period of time under the conditions encountered. Once the p.d.f. has been determined, the cumulative and survival probability functions can be estimated, permitting us to estimate the probability of encountering fewer or more than a given number of privacy breaches or incidents. Since the rate of breach is not constant throughout the time period of interest, and because breaches may occur in clumps or clusters rather than as single outcomes, the nature of the nonhomogeneous Poisson process is a special case. A computer code will illustrate how to calculate the probability likelihood (or exact density), then the cumulative probability, and Þnally, the survival (the complement of the cumulative) probability, when the breaches within a cluster are assumed to be contagious (positively correlated) or uncorrelated. In repairable system reliability, repair actions take place in response to the failures observed, and the system is returned to the Þeld as good as new. As explained in Chapter 1, a random or stochastic model could be experiencing a constant failure rate (CFR), an increasing failure rate (IFR), or a decreasing failure rate (DFR). In a homogeneous Poisson process (HPP; simply called a Poisson process) there are no trends and the rate is CFR.
152
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
P [N (t) = n] =
(λt)n e−λt , n!
n = 0, 1, 2, . . . , ∞
(41)
where E[N (t)] = Var[N (t)] = λt
(42)
If there are no trends in the failure data, it is deÞned to be a renewal process, where the interarrival times may come from any i.i.d. (independent identical distributed) Ti ∼ F (·), where F (·) is Þnite. For a nonhomogeneous Poisson process (NHPP), there are trends such as a DFR or IFR where P [N (t) = n] =
[(t)]n −(t) e , n!
n = 0, 1, 2, . . . , ∞
(43)
The failure probabilities for an interval starting at s and ending at s + t are given by
t+s
u(x) dx P [N (t + s) − N (s) = n] = where
s
n!
E[N (t)] =
e
−
t+s s
u(x) dx
(44)
t+s
u(x) dx
(45)
s
Furthermore, there may occur more than a single breach at an interval for the NHPP where the size of events at each interval is represented by a compound Poisson process. That is, if the governing process is NHPP and the size of clusters is geometric with a forgetfulness property, the compound Poisson process is Poisson∧ geometric. If the outcomes within a cluster are correlated assuming a compounding p.m.f. of a logarithmic series, the CP is deÞned to be a Poisson∧ logarithmic series or simply a negative binomial. These topics are studied in detail in Chapter 1. 3.5.3 Numerical Applications for a Privacy Risk Management Study Given the following privacy breaches (phishing activity) at a national state agency for May–June 2006 on different days [64]: 14, 32, 28, 25, 25, 19, 24, 25, 22, 24, we wish to conduct a privacy likelihood analysis. Total (M) = 213; average (daily) = 23.7; variance (daily) = 26.25; q = variance/average = 1.11. We conduct this experiment using Poisson∧ geometric (stuttering Poisson) and Poisson∧ logarithmic series (NBD): 1. By assuming a Poisson∧ geometric approach, where the outcomes in each cluster are assumed to be independent or uncorrelated in relation to each other, the following software results are obtained:
153
LIKELIHOOD OF LACK OF PRIVACY
PG Output.txt q = 1.11; Mean = 213.0; RHO = 0.052; LAMBDA = 201.89; x
Density f(x)
211 212 213 214 215 216 217 218 219 220
0.25848572E−01 0.25946626E−01 0.25934851E−01 0.25813957E−01 0.25586021E−01 0.25254425E−01 0.24823783E−01 0.24299833E−01 0.23689313E−01 0.22999817E−01
Cumulative P(x) 0.46633222E+00 0.49227885E+00 0.51821370E+00 0.54402766E+00 0.56961368E+00 0.59486810E+00 0.61969189E+00 0.64399172E+00 0.66768103E+00 0.69068085E+00
Survival S(x) 0.53366778E+00 0.50772115E+00 0.48178630E+00 0.45597234E+00 0.43038632E+00 0.40513190E+00 0.38030811E+00 0.35600828E+00 0.33231897E+00 0.30931915E+00
If the company or agency sets a threshold based on which the risk of privacy violation is deÞned, such as X = 220 breaches, then the “probability of equaling or exceeding 220” [i.e., P (X ≥ 220)] = 0.31, or 31%. 2. By assuming a Poisson∧ logarithmic series (NBD) approach, where the outcomes in each cluster are assumed to be contagious (correlated positively), the following software results are obtained. These results are almost identical to the above. NB Output.txt q = 1.11; Mean = 213.0; XK = 0.1936E+04; P = 0.11000000E+00; x Density f(x) Cumulative P(x) Survival S(x) 211 212 213 214 215 216 217 218 219 220
0.25849353E−01 0.25947117E−01 0.25935045E−01 0.25813853E−01 0.25585623E−01 0.25253744E−01 0.24822836E−01 0.24298641E−01 0.23687901E−01 0.22998217E−01
0.46635510E+00 0.49230222E+00 0.51823726E+00 0.54405111E+00 0.56963674E+00 0.59489048E+00 0.61971332E+00 0.64401196E+00 0.66769986E+00 0.69069807E+00
0.53364490E+00 0.50769778E+00 0.48176274E+00 0.45594889E+00 0.43036326E+00 0.40510952E+00 0.38028668E+00 0.35598804E+00 0.33230014E+00 0.30930193E+00
Assume now that the company (e.g., bank) or agency wishes to conduct a privacy risk mitigation study through buying certain anti-phishing or antipiracy software and contracting with a software security form for auditing and probing. Has the company accomplished its goals by a speciÞc time period later? Suppose that overall the bank has spent $1 million to assure privacy risk mitigation. Now, after the countermeasures are taken, the bank collects new data: 14, 32, 28, 25, 25, 19, 24, 22, 22, 4 and runs a new data analysis, with a positive correlation between the outcomes in each cluster [31, 65]:
154
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
NB Output.txt q = 3.19; Mean = 190.0; XK = 0.867E+02; P = 0.219E+01; x
Density f(x)
211 212 213 214 215 216 217 218 219 220
0.10511787E−01 0.10135891E−01 0.97602257E−02 0.93858812E−02 0.90138881E−02 0.86452144E−02 0.82807642E−02 0.79213753E−02 0.75678187E−02 0.72207978E−02
Cumulative P(x) 0.81138208E+00 0.82151797E+00 0.83127820E+00 0.84066408E+00 0.84967797E+00 0.85832318E+00 0.86660395E+00 0.87452532E+00 0.88209314E+00 0.88931394E+00
Survival S(x) 0.18861792E+00 0.17848203E+00 0.16872180E+00 0.15933592E+00 0.15032203E+00 0.14167682E+00 0.13339605E+00 0.12547468E+00 0.11790686E+00 0.11068606E+00
The conclusion is that after the countermeasures are taken, P (X ≥ 220) = 0.11 = 11%. The risk deÞned by the bank for exceeding a certain threshold has gone down to 11% from an earlier 31%. The risk has been mitigated by a solid 20%, amounting to a beneÞt of $2 million, if each 1% slot on average signiÞes a beneÞt of $100,000 by avoiding identity thefts. Overall, the bank is proÞting $1 million from this transaction, since the $2 million beneÞt clearly exceeds the $1 million cost: proÞt = beneÞt − cost = $2,000,000 (beneÞt) − $1,000,000 (improvement cost) = $1,000,000
(46)
3.5.4 Discussion and Conclusions In this section we studied how to quantify lack of trust of privacy similar to quantifying lack of security. The privacy meter is a mathematical statistical inferential method through which the likelihood of breach of trust is computed using compound Poisson processes. Then we saw how privacy risk is managed and mitigated in a quantitative solution through a solid budgetary approach. This approach is superior to the conventional descriptive or averaging privacy measures. A similar approach can be applied to time-dependent security risk estimation. APPENDIX 3A: COMPARISON OF VARIOUS RISK ASSESSMENT APPROACHES AND CINAPEAAA In this chapter, in which we are studying quantitative risk analyses, we have shown various methods to compute expected losses in the framework of statistical science and probability theory. In doing so, we based our analysis on monetary values of the assets and probabilities of the likelihood of vulnerabilities and attached threats. The results will be more scientiÞc, more usable, and more reliable when the data to support the models originate from trustworthy sources by actual experimentation, as studied in Section 3.4. There are other methods, such as fuzzy logic, attack trees, capability-based attack trees, and time-to-defeat
COMPARISON OF VARIOUS RISK ASSESSMENT APPROACHES
155
models, studied below in the appendixes, or data mining that are outside our scope here. Blakley, McDermott, and Geer appropriately claim: “In business terms, a risk is the possibility of an event which would reduce the value of the business (an asset) were it to occur” [32]. In fact, Blakley et al. note the low use of quantitative method in disciplines other than IT security, such as Þnance, health care, and safety. Today, organizations face a variety of “harming threats” from cyberspace that were unthinkable 15 or 20 years ago [33,34]. Risk assessment methods may be classiÞed as conventional qualitative, unconventional quantitative, and recently, hybrid [1]. Landoll notes: “A quantitative approach to determining risk and even presenting security risk has the advantages of being objective and expressed in terms of dollar Þgures” [35]. Despite these advantages, decision makers tend to lean toward qualitative risk assessments, due to their ease of use and more lax input data requirements. A decision tree or diagram, which is gaining popularity in quantitative risk assessment, is a model of the evaluation of a discrete function wherein the value of a variable is Þrst determined and the next action is chosen accordingly [1,36–39]. However, there is widespread reluctance to apply numerical methods. A primary reason for this reluctance is the difÞculty in collecting trustworthy data regarding security breaches [40–44]. A collection of various works, including Bayesian techniques, is included to help readers focus on this dilemma of assessing risk: qualitatively, quantitatively, or combined [45–48]. “Data, data, data. . .,” says Wentzel, and favors Sahinoglu’s security meter model as a solid way out of the confusion [49]. In qualitative risk analyses, which most conventional risk analysts prefer out of convenience, assets can be classiÞed on a scale of crucial–critical or very significant, signiÞcant, or not signiÞcant. On the other hand, qualitative criticality can be rated on a scale of to be Þxed immediately, to be Þxed soon, should be Þxed sometime, and to be Þxed if convenient. Vulnerabilities and associated threats can be rated on a scale of highly likely, likely, unlikely, or highly unlikely. On the subject of countermeasures and risk mitigation, the qualitative approach is from strong (high) to acceptable (medium) and unacceptable (low), as opposed to the probabilistic values proposed. Among well-known security models to establish a security policy, the following are most popular [3]: the Bell–LaPadula model, the Biba model, the Chinese wall model, the Clark Wilson model, the Harrison–Ruzzo–Ullman model, and the information ßow (entropy-equivocation and lattice-based) models. Next, let’s study the elements of CINAPEAAA: conÞdentiality, integrity, nonrepudiation, authentication, privacy, encrption, anonymity, availability, and audit. ConÞdentiality concerns the protection of sensitive information from unauthorized disclosure; information is not disclosed to unauthorized parties. That is, trust no one! For the concept of integrity, sometimes called accuracy, we watch that information is not altered by unauthorized parties, such that records of alterations are not destroyed: assuming, moreover, that the data were correct at the beginning and that because all changes have been done correctly and accountably (usually required to maintain integrity), data are still correct as they stand.
156
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
As for availability, information such as operation time or component redundancy or fault tolerance should be available to the user when needed. In other words, even the most secure system is no good if we cannot accomplish the target mission. Availability, operational readiness against Þre, damage, disaster, vandalism, and so on, is an indispensable operational feature, at least physically. Encryption for algorithms is required to protect the conÞdentiality of data. An encryption algorithm or cipher enciphers the clear text through a crypto key, eK(X), denoting that the plaintext X is encrypted under key K. Then the decryption action dK(X) deciphers to retrieve the plaintext from the cipher text (see more on this topic in Appendix 3B). Authentication is something you have or you know or you are, or a combination of these, as is true of a company password. In the case of nonrepudiation, the user cannot deny an operation that he or she has made. A few of the methods available are time stamps, a trusted third-party, or an electronic signature. Privacy relates to information that is not disclosed without the consent of the subject. Anonymity is identity information that is not disclosed. Audit is deÞned as the maintenance, tracking, and communication of event information within the service, host, or network. APPENDIX 3B: BRIEF INTRODUCTION TO ENCRYPTION, DECRYPTION, AND TYPES On the wide topic of the encryption, many resources explain the subject thoroughly, which is why we present only an introduction. The history of cryptography dates back to ancient Egypt, to India, Mesopotamia, and Babylon, and to central Asia. Encrypted messages were broken during the American Revolutionary War. The German Enigma machine, developed as early as 1918, was used beginning in 1926, but Polish, French, and British scientists cracked the code during World War II. Since World War II, computers have transformed the codebreaking process, leading to important contributions in military and intelligence applications [50]. Cryptography can be deÞned as the art or science of storing information in a form that allows it to be revealed only to those intended, hiding it from those not intended. Cryptology includes both cryptography and cryptanalysis. The original information is plaintext and the hidden information is cipher text. Encryption is the procedure of converting plaintext into hypertext by using an encryption engine (usually, a computer program). Decryption is the procedure employed to convert cipher text into plaintext by using a decryption engine (again, usually a computer program). Modern cryptographic systems use private and public key systems. Private (symmetric, secret, or single) key systems use a single key. An identical but separate key is necessary for each pair of users to exchange messages, and the sender–receiver pair must keep the secret key. While a user should keep his or her private key secret, a public key is known in public. The private and public keys are related mathematically in a public key system. If a message is encrypted with a private key, the message can be decrypted by the recipient
157
INTRODUCTION TO ENCRYPTION, DECRYPTION, AND TYPES
using the public key. Similarly, anyone can send others an encrypted message by encrypting the message using the recipient’s public key. The sender does not need to know the recipient’s private key; it is decrypted using the private key. Upon receiving an encrypted message, it is symmetrical, the same “secret key” encrypts and decrypts the information at stake. One needs a private key for each channel to accommodate. Risk management of a large number of secret keys can be cumbersome. This tool can provide authentication and access control but cannot provide veriÞcation. Congruence, which is part of the discrete math curriculum in computer science education through its use of modulus operations, often involves cryptology, the study of secret messages. Julius Caesar, one of the earliest cryptologists, created secret messages by shifting each letter ahead by three (i.e., C is sent as F, etc.), an early example of encryption. There were 25 letters in the ancient Roman alphabet. Caesar’s encryption method can be represented by a function f that assigns to the nonnegative integer p, p ≤ 25, the integer f (p) in the set {0, 1, 2, 3, . . . , 25} with the following: f (p) = (p + 3) mod 26 [53]. In encrypting the message MEET YOU SOON, we Þrst replace letters with numbers: 12
4 4 19
24 14
20
18
14 14 13
(47)
Now replace each of the numbers p with f (p) = (p + 3) mod 2: 15 7
7 22
1 17 23
21 17
17 16
(48)
Translating this back to letters, one reads “PHHW BRX VRRQ.” The process of Þnding the original message from the encrypted message is deÞned as decryption. In asymmetrical (or public key) encryption such as RSA or El Gamal, there exist two complementary keys. What is encrypted with one private key has to be decrypted with the other key. Knowledge of one key should provide no information about the other key. Public keys are often linked to identity. They can provide access control, authentication, and veriÞcation through digital signatures. Public key encryption is based on a mathematical relationship between prime numbers and the computational difÞculty of doing speciÞc mathematical operations on large numbers, such as factoring. RSA, for example, depends on the difÞculty of factoring large numbers. There are also digital signatures separate from encryption, which is more related to authentication and privacy. One-time signatures, such as El Gamal signatures, DSA, and RSA digital signatures, are among the most popular [52]. RSA (Rivest–Shamir–Adelman) Encryption Scheme Two very large prime numbers are chosen. Here we use smaller values to make it easier to understand. Two prime numbers are chosen, p1 and p2 , say 3 and 11. We can obtain an exponent value from the following [28]: (p1 − 1)(p2 − 1) + 1 = x
(49)
158
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
For our small prime numbers, x = (3 − 1)(11 − 1) + 1 = (2)(10) + 1 = 21
(50)
Now multiply p1 by p2 to obtain a modulus value (m). In our case, this is (3)(11) = 33. Now, for any value (v) ranging from 0 to (m − 1) there is an equation that holds true: v = vx mod m. Then we factor the exponent value such that factor 1 (f1 ) multiplied by factor 2 (f2 ) is equal to the exponent value. In our case, f1 f2 = (3)(7) = 21 = x (51) Therefore, f1 = 3 and f2 = 7. One of the factors is chosen as our public key, the other as our private key. Selecting the smaller of the two keys makes life easier for the public. To encrypt a message, someone takes the known public key and uses that to encrypt a message using the formula encrypted = plaintextf1 mod m
(52)
For our simple example, let’s say that the letter G is being encoded, and to make the arithmetic easier, it is the seventh letter in our alphabet. G is assigned a value of 7. Then encrypted = 73 mod 33 = 13. To decrypt, you use the formula in reverse: decrypted = encryptedf2 mod 33 = 137 mod 33 = 7 (53) We have our original message “G = 7” back. Even with small inputs, we work with large enough numbers to need a calculator. Imagine when we use large prime numbers how difÞcult it becomes to crack these ciphers. El Gamal Encryption Scheme In the El Gamal public key algorithm, which is probably the second most widely used public key cipher, the prime number is again a very large number. Let b ∈ Zp be a primitive base element or an integer of large order mod p. Let a be the private decryption key of user A, and let y = ba mod p
(54)
be the corresponding public encryption key and a secret random number k ∈ Zp−1 . Let M ∈ Zp be a message carrying an integer less than p. To send a message block M to A, the sender chooses a secret random k: r = bk mod p
(55)
s=b
(56)
ak
mod p
t = Mbak mod p
(57)
Denote D = decryption. Then, for r, t ∈ Zp−1 , Dk (r, t) = Mbak b−ak mod p = M
(58)
Example Using small numbers for convenience, let p (modulus) = 31, b (base) = 6, a (secret exponent) = 5, 1 < a < p − 1, and message M = 15
159
ATTACK TREES
[Þfteenth letter in the English alphabet (O)]. Let’s choose k = 2 as our random message key. Then [52] y (public key) = ba mod p = 65 mod 31 = 7776 mod 31 = 26 r = bk mod p = 62 mod 31 = 5 s=b
ak
5(2)
mod p = 6
mod 31 = 25
t = Mbak mod p = (15)(65(2) ) mod 31 = 3
(59) (60) (61) (62)
Here is how the procedure operates: To deliver a message 1 < M < p − 1, the sender chooses a random number 1 < k < p − 1. Then he or she computes r and t, where bak is a message key, with one of its factors (k and a) known. {r, t} encodes {k} in such a way that the private key can be used to compute the original message key {k} and recover the original message {M}. Therefore, our message encrypts into {r, t} = (5, 3). Observe that a single number M = 15 encrypts into two distinct numbers; this doubling of size is a major disadvantage of El Gamal. To decrypt D2 (r = 5, t = 3) = ?, r −a t mod p = [(5−5 )(3)] mod 31 = [(255 )(3)] mod 31 = [(9,765,625)(3)] mod 31 = 29,296,875 mod 31 = 15 = M (back!)
(63)
5 and 25 are inverses [since (5)(25) mod 31 = 1]. What multiplies 5 with mod 31 to give 1 is 25 [51,52]. APPENDIX 3C: ATTACK TREES Threats are usually deÞned as malicious actions carried out by “bad guys,” who exploit vulnerabilities to destroy assets. One way to identify threats would be to categorize them by the damage done to assets. Howard and LeBlanc have listed them into a number of categories [19]. Attack trees provide a formal, structural way to describe system security, based on various “sensible” attacks [4]. Basically, you represent attacks against a system in a tree structure, with the goal as the root node and ways of achieving that goal as leaf nodes. How do you create such an attack tree? Primarily, you list all possible attack goals. Then, try to think of all attacks against each goal and add them to the tree. Repeat this process down the tree until your list is complete. Discuss your tree with someone else and add any nodes that the person suggests. Of course, there is always the chance that you have omitted an attack, but you will improve with time. Like any security analysis, creating attack trees requires a certain mindset and takes practice. Once you have the attack tree and have researched all the node values, use the attack tree to reach security decisions. You can look at the values of the root node to see if the system’s goal is vulnerable to attack. Determine if the system is vulnerable to a type of attack (e.g., password guessing) and list the assumptions.
160
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
To obtain another user’s password, we could ask the operator, or guess it, or spy on it illegally. Since the operator will usually not tell us, we may move on to guessing it, which we can do either online or off-line (by obtaining an encrypted password or by attacking a dictionary). Spying is usually done in person, through a microphone or using a camera. Attack trees provide a formal methodology for analyzing the security of systems and subsystems. They provide a way to think about security. Attack trees form the basis of understanding a security process (Figure 3C.1). Figure 3C.2, a more exciting example, is a simple attack tree targeting a bank safe [4]. To open the safe, attackers can pick the lock, learn the combination, cut open the safe, or install the safe improperly so that they can open it more easily later. To learn the combination, they might be able to Þnd the combination written down or get the combination from the safe owner. To eavesdrop on someone stating the safe combination, also called shoulder surÞng, attackers have to eavesdrop on the conversation and get safe owners to state or confess the combination. Assigning “expensive or high” and “not expensive or low” to nodes is useful, but it would be better to show exactly how expensive in terms of dollars or how critical in terms of probabilities. It is also possible to assign continuous values to nodes. Figure 3C.2 also shows the tree with different costs Steal or Get Password Guess Password (Outsider) Guess online
Ask Operator (Outsider)
Guess offline
Insider Intrusion
Take Pictures
Spy Password
Shoulder Surfing
Audio Taped Conversations
Get Encrypted Password
Using Insiders
Dictionary Attack
Social Engineering (Outsiders)
SSN, Birthday, Maiden Name, etc.,
Using a bug (microphone)
FIGURE 3C.1 Possible attack tree for stealing a password.
Open Safe $10K Pick Lock $30K
Brute-Force Open $40K
Learn Combo $20K
Find Written Combo $75K
Install to Open Later $90K
Get Combo From Target $25K Threaten $60K Bribery $25K Blackmail $80K Shoulder surfing $40K Listen to Conversation $10K Convince Target to Confess $30K
FIGURE 3C.2 Attack nodes with costs of attack.
161
CAPABILITIES-BASED ATTACK TREE ANALYSIS
assigned to the leaf nodes, where the costs have propagated up the tree and the cheapest attack has been highlighted. This attack tree can be used to determine where a system is vulnerable. APPENDIX 3D: CAPABILITIES-BASED ATTACK TREE ANALYSIS Attack trees graphically show how an asset can be attacked [5]. The topmost (or root) node in an attack tree represents the attacker’s goal (Figure 3D.1). This overall goal is decomposed into nodes representing increasingly detailed tasks which by themselves or in combination will result in the attacker obtaining his or her objective. Associated with the detailed tasks are estimates, based on expert opinion, of the resources required by the attacker to perform the operation. Resources include money, technical ability, materials, and how noticeable the attack is. By estimating the capabilities of the adversary it is possible to eliminate those portions of the attack tree model that are unattainable. This greatly reduces the problem of defending the asset. Further analysis can show which of the remaining attacks are to be preferred by the adversary (i.e., bring them the greatest beneÞt or the lowest expenditure of resources), and which are most harmful to the victim. This allows a true determination of risk. Steps in Capabilities-Based Attack Tree Analysis Attack tree analysis is quick to learn, simple to use, and easy to understand. It can be broken down into Þve steps. 1. Create a model of ways in which the system can be attacked (i.e., the attack scenarios). 2. Predict how your enemies will attack by comparing their capabilities with your vulnerabilities and estimating the beneÞts they will obtain from each attack. 3. Evaluate the negative impact on the victim of each attack scenario. 4. Combine your attack predictions with victim impact to determine the level of risk associated with each attack scenario. 5. Use your Þndings to propose a strategy of countermeasures. Incorporate the countermeasures in your model and repeat steps 2 to 4 to evaluate the effectiveness of the proposals. Burgle House
Open Passage
Pick Lock Break down door
Enter via 1 Steal Key (pickpock method)
Break glass
FIGURE 3D.1
Cut glass
Garag Attack
Enter Garage
2 Cut hole Chimney in wall or Attack roof
Penetra House
Capabilities-based attack tree analysis.
Tunnel through floor
Social Engineering
162
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
Why Conventional Risk Analysis Does Not Work for Hostile Threats At the most fundamental level, all risk analysis systems try to determine two things: the likelihood that an undesirable event will occur and the damage that will result. For some types of risks (e.g., natural disasters), it is easy to Þnd statistics describing the frequencies of such hazards as hurricanes, tornados, ice storms, and ßoods. These Þgures can easily be combined with projected damage to arrive at an accurate risk estimate. Unfortunately, accidental risks are no longer our main problem. In an increasingly hostile world, neither information systems nor physical infrastructure are safe from open attack. There are no statistics describing the frequency of the attacks. Attack trees provide a powerful mechanism to document the multitude of diverse types of attacks on the entire enterprise and to suggest improvements to requirements and design. They are, however, only a small part of the answer as to how to use the intrusion scenarios to improve survivability engineering. The lack of accurate adversary models and risk analysis models is a serious issue [5].
APPENDIX 3E: TIME-TO-DEFEAT MODEL The time to defeat (TTD) is the length of time required to compromise or defeat a given security characteristic in a given service, host, or network. The deÞnition of compromise varies but includes host and service compromise, loss of service, network exposure, unauthorized access, and data theft. The quantiÞcation of IT security is expressed with two components: (1) an accurate, defendable, repeatable, and consistent quantitative metric, and (2) establishment of a set of measurable items that reßect and represent IT security in a comprehensive and consistent manner. TTD, combined with the Þve-A (availability, authentication, authorization, audit, and accuracy) characteristics, allow for a complete measurement solution that is founded on mathematical accuracy and strength, combined with a deep knowledge of IT security issues and practice. Analysis of the Þve security characteristics provides the following beneÞts: (1) identiÞcation of weaknesses in security areas across any level of granularity (services, hosts, networks, or groups of each); (2) using TTD, a set of common data points that allow for statistical analysis; and (3) a standard set of accepted security constructs. Let’s look at certain descriptive graphs on the TTD model’s Þve-A’s [6]. The enterprise time-to-defeat graph aggregates and summarizes the data from all networks analyzed to provide an overall sense of security within the environment. This is the highest-level overview. Enterprises contain networks, hosts comprise networks, and services are identiÞed on the hosts. In the example shown in Figure 3E.1 we see that the overall levels of security are low, as indicated by the red or minimum TTD values. The maximum values calculated in this environment are generally stable, except for the authentication characteristic. For a highly secured and managed environment, both the maximum and minimum values
TIME-TO-DEFEAT MODEL
FIGURE 3E.1
163
Demo network with its minimum time to defeat values.
should be consistently high across the Þve security characteristics. Low authentication values are a common problem that often results in unauthorized system access and stolen identities and credentials. The effects of low authentication reach beyond simple access; if the system in question contains important assets and/or information, or if it exposes such a system, the effects of compromise are severe. The detailed listing of the enterprise time-to-defeat information identiÞes the networks that comprise that environment (the networks analyzed). In this sample, only one network has been deÞned, the “demo network.” The display shows the smallest time values for that network in the summary. In a typical environment, there are multiple distinct networks that would be analyzed. The results summarized in Figure 3E.1 allow for a broader understanding of the areas of weakness that span an organization: areas that can then be treated effectively with a security process, or a policy and technology. The weakest networks within an enterprise are identiÞed immediately, and when correlated with important company assets, help to provide a Þrm understanding of the security risk that is present [6]. Viewing the analysis at the enterprise level, with network summaries, also creates an understandable picture of the security posture as it crosses these networks, departments, and organizations. A large disparity between the shortest and longest times can indicate the presence of vulnerabilities, misconÞgurations, failures in policy compliance, and weak security policy. A large standard deviation in time summarizes inconsistencies that merit examination. Identifying the areas of security that are weakest also
164
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
allows organizations to prioritize and determine which solutions to investigate and activate Þrst. REFERENCES 1. M. Sahinoglu, Security Meter: A Practical Decision Tree Model to Quantify Risk, IEEE Security Privacy, 3, 18–24 (2005). 2. E. Forni, CertiÞcation and Accreditation, AUM Lecture Notes, DSD (Data Systems Design) Labs, 2002, http://www.dsdlabs.com/security.htm. 3. D. Gollman, Computer Security, 2nd ed., Wiley, Chichester, West Sussex, England, 2006. 4. B. Schneier, Applied Cryptography, 2nd ed., Wiley, New York, 1995, http://www. counterpane.com. 1995. 5. Capabilities-Based Attack Tree Analysis, www.amenaza.com; http://www.attacktrees. com/. 6. Time to Defeat (TTD) Model, www.blackdragonsoftware.com. 7. M. Sahinoglu, Security Meter: A Probabilistic Framework to Quantify Security Risk, CertiÞcate of Registration, U.S. Copyright OfÞce, Short Form TXu 1-134-116, December 2003. 8. M. Sahinoglu, A Quantitative Risk Assessment, Proceedings of the Troy Business Meeting, San Destin, FL, 2005. 9. M. Sahinoglu, Security-Meter Model: A Simple Probabilistic Model to Quantify Risk, 55th Session of the International Statistical Institute, Sydney, Australia, Conference Abstract Book, 2005, p. 163. 10. M. Sahinoglu, Quantitative Risk Assessment for Software Maintenance with Bayesian Principles, Proceedings of the International Conference on Software Maintenance, ICSM Proc. II, Budapest, Hungary, 2005, pp. 67–70. 11. M. Sahinoglu, Quantitative Risk Assessment for Dependent Vulnerabilities, Proceedings of the International Symposium on Product Quality and Reliability (52nd Year) (RAMS’06), Newport Beach, CA, 2006. 12. M. Sahinoglu, D. Libby, and S. R. Das, Measuring Availability Indices with Small Samples for Component and Network Reliability Using the Sahinoglu–Libby Probability Model, IEEE Trans. Instrum. Meas., 54(3), 1283–1295 (June 2005). 13. M. Sahinoglu and E. H. Spafford, A Bayes Sequential Statistical Procedure for Approving Products in Mutation-Based Software Testing, in W. Ehrenberger (ed.), Proceedings of the IFIP Conference on Approving Software Products (ASP’90), Garmisch-Partenkirchen, Germany, Elsevier Science (North Holland), Amsterdam, pp. 43–56, September 1990. 14. M. Sahinoglu, An Empirical Bayesian Stopping Rule in Testing and VeriÞcation of Behavioral Models, IEEE Trans. Instrum. Meas., 52, 1428–1443 (October 2003). 15. B. Potter and G. McGraw, Software Security Testing, IEEE Security Privacy, 2(5), 81–85 (2004). 16. O. H. Alhazmi and Y. K. Malaya, Quantitative Vulnerability Assessment of Systems Software, Proceedings of the International Symposium on Product Quality and Reliability (RAMS’05), Alexandria, VA, January 2005.
REFERENCES
165
17. R. Weaver, Guide to Network Defense and Countermeasures, 2nd ed., Thomson Publishing, Stamford, CT, 2007. 18. S. A. Scherer, Software Failure Risk, Plenum Press, New York, 1992. 19. M. Howard and D. LeBlanc, Writing Secure Code, 2nd ed., Microsoft Press, Redmond, WA, 2002. 20. F. Swiderski and W. Snyder, Threat Modeling, Microsoft Press, Redmond, WA, 2004. 21. I. Krusl, E. Spafford, and M. Tripunitara, Computer Vulnerability Analysis, COAST TR 98-07, Department of Computer Sciences, Purdue University, West Lafayette, IN, May 1998. 22. R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 3rd ed., Macmillan, New York, 1970. 23. J. Keyes, Software Engineering Handbook, Auerbach Publications, Boca Raton, FL, 2003. 24. E. B. Swanson and C. M. Beath, Departmentalization in Software Development and Maintenance, Commun. ACM, 33(6), 658–667 (June 1990). 25. G. Parikh, Handbook of Software Maintenance, Wiley, New York, 1986. 26. D. S. Moore and G. P. McCabe, Introduction to the Practice of Statistics, 4th ed., W. H. Freeman, New York, 2003. 27. G. Cybenko, Why Johnny Can’t Evaluate Security Risk, IEEE Security Privacy, 4(5) (2006). 28. S. Goldsby, CEO/ICS, Information Security, presented at the TSUM/CIS Millenium Colloquium, Montgomery, AL, April, 2000 and the Roundtable for Security, IDPT, Dallas, TX, 2000. 29. W. G. Cochran, Sampling Techniques, 3rd ed., Wiley, New York, 1970. 30. C. Nagle and P. Cates, CS6647: Simulation Term Project, Troy University, Montgomery, AL, Fall 2005. 31. M. Sahinoglu, A Simple Design to Estimate the Parameters of the Security-Meter Model to Quantify and Manage Software Security Risk, IEEE Trans. Instrum. Meas. (accepted for publication in October–December, 2007.) 32. B. Blakley, E. McDermott, and D. Geer, Information Security Is Information Risk Management, Proceedings of the 2001 Workshop on New Security Paradigms (NSPW’01), 2001, pp. 97–104. 33. E. Brynjolfsson, The Productivity Paradox of Information Technology, Commun. ACM, 36(12), 66–77 (1993). 34. B. I. Dewey and P. B. DeBlois, Current IT Issues Survey Report, EDUCAUSE Q., pp. 12–30 (November 2006). 35. D. Landoll, The Security Risk Assessment Handbook, Auerbach Publications, Boca Raton, FL, 2006. 36. H. D. Sherali, J. Desai, and T. S. Glickman, Cascading Risk Management Using Event Tree Optimization, http://Þlebox.vt.edu/users/jidesai/Event%20Tree%20Optimization. pdf, 2005. 37. B. Moret, Decision Trees and Diagrams, Comput. Surv., 14(4), 593–623 (1982). 38. S. C. Palvia and S. R. Gordon, Tables, Trees and Formulas in Decision Analysis, Commun. ACM, 35(10), 104–113 (1992).
166
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
39. M. Moussa, J. Y. Ruwanpura, and G. Jergeas, Decision Tree Module Within Decision Support Simulation System, Proceedings of the 2004 Winter Simulation Conference, 2004, pp. 1268–1276. 40. G. Stoneburner, A. Goguen, and A. Feringa, Risk Management Guide for Information Technology Systems, Special Publication 800-30, National Institute of Standards and Technology, U.S. Department of Commerce, Washington, DC, 2002, http://csrc.nist. gov/publications/nistpubs/800-30/sp800-30.pdf. 41. A. Arora, D. Hall, C. A. Pinto, D. Ramsey, and R. Telang, Measuring the Risk-Based Value of IT Security Solutions, IT Prof., 6(6), 35–42 (2004). 42. G. Bakus, Recent Advances in Risk Assessment and Decision Analysis, http://bioweb.usc.edu/courses/2003-spring/documents/bisc102-bakus EIA recent.pdf, January 2002. 43. F. Farahmand, S. Navathe, G. Sharp, and P. Enslow, Managing Vulnerabilities of Information Systems to Security Incidents, Proceedings of the 5th International Conference on Electronic Commerce, ACM Press, New York, 2003, pp. 348–354. 44. N. R. Mead and T. Stehney, Security Quality Requirements Engineering (SQUARE) Methodology, Proceedings of the 2005 Workshop on Software Engineering for Secure Systems-Building Trustworthy Applications, St. Louis, MO, ACM Press, New York, 2005, pp. 1–7. 45. M. S. Feather, S. L. Cornford, and T. W. Larson, Combining the Best Attributes of Qualitative and Quantitative Risk Management Tool Support, 15th IEEE International Conference on Automated Software Engineering (ASE00), Vol. 309, 2000. 46. A. Mosleh, E. R. Hilton, and P. S. Browne, Bayesian Probabilistic Risk Analysis, ACM SIGMETRICS Performance Eval. Rev., 13(1), 5–12 (1985). 47. W. Sonnenreich (n.d.), Return on Security Investment (ROSI): A Practical Quantitative Model, SageSecure, New York, retrieved May 2006 from http://www.infosecwriters.com/text resources/pdf/ROSI-Practical Model.pdf. 48. H. Wei, D. Frinke, O. Carter, and C. Ritter, Cost–BeneÞt Analysis for Network Intrusion Detection Systems, presented at the 28th Annual Computer Security Conference, Computer Security Institute, Washington, DC, October 29–31, 2001. 49. L. Wentzel, Quantitative Risk Assessment, Nova Southeastern University, Fort Lauderdale, FL, and personal communication, May–June 2006. 50. R. T. Meyers III, The Past, Present, and Uncertain Future of Encryption, in M. Sahinoglu and C. Bayrak (eds.), Proceedings of the CIS Millenium Conference on IT, Troy University, Montgomery, AL, April 2000. 51. J. Boncek, Math Department, Troy University, Montgomery, AL, personal communication, July 2006. 52. A. Yasinsac, Computer Science Department, Florida State University, Tallahassee, FL, personal communication, July–August 2006. 53. K. H. Rosen, Discrete Mathematics and Its Applications, 4th ed., WCB/McGraw-Hill, Boston, 1999. 54. M. Sahinoglu and J. Cecil, Working Paper CS 4451, Troy University, Montgomery, AL, Spring 2006. 55. N. J. Rifon, www.ippsr.msu.edu/Documents/ForumPresentations/May05Rifon.pdf, accessed May 2005.
EXERCISES
167
56. A. J. J. T. Singewald, Information Privacy in EU, Proceedings (in Power Point) of the International Conference on the Digital Information Industry, Seoul, South Korea, November 14–15, 2006, pp. 3–22. 57. M. Siegert, Direct Marketing in Germany in the Mirror of Data Protection Law, Proceedings (in Power Point) of the International Conference on the Digital Information Industry, Seoul, South Korea, November 14–15, 2006, pp. 23–56. 58. J. G. H. M. Birken, Current Status of Personal Information Protection and Future Tasks in Asia, Proceedings (in Power Point) of the International Conference on the Digital Information Industry, Seoul, South Korea, November 14–15, 2006, pp. 57–111. 59. Beomsoo Kim, Complementarity Between Protecting Information Privacy and the Fair Use of Information, Proceedings (in Power Point) of the International Conference on the Digital Information Industry, Seoul, South Korea, November 14–15, 2006, pp. 311–329. 60. B. Huberman E. Adar and L. R. Fine, Valuating Privacy, IEEE Security and Privacy, Nov/Dec 2005, pp. 22–25. 61. M. Sahinoglu, A Universal Quantitative Risk Assessment Design to Manage and Mitigate, Proceedings (in Power Point) of the International Conference on the Digital Information Industry, Seoul, South Korea, November 14–15, 2006, pp. 333–405. 62. M. Sahinoglu, Universal (Time-Independent) Security-Meter Design to Quantify Risk and Time-Dependent Stochastic Model to Quantify Lack of Privacy, Invited Seminar, Department of CIS, University of Alabama at Birmingham, December 1, 2006. 63. M. Sahinoglu, Universal (Time-Independent) Security-Meter Design to Quantify Risk and Time-Dependent Stochastic Model to Quantify Lack of Privacy, Invited Seminar, Department of ECE, University of Massachusetts, Amherst, Massachusetts, December 8, 2006. 64. Korea Information Security Agency, www.krcert.or.kr, Seoul, South Korea. 65. M. Sahinoglu, Statistical Inference to Quantify and Manage the Risk of Privacy, Proceedings of the ISI’07 (Session 22:Risk), Lisbon, Portugal, August 2007.
EXERCISES To use the applications and data Þles, click on “Security Meter” and “Privacy” in TWC-Solver on the CD-ROM. 3.1 Discrete Event Simulation of the Security Meter Problem You are expected to simulate a component, such as a server, from the beginning of the year (e.g., 1/1/08) to the end of the year (12/31/08) in a 8760-hour period, with a life cycle of hits (crashes) or saves (e.g., by anti-malware). The input data are supplied below for the simulation of random value, at the end of which you will Þll in the elements of the security meter tree diagram. Recall that the rates are the reciprocals of the means for an assumption of negative exponential p.d.f. to represent the distribution of time to crash. For example, if rate = 98/8760, the mean time to crash (MTTC) is 8760/98. Use the data in Table E3.1 for a 2 × 2 × 2 tree diagram. Assume a security meter diagram of a double-vulnerability and doublethreat scenario as in Tables 3.2 and 3.3. Let X (total number of crash
168
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
TABLE E3.1
Vulnerability–Threat–Countermeasure Spreadsheet
Vulnerability Chance failure Intentional failure
Threat 1. 2. 1. 2.
Design and coding error System power outage Virus Hacking
Countermeasure 1. 2. 1. 2.
Prerelease testing In-house generator Install antivirus software Install Þrewall
preventions) be approximately 1/day. Assume 366 per year. That is, let X11 = 98, X12 = 85, X21 = 85, X22 = 98. Let Y (total number of crashes not prevented) = 10/year. That is, Y11 = 3, Y12 = 4, Y21 = 5, Y22 = 2. (a) Calculate the probabilities of all branches in the security meter tree diagram to calculate the risk and expected cost of loss if criticality = 0.5 and the capital cost = $1000. (b) By means of a discrete event simulation technique using a negative exponential distribution, verify the results of part (a). 3.2 Monte Carlo Simulation of the Security Meter Problem Using all the information in Exercise 3.1(a), use the Monte Carlo principles to simulate a 2 × 2 × 2 security meter. Use the Poisson distribution for generating rates for each leg in the tree diagram of the 2 × 2 × 2 setup in Table E3.1. The necessary rates of occurrence for the Poisson random value generation were given in the empirical data example above. For each security meter realization, get a risk value and then average it over n = 1000 to 5000 in increments of 1000. When you average over n = 5000 runs, you should get the same value as in Exercise 3.1. Calculate ECL for a criticality constant of 0.5 and a capital cost of $1000. 3.3 Comparison of Techniques to Assess Risk in Information Systems Compare the security meter approach with those of the other approaches given in the chapter (i.e., attack trees, capabilities-based attack trees, and the TTD model) in terms of the advantages and disadvantages of each, studying the ease of analytical calculations and economical interpretations and the availabilities of the input data. 3.4 Bayesian Rule in Statistics and Applications for Software Maintenance Given Figure 3.6’s simulation tableau, prioritize the Þve vulnerabilities from most to least urgent in terms of their signiÞcance to be mitigated. 3.5 Security Meter ModiÞed with Nondisjoint Vulnerabilities and Nondisjoint Threats Given Figures 3.10 and 3.11 for both vulnerabilities and threats, convert the security meter tree diagram theoretically (no values attached) from n = 2 to n = 3, where n is the number of vulnerabilities or threats: (a) Vulnerabilities n = 3 only, threats remaining at n = 2
EXERCISES
169
(b) Vulnerabilities remaining at n = 2, threats n = 3 only (c) Both vulnerabilities and threats at n = 3 3.6 Security Meter ModiÞed with Dependent Vulnerabilities and Dependent Threats (Applied) (a) Apply the following initial probabilities to your derivations in Exercise 3.5 for V ’s: P (V1 ) = 0.55, P (V2 ) = 0.25, P (V3 ) = 0.45, P (V1 ∩ V2 ) = 0.15, P (V1 ∩ V3 ) = 0.25, P (V2 ∩ V3 ) = 0.10, P (V1 ∩ V2 ∩ V3 ) = 0.05. Use Exercise 3.5(a). (b) Repeat part (a) but replace the V ’s by T ’s. Use Exercise 3.5(b). (c) Repeat part (a) for V ’s and T ’s at the same time. Use Exercise 3.5(c). 3.7 Security Meter ModiÞed for Purely Qualitative Data Using Figure 3.4, design and calculate a feasible security meter design with qualitative data for your PC in a 3 × 3 × 3 setup. Choose your own input values. 3.8 Security Meter ModiÞed for Hybrid Data Using Figure 3.5, design and calculate a feasible security meter design with hybrid data for your PC in a 3 × 3 × 3 setup. Choose your own input values. 3.9 Basic Security Meter for a Personal and OfÞce Computer (a) You are expected to collect for your PC as in Exercise 3.1, in a 2 × 2 security meter design, or articulate and create data to best estimate your risk. Assuming a criticality of 0.8 and a capital cost of $1000, calculate the expected cost of loss. (b) Proceed as in part (a) but this time, do the same for your ofÞce computer and calculate your residual risk and the ECL. 3.10 ModiÞed (for Qualitative Attributes) Security Meter for a Home and OfÞce PC Repeat Exercise 3.9, this time using H (high), M (medium), L (low), and W (rare), making sure that your risk values obey the laws of probability. 3.11 Hybrid (Qualitative and Quantitative Together) Security Meter at Home and OfÞce PC Repeat Exercise 3.9, this time using H (high), M (medium), L (low), and W (very low) and the quantitative values selected. Make sure that your risk values obey the laws of probability. 3.12 Security Meter for a Personal and OfÞce Computer’s Maintenance Planning Repeat Exercise 3.9, this time employing Bayesian principles. Decide which vulnerability needs a higher priority, and why. 3.13 Security Meter for Personal and OfÞce Computer Repeat Exercise 3.9, this time employing statistical principles on nondisjointness for vulnerability and threat simultaneously. Choose your own input values. 3.14 General Questions About the Security Meter (a) If one of the pillars of the information system security is nonrepudiation, what are the others to complete CINAPEAAA?
170
QUANTITATIVE MODELING FOR SECURITY RISK ASSESSMENT
(b) State three of the countermeasures against hacking and virus of your home ofÞce PC. (c) For your home computer suppose that the probabilities for the vulnerability of loss of e-mail Þles is 0.8 and the Þre hazard is 0.2. Threats against your e-mail system are due to a virus attack (0.6) and hacking (0.4), against both of which an encryption code is installed as well as a Þrewall (0.9). For Þre, threats are from a nearby forest (0.3) and old electrical wires in the ofÞce (0.7). Countermeasure probabilities against both threats are weak (0.3). The entire setup is in a highly critical scenario (0.9). This ofÞce is worth $100,000. How much do you risk losing? Use a formula and be exact. Which vulnerability requires the highest priority for repair? (d) Is safety a software property? Is reliability the same as safety? Give an example in which they are not the same. If reliability is the y-axis of a Pythagorean triangle and safety is the x-axis, what is the hypotenuse called? How best can you make your software safety conscious? 3.15 More on the Security Meter Write an Excel program to mimic the security meter spreadsheets shown in Table E3.15 and obtain the correct risk results. Assume criticality = 0.3, capital cost = $2000, and verify total residual risk = 0.28, and ECL = ($2000)(0.084) = $168. Use security in CD-ROM. TABLE E3.15
Home System
Vulnerability Power failure: 0.2 High speed: 0.3 Hard failure: 0.2 Soft failure: 0.3
Threat 1. 2. 1. 2. 1. 2. 1.
Loss of data: 0.75 Hardware failure: 0.25 Virus: 0.5 Intrusion: 0.5 Loss of data: 0.75 System down: 0.25 Lack of operation: 1.0
Lack of Countermeasure 1. 2. 1. 2. 1. 2. 1.
Backup generator: 0.6 Surge protector: 0.1 Antivirus software: 0.2 Firewall: 0.2 Data backup: 0.6 Alternative laptop: 0.1 Software backup: 0.l
3.16 Security Meter Risk Management (a) Verify the results shown in Table E3.16 using a hand calculator and an Excel program. Also use security application in CD-ROM. (b) Apply a risk management algorithm as in Section 3.4.6 to mitigate your total residual risk to (1) 25%, (2) 20%, (3) 15%, (4) 10%, and (5) 5% by determining break-even points when you spend $Z per 1% improvement of the risk by improving the CM devices. Calculate the optimal Z and ECL for (1) to (5) to accomplish these objectives.
171
EXERCISES
TABLE E3.16
Staff Server
Vulnerability
Threat
0.35
0.48
CM LCM
0.16 0.32 0.04 0.14
0.32 0.02 0.66
0.51
0.32 0.59 0.09
0.7 0.3 0.42 0.58 0.7 0.3 0.8 0.2 0.7 0.3 0.7 0.3 0.97 0.03 0.7 0.3 0.7 0.3 0.46 0.54
Total residual risk Total residual risk percentage
Residual Risk 0.0504 0.03248 0.0336 0.0028 0.01344 0.00084 0.002772 0.04896 0.09027 0.024786 0.300348 30.03%
Final risk = (residual risk)(criticality) = (0.277152)(0.4) = 0.1201392 ECL = (Þnal risk)(capital cost) = (0.1108608)($8000) = $961.11
(c) Repeat part (a) using $3 per 1% improvement of the CM devices where target mitigations are not given. What is the new ECL and percentage mitigation achieved in the residual risk? (d) Repeat part (b) of Exercise 3.15 using part (c).
Firm hands will lose their grip one day, And tongues that talk will stop to decay: The wealth you loved and stored away, Will go to some inheritor’s way. —Yunus Emre, the legendary mystic folk poet (1238–1320)
4 STOPPING RULES IN SOFTWARE TESTING Nutshell 4.0 Software testing and product reliability have always been two inseparable issues, but the analysis of stopping rules to render this activity cost-effective has traditionally been ignored. It is now anticipated that 50 to 75% of software expenses stem from testing [1]. Software testing in reliability is a broad topic that has been widely studied (see, e.g., textbooks such as the Handbook of Software Reliability Engineering [2] and the Software Engineering Handbook [3], among others [4,5]). Even though there are many extensive sources in the literature on testing software, there has been no in-depth analysis, speciÞcally on the intricacies and complexities, and more fundamentally, on the science of when to stop most efÞciently and economically. Usually, the stopping rule is either a time-to-release date, which is nothing more than a commercial benchmark or a time constraint, or it is a rough percentage of how many bugs detected out of a given total prescribed with no statistical data or trend modeling merged with cost-effective economic stopping rules. The focus is on determining when, given the results of a testing process, whether white box (coverage) or black box (functional) testing, it is most economical to halt testing and release software under the conditions prevailing for a prescribed period of time. We are dealing with one way of conducting a quality control analysis of software testing activity with the goal of achieving a quality product most economically and accurately. The data are one of two types: either stopping at the end of a time period T , such as at an increment of Tk − Tk−1 for a time-based model, or at the end of a certain amount of testing of the N th test case, such as stopping at an increment of Nk − Nk−1 for the test case or synonymously, an effort-based model. In this Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
172
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
173
chapter we deal with the stopping rules in time- and effort-based models and their applications using a programming code collected under the general title MESAT. MESAT-1 is application software for effort-based data, and MESAT-2 is for timebased data. Although this chapter works with empirical data on chance or random failures that cause disruption of the intended service of for particular hardware or software, the same logic can be utilized for malicious (not chance related) attacks that cause security breaches in security testing. Attacks replace test cases, and crashes replace the failures with other penetrations countermeasured. Moreover, provided that the data are applicable to the mathematical statistical and engineering model proposed, the practices described in this chapter can also easily Þnd use in the vast world of quality control testing of defective items, such as those of an automotive or airline manufacturing assembly line. The subject matter is a feasible alternative to existing statistical process control rules for accepting or rejecting a certain product before its release. Therefore, it is a new paradigm in the larger question of quality control testing, being one step ahead of just-in-time statistical process control.
4.1 EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE Nutshell 4.1 MESAT-1 is a cost-efÞcient stopping-rule algorithm used to save substantial numbers of test vectors in achieving a given degree of coverage reliability. Through cost–beneÞt analysis, the author has shown how cost-efÞciently his proposed stopping-rule algorithm performs compared to those employing conventionally exhaustive “shotgun” or “testing-to-death” approaches. This cost-effective technique is valued for its industrial potential by keeping a tight rein on budgetary constraints as well as by using a scientiÞc one-step-ahead formula to optimize resource utilization. This quantitative evaluation employing a stopping rule is in sharp contrast to conventional techniques that require using billions of test vectors to guarantee a certain degree of reliability. 4.1.1 Stopping Rule in Test Case–Based (Effort) Models Software-testing stopping rules are decision-making tools used to minimize effectively the time and cost involved in software testing. The algorithms serve to guide the testing process such that if a certain level of branch or fault (or failure) coverage is obtained without the expectation of further signiÞcant coverage, the testing strategy can be stopped or changed to accommodate further, more advanced testing strategies. By combining cost analysis with a variety of stopping-rule algorithms, a comparison can be made to determine an optimally cost-effective stopping point. A novel cost-effective stopping rule using empirical Bayesian principles for a nonhomogeneous Poisson counting process (NHPP) compounded with a logarithmic series distribution (LSD) is derived and applied
174
STOPPING RULES IN SOFTWARE TESTING
to digital software testing and veriÞcation [6]. It is assumed that the software failures, or branches covered, whichever the case may be, clustered as a result of the use of a given test case, are positively correlated (i.e., contagious). This assumption implies that the occurrence of one software failure (or coverage detection of a branch) positively inßuences the occurrence (or detection) of the next. This phenomenon of clustering of failures or branches is often observed in software testing practice. The random variable wi of the failure clump size of the interval is assumed to have LSD(θ ) justiÞed for the given data sets by employing chi-square goodness-of-Þt testing, while the distribution of the number of test cases is Poisson(λ). Then the distribution of the total number of failures observed, or similarly, covered branches, X, is a compound Poisson∧ LSD [i.e., negative binomial distribution (NBD)], provided that a certain mathematical identity holds. For each checkpoint in time, either the software satisÞes a desired reliability attached to an economic criterion, or the software testing is allowed to continue for the next test case application. By using a one-step-look-ahead formula derived for the model, the stopping rule proposed is applied to Þve test case–based data sets acquired by testing embedded chips through complex VHDL models. Further, multistrategy testing is conducted to show its superiority to single-stage testing. Results are interpreted satisfactorily from a practitioner’s viewpoint as an innovative alternative to the ubiquitous test-it-to-death approach, which is known to waste billions of test cases in the tedious process of Þnding more bugs. Moreover, the dynamic stopping rule algorithm proposed can validly be employed as an alternative paradigm to the existing just-in-time statistical process control methods, which are static in nature for the manufacturing industry, provided that underlying statistical assumptions hold. A detailed comparative literature survey of stopping-rule methods is included in the Appendix in terms of pros and cons and cost-effectiveness. 4.1.2 Introduction and Motivation In this chapter we describe a statistical model to devise a stopping criterion for random testing in software or hardware veriÞcation. The method is based on statistical estimation of branching coverage and will ßag the stopping criteria to halt the veriÞcation process or to switch to a different veriÞcation strategy. We build on the statistical behavior of failure or branch coverage described earlier. Applying empirical Bayesian and other statistical methods to problems in hardware veriÞcation, such as better stopping rules, should be a fruitful area of research where improvements in the state of the art would be very valuable. Technically, the general concept is questionable. However, the stopping-rule idea is generally accepted to be more rational than having no value-engineering judgment to stop testing, as often dictated by a commercially tight time-to-market approach [7]. Actually, a large pool of research and practical results is available for statistical analysis in hardware veriÞcation processes. All major microprocessor companies rely heavily on such efÞcient concepts [8,9]. Given a behavioral model, how should we apply test patterns effectively such that the target quality can be achieved with a minimum amount of effort measured
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
175
in terms of the number of test patterns (cases) used? Branch or decision coverage testing is a white-box testing technique in which each test case is written to ensure that every decision has a true or false outcome at least once (e.g., each branch is traversed at least once). Branch coverage generally satisÞes statement coverage since every statement is on the same subpath from either branch statement. In white-box testing, test cases are written to ensure that each decision and the conditions within that decision take on all possible values at least once. It is a stronger logic-coverage technique than decision/condition coverage because it covers all the conditions that can not be tested using decision coverage alone. It also satisÞes statement coverage. One method of creating test cases using this technique is to build a truth table and write down all the conditions and their complements. Then, if they exist, we must eliminate the duplicates. In addition, how do we decide when a given test strategy (i.e., the way to generate test patterns/cases) has reached its potential and a new (better) test strategy should be activated? When designing a VLSI system, embedded or not, at the behavioral level, one of the most important steps to be taken is verifying the system’s functionality before it is released to the logic and product development design phase. It is widely believed that the quality of a behavioral model is correlated with the experienced branch or fault coverage during the veriÞcation process [10–18]. However, measuring coverage is just a small part of ensuring that a behavioral model meets the desired quality goal. A more important question is how to increase coverage during veriÞcation to a certain level with a given timeto-market constraint. Current methods use brute force, with billions of test cases applied without knowing the effectiveness of the techniques used to generate the test cases [19–21]. One may consider behavioral models as oracles in industries to test against when the Þnal chip is produced. In the experimental sets in this chapter, branch coverage (in Þve data sets, DR1 to DR5) is used as a measure for the quality of verifying and testing behavioral models. Minimum effort to achieve a given quality level can be realized by using the empirical Bayesian stopping rule proposed above. The stopping rule guides the process to switch to a different testing strategy using different types of patterns (i.e., random versus functional) or using different set of parameters to generate patterns or test cases or test vectors when the current strategy is expected not to increase the coverage. This leads to the practice of mixed-strategy testing. We can demonstrate use of the stopping-rule algorithm on complex VHDL models, having observed that switching phases at certain points guided by the stopping rule would yield to the same or even better coverage with fewer testing patterns. This method is an innovative alternative to help save millions of test patterns and hence reduce cost in the colossal testing process of embedded chips versus the conventionally used test-it-to-death exhaustive testing approach. There occur many physical events according to the independent Poisson process, and at each of these Poisson events, one or more other events can occur. This is identiÞed as overdispersion in many life sciences–oriented textbooks, as in the total number of certain bacteria or algae clustered on individual leaves in
176
STOPPING RULES IN SOFTWARE TESTING
a water pond [22,23]. If an interruption during testing of a software program is assumed to be due to one or more software failures (or branch coverage) in a clump, and if the distribution of the total number of interruptions or test cases is Poisson, distribution of the total number of experienced failures or covered branches is a compound Poisson [6,24–31]. The empirical Bayesian stopping rule therefore uses the mathematical principles of a Poisson counting process as applied to the count of test cases, with a logarithmic series distribution (LSD) applied to the cluster size of software failures or branch coverage generated by each test case. It applies satisfactorily to a time-continuous, compounded, and nonhomogeneous Poisson process as well as to time-independent effort (or test case)–based testing, such as in a sequentially discrete Bernoulli process. That is, the Poisson process is a time-parameter version of the counting process for Bernoulli trials [32, p. 72]. It is imperative to recall that often-used binomial processes are the sum of identical Bernoulli-distributed random variables. However, those Bernoulli random variables in each test case are nonidentical, with unequal “arrival” success probabilities as reported in 1990 [33]. The model proposed assumes randomization of test cases in the spirit of an independently incremented Poisson counting process, since the coverage sizes do not necessarily follow a deÞnite trend unless the test cases are arranged in the order of merit. This is a practice that is impossible to attain perfectly prior to actual experimentation. Some sources claim that the independent-increment Poisson arrival model is applicable for the Þrst “surprise” execution against a test suite. On second and subsequent executions, the “arrival” (or discovery) of failures (or branches) is no longer random unless the software development process is chaotic or parallel-distributed. Evidently, the applicability of such an independent-increment counting process, and hence the proposed stopping rule, varies with the maturity of the software testing activity being developed. This is why a regression testing technique to observe for said maturity is of relevance here in terms of mainstream software engineering [34]. Also, some authors support the concept of the probability distribution function, p(t): an interruption correlation function for the occurrence of interruptions, a rather hazy and nebular concept [35]. First, the total number of observations should always be known in advance, to model the probability of interruptions, which testers are unable to master. Therefore, p(t) represents unrealistic guesswork and clearly varies from one set of data to another, so cannot be generalized. It is therefore more rational to randomize the interruption activity statistically, which is much more natural, as unprecedented test cases may act surprisingly different at random times. The randomization phenomenon is also in the spirit of a Poisson process with independent increments on which the MESAT tool is structured. The unpredictability factor of fault arrival or branch coverage is therefore best addressed by a nonhomogeneous Poisson process whose rate of arrival is adjusted, in this case diminished, with the advance of time or number of test cases. This nonstationarity of the Poisson process takes care of the no-longer-independent Poisson arrival times, a phenomenon best displayed by NHPP [32, pp. 94–101].
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
177
When a new computer software package has been written and compiled, and all obvious software failures have been removed from the input sets, a testing program is usually initiated to eliminate the remaining failures. The common procedure is to use the software package on a set of problems, and whenever testing is interrupted because of one or more programming failures, the codes are corrected, the software is recompiled, and computation is restarted. This type of testing can continue for several time units (e.g., hours, days, weeks), with the number of failures per unit time decreasing. The same is true, for instance, when discretely applied test cases replace test weeks and branch coverage records replace those of failures. Finally, one reaches a point of optimal economic return in time or effort when testing is stopped and the software is released. However, one is never certain that all software faults due to failures have been removed, or similarly, that all branches have been tested (covered). Although a small number of failures may remain in the software, the chances of Þnding them within a reasonable time may be so small that it is not economically feasible to continue testing [6,36,37]. The objective is to Þnd a cost-effective stopping rule to terminate testing. One can add the dimension of a preconceived conÞdence level, 0 < CL < 1, to ensure minimal coverage reliability. Stopping-rule problems have been studied extensively by statisticians [38–44] and engineers. In this chapter, however, a cost-effective stopping rule is presented with respect to a popularly used one-step-ahead economic criterion when an alternative underlying p.d.f. is assumed for the clump size of the failures or branches observed. The total number of failures or covered branches discovered is the Poisson counting process compounded with logarithmic series distribution at each Poisson arrival. That is, the number of incidents over time is distributed as Poisson, whereas the number of failures that occur as a clump at each interruption time or incident is distributed according to a discrete logarithmic series distribution (LSD). The failures within a clump are positively correlated with each other. This phenomenon is represented by a parameter 0 < θ < 1 in the LSD for the clump-size random variable. A Poisson distribution compounded by a discrete logarithmic-series distribution will be denoted as Poisson∧ LSD (i.e., a negative binomial distribution) pending a certain mathematical identity as in equation (14). The algorithm is applied in the effort domain where test cases are used in Þve example experiments on embedded chips [8–10,45]. 4.1.3 Notation, Compound Poisson Distribution, and Empirical Bayes Estimation CL NBD N (t) X(t)
conÞdence level; a minimal percentage of branches or failures to cover negative binomial distribution random variable for the number of Poisson events until and including time t total number of failures distributed with respect to Poisson∧ LSD until time unit t
178
wi =θ a k λ θ1 θ2 c.f. or X(t) dif(θ ) q
p f (X|θ ) h(θ ) h(X) α, β Beta(α, β) h(θ |X) f (X|θ ) E(θ|X) E(X) = kp S(·) = s C(n, k)
DR1–5
STOPPING RULES IN SOFTWARE TESTING
random variable of failure clump size distributed with LSD at each Poisson event i LSD parameter that denotes the positive correlation constant for the LSD random variable of w NBD parameter (calculated recursively at each Poisson epoch) Poisson rate or parameter where λ = −k ln(1 − θ ) = k ln q holds lower limit of θ upper limit of θ characteristic function of X(t) range for LSD parameter, the correlation coefÞcient: θ 2 − θ1 reciprocal of (1 − θ ); when θ = 0, there is no compounding phenomenon, and the process is purely Poisson with q = 1 (if q > 1, there is overdispersion) related parameter, p = q − 1; no compounding or pure Poisson when p = 0 discrete negative binomial conditional probability distribution of X prior distribution of the positive correlation parameter marginal distribution of X following the Bayesian analysis positive shape and scale parameters of beta prior beta distribution for LSD variable posterior conditional distribution of θ on X: failure vector discrete conditional probability distribution of X given Bayes estimator with respect to squared-error loss function; expected value of the conditional posterior random variable θ , where ∼ h(θ |X), which is a conditional posterior expected value of the conditional X ∼ NBD whose only parameter is k and based on a single random variable stopping rule S gives the number of failures s to stop after (·) discrete time units (days, weeks, etc.) or number of test cases combination (n, k) notation denoting how many different unorderedcombinations exist of “size k out of a sample of n! n n,” as in = k k!(n − k)! effort-based time-independent (test cases) coverage data sets 1 to 5
A nonstationary compound Poisson arrival process is given as [6,25–31,36; 32, pp. 90–101] [X(t), t ≥ 0] =
N(t) i=1
wi
(1)
179
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
where N (t) > 1 and the compounding clump sizes w1 , w2 , . . . are i.i.d. and where f (wi ) are distributed with LSD [6] as follows: f (w) = a
θw , w
a=−
0 < θ < 1,
a > 0,
w = 1, 2, . . .
1 ln(1 − θ )
(2) (3)
Then X(t), t ≥ 0 is a Poisson∧ LSD p.d.f. when N (t) ∼ Poisson(λ) and wi ∼ LSD(θ ) for i = 1, 2, . . . [6,26,27,29]. However, if for k > 0 we let λ = −k ln(1 − θ ) = k ln q where q=
1 1−θ
(4)
(5)
then X ∼ Poisson∧ LSD is a random variable with a negative binomial distribution (NBD). E(X) = kp when E(X) is the number of expected failures within the next time or effort unit. Since w 1 1 q f (w) = ln q w q − 1 1 1 p w = (6) ln q w q where p = q − 1,
q = (p + 1) > 1
(7)
its characteristic function (c.f.) is derived as follows:
X(t) (u) = exp[λ(φw (u) − 1)]
(8)
where φw (u) is the c.f. of LSD, which is given by φw (u) = 1 −
1 [ln(q − peiu )] ln q
(9)
Then
1
X(t) (u) = exp k ln q 1 − (q − peiu ) − 1 ln q = exp[ln(q − peiu )]−k = (q − peiu )−k
(10)
180
STOPPING RULES IN SOFTWARE TESTING
Note that X(t) (u) is the c.f. of NBD. Now the probability distribution function of X is X k+X−1 p f (X) = Ck−1 (11) k+X q where C denotes a combination operator, and from equation (7), p =q−1=
1 θ −1= 1−θ 1−θ
(12)
where q = 1/(1 − θ ). Thus, reorganizing (11) and (12), we obtain k+X−1 f (X | θ ) = Ck−1
θ 1−θ
X (1 − θ )k+X
(13)
Since the positive autocorrelation among the failures or branches in a cluster is not constant and varies from one to another, it can be well treated as a random variable denoted by θ that ranges from 0 to 1. Hence, among continuous distributions with a range between 0 and 1, the beta distribution can be considered as a conjugate prior distribution for θ with its corresponding p.d.f. Since 0 < θ < 1, we let the prior p.d.f. of the random variable = θ be a Beta(α, β) p.d.f. with h(θ ) =
(α) (β) α−1 θ (1 − θ )β−1 , (α + β)
0 < θ < 1,
α, β > 0
k+X−1 X f (X | θ ) = Ck−1 θ (1 − θ )k
Then the joint p.d.f. of X = h(θ, X) = =
N(t) i=1
(14) (15)
wi and is given as
(α) (β) X−1 k+X−1 X θ (1 − θ )β−1 Ck−1 θ (1 − θ )k (α + β) (α) (β) k+X−1 α+X−1 θ (1 − θ )β+k−1 C (α + β) k−1
(16)
and the marginal distribution of X is given as k+X−1 h(X) = Ck−1 k+X−1 = Ck−1
(α) (β) (α + β)
1
0
θ α+X−1 (1 − θ )β+k−1 dθ
(α) (β) (α + β + X + k) (α + β) (α + k) (β + k)
(17)
Now, by using Bayes’ theorem [38,46], where h(θ | X) = [f (X | θ )h(θ )]/ h(X), the posterior distribution of (θ | X) is derived as follows: h(θ | X) =
(α + X) (β + k) (α + β + X + k)
θ α+X−1 (1 − θ )β+k−1
(18)
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
181
This is the well-known beta distribution, as in h(θ | X) = Beta(α + X, β + k)
(19)
With respect to the squared-error loss function deÞnition [46], its expected value is given as α+X (20) E(θ | X) = α+β +X+k which is deÞned to be the Bayes estimator. We know that the expected value of the random variable X, which is a negative binomial distribution, is given as in the following by substituting the Bayes posterior p.d.f. of θ from equation (19) into (21), then using (12) for p and E(X) = kp: 1 θ E(X) = k h(θ | X) dθ (21) 0 1−θ α+X E(X) = k (22) β −1+k Therefore, γ = −k ln(1 − θ ) = k ln q and thus k = λ/ ln q from equation (4) can be approximated recursively as in (24) when the posterior Bayes estimator of θ from (20) is entered for θ in (4): eλ/k =
α+β +X+k β+k
(23)
which is a nonlinear equation that can be solved readily using the Newton– Raphson method employing an initial k(0). Since α and β are given constants, at each discrete step we use the accumulated X (the total failures or branch coverage) and calculate the constant k for the next step. However, using the generalized (incomplete) beta prior [47] instead of the standard beta prior can be more reasonable and realistic since the former includes an expert opinion (sometimes called an “educated guess”) about the feasible range of the parameter 0 < θ < 1. Therefore, θ can be entered by the analyst as a range or difference this time in the form dif(θ ) = θ2 (upper) − θ1 (lower)
(24)
to reßect a range of prior belief of positive correlation among the software failures or branches covered in a clump. Finally, we derive a more general equation not detailed beyond the scope of this section for a generalized beta to replace equation (24), which was derived for the standard beta prior. Equation (24) transforms into (25) for the generalized beta, and for example, when θ 1 = 0 and θ2 = 0.6: eλ/k = =
α+β +X+k (1 − θ2 + θ1 )(α + X) + β + k α+β+X+k 0.4(α + X) + β + k
(25)
182
STOPPING RULES IN SOFTWARE TESTING
One should emphasize that X is an input datum denoting the experienced value of the number of failures discovered or branches covered as a realization of the CP. Consequently, E(X) is the expected value of software failures or branch coverage in the next unit of time or discrete effort (test case). If E(X) is multiplied by the time units or efforts (test cases) remaining, we can predict the expected number of remaining failures, or branches uncovered. 4.1.4 Stopping Rule Proposed for Use in Software Testing If the incremental difference expected between sequential steps, i = 1, 2, . . ., where i denotes the testing interval in terms of days or we3eks in the time domain or test cases in the effort domain, is shown to exceed a given economic criterion d, testing is continued. Otherwise, testing is stopped. Following is the one-step-ahead formula, whose utility is maximized (or loss is minimized) as shown earlier by Randolph and Sahinoglu [36]: e(X) = E(Xi+1 ) − E(Xi ) ≤ d
(26)
which can be rearranged in the form e(X) = ki+1
α + Xi+1 α + Xi − ki ≤d β − 1 + ki+1 β − 1 + ki
(27)
by utilizing equation (25). However, incorporating the generalized beta prior yields e(X) = ki+1 − ki
(θ2 − θ1 )(α + Xi+1 ) (α + β − 1 + Xi+1 + ki+1 ) − (θ2 − θ1 )(α + Xi+1 ) (θ2 − θ1 )(α + Xi ) ≤d (α + β − 1 + Xi + ki ) − (θ2 − θ1 )(α + Xi )
(28)
where d = c/(a − b) and α, β, ki , Xi , θ2 , and θ1 are input values at each discrete step i. Note that equation (28) defaults to (27) for θ1 = 0 and θ2 = 1 or dif(θ ) = θ2 − θ1 = 1 when neither an expert judgment nor an educated guess exists on the bounds of the correlation strength for failure clumps. If we were to stop at a discrete interval i, we would assume that the failures or branch coverage discovered will have to accrue in the Þeld a cost of a per failure or branch after the fact or following release of the software. Thus, there is an expected cost over the interval {i, i + 1} of aE{Xi } for stopping at time t = ti or test case i. If we continue testing over the interval, we assume that there is a Þxed cost of c for testing and a variable cost of b of Þxing each failure found during the testing before the fact or preceding release of the software. Note that a is almost always larger than b since it should be considerably more expensive to Þx a failure (or recover an undiscovered branch) in the Þeld than to observe and Þx it
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
183
while testing in house. Thus, the expected cost for the continuation of testing for the next time interval or test case is bE(Xi ) + c. This cost model is somewhat similarly inspired, if not exactly the same, from that of the criterion expressed in reference 48. Opportunity or shadow cost is not considered here since such an additional or implied cost may be included within a more expensive and remedial after-release cost coefÞcient denoted by a. Some researchers are not content with these Þxed costs. However, the MESAT-1 tool employed here can treat that problem through a variable costing data-driven approach as needed by the testing analyst. That is, a separate value is entered at will in the MESAT-1 Java program for a or b or c at each test case, if these cost parameters are deÞned to vary from case to case. Therefore, an alternative cost model similar to that of Dallal and Mallows [48], revised by Randolph and Sahinoglu [36], is used. If for the ith unit interval beginning at time t or for the ith test case, the expected cost of stopping is greater than or equal to the expected cost of continuing, aE(Xi+1 ) ≥ bE(Xi ) + c
(29)
it is economical to continue testing through the interval or effort. On the other hand, if the expected cost of stopping is less than the expected cost of continuing (when the inequality sign is reversed), it is more economical and cost-effective to stop testing. aE(Xi+1 ) < bE(Xi ) + c (30) The decision-theoretic justiÞcation for this stopping rule is trivially simple. When E(Xi+1 ) = E(Xi ) is almost identical at the point of equality or equilibrium where the decision of stopping has the most utility (lowest loss) due to a negligible difference between the old and new information, we stop at a balance point between under- and overtesting. Then (30) follows as a follow-up to (29): E(Xi+1 ) − E(Xi ) =
c = d a−b
(31)
However, in this chapter we also contend that one-step-ahead decision is not the only way. A multistrategy such as two-stage decision making is shown to be superior, as discussed by Sahinoglu et al. [7]. This is equivalent to using the same stopping rule for the latent data following the decision made for the earlier stopping rules as described by McDaid and Wilson [42] based on Singpurwalla’s and Wilson’s taxonomy in their most recent book [49, Chap. 6]. Equation (28) here is neither a Þxed-time look-ahead nor a one-bug look-ahead plan as outlined by Singpurwalla and Wilson [49]. However, it is one-stage look-ahead testing, fortiÞed by second- and third-stage testing if needed, called a multistrategy testing plan in this chapter and supported in some recent publications [6–10,45]. Appendix 4C shows a practical application of these multistrategy rules using proposed look-ahead equation (28) under the newly proposed negative binomial distribution probability model, which is a compounded NHPP.
184
STOPPING RULES IN SOFTWARE TESTING
The stopping rule outlined through equations (26)–(29) essentially states that if the number of failures (or branch coverage) expected to be found in the software in the next unit of time or effort is sufÞciently small with respect to a given criterion, we should stop testing and release the software package to the end user. If the number of failures (branch coverage) expected is relatively large, we should continue testing to cover more ground. The stopping rule depends on an up-to-date expression for a Poisson∧ LSD or negative binomial distribution provided that a special assumption holds. Therefore, we need accurate estimates of θ to update stepwise. However, such estimates depend on the history of testing, which implies the use of empirical Bayes decision procedures as described above, such as in the “statistician’s reward” or “secretary” problem, where a Þxed cost c per observation is considered [6–10,38,39]. Moreover, the divergence factor d = c/(a − b) in equation (31) signiÞes the ratio of the cost c of performing a test over the difference between the higher a cost of catching a failure after the fact and the lower b cost of catching a failure before release. Given that the numerator c is constant, intuitively, a large difference between a and b, hence a smaller d, will delay the stopping moment, as it is costlier to stop prematurely by leaving uncorrected failures or undetected branches. Also, given that the denominator a − b is constant, a lower testing cost per test case c yielding a smaller d will also delay the stopping moment, as it is cheaper to experiment more. Moreover, α, β, ki , Xi , θ2 , and θ1 are input constants at each discrete step i, where α and β are prior parameters for the LSD(θ ) in the Bayesian analysis, where 0 < θ < 1 denotes the positive-correlation-coefÞcientlike parameter θ of LSD. In equations (4) and (20), k is an unknown quantity. Note that θ and k together deÞne λ, which is an important parameter of the model. A complete Bayesian analysis requires an inference on k as well. Note that even though such analysis does not yield analytically tractable results, it can easily be done using MCMC (Markov chain Monte Carlo) methods. Since k is not described probabilistically, but is estimated using data, the approach followed is not fully Bayesian, but empirical Bayesian [50]. Also, MCMC is beyond the scope of this chapter, which does not use a fully Bayesian approach. θ2 and θ1 are upper and lower constraints for θ if default situation θ2 − θ1 = 1 is not selected. Now, let RF be the number of faults or coverage remaining after the stopping action and RT be the number of test cases remaining after the stopping action. Then, for the stopping-rule algorithm to be cost-efÞcient, the following equation (in dollars) should hold: (RF)a ≤ (RF)b + (RT)c
(32)
from which the inequalities for a ≤, b ≥, and c ≥ can be derived using a simple algebra: (RF)b + (RT)c RF (RF)a − (RT)c b≥ RF
a≤
(33) (34)
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
(RF)a − (RF)b RT test expense = b × no. of failures repaired + c c≥
× no. of test cases covered
185
(35)
(36)
4.1.5 Applications and Results The right-hand side of equation (32) is the dollar amount of savings due to the stopping action taken by not executing the remaining test cases and by not correcting or detecting the remaining faults (or branches). The left-hand side of equation (32) is the dollar amount of potential loss if those remaining faults or coverage were to be corrected after release. If the right-hand side is greater than the left-hand side in (32), it is a positive gain; otherwise, it is a negative loss. Let TC be the number of test cases, NC the coverage number, and MC the minimum coverage required, which is equal to CL × NC [6,51]. Listed in Table 4.1 are the six varying cost scenarios for Table 4.2, which also indicates the subtle effect due to additional constraint information on the range of θ . There are Þve quadruplets in Table 4.2, each signifying one data set. Each row in a quadruplet pertains to one of the four sensitivity studies 3 to 6 in Table 4.1. Note that the Þrst rows in each quadruplet demonstrate a test environment where the value of TC is not available, therefore has no conÞdence level (CL) speciÞed. Thus, testing halts whenever the one-step-ahead formula (28) Þrst holds after at least two test cases with nonzero failures or branch coverage are experienced. The second rows in each quadruplet again possess no speciÞed CL, but testing is allowed to continue until or past a certain given minimal number of test cases speciÞed by the analyst. In this example, 50% of total test cases is taken as minimal and the test stops as soon as (28) is veriÞed. The test can also be halted as a result of exceeding the expense criterion in equation (36), which is based on funds budgeted. The third and fourth rows show S(·) when the dollar gain is positive under the savings column. The optimal a < for deÞnitions (3) and (4) in Table 4.1 are the optimal a costs to render the stopping rule lucrative (i.e., cost-efÞcient). TABLE 4.1
Six Scenarios and Their Sensitivity Studies for Table 4.2
1. The stopping rule S(·) for the default intracorrelation with a range of unity, θ2 − θ1 = 1.0 2. The stopping rule S(·) for the intracorrelation with a range of half, θ2 − θ1 = 0.5 3. a ≤ : Given c = $100, b = $1000, what is the optimal a ≤ to render MESAT-1 cost-efÞcient? 4. a ≤ : Given c = $100, b = $200, what is the optimal a ≤ to render MESAT-1 cost-efÞcient? 5. Savings with respect to equation (32) using the input cost parameters and coverage level. 6. Expense criterion calculated from equation (36) until stopping for cases when the CL is unknown.
186
STOPPING RULES IN SOFTWARE TESTING
TABLE 4.2
Single-Stage Stopping Rules S (·) = X ∗
Six Scenarios → TC
CL (MC)
DR1 100 200 200 DR2 92 185 185 DR3 50 100 100 DR4 100 200 200 DR5 1094 2176 2176
0 0.5 0.8 0.9 0 0.5 0.8 0.9 0 0.5 0.8 0.9 0 0.5 0.8 0.9 0 0.5 0.8 0.9
(107) (121)
(74) (83)
(35) (40)
(50) (57)
(37) (41)
(1) θ2 − θ1 =1
(2) θ2 − θ1 = 0.5
S(4) = 38 S(100) = 94 S(126) = 108 S(169) = 132 S(3) = 23 S(92) = 52 S(153) = 90 S(153) = 90 S(5) = 4 S(50) = 27 S(85) = 43 S(85) = 43 S(4) = 19 S(101) = 54 S(95) = 51 S(171) = 57 S(2) = 4 S(1094) = 40 S(100) = 38 S(2042) = 42
S(2) = 36 S(100) = 94 S(125) = 108 S(167) = 126 S(2) = 23 S(92) = 52 S(153) = 90 S(153) = 90 S(5) = 4 S(50) = 27 S(84) = 43 S(84) = 43 S(4) = 19 S(101) = 54 S(95) = 51 S(171) = 57 S(2) = 4 S(1094) = 40 S(100) = 38 S(2042) = 42
(3) a≤
(4) a≤
(5) Savings
$1,284 $2,550
$485 $1,750
$1,500 $1,500
$1,300 $2,550
$500 $1,750
$1,500 $1,500
$1,375 $1,633
$575 $833
$700 $700
$1,875 $1,483
$1075 $683.3
$900 $600
$27,088 $4,625
$26,288 $3,825
$202,300 $11,300
(6) Expense
$28,800
$19,600
$10,400
$20,900
$117,400
∗
DR1 (NC = 134 in TC = 200) of rows 1–4, DR2 (NC = 92 in TC = 185) of rows 5–8, DR3 (NC = 44 in TC = 100) of rows 9–12, DR4 (NC = 63 in TC = 200) of rows 13–16, DR5 (NC = 46 in TC = 2176) of rows 17–20 when α = 8 and β = 2, with respect to criteria (1) to (6) in Table 4.1
Looking at an example from Table 4.2 for DR5, on its Þrst row, stop at the second test case after covering four branches when equation (28) is Þrst veriÞed. For DR5’s second row, when CL did not apply due to the Þnal number of failures or branches being unknown; at least 50% of the total failures as a prescribed minimal 1094 test cases were allowed to run, at which point the decisive equation (28), was also veriÞed. When the stopping rule is applied, there is an expense amount of dollars accumulated from equation (36). The third and fourth rows in each quadruplet behave with respect to a conÞdence level of 0.8 (80%) and 0.9 (90%), respectively. Testing may halt on or after ensuring this speciÞed minimal conÞdence level of coverage as long as equation (28) holds and the gain is positive in equation (32), since the total number of failures or branches available is known. The TC values in rows 3 and 4 simply display the total known number of test cases for each data set. For DR5’s third row, testing stops at the 100th test case for a CL = 0.8 after covering 38 branches, exceeding 37 MC (minimal cases), which is found as MC = CL × NC = (0.8)(38) = 36.4 ≈ 37. Also, to render the stopping rule cost-effective, a per undiscovered fault should be at most $27,088 according to scenario (3) in Table 4.1. Total savings is $202,300.00 due to scenario (5) with the assumed cost parameters c = $100,
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
187
b = $200, and a = $1200. For DR5’s fourth row, with CL = 0.9, we stop at the 2042nd test case covering 42 branches to save $11,300, when c = $100, b = $200, and a = $1200. The difference between scenarios (1) and (2) is very subtle, but in generality, the shift from θ2 − θ1 = 1 to θ2 − θ1 = 0.5 in Table 4.1 generates an earlier stopping rule as expected, implying less inßuential action. The body of test cases is essentially randomized as in the major assumption of Poisson or Bernoulli counting processes. Savings as a prerequisite to a favorable stopping rule is deÞnitely a function of the cost parameters involved in each scenario, as equation (32) dictates. Essentially, if the cost of redeeming coverage (failure or branch) is high, it is disadvantageous to stop prematurely with respect to a stopping-rule algorithm, such as in MESAT-1. If the cost parameters are not known, a sensitivity analysis can be conducted to observe a range of losses or savings. MESAT-1 enjoys the beneÞt of setting a conÞdence level at will, due to the budget resources’ availability, in addition to a one-step-ahead criterion (28) controlled by divergence criterion d. Moreover, the MESAT-1 algorithm accounts effectively for clumping of the coverage as well as the positive autocorrelation among the observations in an aggregate. MESAT-1 is also ßexible when the Þnal coverage number may not be known, as illustrated in Table 4.2, where we allow a minimal number of test cases to run. This method is also ßexible for employing variable cost values, a, b, or c, at different test cases, some test cases perhaps having more weight than others. Note that in Table 4.2, dif(θ ) = 1 implies the use of a default standard beta prior, whereas dif(θ ) = 1 implies implementation of the generalized beta prior. It is clear that as the economic stopping criterion d varies from a liberal (higher) to a conservative (lower) threshold, the stopping rule is shifted and postponed to a later test case. By a conservative setup, we mean a scenario where the stopping rule is trying not to miss any failures, and testing activity is likely to stop later rather than sooner. The correlation behavior within each clump is represented by our choice of α and β in the light of previous engineering judgment. Note that for α > β, as in α = 8 and β = 2, such as imposed in the empirical Bayesian sense in the examples of Table 4.2, the posterior of the random variable θ displays distinctly left-skewed behavior. It has been observed that stopping occurs earlier in this scenario. However, in α = β, such as in α = 5 and β = 5, where the beta distribution looks evenly symmetrical as opposed to the presently skewed distributions since α > β, the correlation within the coverage numbers in each test case is not that strong. In the latter case it has been observed that the stopping rule is then delayed somewhat, if not considerably. Therefore, a choice of α > β, as in the goodness-of-Þt tests in Appendix 4A, is statistically feasible and acceptable. As for the range of the LSD correlation coefÞcient, dif(θ ) = θ2 − θ1 , Þrst having a range of 1.0 (uneducated guess), then gradually dropping to 0.5 does generally, if not always, have a subtle savings effect. This is why a generalized beta prior [47] was chosen to incorporate the expert opinion for the range of θ and to recognize the infeasibility of a very low imposed θ to lend freedom to the versatility rather than assuming the default case of θ2 − θ1 = 1, when anything may happen to avoid statistically unrealistic autocorrelation θ values.
188
STOPPING RULES IN SOFTWARE TESTING
Note in Appendix 4A that the goodness-of-Þt chi-square tests do not involve counts of zero for the underlying logarithmic-series distribution tested as the random variable w for LSD takes on nonzero values, w = 1, 2, 3, . . . as shown in equation (2), where the constant a is given by equation (3). Therefore, the blocks will show the frequencies of nonzero entities, where the zero count can be found by subtracting from the total number of test cases for each data set. Figure 4C.1 in Appendix 4C displays a menu of the aforementioned parameters and solutions for a multistrategy testing. Variable cost data (such as DR5vd.txt) where ‘vd’ denotes “variable data” can also be used by using forced data of the cost parameters a, b, and c, respectively, for each test case entered. Figure 4C.2 to 4C.8 show various applications of MESAT-1. 4.1.6 Discussion and Conclusions The contribution of the methodology proposed lies in an empirical Bayesian approach to determining an economically efÞcient stopping rule in a compound Poisson setting that takes into account the accumulation of failure clumps at each step in a software-failure (or branch coverage) counting process. This chapter is a follow-up summary to previous research done on Poisson∧ LSD as applied to computer software or hardware testing [6,27,29,37]. In this chapter we also present an alternative to those earlier publications in that the compounding distribution was assumed to be geometric (hence, Poisson∧ geometric), due to the forgetfulness or independence property of the clumped failures and where additionally, the stochastic time index was assumed to be in terms of CPU seconds [27,28,36]. We also address the effort-domain problem, where the unit tests per calendar week are now replaced by test cases, or test vectors as they are sometimes called, in embedded-chip testing. However, in this chapter, the compounding density is a logarithmic-series distribution (LSD), where failures are interdependent and assumed to affect each other adversely in terms of test cases as opposed to a continuous-time domain in terms of CPU seconds, hours, or weeks. Recall that the dual of a time-dependent Poisson process is a time-independent Bernoulli process whose theory is sufÞciently strong to handle the unit test case phenomenon replacing the unit test week as a stochastic index where the response variable is the success or failure of a fault or branch [32,53]. This is in line with test case–based testing activity, where limiting distribution of the sum of the nonhomogeneous Bernoulli variables is approximately a compound Poisson process, where λt = ni=1 pi , with n representing the number of Bernoulli trials and pi the probability of detecting a failure or covering a branch at each step i = 1, . . . , n [25]. Sahinoglu [25] has studied a y1 , y2, . . . , yn sequence denoting a set of Bernoulli random variables, with pi being the success probability (e.g., the software test successfully passing the reliability test asserted by test case at the ith step). Further, assume that the nonindependent and nonidentical yi have a nonhomogeneous Markov Bernoulli sequence as described inthe matrix on page 47 of reference 25. Then the author has proven that Sn = ni=1 yi has an asymptotic or limiting (as n → ∞) Poisson∧ geometric, a compound Poisson
189
EFFORT-BASED EMPIRICAL BAYESIAN STOPPING RULE
distribution with E(Sn ) = nP , where P = (1/n) variance is derived
n i=1
pi and Q = 1 − P . The
Var(Sn ) = nP Q + 2nP Qπ/(1 − π) − 2P Qπ(1 − π)n /(1 − π)2
(37)
where π (deÞned as the autocorrelation coefÞcient) denotes a degree of interdependence such that π = 0 implies a completely s-independent Bernoulli sequence and π = 1 indicates complete s-dependence where the Markov process remains absorbed in its initial state forever. q(Sn ) = Var(Sn )/E(Sn ) ∧
(38)
is necessary to use for the Poisson geometric p.d.f. (see Chapter 1) in the book’s CD-ROM to conduct statistical inference for time-independent success–failure test schemes. The stopping rule has been applied to six effort-domain test data sets, DR1 to DR5, compiled at Colorado State University [8–10] and also to a businessrelated data set, DR6 [7]. This stopping-rule method is a new derivative of the original publications on the compound Poisson reliability model [6,26–30]. The number of failures or branches covered is independent from test case to test case. Test cases are randomized and thus have no speciÞc order. However, the total number of contributions or coverage at each one-step-ahead check assures that the testing activity will stop due to a speciÞed criterion d for a set of speciÞed cost parameters α and β imposed on the data set itself, obtained from similar earlier activity or from subjective guesswork. The software analyst can apply a subsequent testing strategy after stopping due to the saturation effect with respect to an economic criterion, provided that there is a desired conÞdence level. The same algorithm can be used in a follow-up strategy to judge where to stop. Hence, a mixed sequence of strategies can be employed for best efÞciency to save time and effort (i.e., overall resources). This is sometimes called mixed-strategy testing [6–10]. McDaid and Wilson [42] have shown that two-stage sampling is superior to single-stage sampling, as illustrated in the examples in Appendix 4C. It is very likely that by sacriÞcing only a small percentage of failure or branch coverage accuracy, one can literally avoid wasting testing resources because of persisting in using the same futile testing strategy—on a journey to the unknown. Tables 4B.2 and 4B.3 illustrate the results of mixed-strategy testing activity. Also, as d gets smaller, stopping is commonly delayed for Þne-tuning. The saving of testing resources can be very important in colossal testing problems. The stopping-rule method is therefore based on a Bayesian approach to updating historical information for use in future decision making. It assumes a Poisson∧ LSD (negative binomial for a special case) model in which the contributed failures clumped in a test case are positively correlated. This implies that the occurrence of a failure or detection of a branch is likely to invite another failure or branch. For further research, a variety of informative priors can be alternatives to a conjugate prior generalized Beta(α, β) for θ [6,27,47]. Further, to provide readers with fundamental information about what sorts of methods currently exist for a variety of projects, as listed in Tables 4.1 and 4.2
190
STOPPING RULES IN SOFTWARE TESTING
for the stopping-rule problem, and to provide evidence that the method proposed herein is a substantial improvement, a list of comparisons over other existing methods is presented in Appendix 4B. In summary, the MESAT-1 proposed is progressive and more data friendly in terms of its exploratory data analysis (EDA) than other methods that do not attempt to study for diagnostics. MESAT-1 is suitable for those data sets that satisfy the goodness-of-Þt criterion for their clump size distribution with respect to a hypothesized LSD. This property of MESAT-1 is therefore discriminative rather than Þtting for all purposes. This is why Þve of the Þve data sets proved positive for the assumed, and hence good Þts are declared for NBD in natural consequence by equations (1) to (13)—but others may not. MESAT-1’s only seemingly subtle disadvantage is the assumption of independent and randomized test cases, which may or may not occur in actual testing. This disadvantage is actually a requirement for the independent increments property of the Poisson processes as the major underlying distribution of counts in this research. However, as explained in Section 4.1, the randomization assumption is a practical reality in testing practice. Even if otherwise suspected, there is no universally accepted solution to modeling the correlation of test cases for each testing activity, whose results are not known in advance, by the nature of the surprise factor in software testing. In Table 4.2’s second and third rows, it is assumed that you know the end of the data set in terms of how many total test cases and total coverages exist. Also, you should not stop unless you exceed a minimal coverage criterion such as 70% or 80% and have a resulting positive proÞt in the “gain” column with respect to equation (32). The positive proÞt means that the right side of (32) is greater than the left side. The proÞt criterion is honored in Table 4.2’s stopping rules of S(·) = X for the third or fourth rows, given together with the minimal criteria. Then we can optimize a, b, and c, where one target is entered as 0 and other two are kept constant. Do not change the d while doing so. On the other hand, if one does not know the total number of test cases ahead of time, decide on a minimal number of test cases that you wish to try, such as 100 of an estimated 200. Also decide on an initial budget (e.g., $15,000) for testing and in-house repair. You need to do this before you release your reliable product following cybertesting or conclude that the product is secure following security testing. We can prioritize the expense account to dictate a stopping point rather than prioritize the minimal number of test cases. If no budget expense account is listed ($0), use the minimal number of test cases to dictate a stopping point. The coverage percentage in this scenario will not make sense, due to nonavailability of the Þnal count of errors or coverages. One additional feature available in MESAT-1 is the “allowed” column. This feature roughly estimates the number of errors expected in the near future by using the ratio “RF (= no. of remaning failures) = expense/after release “a” cost, to justify spending it. In Figure 4C.9, when the Þnal number of test cases and failures is unknown, the total expense of $25,000 when divided by the “a” cost = $1000 generates RF = 25 failures remaining, roughly, for a convenient estimate.
191
ANALYSIS TABLES
APPENDIX 4A: ANALYSIS TABLES
Frequency (Count)
Freq. Distribution of Cluster Sizes 14 12 10 8 6 4 2 0
dr1 dr2 dr3 dr4 1
2
3
4
5 6 7 Cluster Size
8
9
10
>11
dr5
FIGURE 4A.1 Frequency distribution of cluster sizes of data sets DR1 to DR5. TABLE 4A.1 Diagnostic Checks for Experimental Data Sets Cluster Size 1 2 3 4 5 6 7 8 9 10 >11 TABLE 4A.2 Data Set DR1 n = 31 p = 0.149 a = 0.62 α = 0.05 θ = 0.8 p>α good Þt
DR1
DR2
DR3
DR4
DR5
11 9 4 1 0 1 1 2 0 0 2
9 8 3 1 0 2 1 2 0 0 1
9 5 3 0 0 0 0 2 0 0 0
11 3 3 0 2 2 1 1 0 0 0
13 3 5 3 0 0 0 0 0 0 0
Goodness-of-Fit Tests for Data Sets DR1 to DR5 with p-values X
P
E
O
Chi-Square
1 2 3 4 5 6 7 8 9 10 >11
0.497064 0.198826 0.106040 0.063624 0.040719 0.027146 0.018615 0.013030 0.009266 0.006671 0.018998
15.90605 6.36242 3.39329 2.03597 1.30302 0.86868 0.59567 0.41697 0.29651 0.21349 0.60793
11 9 4 1 0 1 1 2 0 0 2
1.513217 1.093426 0.108478 0.527140 1.303023 0.019851 0.274456 6.010041 0.296510 0.213487 3.187638 14.54727 (Continued)
192 TABLE 4A.2 Data Set DR2 n = 27 p = 0.117 a = 0.62 α = 0.05 θ = 0.8 p>α good Þt
STOPPING RULES IN SOFTWARE TESTING
(continued ) X
P
E
O
Chi-Square
1 2 3 4 5 6 7 8 9 10 >11
0.497064 0.198826 0.106040 0.063624 0.040719 0.027146 0.018615 0.013030 0.009266 0.006671 0.018998
13.42073 5.38291 2.86309 1.71785 1.09943 0.73295 0.50260 0.35182 0.25018 0.18013 0.51294
9 8 3 1 0 2 1 2 0 0 1
1.456168 1.290148 0.006547 0.299975 1.099426 2.190344 0.492269 7.721385 0.250181 0.180130 0.462484
1 2 3 4 5 6 7 8 9 10 >11
0.497064 0.198826 0.106040 0.063624 0.040719 0.027146 0.018615 0.013030 0.009266 0.006671 0.018998
9.44422 3.77769 2.01477 1.20886 0.77367 0.51578 0.35368 0.24757 0.17605 0.12676 0.36096
9 5 3 0 0 0 0 2 0 0 0
15.44906 DR3 n = 19 p = 0.078 a = 0.62 α = 0.05 θ = 0.8 p>α good Þt
0.020894 0.395494 0.481786 1.208860 0.773670 0.515780 0.353678 12.404330 0.176053 0.126758 0.360958 16.81826
DR4 n = 23 p = 0.477 a = 0.62 α = 0.05 θ = 0.8 p>α good Þt
1 2 3 4 5 6 7 8 9 10 >11
0.497064 0.198826 0.106040 0.063624 0.040719 0.027146 0.018615 0.013030 0.009266 0.006671 0.018998
11.43247 4.57299 2.43893 1.46336 0.93655 0.62437 0.42814 0.29970 0.21312 0.15344 0.43695
11 3 3 0 2 2 1 1 0 0 0
0.016360 1.541067 0.129074 1.463356 1.207551 3.030870 0.763841 1.636417 0.213117 0.153444 0.436949 9.592047
193
COMPARISON OF PROPOSED CP RULE WITH OTHER STOPPING RULES
TABLE 4A.2 Data Set DR5 n = 24 p = 0.651 a = 0.62 α = 0.05 θ = 0.8 p>α good Þt
(continued ) X
P
E
O
Chi-Square
1 2 3 4 5 6 7 8 9 10 >11
0.497064 0.198826 0.106040 0.063624 0.040719 0.027146 0.018615 0.013030 0.009266 0.006671 0.018998
11.92954 4.77181 2.54497 1.52698 0.97727 0.65151 0.44675 0.31273 0.22238 0.16012 0.45595
13 3 5 3 0 0 0 0 0 0 0
1.096055 0.657889 2.368275 1.420965 0.977268 0.651512 0.446751 0.312726 0.222383 0.160116 0.455947 7.769886
APPENDIX 4B: COMPARISON OF THE PROPOSED CP RULE WITH OTHER STOPPING RULES Almost all of the existing statistical models used to determine stopping points stem from research results in software engineering [6,7,40–45]. Many models have been proposed assessing the reliability measurements of software systems to help designers evaluate, predict, and improve the quality of their software systems [54–61]. However, software reliability models aim at estimating the remaining faults in a given software program, which makes direct use of such models nonbeneÞcial in estimating the number of uncovered branches remaining in a behavioral model, since the remaining uncovered branches are known. Instead, the estimation process can be modiÞed slightly to focus on the number of faults, or coverage items in the case of behavioral model veriÞcation, that are expected within the next unit of testing time. Unfortunately, all the existing software reliability models assume that failures occur one at a time, except for the MESAT approach proposed, which uses a CP (compound Poisson) that does not assume so. Based on this assumption, expectations of the time between failures are determined. In observing new coverage items in a behavioral model, branches are typically covered in clumps. In the MESAT tool proposed, the positive correlation within a clump is taken into account. The conÞdence-based modeling approach takes advantage of hypothesis testing in determining the saturation of the software failure [58,59]. A null hypothesis H0 is performed and later examined experimentally based on an assumed probability distribution for the number of failures in a given software product. Suppose that a failure has a probability of occurring of less than or equal to B; then we are at least 1 − B conÞdent that H0 is true. Similarly, if the failures for the next period of testing time have the same probability of at least B to occur, then for the next N testing cycles, we have a conÞdence of at least C that no failures will happen,
194
STOPPING RULES IN SOFTWARE TESTING
where C = 1 − (1 − B)N N=
(4B.1)
ln(1 − C) ln(1 − B)
(4B.2)
If C = 0.95, B = 0.3, then by using equation (4B.2), N ≈ 100. This is a singleequation stopping-rule method, which can be likened to a parallel system of N independent components whose reliabilities are identical to each R = 1 − B to satisfy an overall network reliability of C [62, p. 265]. To apply Howden’s model to the process of HDL veriÞcation, we Þrst need to create failures as interruptions, where an interruption is an incident where one or more new parts of the model are exercised. Using branch coverage as a test criterion, an interruption therefore indicates that one or more new branches are covered. We set a probability for the interruption rate B and choose an upper-bound level of conÞdence C. Experimentally, we do not examine the hypothesis unless the interruption rate becomes smaller than the preset value B. When so, we calculate the number of test patterns needed to have at least C conÞdence of not having any new branch in the next N patterns and run them. If an interruption occurs, we continue examining the hypothesis until we prove it, and then stop. In this approach we assume that coverage items, or interruptions, are independent and have equal probabilities of being covered. The rate of interruption is decreasing and we assume that no interruptions will occur in the next N test cases; then the expected probability of interruptions will be [58,59] Bt =
B t +T
(4B.3)
where T is the last point checked in testing, and this leads to the reformulation of equation (4B.1) as C=1−
N 1
B 1− t +T
N (4B.4)
In Howden’s model, the assumption that failures or interruptions have a given probability B independently is not error free. As we know, branches in an HDL model are strongly dependent of each other. In fact, we can classify some cases where it is impossible to cover the lower-level branches without covering their dominants. Moreover, the clump sizes caused by the interruptions are not modeled in this study, making the decision of continuing or stopping the testing process inaccurate. Finally, this work does not incorporate the cost of testing or releasing the product, and the goal of testing in the Þrst place is not only having a highquality product but also minimizing the testing costs [40]. Dallal and Mallows [48] assumed that the total number of software failures is a random variable with unknown mean and that the number of failures that occur
COMPARISON OF PROPOSED CP RULE WITH OTHER STOPPING RULES
195
during testing is a nonhomogeneous Poisson process with increments λg(t). The time needed for a single failure to occur is distributed as g(t), which can be assumed exponential. This model better describes the failure process then do the models discussed previously, such as the Howden and modiÞed Howden methods. However, it still suffers from the problem of not having more than one interruption at a time, which reduces the efÞciency of the model when applying it to branch coverage estimation [40,48,58–60]. Finally, the present author applied a compound Poisson method that models the branch coverage process of VHDL circuits utilizing the beneÞts of the one-step-ahead econometric model by reformulating it [6,36,48] and solving the clumping phenomenon of branches being covered in the testing process. This model uses empirical Bayesian principles for the compound Poisson counting process. It was introduced in 1992 as a software reliability model for estimating the remaining number of failures [27] and later modiÞed [6,36] to incorporate a different version of the cost modeling proposed in 1995 by Dallal and Mallows [48]. More recently, it has been formulated to model the branch coverage process in behavioral models [6–10]. The idea is to compound two potential probability distributions: for the number of interruptions and the size of interruptions. The resulting compound distribution is assumed to be the probability distribution function of the total number of failures, or coverage items, at a certain testing time point. The parameters of the distributions are also assumed to be random variables based on empirical Bayesian estimation. For modeling the branch coverage process for behavioral models, it is assumed that the number of interruptions over time, N (t), is a Poisson process with mean λ, and the size of each given interruption, wi , is distributed as a logarithmic-series distribution (LSD; see the diagnostics of Appendix 4A for a justiÞcation of the LSD of clump sizes). The resulting compound distribution for the total number of failures, which is the sum of the sizes, is also known as a negative binomial distribution if the Poisson parameter λ is set to −k ln(1 − θ ). The compound Poisson model takes the clumps of the coverage items into account in a statistical manner by updating the assumed probability distribution parameters in every test case based on the testing history. However, interruptions in the testing process are assumed to be independent, due primarily to the independent increments property of the anchoring Poisson process. The MESAT-1 proposed also incorporates a minimal conÞdence rule in addition to using the one-step-ahead formula (28) for assessing whether to stop or continue economically. All the stopping rules discussed previously assume that failures or interruptions are random processes according to a given probability distribution. A sequential sampling technique that involves no assumptions regarding the probability distributions for the failure process was presented by Musa [56]. Recently, the technique has been applied to VHDL models used to determine stopping points for a given testing history of branch coverage [61]. The model evaluates the stopping decision based on three key factors: the discrimination ratio, τ ; the supplier risk, α; and the consumer risk, β. If the number of cumulative coverage at time
196
STOPPING RULES IN SOFTWARE TESTING
t is X(t), the testing process should be stopped at X(t) =
ln[(1 − β)/α] − ln γ 1−γ
(4B.5)
The stopping decision depends on the value of γ much more than on α and β. The decision does not incorporate a cost model of the testing process. In [56], the variable was modiÞed with respect to testing strategies such that if higher coverage were achieved in the previous test strategy, the value of γ is increased in the current test strategy in order to decrease the expectation of achieving more coverage in the current strategy. The new value of γ therefore becomes γ = γ ln , where is the coverage increase achieved in the previous test strategy. The value of γ remains the same, however, if < e. This type of statistical modeling does not use any prior probability distribution for the data provided. This is one reason why sequential sampling models are used widely in many testing areas [35,40]. However, the cost of testing is not modeled in making the stopping decision. Moreover, in the opinion of this author, the stopping point determined by the sequential sampling model is very sensitive to the value chosen during the testing process. Equation (4B.5) is subject to an abusive use for purposes of experimental validation. Authors of this approach [61] have earlier suggested values for γ up to 250, whereas Musa’s paper [56] uses γ only on the order of 5 or 10. Excessive values of γ pose a contradiction and threat to Wald’s SPRT theory for sequential testing in terms of type I (whose probability is α) and type II (whose probability is β) errors. The same holds true for α, which various authors have suggested to be 0.50, a relatively exaggerated value compared to Musa’s 0.10. Singpurwalla et al. [41,44,49], McDaid and Wilson [42], and Ross [43] have developed their own stopping rules with differing statistical assumptions in one- or two-stage testing schemes. However, because these techniques have not been subjected to hardware or silicon testing with respect to branch coverage, no comparative results are available in the engineering literature. The arguments above suggest that the MESAT-1 proposed, which employs both a minimal conÞdence rule and a one-step-ahead formula within a single- or multistage testing scenario to justify a decision taken as to whether to continue or stop testing, has the imminent advantages of recognizing the clumping effect in coverage testing as well as incorporating economic criteria in addition to its data discriminative traits by conducting exploratory data analysis through diagnostic checks. It is imperative that a diagnostic check, such as in Appendix 4A, be undertaken if similar exhaustive test results are available. This is necessary to justify use of the LSD model for the clump sizes, a model that eventually leads to an NBD assumption for the total amount of coverage by default in the wake of the expression λ = −k ln(1 − θ ) = k ln q, assumed to hold true. For a more thorough comparative case study, a research done by Hajjar and Chen was utilized [63,64], in which nine stopping rules, shown in Table 4B.1, were applied to 14 different VHDL models [45]. The results of the stoppingrule determinations are shown in Table 4B.2, including results obtained without
COMPARISON OF PROPOSED CP RULE WITH OTHER STOPPING RULES
TABLE 4B.1 Study
Stopping Rules Used in the Case
Orig. SS1 SS2 HW1 HW2 BM DL CP SB DB CDB
Original (without a stopping rule) Sequential sampling Þxed Sequential sampling variable Howden’s Þrst formula Howden’s second formula Binary Markov model Dalal–Mallows model Compound Poisson rule Static Bayesian rule Dynamic Bayesian rule ConÞdence-based dynamic Bayesian rule
197
Source: [64].
the use of stopping rules. This stopping-rule comparison portrays the compound Poisson (CP) method as having one of the lowest efÞciencies based on a naive coverage per testing pattern index, deÞned as the number of branches covered divided by the total number of test patterns used. Despite their index rating, CP found the most faults for 10 of the 14 VHDL models, while ranking second in B15, third in B01, and fourth in B04 (Tables 4B.2 and 4B.3). Furthermore, no economic analysis has been undertaken to illustrate the monetary gain or loss associated with the various stopping rules. We now use the cost–beneÞt criterion of equation (32), where RF is the remaining number of failures uncovered and RT is the number of test patterns still unused when stopped. In our example, we use c = $1, b = $230, and a = $2300, since the cost of after-market redemption is 10 times greater than that before. Using the Sys7 data with CP, we get, by equation (32), $2300(568 − 547) = $2300(21) $48,300 < $230(21) + $1(54,283 − 6287) = $52,826
(4B.6)
thus showing CP to be cost-effective by $52,826 − $48,300 = $4526. Comparing Sys7 with DB, $2300(568 − 536) = $2300(32) $73,600 < $230(32) + $1(54,283 − 563) = $61,080
(4B.7)
showing DB not to be cost-effective by $61,080 − $73,600 = −$12,520. Why is a ratio of 10 used between before- and after-release costs? The reason is that unlike software testing, silicon testing is more expensive for uncovered branches or failures. Although access to the VHDL model data used in Hajjar and Chan’s research [63,64] was not available, cost analysis could still be applied to their results. In
198
STOPPING RULES IN SOFTWARE TESTING
TABLE 4B.2 Results of Stopping-Rule Coverage versus Number of Test Patterns for the Static Case Studya Model
Orig.
SS1
SS2
HW1
HW2
BM
DL
CP
SB
DB
CDB
Sys7
568 5428 161 8150 200 8000 223 8000 259 8000 210 8000 210 8000 274 8000 260 8000 210 8000 223 8000 259 8000 257 8000 415 8000
536 1039 73 3259 177 8169 220 1028 234 1079 192 8725 196 8963 268 1244 234 1079 197 9068 220 1028 234 1079 248 1136 351 1619
538 1858 73 3812 142 3352 218 5047 251 7122 192 4618 198 4660 268 6122 251 7122 198 4660 218 5047 251 7122 248 5712 351 7906
536 927 79 2769 128 1010 206 1894 251 2092 192 1240 196 1322 263 1392 234 1512 204 1488 206 1894 234 1545 244 1892 350 1892
536 969 79 2906 155 1108 214 2468 251 2343 192 1407 204 3904 263 1405 251 2053 204 1711 214 2468 251 2085 244 1900 350 1900
536 1025 81 3033 128 1211 214 2557 251 2431 192 1439 196 1621 273 2283 251 2470 204 1781 214 2557 251 2462 244 1991 350 1991
536 1235 75 5712 128 1211 219 1175 252 5318 204 7110 196 1132 273 8427 252 5324 196 915 219 1175 252 5318 248 1618 418 8000
547 6287 112 9600 135 4200 217 1710 253 1080 204 4500 204 4500 273 9600 253 7800 208 4200 217 1710 253 6900 253 2100 383 9000
535 661 74 2275 128 1854 199 631 232 808 192 673 195 789 273 2249 232 809 208 2181 199 631 232 808 245 1982 364 2080
536 563 73 2239 128 914 202 674 233 745 192 708 195 704 273 2033 233 734 208 1488 202 674 233 745 245 735 364 2298
535 569 67 2091 128 897 202 742 233 744 192 708 195 731 273 1829 233 735 208 1240 202 742 233 744 245 748 364 2010
8251 B01 B04 B05 B06 B07 B08 B09 B10 B11 B12 B14 B15 a
It is assumed that the coverage per testing pattern can be calculated (i.e., coverage/patterns) without using cost factors
a case study using the cost criterion of equation (32), in which a cost index was applied to the data with cost values of a = $5000, b = $500, and c = $1, the CP stopping rule was clearly more beneÞcial. As can be seen in Tables 4B.1, 4B.2, and 4B.3 of the nine stopping rules used in that study, the compound Poisson stopping rule ranked very high with regard to savings in many of the VHDL data sets. CP scored 6 Þrst and 3 second and 2 fourth places in Table 4B.3. The low cost of testing, in conjunction with the high postrelease repair cost, renders the CP stopping rule superior to many of the other stopping rules in the study. The incentive behind mixed strategy testing is that a bug undetected in a silicon-embedded chip is much more costly than a bug in software, and therefore the stopping rule needs to be very conservative. At the end of the spectrum,
COMPARISON OF PROPOSED CP RULE WITH OTHER STOPPING RULES
199
TABLE 4B.3 Comparisons of Costs in Dynamic Case Study for a = $5000, b = $500, and c = $1 Model Sys7 825l B01 B04 B05 B06 B07 B08 B09 B10 B11 B12 B14 B15
Rank by Savings/BeneÞt (High to Low Left to Right) CP SS2 DB HW1 HW2 BM SS1 DL CDB SB −46504 −82575 −90280 −90644 −90686 −90742 −90756 −90952 −94786 −94878 CP BM HW1 HW2 DL SB DB SS1 SS2 CDB −148600 −281533 −290269 −290406 −311212 −312275 −316739 −317759 −318312 −343591 SS1 HW2 SS2 CP CDB DB HW1 BM DL SB −31669 −133584 −184352 −216700 −244897 −244914 −245010 −245211 −245211 −245854 SS1 SS2 DL HW2 BM CP HW1 DB CDB SB 56218 52453 50245 37032 36943 35900 1606 −15174 −15242 −28631 DL CP HW1 HW2 BM SS2 CDB DB SB SS1 43182 42200 41908 41657 41569 36878 −37744 −37745 −42308 −43295 CP DL SB DB CDB HW1 HW2 BM SS2 SS1 −2240 −2407 −2439 −5618 −9725 48500 45890 −1673 −1708 −1708 HW2 CP SS2 DL HW1 BM DB CDB SB SS1 49096 48500 21340 15868 15678 15379 11796 11769 11711 8037 CDB DB SB BM DL CP SS2 SS1 HW1 HW2 73671 73467 73251 73217 67073 65900 46878 40553 29108 29095 CP DL HW2 BM SS2 HW1 DB CDB SB SS1 40700 38676 37447 37030 32378 −38512 −42234 −42235 −46809 −47795 CDB DB SB CP HW1 HW2 BM SS2 DL SS1 69760 69512 68819 66800 51512 51289 51219 21340 16085 12432 SS1 SS2 DL HW2 BM CP HW1 DB CDB SB 56218 52453 50245 37032 36943 35900 1606 −15174 −15242 −28631 CP DL HW2 BM SS2 HW1 CDB DB SB SS1 46100 43182 41915 41538 36878 −34045 −37744 −37745 −42308 −43295 CP DL SS2 SS1 DB CDB SB HW1 HW2 BM 41000 37882 33788 28133 25265 25252 24018 19608 19600 19509 DL CP CDB SB DB HW1 HW2 BM SS2 SS1 13498 −73000 −151510 −151580 −151798 −214392 −214400 −214491 −215906 −224190
TABLE 4B.4 SpeciÞcations of Two of the Models Listed in Table 4B.3a
LOC Branches Input control bits Input data bits Process blocks Levels of hierarchy
Sys7
Intel 8251
3785 591 7 62 92 5
3113 207 11 8 3 1
a
The Sys7 model is a two-dimensional real-time object classiÞcation chip, and the Intel 8251 model is a microcontroller chip.
because the cost of testing is much less than the cost of a bug in silicon, it seems that a nonconservative stopping rule is worse than some other rules. Another angle can be extracted from Table 4B.3, where the number of branch coverage in SB (Hajjar and Chen’s proposed rule) is more than 10% less than the original
200
STOPPING RULES IN SOFTWARE TESTING
(no stopping rule). This is probably not acceptable in hardware. A comparison: ATPG typically aims for higher than 90% fault coverage, and a user would probably aim for even 1% increases in coverage points if it is achievable in a reasonable amount of computation. So the rule proposed is probably a good rule for switching instead of stopping the testing process. APPENDIX 4C: MESAT-1 OUTPUT SCREENSHOTS AND GRAPHS [6]
FIGURE 4C.1
MESAT-1 multistrategy testing for data set DR5 with cost results.
MESAT-1 OUTPUT SCREENSHOTS AND GRAPHS [6]
201
FIGURE 4C.2 Plot of multistrategy stopping rule for DR5 in Figure 4C.1 at a minimal 80% conÞdence level.
FIGURE 4C.3 Plot of multistrategy stopping rule for DR5 at a minimal 90% conÞdence level.
202 Week ---1 5 6 7 8 9 10 34 35 43 44 52 66 76 91 99 100
STOPPING RULES IN SOFTWARE TESTING Lambda ------1.0 0.4 0.5 0.57143 0.625 0.66667 0.7 0.23529 0.25714 0.23256 0.25 0.23077 0.19697 0.18421 0.16484 0.16162 0.16
k ------0.57711 0.20036 0.2266 0.24445 0.26348 0.26808 0.27024 0.08512 0.0894 0.07859 0.08385 0.0767 0.06435 0.0597 0.05299 0.05124 0.05073
w ---4 2 4 3 1 3 3 3 4 3 1 1 2 1 1 2 0
X ---4 6 10 13 14 17 20 23 27 30 31 32 34 35 36 38 38
E(X) ----4.391 2.337 3.325 4.125 4.588 5.285 5.957 2.432 2.872 2.769 3.017 2.849 2.539 2.423 2.214 2.242 2.221
e(X) -----N / A 0.936 0.988 0.8 0.463 0.697 0.672 0.372 0.44 0.348 0.248 0.223 0.232 0.174 0.153 0.18 0.00
Percentage ---------8.7 13.04 21.74 28.26 30.43 36.96 43.48 50.0 58.7 65.22 67.39 69.57 73.91 76.09 78.26 82.61 82.61
Stop at X(100) = 38.0 Coverage = 82.6086956521739 %
•
•
•
• •
Cost Analysis: Cost of correcting all 46 errors by exhaustive — testing would have been 46000.00$ Cost of correcting 38 pre — release errors using MESAT is 38000.00$ Savings for not correcting the remaining 8 by using MESAT is 8000.00$ Cost of executing all 2176 test cases by exhaustive — testing would have been 1088000.00$ Cost of executing 100 test cases by using MESAT is 50000.00$ Savings for not executing the remaining ( 2176 − 100 ) = 2076 test cases is 1038000.00$ Results of using MESAT are: Savings for not correcting the remaining 8 errors by using MESAT is 8000.00$ Plus the 1038000.00$ saved for not executing the remaining 2076 test cases equals a total savings of 1046000.00$ Minus the 16000.00$ post - release cost of correcting 8 errors not covered ( 8 × 2000.00$ ) Total savings for using MESAT is 1030000.00$ Strategy: 1 Stop at X(100) = 38.0 Coverage = 82.0% Total Coverage = 83 % Total Covered = 38 strategy: 2 Stop at X(1959) = 7.0 Coverage = 88.0% Total Coverage = 98 % Total Covered = 45 Strategy 1 Cost Analysis Summary: Total savings for using MESAT is 1030000.00$ Strategy 2 Cost Analysis Summary: Total savings for using MESAT is 63500.00$ Insufficient data for Strategy Number: 3
FIGURE 4C.4 dence level.
Results of DR5 mixed strategy stopping rule at a minimal 80% conÞ-
203
MESAT-1 OUTPUT SCREENSHOTS AND GRAPHS [6]
Week ---1 2 3 4 5 2042
Lambda ------1.0 0.5 0.33333 0.25 0.4 0.00979
k ------0.057711 0.27209 0.17794 0.13218 0.20036 0.00301
Exact Cum. X X E(X) ---- ---- ----4 4 4.391 0 4 2.567 0 4 1.813 0 4 1.401 2 6 2.337 1 42 0.15
e(X) -----N / A 0.00 0.00 0.00 0.936 0.0090
% Expense ----- ---------8.7 $4500.0 8.7 $5000.0 8.7 $5500.0 8.7 $6000.0 13.04 $8500.0 91.3 $1063000.0
Gain ------$1051000.0 $1050500.0 $1050000.0 $1049500.0 $1051000.0 $68500.0
strategy: 1 Stop at X(2042) = 42.0 Coverage = 91.0% Total Coverage = 91.0 % Total Covered = 42 strategy: 2 Stop at X(86) = 4.0 Coverage = 100.0% Total Coverage = 100.0 % Total Covered = 46 Strategy 1 Cost Analysis Summary: Total savings for using MESAT-1 is $685000.00 Strategy 2 Cost Analysis Summary: Total savings for using MESAT-1 is $29500.00
FIGURE 4C.5 Results of DR5 mixed strategy stopping rule at a minimal 90% conÞdence level.
FIGURE 4C.6 MESAT-1 mixed strategy testing results for DR4.
204
STOPPING RULES IN SOFTWARE TESTING
FIGURE 4C.7 MESAT-1 results summary on when to stop, and the economic plot for DR4 for a minimal 80% conÞdence level.
FIGURE 4C.8 MESAT-1 results summary of DR4 when the number of failures is not known in advance. Mixed strategy is not conducted for such scenarios. Budget = $20,000, minimal number of cases = 100, coverage criterion = 0, number of coverages = 0.
STOPPING RULE FOR HIGH-ASSURANCE SOFTWARE TESTING IN BUSINESS
205
4.2 STOPPING RULE FOR HIGH-ASSURANCE SOFTWARE TESTING IN BUSINESS Nutshell 4.2 In this application-oriented section we argue that cost-effective testing can be less thorough yet more efÞcient if applied in a well-managed, empirical manner across the entire software development life cycle (SDLC). To ensure success, testing must be planned and executed within an earned value management (EVM) paradigm. A speciÞc example of empirical software testing is given: the empirical Bayesian stopping rule, which is the MESAT-1 algorithm. The stopping rule is applied to an actual case involving business software development, to show potential gains with respect to archaic testing methods that were used earlier. The result is that a percentage of the particular testing effort could have been saved under normal circumstances had the testing been planned and executed under EVM with the empirical Bayesian stopping rule (i.e., MESAT-1) algorithm, which was covered in Section 4.1. 4.2.1 Introduction Across the manufacturing world and the general software industry, there is a drastic disparity in SDLC test planning and management. Businesses waste tremendous amounts of resources by not planning, developing, or testing software in an efÞcient, scientiÞc manner. EVM is misunderstood and misused, planning is not comprehensive, and testing is not pervasive throughout the SDLC. There are methods of efÞciently managing an SDLC project whereby intensiÞed planning and oversight will not cause a negative return on investment. These methods include sufÞcient planning within an EVM methodology, pervasive SDLC testing, and application of scientiÞc rules for life-cycle testing [7]. 4.2.2 EVM Methodology Within an SDLC, software project managers strive to reduce risk while reducing the time that it takes to develop a product and perform the tests. In a large project, planning can take a considerable amount of time. EVM suffers most often because planning must take place long before requirements are speciÞed. Additionally, testing is either not planned sufÞciently or at least not planned discretely. More often than not, the method used to test software units is to throw data at them and view the output. This is black-box testing. It is inefÞcient because it can be a trial-and-error process. To apply EVM methods, all software products must be planned, scheduled, resourced, and budgeted. What is a software product? It is any artifact of the SDLC. These products include the individual requirement, the requirement speciÞcation, the module design, the interface speciÞcation, the software unit, the test plan, and the test script, among others. In short, every activity in a development
206
STOPPING RULES IN SOFTWARE TESTING
effort is associated with some sort of product. Consequently, all of those products can be tested. Thus, all of the items listed above can be tested for accuracy and feasibility. When products undergo this level of planning, the actual testing of those products becomes a part of production. 4.2.3 Typical SDLC Testing Management When software undergoes archaic testing methods where testing “just happens” at a particular phase near the end of the SDLC, it can never be planned and budgeted efÞciently. Nevertheless, within most testing methods, such as build all units followed by test all units, a set of prescribed-use cases with data is input (e.g., a black box). When a failure or group of failures occurs, testing halts. The programmer then corrects the condition that caused the failure, and the program is recompiled for further testing. This is time-domain sequential software testing. In the archaic manner of test execution, the use of EVM is eliminated because testing cannot be planned discretely or managed efÞciently. A common archaic approach to testing software is the “shotgun” or “testingto-death” approach. This is an approach in which every conceivable functional procedure is performed on a pass-or-fail basis, in no particular order. Testing might begin with a random module, without consideration of sequence. The case presented in this chapter is an example of shotgun testing. A seemingly beneÞcial aspect of shotgun testing is full coverage of functional scenarios. Unfortunately, it is exceedingly expensive and redundant. In addition, a project manager can never be sure that all functionality is tested, no matter how long the testing lasts. Shotgun approaches also do not account for the validity of the end product. There is not a high assurance of testing success in these practices. 4.2.4 New View of Testing In test planning, software units and objects can be viewed as pass-or-fail trees and branches and can be predicted and mapped. When design products are planned, parallel testing products are also planned. Test cases are one example of a designphase product. During the testing phase, individual test case reports are examples of products. The planning and mapping exercise may entail much effort, but it will reveal the redundancies and the statistical likelihood of branching. The fact that test planning and analysis are performed while the code is being constructed does not add extra calendar time to the project. Earned value management requires this. Another way to incorporate testing into everyday operations efÞciently is through a mixed testing strategy [6], as described in Section 4.1. This strategy allows a manager to increase testing accuracy and efÞciency while keeping costs down. The mixed nature implies that the empirical rules can be applied in varied detail, depending on the signiÞcance of the tests. One begins, for example, with a functional testing strategy (least sophisticated) and moves to a more discriminatory testing strategy (higher sophisticated): hence a mixed testing strategy. In practice, testers switch strategies when testing yield saturates. They must determine the right time to abandon the current technique and switch to a new one,
STOPPING RULE FOR HIGH-ASSURANCE SOFTWARE TESTING IN BUSINESS
207
as well as how to sequence testing techniques efÞciently. There are no hard-andfast rules. Empirical Bayesian Stopping Rule The empirical Bayesian stopping rule (EBSR) or MESAT-1 uses mathematical principles of the Poisson counting process applied to a number of the test cases, with a logarithmic-series distribution (LSD) applied to the clump size of a fault or coverage for each test case. It applies well to time-domain sequential software testing as well as effort testing, such as the case presented here [6,7,36]. The project manager should set up the test plan with this testing method in mind. A thorough case history of similar projects and programs should be used to arrange the test cases logically to Þt the model. The testing employs a convergence factor that can be set as high as necessary. The engineer derives the convergence factor from how well the cases are organized and from how similar the case history is to the current project. This factor is a function of cost constraints [6,7]. The phenomenon of clustered test case failures is observed in software testing practice. Programmers often call it the domino effect. Effectively, a series of failures can often be attributed to cause and effect. If the distribution of the total number of clumped failures Þts the compound Poisson behavioral model, the empirical Bayesian stopping rule can be derived by updating the prior parameters as the Þeld data are collected. Based on the case histories, the manager sets an economic criterion d to signify the convergence level desired to establish that a sufÞcient level of testing has occurred. If for the ith unit interval beginning at time t or for test group i, the expected cost of stopping is greater than or equal to the expected cost of continuing, it is economical to continue testing for the next group of test inputs. This convergence threshold can be represented as d=
c a−b
(37)
where d signiÞes the ratio of the cost (c) of performing a test to the cost of catching a failure after (a) release of the product minus the cost of catching a failure before (b) release of the product. On the other hand, if the expected cost of stopping is less than the expected cost of continuing, it is more economical to stop testing with the following strategy: aE (Xi+1 ) > bE (Xi ) + c. If we were to stop at interval or test group i, we assume that the cost of coverage items as yet uncovered is a per coverage item. Thus, there is an expected cost over the interval {i, i + 1} of aE {Xi }. If we were to continue testing over the interval, we assume that there is a Þxed cost of c for testing, a variable cost of b related to the elements covered, and a variable cost of a related to the uncovered elements discovered after testing. Note that a is usually larger than b. As studied in Section 4.1, the equation in the one-step-ahead formula in its simplest form can be rearranged in the form [6] α + Xi+1 α + Xi e(x) = ki+1 − ki
208
STOPPING RULES IN SOFTWARE TESTING
where α and β are prior parameters for LSD(θ ) in the empirical Bayesian analysis where 0 < θ < 1, and k is a constant to be computed at each step. The λ used in Tables 4C.4 and 4.4 ahead can be described as the ratio of test cases with any activity (nonzero error) to the number of test cases experienced in the past. New Test Planning Concept The software project manager and test manager plan the test units and establish a cost for each unit test c. Then they establish the cost of catching and correcting an error prior to release, b, versus the cost of catching and correcting an error after release, a. Then they would set their threshold, d. After these variables are input, the tests are run and failures counted. With this arrangement, unit testing can be arranged in an order by which the most important or most questionable units are in the front and tested Þrst. This ordering is important to the success of managing testing scientiÞcally. 4.2.5 Case Study An Air Force munitions tracking system recently underwent a hardware/operating system re-platforming effort. During the unit and integration testing (UIT) phase, the developers tested 31 modules to validate the functionality on the new platform. The 31 test cases were arranged in order of functionality and were not correlated within the thresholds of cause and effect. It was a brute-force, shotgun approach, which required four full-time programmers to perform a range of tests across the 31 modules [7]. QuantiÞed Account of Testing Effort Throughout the UIT phase, errors occurred within certain units and were Þxed before testing continued. Those were identiÞed as cost factor b. Following the testing effort, errors were found that were quarantined, Þxed, and recycled through extra testing. This was identiÞed as cost factor a. Thus, the average cost c of testing a module initially equals two cost units (CU) each. There were a total of 31 units. The average cost b of catching and Þxing an error before release was 10 CU each. In addition, 10 such errors were corrected by exhaustive brute-force testing. The costs so far incurred can be summarized as 31c = 2 CU × 31 modules = 62 CU
(39)
10b = 10 CU × 10 modules = 100 CU
(40)
The total cost of the conventional unit testing without applying a stopping rule was 162 CU. See Figures 4.1 to 4.7 for the entire case study. Application of the EBSR Algorithm To calculate of potential savings, the software units or test cases are ordered with respect to their interdependencies. Additionally, the more critical units are placed Þrst. This cause–effect relationship ensures that the ordering better Þts the requirements for use of the empirical Bayesian stopping rule, MESAT-1. After this ordering, the actual error
STOPPING RULE FOR HIGH-ASSURANCE SOFTWARE TESTING IN BUSINESS
TABLE 4.3 Modular Test Strategy 1 Results, Ordered Number of Cumulative Number of Module Errors Number of Errors Module Errors 1 1 1 17 0 2 3 4 18 0 3 1 5 19 0 20 0 4 1 6 5 0 6 21 0 6 0 6 22 0 23 0 7 1 7 8 1 7 24 0 9 0 8 25 0 26 0 10 0 8 11 1 9 27 0 12 0 9 28 0 29 0 13 0 9 14 0 9 30 1 15 0 9 31 0 16 0 9
FIGURE 4.1
209
Cumulative Number of Errors 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10
Goodness-of-Þt test for the business example.
210
STOPPING RULES IN SOFTWARE TESTING
TABLE 4.4
Analysis and Decision Table for Table 4.3’s Strategy 1 Testing
Module
λ
k
1 2 3 4 5
1.0 1.0 1.0 1.0 0.8
0.67922 0.57711 0.55345 0.53307 0.41763
x X = Cumulative x E(X) e(X) Coverage (%) Decision 1 3 1 1 0
1 4 5 6 6
3.64 4.39 4.63 4.87 4.12
N/A 0.75 0.24 0.24 0.00
10 40 50 60 60
Continue Continue Continue Continue Stop
counts encountered during real testing are added (Table 4.3). It is obvious by the clustering of errors that these errors correlate to a hypothetical ordering. The EBSR one-step-ahead equation indicates that testing should have stopped after the Þfth unit was tested without cost considerations due to (27) only. The stopping rule could tell the project manager that the UIT phase was ongoing. Next, applying a mixed testing strategy, and therefore using the next strategy that is more sensitive than the initial (functional) testing strategy, the EBSR could be applied to the remainder of the test cases. Beginning with unit 8 (now the Þrst test case of the second testing strategy), the code tells us that testing should stop after unit 11 (now the fourth test case of the second testing strategy) was tested. That would leave a single unit in error that would be caught after the testing effort was concluded (Table 4.5).
FIGURE 4.2 Using the 50% minimal coverage criterion, the a cost not known.
211
STOPPING RULE FOR HIGH-ASSURANCE SOFTWARE TESTING IN BUSINESS
TABLE 4.5 Modular Test Strategy 2 Results, Ordered Number of Cumulative Number of Module Errors Number of Errors Module Errors 1 0 1 14 0 2 1 1 15 0 3 1 2 16 0 17 0 4 0 2 5 0 2 18 0 6 1 3 19 0 20 0 7 0 3 8 0 3 21 0 9 0 3 22 0 23 0 10 0 3 11 0 3 24 0 12 0 3 25 1 26 0 13 0 3 TABLE 4.6
Cumulative Number of Errors 3 3 3 3 3 3 3 3 3 3 3 4 4
Analysis and Decision Table for Table 4.5’s Strategy 2 Testing x X = Cumulative x E(X) e(X) Coverage (%) Decision
Module
λ
k
1 2 3 4
0.0 0.5 0.667 0.5
0.0 0.3151 0.4065 0.2981
0 1 1 0
1 4 5 6
0.00 2.16 2.89 2.29
N/A 2.16 0.74 0.00
10 25 50 50
Continue Continue Continue Stop
Cost–BeneÞt Analysis via the EBSR (MESAT-1) A total of nine units or modules (or test cases), 5 + 4 = 9 in two sequential testing strategies, would have been tested during the UIT (9c = 18 CU). During this testing activity, eight errors, 6 + 2 = 8, out of 10 would have been detected and corrected (8b = 80 CU). Therefore, summing the two costs, a total cost of 98 CU would have been incurred by applying the EBSR, as opposed to 162 CU without the EBSR. Assuming that the two additional errors would have been detected after testing, this would be corrected after the fact. The value of a, the cost of correcting an after-release error, needs to be calculated to judge the EBSR to have been efÞcient. Thus, the simple savings is now 31−9 = 22 units or modules. This is more than a 70% savings, missing out only on two additional undiscovered errors. Now, considering that the additional errors would be caught during posttesting, an optimal a can be solved to render the EBSR algorithm economical and efÞcient. For this argument to hold, the following equations must be true: 98 CU + 2a < 162 CU
(41)
a < (0.5)(162 CU − 98 CU) < 32 CU
(42)
This implies that it makes economic sense to apply the stopping rule when the cost of catching an error post facto is less than a certain upper bound. In this
212
Week ---1 2 3 4
STOPPING RULES IN SOFTWARE TESTING
Lambda ------0.0 0.5 0.66667 0.5
k ------0.0 0.31512 0.40649 0.29809
Exact x ---0 1 1 0
Cum. x ---0 1 2 2
E(x) ----0.0 2.157 2.89 2.296
e(x) -----N / A 2.157 0.734 0.00
% ----0.0 25.0 50.0 50.0
Stop at X(4) = 2.0 Coverage = 50.0 %
•
Cost Analysis: Cost of correcting all 4 errors by exhaustive - testing would have been 40.00$ Cost of correcting 2 pre - release errors using MESAT is 20.00$ Savings for not correcting the remaining 2 by using MESAT is 20.00$
•
•
Cost of executing all 26 test cases by exhaustive - testing would have been 52.00$ Cost of executing 4 test cases by using MESAT is 8.00$ Savings for not executing the remaining ( 26 − 4 ) = 22 test cases: 44.00$ Total savings by using MESAT are 20.00$ + 44.00$ = 64.00$ Total number of errors, left undiscovered are 2 MESAT is cost - effective if post - release error cost is less than ( 64.00$ divided by 2) = 32.00$ strategy: 1 Stop at X(5) = 6.0 Coverage = 60.0% Total Coverage = 60.0 % Total Covered = 6 strategy: 2 Stop at X(4) = 2.0 Coverage = 50.0% Total Coverage = 80.0 % Total Covered = 8
• •
Strategy 1 release Strategy 2 release
FIGURE 4.3 given.
Cost Analysis error cost is Cost Analysis error cost is
Summary: MESAT is less than (92.00$ Summary: MESAT is less than (64.00$
cost - effective divided by 4 ) = cost - effective divided by 2 ) =
if post 23.00$ if post 32.00$
Mixed strategy results for the business example when the a cost is not
213
STOPPING RULE FOR HIGH-ASSURANCE SOFTWARE TESTING IN BUSINESS
10
100.0%
F a 8 i l u 6 r e s 4
80.0% Stopping at 8.0 80.0% Strategy 2
60.0%
Stopping at 6.0 60.0% Strategy 1 2
4
6
8
10
40.0% 12
14
16 Test cases
18
20
22
24
26
28
30
FIGURE 4.4 Stopping rules (y-axis: errors versus the x-axis: test cases) when the a cost is not given.
FIGURE 4.5 Stopping rule when all costs are given. The gain is +$4.0.
case, the upper bound for a = 32 CU. That is, in this example, for the EBSR to be effective, the cost of correcting an error after release cannot be 3.2 times more than that of correcting it before release of the same software. Recall that b = 10 CU in this example. 4.2.6 Discussion and Conclusions Migrating to new testing methods requires an initial investment of research and analysis by the software engineering agencies. Engineers must adhere to
214
Week ---1 2 3
STOPPING RULES IN SOFTWARE TESTING
Lambda ------0.0 0.5 0.33333
k ------0.0 0.31512 0.20505
Exact Cum. x x E(x) ---- ---- ----0 0 0.0 1 1 2.157 0 1 1.531
e(x) -----N / A 2.157 0.00
% Expense Gain ----- ----- -----0.0 $100.0 $2.0 50.0 $14.0 $20.0 50.0 $16.0 $18.0
Stop at X(3) = 1.0 Coverage = 50.0 % Cost Analysis: Cost of correcting all 2 errors by exhaustive — testing would have been 20.00$ Cost of correcting 1 pre — release errors using MESAT is 10.00$ Savings for not correcting the remaing 1 by using MESAT is 10.00$ •
Cost of executing all 22 test cases by exhaustive — testing would have been 44.00$ Cost of executing 3 test cases by using MESAT is 6.00$ Savings for not executing the remaining ( 22 − 3 ) = 19 test — cases is 38.00$ Results of using MESAT are: Savings for not correcting the remaining 1errors by using MESAT is 10.00$
•
Plus the 38.00$ saved for not executing the remaining 19 test cases equals a total savings of 48.00$
•
Minus the 30.00$ post — release cost of correcting 1 errors not covered ( 1 x 30.00$ ) Total savings for using MESAT is 18.00$ strategy: 1 Stop at X(9) = 8.0 Coverage = 80.0% Total Coverage = 80.0 % Total Covered = 8 strategy: 2 Stop at X(3) = 1.0 Coverage = 50.0% Total Coverage = 90.0 % Total Covered = 9 Strategy 1 Cost Analysis Summary: Total savings for using MESAT-1 is 4.00$ Strategy 2 Cost Analysis Summary: Total savings for using MESAT-1 is 18.00$
FIGURE 4.6
Mixed-strategy results for the business example when the a cost is given.
strict process and technical standards so that they can plan measurable execution criteria. It is a high, yet manageable goal and requires EVM. The migration to this environment is a return on investment (ROI) that will save modest to enormous amounts of dollars in the end, not to mention potentially increasing the
215
BAYESIAN STOPPING RULE FOR TESTING IN THE TIME DOMAIN
100.0%
10 F a 8 i l u6 r e4 s
$22.0 80.0% P $6.0 e $4.0 r 60.0% c $-26.0 e n 40.0% t
Stopping at 9.0 90.0% Strategy 2 Stopping at 8.0 80.0% Strategy 1
2
4
6
8
10
12
14 16 Test cases
18
20
22
24
26
28
G a i n
30
FIGURE 4.7 Multistrategy stopping rules (y axis: errors versus the x axis: test cases) when the a cost is given for DR6.
quality of the product [67]. This holds true provided that unreasonably high and exaggerated cost factors (a) are not attributed to those errors post facto [6,7]. The commercial software industries must migrate to more scientiÞc and quantitative methods of software testing. In fact, the U.S. Department of Defense is already mandated to follow principles of EVM [7,65]. Unfortunately, the requirements for EVM are not normally met. The efÞciencies that could be introduced would be phenomenal. Moreover, it would take software engineering one more step closer to a true engineering discipline. Also, cost-effectiveness is a prime concern in this work [6]. Moreover, a novel approach to space–time compaction in self-testing VLSI circuits in the event of nonexhaustive cost-effective test sets has also been published in various stages [12–18]. As the digital design moves through increased levels of integration densities, it is desirable to create better and more effective methods of testing to ensure reliable systems operation. Others have published on the use of sequential statistical analysis in solving the relation between random test length and good/bad circuit (or chip) ratios [66]. The sequential probability ratio test (SPRT) derived by Wald is described in most mathematical statistics textbooks (e.g., [46, Chap. 14]). However, the SPRT method is limited in applications to binomial and normal densities, to name a few theoretically developed approaches. 4.3 BAYESIAN STOPPING RULE FOR TESTING IN THE TIME DOMAIN Nutshell 4.3 MESAT-2 is a cost-efÞcient optimal stopping-rule algorithm for the Poisson compounded with geometric distribution, Poisson∧ geometric. It is developed and applied to the problem of sequential testing of computer software. For each checkpoint in time, either the software satisÞes a desired economic criterion or the software testing is continued. 4.3.1 Introduction There are many examples in which events occur according to a Poisson distribution, and furthermore, for each of these Poisson events, one or more other events can occur. For example, automobile accidents on a given highway might follow a Poisson distribution, but the number of injuries follows a compound
216
STOPPING RULES IN SOFTWARE TESTING
Poisson distribution [36]. In this chapter the application is software testing. If an interruption that occurs during testing of a software program is due to one or more software failures in a clump, and if the distribution of the number of interruptions is Poisson, the distribution of the number of clumped failures is compound Poisson [27]. When a new computer software program is written and all obvious software faults have been removed, a testing program is usually initiated to eliminate the remaining faults. The common procedure is to use the software package on a set of problems, and whenever the testing is interrupted because of one or more programming failures, the faults are corrected, the software recompiled, and computation restarted. This testing can continue for several days or weeks, with the number of failures per unit time becoming fewer and fewer. Finally, a point is reached where it seems that all the software faults surely have been removed, at which time the software can be released to the end user. However, one can never be completely certain that all software faults have been found. Most likely a very small number of faults will remain in the software, but the chances of Þnding them in a reasonable time may be so small that it is not economically feasible to continue testing. The point at which testing is stopped will be the break-even point for the cost of removing faults in the Þeld versus the cost of continued testing. The objective is to Þnd the optimal time to stop testing. Optimal stopping rule theory has been studied extensively, and good presentations have been published [38,68–71,76]. In this section we present optimal stopping rules using the Poisson∧ geometric in time-domain software testing. 4.3.2 Review of the Compound Poisson Process Let Y (t) be the random variable of the number of interruptions, and let X(t) be the random variable of the number of failures that occur up to time t. Then, following Sherbrooke [24], the Poisson with parameter λ compounded with a geometric ρ can be written as P (X(t) = 0) = exp(−λt) (λt)y exp(−λt) x−y P (X(t)) = CYX−1 (1 − ρ)y , −1 ρ y!
(43)
x = 1, 2 . . . , λ > 0, 0 < ρ < 1 (44) n = n!/[k!(n − k)!]. Using moment-generating funcwhere Ckn = C(n, k) = k tions, as shown in Table 1.5, E(X) =
λ 1−ρ
(45)
4.3.3 Stopping Rule Expression (44) for the expected value of X leads to a rule for determining when to stop software testing. Suppose that we are at time t. It is evident that whenever
BAYESIAN STOPPING RULE FOR TESTING IN THE TIME DOMAIN
217
there is an interruption in the program during the testing interval [0, t), we remove all faults observed. As a result, the values for λ and ρ should be decreasing over time, since there should be fewer and fewer faults in the program. The gradual reduction of ρ is in line with the results of others, such as Musa and Okumoto [72], who used an exponential function to reduce λ over time, and Becker et al. [73] who reduced λ by a Þxed amount over time. Let Xt be the random variable of the number of failures that occur in [t, t + 1], the unit time interval starting at time t, and let λt and ρt be the values of the parameters at time t. Then the expected number of failures occurring during this unit time interval is λt E(Xt ) = (46) 1 − ρt If we were to stop at time t, we would assume that the faults that caused these failures would have to be Þxed in the Þeld at a cost of a per fault. Thus, there is an expected cost over the interval [t, t + 1) of aE(Xt ) for stopping at time t. On the other hand, if we continue testing over the interval, we assume that there is a Þxed cost c for testing and a variable cost b of Þxing each fault found during testing. Note that a is larger than b since it should be considerably more expensive to Þx a fault in the Þeld than to observe and Þx it while testing. Thus, the cost expected for the continuation of testing for the next time interval is bE(Xt ) + c. This cost is similar but simpler than that of Dallal and Mallows [48]. If for the unit interval beginning at time t the expected cost of stopping is greater than the expected cost of continuing, that is, if aE(Xt ) > bE(Xt ) + C
(47)
it is economical to continue testing through the interval. On the other hand, if the expected cost of stopping is less than the expected cost of continuing, that is, if aE(Xt ) < bE(Xt ) + C (48) it is more economical to stop. If we let d = c/(a − b), then if the relation (47) evolves to λt E(Xt ) = >d (49) 1 − ρt we would continue testing; and if E(Xt ) =
λt ≤ d 1 − ρt
(50)
we would stop testing. With λt and ρt , both being decreasing functions of time, E(Xt ) ≥ E(Xt+1 )
(51)
218
STOPPING RULES IN SOFTWARE TESTING
so that if we should have stopped at time t but did not, we should certainly stop at time t + 1. This rule for stopping seems reasonable. It says essentially that if the number of faults that we can expect to Þnd in the software unit time is sufÞciently small, we stop testing and release the software package to the end user. If the number of faults expected is large, we continue testing. This stopping rule depends on an up-to-date expression for the compound Poisson distribution (i.e., we need accurate estimates of λ t and ρt ). However, such estimates depend on the history of the testing, which implies the use of empirical Bayes decision procedures [74,75]. 4.3.4 Bayes Analysis for the Poisson∧ Geometric Model We begin with the conjugate prior density function of the initial values of λ and ρ. The conjugate prior density function for the Poisson probability function is the gamma p.d.f., and the prior density for the geometric is the beta p.d.f. Thus, the initial prior joint density for λt and ρt is given by f (λ, ρ) =
β(λβ)α+1 e−λβ (μ + ν) μ−1 ρ (1 − ρ)ν−1 (α) (μ) (ν)
(52)
where α > 0, β > 0, μ > 0, and ν > 0 are the parameters of the initial prior density function. Let X and Y be the random variables of the number of failures and interruptions, respectively, that will occur during the Þrst unit time interval. The joint probability function of X and Y , given λ and μ, is the Poisson∧ geometric: λy exp(−λ) px,y (x, y | λ, ρ) = y!
x−1 y−1
ρ x−y (1 − ρ)y
(53)
so that the joint distribution of X, Y, λ, and ρ after observing the process for one unit time period (i.e., when t = 1) is the product g(x, y, λ, ρ) =
β α λα+y−1 exp[−(β + 1)] (α)y! x − 1 (μ + ν) x−y+μ−1 ρ (1 − ρ)y+ν−1 × y − 1 (μ) (ν)
(54)
The marginal probability function of X and Y is then p(x, y) = =
∞
t
f (x, y, λ, ρ) dρ dλ 0
0
(α + y)β α (μ + x − y) (ν + y) (μ) (ν) (μ + ν + x) (ν + μ) (α)(β + 1)α+y
(55)
BAYESIAN STOPPING RULE FOR TESTING IN THE TIME DOMAIN
219
Therefore, the posterior joint density function of λ and ρ at time t = 1 is f (λ, ρ | x, y) =
(β + 1)[β − 1)λ]α+y−1 exp[−λ(β + 1)] (α + y) ×
(μ + ν + x)ρ μ+x−y−1 (1 − ρ)ν+y−1 (μ + ν + x) (ν + y)
(56)
This is the product of a gamma density with parameters α + y and β + 1 and a beta density with parameters μ + x − y and ν + y. It is well known that the posterior expectation minimizes the mean quadratic loss function [46] and thus can be used as the Bayes estimators of parameters λ and μ. We substitute these mean values into the expressions for E(Xt ), getting a Bayes estimate for the number of failures in the next unit time period, t = 1: E(X) =
λˆ (α + y)(μ + ν + x) = 1 − ρˆ (β + 1)(ν + y)
(57)
Now suppose that the process is at time t (i.e., the process has been observed for t time periods), with a total of x failures over y interruptions, where xt ≥ yt . The posterior estimates of λ and ρ at time t will be αt α0 + yt λˆ t = = β0 + t βt ρˆt =
μ0 + xt − yt μt = μ0 + ν0 + xt μt + νt
(58) (59)
where the zero subscript denotes initial values at time 0. If the unit time period is sufÞciently short, such as 1 second, there will be many time periods with no interruptions, so that y will be less than t, making λt a decreasing function of t. Substituting the values of λt and ρt above into the formula for E(Xt ) gives αt (μt + νt ) λt (αo + yt )(μ0 + ν0 + xt ) E(Xt ) = = = (60) 1 − ρt (β0 + t)(ν0 + yt ) βt νt Since the Poisson∧ geometric is memoryless, the posterior expected number of failures at time t + 1 [i.e., E(Xt + 1)] can be found as follows. If we denote the number of failures during the interval [t, t + 1) as x and the number of interruptions as y, then E(Xt+1 ) =
λt+1 (αt + yt )(μt + νt + x) = 1 − ρt+1 (βt + t)(νt + y)
(61)
220
STOPPING RULES IN SOFTWARE TESTING
There are two cases to consider. Case 1: x = y = 0. No interruptions, and thus no failures, are observed during the interval [t, t + 1). It is obvious that E(Xt + 1) < E(Xt ), since the denominator increases but the numerator remains constant. Essentially, the data are saying that because we have observed no failures during the interval, we should expect to see fewer failures in the near future. Case 2: x > y > 1. There is at least one interruption during the interval [t, t + 1), with one or more failures. When an interruption occurs at time t, E(Xt + 1) > E(Xt ). 4.3.5 Empirical Bayesian Stopping Rule Since the compound Poisson is memoryless, we can assume that the process will be the same starting at time t as starting at time 0, but with different parameters. That is, we need only observe the process from one time period to the next. Let xt be the cumulative number of failures to time t, and yt be the cumulative number of interruptions. Thus, the expected value function is e(t; xt ,yt ) =
(αt + yt )(μt + νt + xt ) (βt + t)(νt + yt )
(62)
The empirical Bayesian stopping rule will be: If e(t; xt ,yt ) =
(αt + yt )(μt + νt + xt ) > d, continue testing (βt + t)(νt + yt )
(63a)
If e(t; xt ,yt ) =
(αt + yt )(μt + νt + xt ) ≤ d, stop testing (βt + t)(νt + yt )
(63b)
and release the software. Stopping will occur at time t greater than t , where t is given by (αt + yt )(μt + νt + xt ) −β (64) t = (νt + yt )d As software faults are removed, e(t; xt ,yt ) will approach zero, so testing will eventually stop. 4.3.6 Computational Example The stopping rule was applied to a data set given by Musa et al. [76]. The data are listed in Tables 4D.1 and 4D.3. The failure times, which are cumulative and in seconds, indicate when failures occur. Note that failures sometimes occur in clumps, as for example at time t = 5089 seconds (the values in italic indicate clumped observations). This clumping is fairly typical in the testing of software, indicating the need for the compound Poisson. Table 4D.1 shows the T1 data
MESAT-2 APPLICATIONS AND RESULTS
221
set of Musa et al. [76]. For illustrative purposes it was assumed that c = $0.01, a = $6.0, and b = $1.0; that is, the Þxed cost of testing is $0.01 per second, the cost of Þxing each fault while testing is $1.0, and the cost of Þxing each fault in the Þeld is $6.0. Hence, d = c/(a − b) = 0.002. Note that the values of a, b, and c are important but it is the ratio d = c/(a − b) that is used. In this example using the data set T1, the initial estimate of ρ was taken to be 0.02 due to data since at the end of the testing time, ρˆ = 1 − yt /xt = 1 − (133/136) = 0.02. It was assumed that the conÞdence is equivalent to a sample size of 100, so that μ + ν = 100. The initial values μ = (0.02)(100) = 2 and ν = 100 − 2 = 98, respectively. μt and νt will vary for each varying pair of yt and xt . Similarly, it was estimated that the mean time to failure (MTTF) from the data set T1 would be 666 seconds, giving an initial λ = MTTF−1 estimate of 0.0015. Since, assuming an initial gamma prior parameter, α = 0.5, λ=
α β
where β = αλ−1 = α · MTTF = (0.5)(666) = 333
(65)
is the initial value. At each time t, αt and βt will vary. As in Appendix 4D, the Þrst time that (αt + yt )(μt + νt + xt ) e(t; xt ,yt ) = ≤d (66) (βt + t)(νt + yt ) [where d = c/(a − b) = 0.01/(6 − 5) = 0.002, and αt > 0, βt > 0, μt > 0, and νt > 0 all vary throughout the process], testing is stopped and the software is released to the customer is at t = 60,852. We used equation (64), where e(t ; xt ,yt ) = 0.0019 < 0.002. This indicates that the testing should be stopped sometime between t = 57,042 and t = 62,551 seconds. 4.3.7 Discussion and Conclusions As a Þnal remark, the contribution of this section is an empirical Bayesian approach to determine an economic stopping rule for a compound Poisson process and is a follow-up to previous research by Sahinoglu [6,27] on the Poisson∧ geometric as applied to software reliability modeling. In the present chapter, however, a stopping rule is developed for software testing in an environment where the software program is to be interrupted due to one or more software failures in a clump at least once during the testing activity. The computational example illustrates that the rule proposed is practical and valid for software failure data with clumped failures, such as in the data set in Tables 4D.1 and 4D.2. More data with clumped software failures may be found in the literature [76] and in the book’s CD-ROM. APPENDIX 4D: MESAT-2 APPLICATIONS AND RESULTS To use the applications and data Þles, click on “Stopping Rule” in TWC-Solver on the CD-ROM.
222 TABLE 4D.1 Data Set T1 n t Cumulative 1 3 3 2 30 33 3 113 146 4 81 227 5 115 342 6 9 351 7 2 353 8 91 444 9 112 556 10 15 571 11 138 709 12 50 759 13 77 836 14 24 860 15 108 968 16 88 1,056 17 670 1,726 18 120 1,846 19 26 1,872 20 114 1,986 21 325 2,311 22 55 2,366 23 242 2,608 24 68 2,676 25 422 3,098 26 180 3,278 27 10 3,288 28 1,146 4,434 29 600 5,034 30 15 5,049 31 36 5,085 32 4 5,089 33 0 5,089 34 8 5,097 35 227 5,324 36 65 5,389 37 176 5,565 38 58 5,623 39 457 6,080 40 300 6,380 41 97 6,477 42 263 6,740 43 452 7,192 44 255 7,447 45 197 7,644 46 193 7,837
STOPPING RULES IN SOFTWARE TESTING
n 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
t 6 79 816 1,351 148 21 233 134 357 193 236 31 369 748 0 232 330 365 1,222 543 10 16 529 379 44 129 810 290 300 529 281 160 828 1,011 445 296 1755 1,064 1,783 860 983 707 33 868 724
Cumulative 7,843 7,922 8,738 10,089 10,237 10,258 10,491 10,625 10,982 11,175 11,411 11,442 11,811 12,559 12,559 12,791 13,121 13,486 14,708 15,251 15,261 15,277 15,806 16,185 16,229 16,358 17,168 17,458 17,758 18,287 18,568 18,728 19,556 20,567 21,012 21,308 23,063 24,127 25,910 26,770 27,753 28,460 28,493 29,361 30,085
n 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
t 2,323 2,930 1,461 843 12 261 1,800 865 1,435 30 143 108 0 3,110 1,247 943 700 875 245 729 1,897 447 386 446 122 990 948 1,082 22 75 482 5,509 100 10 1,071 371 790 6,150 3,321 1,045 648 5,485 1,160 1,864 4,116
Cumulative 32,408 35,338 36,799 37,642 37,654 37,915 39,715 40,580 42,015 42,045 42,188 42,296 42,296 45,406 46,653 47,596 48,296 49,171 49,416 50,145 52,042 52,489 52,875 53,321 53,443 54,433 55,381 56,463 56,485 56,560 57,042 62,551 62,651 62,661 63,732 64,103 64,893 71,043 74,364 75,409 76,057 81,542 82,702 84,566 88,682
223
MESAT-2 APPLICATIONS AND RESULTS Goodness-of-Fit Analysis Chi-square sum = 24.8763; right-tailed area (p-value) = 0.0056 α (type I error probability) = 0.005; p-value = 0.0056 > 0.005. Do not reject H0 ; good Þt.
Compound Poisson∧ Geometric Stopping Rule Analysis α (gamma prior p.d.f. shape) = 0.5; d (stopping criterion) = 0.002; number of faults = 136; number of occurences = 133; MTTF = 666.78. λ = 0.0015, ρ = 0.022, β = 333.4, μ = 2.2059, and ν = 97.7941 are the initial values.
TABLE 4D.2 Analysis Results of Table 4D.1 x
y
t
t
e
1 2 3
1 2 3
3 30 113
3 33 146
0.004559 0.006974 0.007461
.. . 100 101 102 103 104 105 119 120 121 122 123
98 99 100 101 101 102 116 117 118 119 120
1435 30 143 108 0 3,110 1082 22 75 482 5509
42,015 42,045 42,188 42,296 42,296 45,406 56,463 56,485 56,560 57,042 62,551
0.002376 0.002398 0.002414 0.002431 0.002443 0.002299 0.002101 0.002118 0.002133 0.002143 0.001953
Calculated stopping time = 60,852 seconds; e = 0.002; stop after fault = 122; fault coverage = 89.71%; time coverage = 68.62%. Cost Analysis Cost per corrected error (postrelease), a = $6.00 Cost per corrected error (prerelease), b = $1.00; cost per test, c = $0.01; total faults (tf) = 136; total cycles (time) (tt) = 88,682 seconds; stop fault (sf) = 122; stop cycle (st) = 57,042 seconds; remaining faults (rf) = 14; remaining cycles (time) (rt) = 31,640 seconds; cost of correcting all faults (exhaustive): $1022.82 a (rf) < b (rf) + c (rt) = ? (6) (14) < (1)(14) + (0.01) (31,640) = ? 84.00 < 330.40? Savings using stopping rule: $246.40 End of analysis
224
STOPPING RULES IN SOFTWARE TESTING
TABLE 4D.3 Data Set T3 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
t 115 0 83 178 194 136 1,077 15 15 92 50 71 606 1,189 40 788 222 72 615
Cumulative 115 115 198 376 570 706 1,783 1,798 1,813 1,905 1,955 2,026 2,632 3,821 3,861 4,649 4,871 4,943 5,558
n 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
t 589 15 390 1,863 1,337 4,508 834 3,400 6 4,561 3,186 10,571 563 2,770 625 5,593 11,696 6,724 2,546
Cumulative 6,147 6,162 6,552 8,415 9,752 14,260 15,094 18,494 18,500 23,061 26,247 36,818 37,381 40,151 40,776 46,369 58,065 64,789 67,335
Goodness-of-Fit Analysis Chi-square sum = 16,0925; right-tailed area = 0.097 α (type I error probability) = 0.005; p-value = 0.097 > 0.005. Do not reject H0 : good Þt. Compound Poisson∧ Geometric Stopping-Rule Analysis α (gamma prior p.d.f. shape) = 0.5; d (stopping criterion) = 0.002; number of faults = 38; number of occurences = 37; MTTF = 1819.86. λ = 0.00055, ρ = 0.026, β = 909.9, μ = 2.63, and ν = 97.37 are the initial values.
TABLE 4D.4 Analysis Results of Table 4D.3 x
y
t
t
e
1 2 3 4 5 6 7 8 9 10 .. . 20 21 22 23 24 25
1 1 2 3 4 5 6 7 8 9
115 0 83 178 194 136 1,077 15 15 92
115 115 198 376 570 706 1,783 1,798 1,813 1,905
0.001503 0.001518 0.002339 0.002820 0.003150 0.003524 0.002499 0.002866 0.003229 0.003490
19 20 21 22 23 24
589 15 390 1,863 1,337 4,508
6,147 6,162 6,552 8,415 9,752 14,260
0.002849 0.002988 0.002970 0.002486 0.002271 0.001650
REFERENCES
225
Calculated stopping time = 11,195 seconds; e = 0.0020; stop after fault = 24; fault coverage = 63.16%; time coverage = 16.63%. Cost Analysis Cost per corrected error (postrelease), a = $6.00 Cost per corrected error (prerelease), b = $1.00; cost per test, c = $0.01; total faults (tf) = 38; total cycles (time) (tt) = 67,335 seconds; stop fault (sf) = 24; stop cycle (st) = 9752 seconds; remaining faults (rf) = 14; remaining cycles (time) (rt) = 57,583 seconds; cost of correcting all faults (exhaustive): $711.35 a (rf) < b (rf) + c (rt) = ? (6)(14) < (1)(14) + (0.01)(57,583) = ? 84.00 < 589.83? Savings using stopping rule: $505.83 End of analysis
REFERENCES 1. G. Parikh, Handbook of Software Maintenance, Wiley, New York, 1986. 2. W. Farr, Chap. 3 in M. R. Lyu (ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press/McGraw-Hill, New York, 1996. 3. J. Keyes, Software Engineering Handbook, Auerbach Publications, Boca Raton, FL, 2003, Chap. 16. 4. B. Marick, Craft of Software Testing, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 1994. 5. C. Kaner, J. Falk, and H. Q. Nguyen, Testing Computer Software, Wiley, New York, 1999. 6. M. Sahinoglu, An Empirical Bayesian Stopping Rule in Testing and VeriÞcation of Behavioral Models, IEEE Trans. Instrum. Meas., 52, 1428–1443 (October 2003). 7. M. Sahinoglu, C. Bayrak, and T. Cummings, High Assurance Software Testing in Business and DoD, Trans. Soc. Des. Process Sci., 6(2), 107–114 (2002). 8. T. Chen, M. Sahinoglu, A. von Mayrhauser, A. Hajjar, and C. Anderson, How Much Testing Is Enough? Applying Stopping-Rules to Behavioral Model Testing, Proceedings of the 4th International High-Assurance Systems Engineering Symposium (HASE’99), November 17–19, 1999, pp. 249–256. 9. M. Sahinoglu, A. von Mayrhauser, A. Hajjar, T. Chen, and C. Anderson, On the EfÞciency of a Compound Poisson Stopping-Rule for Mixed Strategy Testing, IEEE Aerospace Conference Proceedings, Snowmass at Aspen, CO, March 6–13, 1999. 10. T. Chen, M. Sahinoglu, A. von Mayrhauser, A. Hajjar, and C. Anderson, Achieving the Quality of VeriÞcation for Behavioral Models with Minimum Effort, Proceedings of the First International Symposium on Quality Electronic Design (IEEE/ISQED), San Jose, CA, March 20–22, 2000, pp. 234–239. 11. B. Barrera, Code Coverage Analysis: Essential to a Safe Design, Electron. Eng., pp. 41–44 (November 1998). 12. S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, and W.-B. Jone, Fault Tolerance in Systems Design in VSLI Using Data Compression Under Constraints of Failure Probabilities, IEEE Trans. Instrum. Meas., 50(6), 1725–1745 (December 2001).
226
STOPPING RULES IN SOFTWARE TESTING
13. S. R. Das, M. Sudarma, M. H. Assaf, E. M. Petriu, W. Jone, K. Chakrabarty, and M. Sahinoglu, Parity Bit Signature in Response Data Compaction and Built-in SelfTesting of VLSI Circuits with Nonexhaustive Test Sets, IEEE Trans. Instrum. Meas., 52(5), 1363–1380 (October 2003). 14. S. R. Das, M. H. Assaf, E. M. Petriu, and M. Sahinoglu, Aliasing-free Compaction in Testing Cores-Based System-on-Chip (SOC) Using Compatibility of Response Data Outputs, Trans. Soc. Des. Process Sci., 8(1), 1–17 (March 2004). 15. S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, W. B. Jone, and M. Sahinoglu, Revisiting Response Compaction in Full-Scan Circuits with Nonexhaustive Test Sets Using Concept of Sequence Characterization, IEEE Trans. Instrum. Meas., Special Issue on VLSI Testing, 54(5), 1662–1677 (October 2005). 16. S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, W. B. Jone, and M. Sahinoglu, Fault Simulation and Response Compaction in Full-Scan Circuits Using HOPE, IEEE Trans. Instrum. Meas., 54(6), 2310–2328 (December 2005). 17. S. R. Das, C. Jin, L. Jin, M. H. Assaf, E. M. Petriu, W. B. Jone, S. Biswas, and M. Sahinoglu, Implementation of a Testing Environment for Digital IP Cores, IEEE Trans. Instrum. Meas., 55(6) (December 2006). 18. S. R. Das, J. Zakizadeh, M. H. Assaf, E. M. Petriu, S. Biswas, and M. Sahinoglu, Testing Analog and Mixed-Signal Circuits with Built-in Hardware: New Approach, IEEE Trans. Instrum. Meas., 55(6) (December 2006). 19. B. Dickinson and S. Shaw, Software Techniques Applied to VHDL Design, New Electron., 9, 63–65 (May 1995). 20. M. Sahinoglu and E. H. Spafford, A Sequential Procedure for Approving Software Products, Proceedings of the 28th IEEE Annual Spring Reliability Seminar, May 1990, pp. 127–149. 21. M. Sahinoglu and E. H. Spafford, A Bayes Sequential Statistical Procedure for Approving Products in Mutation-Based Software Testing, W. Ehrenberger (ed.), Proceedings of the IFIP Conference on Approving Software Products (ASP’90), GarmischPartenkirchen, Germany, Elsevier Science (North Holland), September 17–19, 1990, pp. 43–56. 22. N. Johnson, S. Kotz, and J. Kemp, Univariate Discrete Distributions, 2nd ed., Wiley, New York, 1993. 23. S. Kotz et al., Encycl. Stat. Sci., 5, 111–113; 6, 169–176 (1988). 24. C. C. Sherbrooke, Discrete Compound Poisson Processes and Tables of the Geometric Poisson Distribution, Memorandum RM-4831-PR, Rand Cooperation, Santa Monica, CA, July 1966. 25. M. Sahinoglu, The Limit of Sum of Markov Bernoulli Variables in System Reliability Evaluation, IEEE Trans. Reliab., 39, 46–50 (April 1990). 26. M. Sahinoglu, Negative Binomial Density of the Software Failure Count, Proceedings of the 5th International Symposium on Computer and Information Sciences (ISCIS), Vol. 1, October 1990, pp. 231–239. 27. M. Sahinoglu, Compound Poisson Software Reliability Model, IEEE Trans. Software Eng., 18, 624–630 (July 1992). 28. M. Sahinoglu and U. Can, Alternative Parameter Estimation Methods for the Compound Poisson Software Reliability Model with Clustered Failure Data, Software Test. VeriÞcation Reliab., 7, 35–57 (March 1997).
REFERENCES
227
29. M. Sahinoglu and A. S. Al-Khalidi, A Bayesian Stopping-Rule for Software Reliability, Proceedings of the 5th World Meeting of ISBA, Satellite Meeting to ISI-1997, Istanbul, Turkey, August 1997. 30. M. Sahinoglu, J. J. Deely, and S. Capar, Stochastic Bayes Measures to Compare Forecast Accuracy of Software-Reliability Models, IEEE Trans. Reliab., pp. 92–97 (March 2001). 31. N. Johnson, S. Kotz, and N. Balakrishnan, Continuous Univariate Distributions, 2nd ed., Vol. 2, Wiley, New York, 1995. 32. E. Cinlar, Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs, NJ, 1975. 33. M. Sahinoglu, The Limit of Sum of Markov Bernoulli Variables in System Reliability Evaluation, IEEE Trans. Reliab., 39, 46–50 (April 1990). 34. K. Abdullah, J. Kimble, and L. White, Correcting for Unreliable Regression Integration Testing, Proceedings of the International Conference on Software Maintenance, Nice, France, October 1995, pp. 232–241. 35. A. Hajjar, T. Chen, and A. von Mayrhauser, On Statistical Behavior of Branch Coverage in Testing Behavioral VHDL Models, presented at the IEEE High Level Design Validation and Test Workshop, Berkeley, CA, November 2000. 36. P. Randolph and M. Sahinoglu, A Stopping-Rule for a Compound Poisson Random Variable, Appl. Stochastic Models Data Anal., 11, 135–143 (June 1995). 37. M. Sahinoglu and A. Al-Khalidi, A Stopping-Rule for Time-Domain Software Testing, Proceedings of the 10th International Symposium on Software Reliability Engineering (ISSRE’99), Boca Raton, FL, November 1–4, 1999. 38. M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, New York, 1970. 39. S. Samuels, Secretary Problems, in B. K. Ghosh and P. K. Sen (eds.), Handbook of Sequential Analysis, Marcel Dekker, New York, 1991, pp. 381–405. 40. A. Hajjar, T. Chen, I. Munn, A. Andrews, and M. Bjorkman, Stopping Criteria Comparison: Towards High Quality Behavioral VeriÞcation, presented at the International Symposium on Quality in Electronic Design, San Jose, CA, March 2001. 41. E. H. Forman and N. D. Singpurwalla, An Empirical Stopping Rule for Debugging and Testing Computer Software, J. Am. Stat. Assoc., 72, 750–757 (1977). 42. K. McDaid and S. P. Wilson, Deciding How Long to Test Software, Statistician, 50, 117–134 (2001). 43. S. M. Ross, Software Reliability: The Stopping Rule Problem, IEEE Trans. Software Eng., 11, 1472–1476 (1985). 44. N. D. Singpurwalla, Determining an Optimal Time Interval for Testing and Debugging Software, IEEE Trans. Software Eng., 17, 313–319 (1991). 45. M. Hicks, A Stopping Rule Tool for Software Testing, M.S. thesis, Department of Computer and Information Science, Troy University, Montgomery, AL, December 2000. 46. G. G. Roussas, A First Course in Mathematical Statistics, Addison-Wesley, Reading, MA, 1973, p. 253. 47. T. G. Pham and N. Turkkan, Bayes Binomial Sampling by Attributes with a Generalized-Beta Prior Distribution, IEEE Trans. Reliab., 41(1), 310–316 (1992). 48. S. R. Dallal and C. L. Mallows, When Should One Stop Testing Software, J. Am. Stat. Assoc., 83, 872–679 (1988).
228
STOPPING RULES IN SOFTWARE TESTING
49. N. D. Singpurwalla and S. P. Wilson, Statistical Methods in Software Engineering, Springer-Verlag, New York, 1999. 50. W. Notz, personal communication, Department of Statistics, Ohio State University, Columbus, OH, August 2002. 51. M. Sahinoglu and S. Glover, Economic Analysis of a Stopping Rule in Branch Coverage Testing, Proceedings of the 3rd International Symposium on Quality Electronic Design, San Jose, CA, March 2002, pp. 341–346. 52. D. Anderson, D. J. Sweeney, and T. A. Williams, An Introduction to Management Science: Quantitative Approaches to Decision Making, 10th ed., Thomson-South Western, Mason, OH, 2002, pp. 735–743. 53. K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd ed., Wiley, Hoboken, NJ, 2002. 54. S. Gokhale and K. Trivedi, Log-Logistic Software Reliability Growth Model, Proceedings of the 3rd IEEE International High Assurance Systems Engineering Symposium (HASE ’98), Washington, DC, November 1998, pp. 34–41. 55. A. Goel, Software Reliability Models: Assumptions, Limitations, and Applicability, Software Eng., 11(12), 1411–1423 (December 1985). 56. J. Musa, A Theory of Software Reliability and Its Application, Software Eng., 1(3), 312–327 (1975). 57. D. Mills, On the Statistical Validation of Computer Programs, Report FSC-72-6015, IBM Federal Systems Division, Gaithersburg, MD, 1972. 58. W. Howden, ConÞdence-Based Reliability and Statistical Coverage Estimation, Proceedings of the International Symposium on Software Reliability Engineering, 1997, pp. 283–291. 59. W. Howden, Systems Testing and Statistical Test Data Coverage, Proceedings of COMPSAC, IEEE Computer Society Press, Los Alamitos, CA, August 1997, pp. 500–505. 60. S. Chen and S. Mills, A Binary Markov Process Model for Random Testing, Trans. Software Eng., 22(3), 218–223 (1996). 61. T. Chen, I. Munn, A. von Mayrhauser, and A. Hajjar, EfÞcient VeriÞcation of Behavioral Models Using the Sequential Sampling Technique, presented at the Symposium on Very Large Scale Integration, S˜ao Paulo, Brazil, 1999. 62. E. E. Lewis, Introduction to Reliability Engineering, 2nd ed., Wiley, New York, 1996. 63. A. Hajjar and T. Chen, A New Stopping-Rule for Behavioral Model VeriÞcation Based on Statistical Bayesian Technique, IEEE-TCAD, Colorado State University, Fort Collins, CO 80526, 2001. 64. A. Hajjar and T. Chen, Improving the EfÞciency and Quality of Simulation-Based Behavioral Model VeriÞcation Using Dynamic Bayesian Criteria, Proceedings of the 3rd International Symposium on Quality Electronic Design, San Jose, CA, March 2002, pp. 304–309. 65. T. Cummings, A New ScientiÞc Business Engineering Paradigm for Software Agencies, M. S. thesis, Troy University, Montgomery, AL, 2000. 66. W. B. Jone and S. R. Das, An Improved Analysis on Random Test Length Estimation, Int. J. Comput. Aided VLSI Des., 3, 393–406 (1991).
EXERCISES
229
67. P. Nandakumar, S. M. Datar, and R. Akella, Models for Measuring and Accounting for Cost of Conformance Quality, Manage. Sci., 39, 1–16 (1993). 68. A. N. Shiryayev, Optimal Stopping Rules, Springer-Verlag, New York, 1978. 69. P. H. Randolph, Optimal Stopping Rules for Multinomial Observations, Metrika, 14, 48–61 (1969). 70. H. Robbins, Optimal Stopping, Amer. Math. Mon., 77, 333–343 (1963). 71. J. A. Yahov, On Optimal Stopping, Ann. Math. Stat., 34, 30–35 (1966). 72. J. D. Musa and K. Okumoto, A Logarithmic Poisson Execution Time Model for Software Reliability Measurement, Proceedings of the 7th International Conference on Software Engineering, 1984, pp. 230–238. 73. G. Becker and I. Camarinopoulos, A Bayesian Method for the Failure Rate of a Possibly Correct Program, Trans. Software Eng., 16, 1307–1316 (1970). 74. J. S. Maritz, Empirical Bayes Methods, Methuen, London, 1970. 75. S. J. Press, Bayesian Statistics: Principles, Models and Applications, Wiley, New York, 1989. 76. J. D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill International, Singapore, 1987.
EXERCISES To use the applications and data Þles, click on “MESAT Stopping-Rule” in TWCSolver on the CD-ROM. 4.1 Clicking on “Internal Data,” utilize the discrete effort-based (nontime) chip design test data (DR1.txt, DR2.txt, DR3.txt, DR4.txt, DR5.txt) and business data (DR6.txt). Apply MESAT-1 to Þnd stopping rules where a = $1000, b = $200, and c = $100 by employing single- and mixed strategy testing. Choose 70% for your minimal conÞdence level. Assume that you know the end of the data set in terms of how many total test cases exist. You cannot stop unless you have a proÞt. 4.2 Repeat Exercise 4.1 but assume that one does not know the total number of test cases before starting. Decide on a minimal number of test cases you wish to try and on an initial budget for testing before you release your product following cyber-testing or decide that the product is okay following security testing. 4.3 Repeat Exercise 4.1 on DR7.txt, DR8.txt, and DR9.txt in “Internal Data.” 4.4 (a) Clicking on “Internal Data,” and utilizing the continuous time-based ROME lab data, t1.txt, t2.txt, t3.txt, t4.txt, t5.txt, t6.txt, and t7.txt, apply MESAT-2 to Þnd a stopping rule where a = $600, b = $200, and c = $1. (b) Compare the results of part (a) with results using the simplistic Howden method.
230
STOPPING RULES IN SOFTWARE TESTING
4.5 Using MESAT-1for the data set DR1 to DR5, conduct a single-strategy stopping-rule study, and using a = $700, b = $200, c = $100, show why it is cost-efÞcient “not to execute exhaustive testing all the way to the end.” What are the optimal a >, b <, and c < when the other two are kept constant with the cost values given above? You need to print out one-page graphical and two-page maximum analytical results and a onepage conclusion stating why it is cost-effective. 4.6 Repeat 4.5 using MESAT-2 for the continuous data set t1 to t7.
Mystic is what they call me. Hate is my only enemy. I harbor a grudge against none. To me the “whole wide world” is one. —Yunus Emre, the legendary mystic folk poet (1238–1320)
5 AVAILABILITY MODELING USING THE SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION Nutshell 5.1 With the advances in pervasive computing and worldwide networks, quantitative measurement of component and network availability has become a challenging task. It is widely recognized that the forced outage ratio (FOR) of an embedded hardware component is deÞned as the failure rate divided by the sum of the failure and repair rates; or FOR is the nonoperating time divided by the total exposure time. However, it is also well documented that FOR is not a constant random variable. The probability density function (p.d.f.) of FOR is the Sahinoglu–Libby (SL) probability model, used if certain underlying assumptions hold. The SL p.d.f. is a generalized three-parameter beta distribution (G3B). The failure and repair rates are taken to be the generalized gamma variables where the corresponding shape and scale parameters, respectively, are not equal. The SL model is shown to default to that of a standard two-parameter beta p.d.f. when the shape parameters are identical. Decision-theoretic solutions are sought to compute small-sample Bayesian estimators by using informative and noninformative priors for the failure and repair rates with respect to three deÞnitions of loss functions. These estimators for component availability are then propagated to calculate the network expected source–target availability for four fundamental simple networks. Application to complex networks follows in Chapter 6. The method proposed is superior to estimating availability by dividing total uptime by exposure time. Examples show the
Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
231
232
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
validity of this method to avoid over- or underestimation of availability when only small samples or insufÞcient data exist for the historical life cycles of components. 5.1 NOMENCLATURE G3B SL FOR R Q p.d.f. c.d.f. m.g.f. MLE a xT b yT c ξ d η qˆ q∗ q ∗∗ ∗∗ ql−s
E(q) qM rˆ r∗
generalized three-parameter beta random variable Sahinoglu–Libby (synonymous with G3B random variable) forced outage rate or unavailability index of a hardware or software component availability-the probability that an item is up (operating) at any point in time, unavailability-the random variable FOR, the probability that an item is inoperative at any point in time where q = 1 − r probability density function of a given random variable cumulative probability density function of a given random variable moment-generating function maximum likelihood estimate number of occurrences of operative (up) times sampled total sampled up time for a occurrences number of occurrences of debugging (down) times sampled total sampled debugging (down) times for b occurrences of debugging activity shape parameter of gamma prior for component failure rate λ inverse scale parameter of gamma prior for component failure rate λ shape parameter of Gamma prior for component recovery rate μ inverse scale parameter of gamma prior for component recovery rate μ estimator of random variable q using a speciÞed estimation method expected unavailability (= FOR) estimator with informative prior using weighted squared-error loss expected unavailability (= FOR) estimator with noninformative prior when ξ = η = 0, c = d = 1 using weighted squared-error loss expected unavailability (= FOR) large-sample asymptotic estimator of q ∗∗ if a, b → ∞, where a/b ≈ 1 expected unavailability (= FOR) estimator with informative prior using squared-error loss median or Bayes availability estimator with informative prior for an absolute-error loss function estimator of random variable r(= 1 − q) using a speciÞed estimation method expected availability (= 1 − FOR) estimator with informative prior using weighted squared-error loss
INTRODUCTION AND MOTIVATION
r ∗∗ ∗∗ rl−s
E(r) rM
233
expected availability (= 1 − FOR) estimator with noninformative prior when ξ = η = 0, c = d = 1 using weighted squared-error loss expected availability (= 1 − FOR) large-sample asymptotic estimator of r ∗∗ if a, b → ∞, where a/b ≈ 1 expected availability (= 1 − FOR) estimator with informative prior using squared-error loss function median or Bayes availability estimator with informative prior for an absolute-error loss function
5.2 INTRODUCTION AND MOTIVATION In contrast to earlier research conducted by Sahinoglu and others on the reliability of software [1–4] and testing hardware [5–8], also cited in [9], in this chapter we concentrate on two main aspects: the theory and application of the Sahinoglu–Libby (SL) p.d.f. to hardware components and networks. It is assumed that embedded hardware elements (or Þrmware) are components in a computer network that have their failure and repair rates distributed independently with two-parameter generalized Gamma(αi ,βi ) p.d.f.’s. The probability density function of FOR = λ/(λ + μ) [where λ is the failure rate (the number of failures/unit time) and μ is the repair rate (number of repairs/unit time)], the ratio of λ ∼ prior Gamma(α1 ,β1 ) to the sum of λ ∼ prior Gamma(α1 ,β1 ) and μ ∼ prior Gamma(α2 ,β2 ), α1 = α2 , β1 = β2 , was originally documented by Sahinoglu and Libby in their respective published Ph.D. dissertations [10,11]. If the sampled historical up and down times xi and yi , respectively, are distributed exponentially, the reciprocals of the mean up or down times λ and μ have prior distributions of gamma p.d.f. on grounds of mathematical tractability, conjugacy property, and versatility in representing or approximating a wide range of distributions. Further, applying Bayesian inference techniques, a posteriori distributions of these rates are obtained, following the merging of prior information with the system’s Þeld data sampled. Subsequent treatment of the same problem can be found in Sahinoglu et al. [10,12], and Libby and Novick [13], Johnson and Kotz [14,15], and Pham-Gia and Duoang [16] in the form G3B(α,β,L), where L = β1 /β2 , α = a + c, β = b + d. This estimation problem was presented in a classical reference book on statistical distributions by Johnson and Kotz, in which the SL (or G3B) did not occur in the 1970 edition, which contained only the default case of SL: the beta model [14, p. 182]. The 1995 edition made a reference to the said p.d.f. in the form of a G3B [15, p. 251]. The G3B or Sahinoglu–Libby p.d.f. Þrst appeared in 1981 in the form of two separate Ph.D. dissertations derived independently [10,11]. The probability distribution function of availability for a multicomponent network of any complexity more than all series or parallel is simply infeasible to attain in a closed-form solution. Therefore, based on independent assumption of components, a function of a product of availabilities is necessary to estimate the desired network availability index (Rsys ). Numerical examples on components and
234
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
various sample network conÞgurations are given in subsequent sections. The goal is to calculate the expected component and network availability accurately despite the lack of large-sample historical data. This will be accomplished by calculating component availabilities (i.e., Bayesian estimators) using selected informative and noninformative priors and using three different loss (or penalty) functions. 5.3 SAHINOGLU–LIBBY PROBABILITY MODEL FORMULATION In using the distribution function technique, the p.d.f. of FOR = q = λ/(λ + μ) is obtained Þrst by deriving its c.d.f. GQ (q) = P (Q ≤ q) = P (λ/(λ + μ) ≤ q) and then taking its derivative to obtain gQ (q) as in equations (5A.1) to (5A.18) in Appendix 5A [10, pp. 26–32;11;12, p. 1487]: (a + b + c + d) (ξ + xT )a+c (η + yT )b+d (1 − q)b+d−1 q a+c−1 (a + c) (b + d) [η + yT + q(ξ + xT − η − yT )]a+b+c+d
α+β 0 1 (α + β) β−1 α−1 α β (1 − q) q = β1 β2 (α) (β) β2 + q(β1 − β2 )
α+β β2 β−1 α−1 = B(α, β)(1 − q) q Lα β2 + qβ2 (L − 1)
α+β 1 −1 α−1 = B(α, β)(1 − q)β q Lα (1) 1 + q(L − 1)
gQ (q) =
where B(α, β) =
(α + β) (α) (β)
(2)
Note that gQ (q) is the p.d.f. of the random variable Q = FOR, where α = a + c, β = b + d, β1 = ξ + xT , β2 = η + yT , and 0 ≤ q ≤ 1. If L = β1 /β2 = 1 or β1 = β2 , the conventional two-parameter beta p.d.f. is obtained. An alternative original derivation of the same p.d.f. termed the generalized multivariate beta distribution is given by Libby [11, pp. 272–277]. The expression in equation (2) can also be reformulated in terms of SL(α = a + c, β = b + d, L = β1 /β2 ) as follows: gQ (q) =
Lα+c q α+c−1 (1 − q)b+d−1 B(b + d, a + c)[1 − (1 − L)q]a+b+c+d
(3)
where B(b + d, a + c) =
(a + c) (b + d) (a + b + c + d)
and L =
ξ + xT η + yT
(4)
Note that if L = 1, the Sahinoglu–Libby p.d.f. reduces to a Beta(α,β) p.d.f. Similarly, one obtains the p.d.f. of the random variable R(availability) = 1 − Q,
BAYES ESTIMATORS FOR INFORMATIVE PRIORS AND LOSS FUNCTIONS
235
which by reparametrization of SL(α,β,L) into SL(β,α,L−1 ) leads to the following expression for the random variable r: gR (r) =
Lb+d (1 − r)a+c−1 r b+d−1 B(a + c, b + d)[1 − (1 − L)r]a+b+c+d
(5)
where B(a + c, b + d) =
(a + c) (b + d) (a + b + c + d)
and L =
η + yT ξ + xT
(6)
Densities of SL (or G3B) distributions have been cited [12–14, 16] for a variety of L or λ (according to the nomenclature selected) values. From a strictly mathematical point of view, the presence of the parameter L allows the SL p.d.f. to take a variety of shapes other than the standard Beta(α, β) where L = 1. For example, when α = β, the standard Beta (α, α) is symmetric with a mean at 0.5. However, the SL(α, α, L) distribution is not necessarily so, and can be skewed positively or negatively, depending on L > 1 and L < 1, respectively, because the mode, skewness, and kurtosis of SL random variable now also depend on L. For 0 < L < 1, the SL p.d.f. stays below the plot of the corresponding standard beta near zero but crosses the latter to become the greater of the two p.d.f.’s at a point [14, 16] y0 = [1 − Lα1 /(α1 +α2 ) ]−1 − (1 − L)−1
(7)
The reverse action holds true for L > 1 with the same crossing point, y0 . The major drawback to the distribution is that there is no closed form for Þnite estimates of the moments. The moment-generating function for the univariate SL distribution is an inÞnite series [13]. This is why Bayes estimators can be of practical use. The desired moments as well as median and quartiles have been generated through the use of a Java code by the corresponding author. These values are listed in the example in Table 5.1 in Section 5.5. Bayes estimators in closed and numerically integrable forms are derived next. A trapezoidal formula is used for the numerical integration. 5.4 BAYES ESTIMATORS FOR VARIOUS INFORMATIVE PRIORS AND LOSS FUNCTIONS Various studies have substantiated that Þnite moments do not exist in closed form for SL(α, β, L). Standard methods lead only to unfavorable recursive solutions, a situation that poses a dead end, as in references 13 and 16. However, an alternative way of Þnding some meaningful and computable Bayes estimates for the unavailability random variable Q and availability R = 1 − Q is achieved using Bayes estimation techniques with various loss functions [10]. Two popularly used variations of squared-error loss functions and one absolute-error loss function are examined as penalty functions.
236
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
5.4.1 Squared-Error Loss Function Let qˆ denote an estimate of the random variable denoted as Q ≡ FOR. Hence L(q, q), ˆ which is the loss incurred in estimating the true but unknown q, can be deÞned at will. Usually, the loss penalty increases as the difference between q and qˆ increases. Therefore, the squared-error loss function, L(q, q) ˆ = (q − q) ˆ 2, has found favor where the risk of taking a decision is Risk(q, q) ˆ = E[L(q, q)] ˆ = E(q − q) ˆ 2
(8)
This would then be the variance of the estimator Q, penalizing larger differences more in classical least-squares theory, as in references 17 and 18. The Bayes estimator of q in our problem with respect to squared-error loss function is the Þrst moment or expected value of the random variable q using its p.d.f. [10,17]. This follows from the fact that E(q − q) ˆ 2 , if it exists, is a minimum when qˆ = E(q) [i.e., the mean of the conditional posterior distribution of q given x (vector of up times) and y (vector of down times)]. Then E(q) is the Bayes solution: E(q) = EQ [q | X = x, Y = y] =
1
0
qgQ (q) dq
(9)
Similarly, the Bayes estimator of r (i.e., rˆ with respect to the squared-error loss function using an informative prior) is the Þrst moment or expected value of the random variable R = r in equation (5). E(r) = ER [r | X = x, Y = y] =
1 0
rgR (r) dr = 1 − E(q)
(10)
5.4.2 Absolute-Error Loss Function Similarly, according to Hogg and Craig [17, p. 262], the median of the random variable Q is the Bayes estimator using an informative prior when the loss function is given as L(q, q) ˆ = |q − q|. ˆ If E(|q − q|) ˆ exists, then qˆ = q0.5 minimizes the loss function [i.e., the median of the conditional posterior distribution of q given X = x (vector of up times) and Y = y (vector of down times)]. The median is very resistant to change. Then q0.5 or the median of q, qM , is the Bayes solution. That is, qM is the 50th percentile or 0.5 quantile, or second quartile for q, as follows: q0.5 gQ (q) dq (11) 0.5 = 0
Similarly, rM = 1 − qM is the 50th percentile or 0.5 quantile, or second quartile for r, as follows: r0.5 0.5 = gR (r) dr (12) 0
BAYES ESTIMATORS FOR INFORMATIVE PRIORS AND LOSS FUNCTIONS
237
5.4.3 Weighted Squared-Error Loss Function Weighted squared-error loss is of considerable interest to statisticians and engineers [18] and has the attractive feature of allowing the squared error to be weighed by w(q), which is a function of q. This will reßect that a given error of estimation often varies in penalty according to the value of q. Then the weighted squared-error loss function selected in such cases is E(q, q) ˆ 2=
C(q − q) ˆ 2 = w(q)(q − q) ˆ 2 q(1 − q)
(13)
With this loss function, the Bayes estimator of q is given as [see equations (5B.1) to (5B.7) for details] ∗
Q q =
qw(q)h(q | X = x, Y = y) dq
Q w(q)h(q | X = x, Y = y) dq
=
EQ [qw(q) | X = x, Y = y] EQ [w(q) | X = x, Y = y]
(14)
Utilizing equation (5B.7) and assuming the coefÞcient of the weight function w(q), C = 1, q∗ =
EQ [(1 − q)−1 ] EQ [q/q(1 − q)] = EQ [1/q(1 − q)] EQ [q −1 (1 − q)−1 ]
(15)
where 1 λ =1+ 1 − λ/(λ + μ) μ λ μ μ λ 1+ = 1+ = 1+ + +1 λ μ λ μ
(1 − q)−1 = q − 1(1 − q)−1
=2+
λ μ + λ μ
(16)
(17)
When substituted into equation (15) and using posterior gamma distributions by equations (5A.7) and (5A.8), this gives, ∞ ∞ ∗
λ=0 μ=0 [1
q = ∞ ∞
λ=0 μ=0 [2
+ λ/μ)]h1 (λ | X = x)h2 (Y = y) dλ dμ
+ (λ/μ) + (μ/λ)]h1 (λ | X = x)h2 (Y = y) dλ dμ
(18)
Since h1 (λ | x) =
1 (xT + ξ )λa+c−1 exp[−λ(xT + ξ )] (a + c)
(19)
238
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
is the Gamma {a + c, (xT + ξ )−1 } and 1 (yT + η)μa+c−1 exp[−μ(yT + η)] (b + d)
h2 (μ | y) =
is the Gamma {b + d, (yT + η)−1 }, then h1 (λ | x)h2 (μ | y) dλ = 1.0 λ
λ
(20)
(21)
μ
h1 (λ | x)μ−1 h2 (μ | y) dλ dμ = μ
a + c η + yT ξ + xT b + d − 1
(22)
where the expectation of a random variable distributed with Gamma(α, β) is αβ. Therefore, E(λ) = (a + c)/(ξ + xT ) using equation (5A.7). Using (5A.8), the expectation of the reciprocal of a random variable distributed with Gamma(α, β) is 1/β(α − 1), as follows: 1 1 η + yT E = h2 (μ | y) dμ = (23) μ μ b +d −1 μ Similarly, employing the same “expectation of the reciprocal of a gamma random variable” principle, and by (5A.7), 1 1 ξ + xT E = h1 (λ | x) dλ = (24) λ a+c−1 λ λ Now, putting it all together as dictated by equation (14) yields (a + c)(η + yT ) (ξ + xT )(b + d − 1) q∗ = (a + c)(η + yT ) (b + d)(ξ + xT ) + 2+ (η + yT )(a + c − 1) (ξ + xT )(b + d − 1) 1+
(25)
which is the small-sample (before the sampled sums xT and yT predominate) Bayes estimator with respect to a weighted squared-error loss function as given above and suggested for use in conventional studies to stress for tail values such as q = 0.1 or q = 0.9, where the value of the weight function increases. This is a small-sample estimator as opposed to an asymptotic estimator requiring large-sample data, thereby reßecting insufÞcient unit history. Here w(q) was conveniently taken to be [q(1 − q)]−1 . For the special case when placing ξ = η = 0 (i.e., scale parameters are inÞnite), c = d = 1 in equation (25) for noninformative (ßat) priors, q ∗ becomes q ∗∗ : q ∗∗ =
1 + (a + 1)yT /xT b 2 + (b + 1)xT /yT a + (a + 1)yT /xT b
=
xT yT ab + yT2 a(a + 1) 2xT yT ab + yT2 a(a + 1) + xT2 b(b + 1)
(26)
AVAILABILITY CALCULATIONS FOR PARALLEL AND SERIES NETWORKS
239
Finally, q ∗∗ asymptotically approaches the ql−s estimator, the same as that of the MLE obtained by conventional (non-Bayesian) methods, which occurs when the inßuence of a priori parameters vanishes. This happens when the number of samples observed, a, b → ∞ in equation (25), such that a + 1 ≈ b and b + 1 ≈ a. Then (26) will reduce to q∗∗ l−s = ql−s = =
xT yT 1 + yT /xT xT + yT = 2 2 2 + yT /xT + xT /yT 2(yT xT ) + (yT ) + (xT ) xT (xT + yT )yT yT = (xT + yT )2 xT + yT
(27)
By a similar process, we can reparametrize for the random variable r = 1 − q as in equation (24). This reparameterization is achieved since if Q ∼ SL(α, β, L), its complement, 1 − Q, is SL(β, α, L−1 ), a characteristic that is similar to the one employed for the standard Beta(α, β) as in equation(4). Note that E(r) = 1 − E(q). Then (ξ + xT )(b + d) (a + c − 1)(η + yT ) r∗ = = 1 − q∗ (b + d)(ξ + xT ) (a + c)(η + yT ) 2+ + (η + yT )(a + c − 1) (ξ + xT )(b + d − 1) 1+
(28)
is the Bayes estimator of the availability, R = 1 − Q, with respect to a weighted squared-error loss. Here, w(r) was similarly taken for equation (13) to be [r(1 − r)]−1 . For the special case when ξ = η = 0, c = d = 1 [i.e. for noninformative or ßat priors, r ∗ becomes r ∗∗ as in (28)]: r ∗∗ =
1 + (b + 1)xT /yT a = 1 − q ∗∗ 2 + (b + 1)xT /yT a + (a + 1)yT /xT b
(29)
If the sample sizes of up and down times, a and b, are too large such that a/b → 1, then similarly, r ∗∗ approaches rl−s , which is the MLE for a, b → ∞, as follows: ∗∗ rl−s = rl−s =
=
xT yT 1 + xT /yT xT + yT = 2 2 2 + xT /yT + yT /xT 2(yT xT ) + (yT ) + (xT ) yT (xT + yT )xT xT = (xT + yT )2 xT + yT
(30)
5.5 AVAILABILITY CALCULATIONS FOR SIMPLE PARALLEL AND SERIES NETWORKS Four different fundamental topologies will be studied. Therefore, in evaluating various network availability or unavailability, exact values are used, such as
240 TABLE 5.1
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
Input and Output Table for Component and Network Applicationsa
Input Data a (no. failures) b (no. repairs) xT (operating time) yT (repair time) C (shape for λ) ξ (inverse scale for λ) D (shape for μ) η (inverse scale for μ)
Component 1
Component 2
Component 3
Component 4
10 10 1000 h 111.11 h 0.02 1 0.1 1
5 5 25 h 5h 0.2 1 2 0.5
10 10 1000 h 111.11 h 0.5 1 2 0.25
100 100 10,000 h 1111.1 h 0.5 1 2 0.25
Case 1: single Component 1 Component 2 Component 3 Component 4 (see Figure 5.5) 0.907325 0.882397 0.917648 0.902028 r∗ 0.906655 0.849515 0.906655 0.900714 r ∗∗ E(r) = mean 0.890985 0.758064 0.879164 0.897920 (see Figure 5.3) 0.898540 0.775410 0.886580 0.898650 rM = median 0.9 √ 0.8333 0.9 0.9 rl−s = MLE √ √ √ Std. deviation 0.045 = .00203 0.107 = .0115 0.046 = .0021 0.013 = .00017 (= Bayes risk) IQR 0.06 0.14 0.06 0.015 (interquartile range) (see Figure 5.4) Skewness −1.11 −0.901 −1.04 −0.339 Kurtosis 2.11 0.985 1.846 0.2 Case 2: system with identical component 1 (see Figure 5.6) R∗ R ∗∗ E(R) RM ∗∗ Rl−s Case 3: system with four different components (see Figure 5.7) R∗ R ∗∗ E(R) RM ∗∗ Rl−s
System ConÞg. I
System ConÞg. II
System ConÞg. III
System ConÞg. IV
0.677723 0.675723 0.630206 0.652027 0.656102
0.999926 0.999924 0.999868 0.999981 0.999900
0.968721 0.968324 0.957504 0.963901 0.963900
0.907495 0.906723 0.976315 0.989010 0.980100
System ConÞg. I
System ConÞg. II
System ConÞg. III
System ConÞg. IV
0.662709 0.628987 0.533192 0.565567 0.656100
0.999920 0.999846 0.999674 0.999738 0.999833
0.967978 0.956336 0.931650 0.938425 0.952501
0.981846 0.976621 0.961584 0.965978 0.973500
a Sample input parameters and computed estimators for case 1 of single components and for cases 2 and 3 of four system conÞgurations (a) to (d) as in Figure 5.1 are given. Case 1: single components (nonsystem); case 2: system study with all identical component 1; case 3: system study with all nonidentical components using 1 to 4 in a sequence as needed. Gamma prior parameters are selected from sample plots in Figure 5.2 to show degrees of skewness: for example, a pair: d(shape) = 2, η(inverse scale) = 0.5 is almost symmetric; and another pair: c(shape) = 0.5, ξ(inverse scale) = 1.0 is a hyperexponential. The scale parameters in Figure 5.2 are the reciprocals of the inverse scales in this table.
241
AVAILABILITY CALCULATIONS FOR PARALLEL AND SERIES NETWORKS
qi or ri , where ri = 1 − qi , where denotes “the product of.” Assume below that all units have identical reliabilities. I. Series systems: Rsys = m 1 ri and Qsys = 1 − Rsys , and m = number of series subsystems. II. Active parallel systems: Qsys = n1 qi and Rsys = 1 − Qsys , where n = number of parallel paths. III. Series in active parallel: Qsys = n1 [(1 − m 1 ri )] and Rsys = 1 − Qsys . n [(1 − IV. Active parallel in series: Rsys = m 1 1 qi )] and Qsys = 1 − Rsys . Three cases are tested and illustrated [10, 19]. See Table 5.1 to observe the differences between the Bayesian estimators [20, 21]. Also observe input data and results for cases 1, 2, and 3 in Table 5.1. In coding an algorithm in the Java program written speciÞcally for this purpose, the postÞx (∗ ) is used for denoting series and the postÞx (+) is used for denoting active parallel systems [22, p. 454]. The components in Figure 5.1 are treated at most two at a time [23, pp. 298–299]. Below are some examples of the four different fundamental parallel–series networks. Using a hand calculator, for ri = 0.9, Rsys (I) = 0.6561, Rsys (II) =
1 (a)
(b)
2
Input
Output 3
4
1
3
Input
Output 4
2
1 2 (c)
Input
Output 3 4
(d) Input
1
2
3
4
Output
FIGURE 5.1 (a) Series in parallel; (b) parallel in series; (c) parallel; (d) Series.
242
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION b = 1
b = 2
b = 4
1.0 0.8 p(x)
0.6
0.6
0.4
0.4
0.2
0.2 0
2
a =1 2 (n = 1)
0.2
0
4
0.4
2
4
6
0
8
2
4
6
10
12
x
14
(c)
(b)
(a)
8
1.0 0.8 a=1
0.6 p(x)
0.4
0.4
0.2
0.2 0
2
(n = 2)
0
4
2
4
6
0.2 0
8
2
4
6
10
12
0.6 p(x)
x
14
(f )
(e)
(d)
8
a = 1.5
0.4
0.4
0.2
0.2 0
2
(n = 3)
0
4
2
4
6
0.2 0
8
2
4
6
10
12
x
14
(k)
(h)
(g)
8
a=2
0.4 p(x) 0.2
0.2 0
2
4
(n = 4) 0
2
4
6
8
0.2 0
2
4
6
8
(l)
(m)
(n)
b = 1
b = 2
b = 4
10
12
14
x
FIGURE 5.2 Gamma density (α = shape, β = scale, ν (chi-square) = αβ).
0.9999, Rsys (III) = 0.9639, Rsys (IV) = 0.9801. Let’s code each conÞguration and code them using postÞxes: I. 1,2,*,3,4,*,* denotes all four components in series. For case 3 of nonidentical components from 1 to 4, Rsys = r1 r2 r3 r4 . For case 2, let all ri ’s be identical. II. 1,2,+,3,4,+, + denotes all four components in active parallel. For case 3 with nonidentical components from 1 to 4, Rsys = 1 − q1 q2 q3 q4 . For case 2, let all qi ’s be identical. III. 1,2,*,3,4,*,+ denotes that components 1 and 2, Þrst in the series, are in active parallel with the other two components in the series, 3 and 4. Rsys = 1 − [(1 − r1 r2 )(1 − r3 r4 )].
DISCUSSION AND CONCLUSIONS
243
IV. 1,2,+,3,4,+,* denotes that components 1 and 2, the Þrst in active parallel, are in series with the other two components, 3 and 4, in active parallel. Rsys = (1 − q1 q2 )(1 − q3 q4 ). 5.6 DISCUSSION AND CONCLUSIONS In this chapter we have studied the basic theory and application of the Sahinoglu–Libby (SL) p.d.f. as applied to hardware components and networks [24]. In the theory section of the chapter, a detailed derivation of the univariate SL(α, β, L) p.d.f. as noted originally in Sahinoglu’s Ph.D. dissertation is presented with reference to a Bayesian process for informative and noninformative priors using absolute, squared, and weighted squared-error loss functions. Therefore, SL(α, β, L) p.d.f. is the continuous probability density function of the random variable of unavailability (or availability when reparametrized) of a component in a network whose lifetime can be decomposed into operating (up) and nonoperating (down) states in a dichotomous setting. This treatment does not lend itself to include a derated state, for which a multivariate SL p.d.f. must be derived, a task for future consideration. Up and down times are general gamma models where both shape and scale parameters differ from each other. Beta(α,β) is a special case of the SL(α,β,L) where the ratio of the respective gamma shape parameters for the failure and repair rates are identical, L = 1. Further, analytical difÞculties in calculating the closed-form moments of the said random variable are outlined, suggesting Bayesian estimators using informative and ßat (noninformative) priors with respect to various meaningful loss functions. In the application section of the chapter, the reader is referred primarily to Table 5.1 and Figures 5.1 and 5.2 for input parameters and output estimators of the availability for four components selected as examples. Case 1 is for single components only, without network consideration. Case 2 is for networks of different topology, with conÞguration I (series), conÞguration II (parallel), conÞguration III (series in parallel), and conÞguration IV (parallel in series), when component 1 input data are used invariably for all four components that make up the conÞguration. Case 3 is the same as case 2 except that the components are not identical and are selected in the sequence needed from component 1 to component 4 as listed in Table 5.1. The variances of both q and r are identical, as expected, and so are their standard deviations. The random variable r is left-skewed with a negative sign, and q is right-skewed with a positive sign at a mirror image. For component 1, standard deviations are 0.045, skewness is −1.11 and 1.11, and data-resistant medians are 0.8985 and 0.1015, for the two random variables, respectively. Both have positive kurtosis (= 2.11), which denotes that they have leptokurtic distributions where the tail thickness is above that of a standard normal distribution with a reference of unity. Simpson’s trapezoidal rule is used with very Þne precision to obtain moments of r and q, where P (0 < r < 1) = 1, or P (0 < q < 1) = 1. P (0.5 < r < 0.9) = 0.429 and P (0 < r < 0.9078) = 0.5 are examples. In the upper input part of Table 5.1, complying with the deÞnitions in Section 5.1, the gamma priors for the failure and repair rates as indicating left or
244
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
right skewness or symmetry can be chosen by the analyst at will with an educated guess or expert judgment as in Figure 5.2 [15, p.169]. For example, prior inputs of the failure and repair rates for component 3 in Table 5.1 with c = 0.5 and ξ = 1 denote a peaked hyperexponential such as on the very upper left in Figure 5.2, whereas d = 2, η = 0.25 resembles an almost-ßattened moundlike shape rising at most to a probability of 0.1 given at the very bottom right. Keeping all other parameters constant, it may be observed that as the prior distributions of the failure and repair rates become more realistic (i.e., as the numbers of occurrences of sampled up and down times a and b get larger and larger, and correspondingly the total up and down times increase), the mean of the random variable q (= FOR) approaches the MLE used by conventional methods [12,24]. It may be observed for component 4 in Table 5.1, for example, that by taking a large number of failure and repair events, such as 10,000, the small-sample Bayesian estimator of the availability random variable R converges to the largesample estimator of xT divided by the sum of yT and xT , which is 0.9. When the total up and down times for component 4 were elevated to xT = 100, 000 hours and yT = 11, 111.11 hours for a = 100 and b = 100, the Bayesian estimators E(r) = 0.900623, r ∗ = 0.902037, r ∗∗ = 0.900714, and rM = 0.901330 all came closer to the 0.9 mark, which is the large-sample estimate or the conventional MLE, as expected. Further supporting this fact is a sequence of sensitivity analyses, such as a = 150, b = 150, xT = 1, 500, 000 hours, and yT = 166, 666.65 hours, which yielded E(r) = 0.900417, r ∗ = 0.901366, r ∗∗ = 0.900499, and rM = 0.900892. In a Þnal attempt at forcing the computational boundaries, if a = 170, b = 170, but keeping the ratios the same (i.e., xT = 1, 700, 000 hours, yT = 188, 888.87 hours), yielded to E(r) = 0.900368, r ∗ = 0.901207, r ∗∗ = 0.900422, and rM = 0.900788, that is, each time converging closer and closer to the conventional MLE, which is rl−s = 0.9 in this case. In the event of insufÞcient data, it is therefore demonstrated that depending on the types of priors and penalty functions, the Bayes estimators are statistically good alternatives when large-sample asymptotic estimators cannot be obtained [25]. A wise choice of prior parameters and penalty functions is an important requirement since the more realistic these judgments are, the more accurate the results will be [10,24]. Otherwise, assuming large-sample estimators when large-sample data are not available may lead to erroneous over- or underestimated calculations of component availability and thus an erroneously propagated network availability measure. Therefore, in an algorithmic sequence when pursuing a similar project cycle on availability: 1. Decide on the prior functions for the components by considering a list of gamma plots such as in Figure 5.2 for the failure and repair rates, as shown in Table 5.1. 2. Decide on your component’s most realistic loss or penalty function, as given in Section 5.4.1, 5.4.2, or 5.4.3.
DISCUSSION AND CONCLUSIONS
245
3. Decide for the components whether to use informative or ßat priors, and choose the Bayesian estimator(s) r ∗ , r ∗∗ , rl−s , E(r), or rM , or default bruteforce (static or enforced) value, as listed in Table 5.1. These rules then hold also for network applications [i.e., R ∗ , R ∗∗ , E(R), RM , or default or static], according to a given topology, a sample of which is shown in Figure 5.1. Moreover, these calculations are applicable to any complex (nonparallel–series) networks, which have not been illustrated due to lack of space. However, the use of an SL p.d.f. in complex networks is studied in Chapter 6 when other than default or enforced component values are employed, owing to insufÞcient historical failure and repair data, as discussed in detail in the present chapter. Finally, Figures 5.3 to 5.7, respectively, illustrate some additional applications: 1. 90% conÞdence intervals for the population mean using the Bayes estimator for a single component regarding unavailability (q) and availability (r) random variables for component 1 input data as given in Table 5.1 are shown in Figure 5.3. 2. Medians and interquartile ranges for a single component on unavailability (q) and availability (r) random variables using component 1 input data in Table 5.1 are shown in Figure 5.4. 3. Comparison of availability estimators studied for a single component, using component 1, data in Table 5.1, is shown in Figure 5.5.
FIGURE 5.3 90% conÞdence intervals for the right-skewed unavailability (on the left) and the left-skewed availability (on the right) random variables.
246
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
FIGURE 5.4 Medians and interquartile ranges for unavailability (on the left) and availability (on the right) random variables.
FIGURE 5.5 Comparison of availability estimators studied for a single component 1 (D for default).
247
DERIVATION OF THE SAHINOGLU–LIBBY p.d.f.
FIGURE 5.6 Comparison of availability estimators studied for a series network (four components).
4. Comparison of availability estimators studied for a sample network with four components in series, using component 1 data in Table 5.1, is shown in Figure 5.6. 5. Density plots for the right-skewed unavailability (q) and the left-skewed availability (r) random variables side by side using input data from four components in Table 5.1 are shown in Figure 5.7. APPENDIX 5A: DERIVATION OF THE SAHINOGLU–LIBBY p.d.f. The results shown in many textbooks indicate that the residence times in the down state prior to the up state, or vice versa, are roughly exponentially distributed for most electronic hardware equipment. Let Xi and Yj be the up and down times, respectively. It follows that f (xi ) = λ exp(−λxi ), f (yj ) = μ exp(−μyj ),
i = 1, 2, . . . , a, j = 1, 2, . . . , b,
λ > 0, λ > 0,
xi > 0
(5A.1)
yj > 0 (5A.2)
where a is the number of up times sampled and b is the number of down times sampled. Now let the generator failure rate, λ, and the repair rate, μ, have independent prior distributions from the gamma family: θ1 (λ) =
ξ c c−1 λ exp(−λξ ), (c)
λ>0
(5A.3)
248
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
Multi Nose Graph Node 1 Node 2 Node 3 Node 5
fY(y)
0.0
0.25
0.5
0.75
y 1.0
FIGURE 5.7 Density plots for the right-skewed unavailability (on the left) and the left-skewed availability (on the right) random variables.
where for λ prior, c is a shape parameter and ξ is an inverse scale parameter, and ηd d−1 θ2 (μ) = μ exp(−μη), μ>0 (5A.4) (d) where for μ prior, d is a shape parameter and η is an inverse scale parameter; all are estimated by means of a suitable prior estimation technique. The posterior distributions of λ and μ will be obtained by mixing their priors with the data. Since the family of gamma prior distributions for the failure rate λ and repair rate μ are conjugates to the exponential distributions of the up and down data,
DERIVATION OF THE SAHINOGLU–LIBBY p.d.f.
249
respectively, their respective posterior distributions will have the same gamma form, with shape and scale parameters equal to the sum of the scale and shape parameters of the prior and the current up- or down-time total. Therefore, from the sequence of equations (5A.1) through (5A.4), the joint likelihood of the up-time random variables is f (x1 , x2 , . . . , xa | λ) = λa exp(xT λ)
(5A.5)
where a is the number of occurrences of up times sampled and xT is the total sampled up time for a occurrences. The joint distribution of data and prior becomes k(x, λ) = f (x1 , x2 , . . . , xa | λ) =
ξ c a+c−1 λ exp[−λ(xT + ξ )] (c)
(5A.6)
Thus, the posterior distribution for λ is ξ c a+c−1 λ exp[−λ(xT + ξ )] k(x, λ) (c) = h1 (λ | x) = ξc λ f (x, λ) dλ (xT + ξ )−1 (a + c) (c) 1 (xT + ξ )λa+c−1 exp[−λ(xT + ξ )] = (a + c)
(5A.7)
which is Gamma[a + c, (xT + ξ )−1 ] or Gamma(n , 1/b ), as suggested earlier, due to the mathematical conjugacy property. The same arguments hold for the repair rate, μ. That is, h2 (μ | y) =
1 (yT + η)μa+c−1 exp[−μ(yT + η)] (b + d)
(5A.8)
is the Gamma[b + d, (yT + η)−1 ] or Gamma(m , 1/a ) posterior distribution for μ, where b is the number of occurrences of down times sampled and yT is the total sampled down times for b occurrences, usually a = b or a ≈ b. Let Q be the random variable for the forced outage rate, FOR(unavailability) = q = λ/(λ + μ). Then derive its c.d.f. where λ GQ (q) = P (Q ≤ q) = P ≤ q = Area1 for a given 0 < q < 1 λ+μ (5A.9) Now, use the property that Gamma(n , 1/b ) has the moment-generating function (1 − t/b )n . This is the m.g.f. of a chi-square distribution with 2n degrees of freedom. Then it follows that 2 (2a /2m )μ ∼ χ2m /2m = F2m ,2n 2 (2b /2n )λ ∼ χ2n /2n
(5A.10)
which is the Fdf1 ,df2 distribution with numerator df1 = 2m and denominator df2 = 2n .
250
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
From (5A.9), by taking reciprocals of both sides and switching the inequality sign, we obtain 1 μ 1 μ 1 λ+μ GQ (q) = P ≥ =P 1+ ≥ =P ≥ − 1 (5A.11) λ q λ q λ q Multiplying both sides of the inequality (5A.11) by (2a /2m )/(2b /2n ), one obtains
a n −1 (2a /2m )μ > (q − 1) GQ (q) = P (2b /2n )λ b m
a n = P F 2m , 2n > C1 = (q −1 − 1) bm = Area2
(5A.12)
In other words, we obtain an equivalent Area2 for the solution of P (F2m ,2n > C1 ) in (5A.12), instead of attempting to calculate the unknown Area1 for equation (5A.9), whose distributional form is not known or recognized. That is, Area1 = Area2 . Now that we have an accurate representation of the c.d.f. of Q [i.e., GQ (q)], let’s Þnd its mathematical expression by equating Area1 to Area2 :
an GQ (q) = 1 − GF2m ,2n (q −1 − 1) (5A.13) bm Note that Snedecor’s F density is given by [12, p. 23] f (F ) =
[(m + n)/2] m m/2 F (m−2)/2 (m/2) (n/2) n [1 + (m/n)F ](m+n)/2
where μ = E(F ) =
n n−2
0
for n > 2
(5A.14) (5A.15)
and σ 2 = Var(F ) =
2n2 (m + n − 2) m(n − 2)2 (n − 4)
for n > 4 and F > 0
(5A.16)
Since (5A.13) is differentiable, using (5A.14) and differentiating with respect to q through obeying the differential chain rule leads to (note that m = m/2 and n = n/2)
a n −1 a n 1 gQ (q) = −gF2m ,2n (q − 1) − 2 bm bm q {[(a n /b m )(1/q) − 1)]}m −1 a n 1 m m (m + n ) = b m q 2 n (m ) (n ) {[1 + (m a n /n b m )(1/q − 1)]}m +n (5A.17)
251
DERIVATION OF THE BAYES ESTIMATOR
Simplifying and rearranging through a number of intermediate steps yields
(1 − q)m −1 1 (m + n ) a m +n m m 2 (m ) (n ) b {[1 + (a /b )(1/q) − 1]} q q m −1 0−1 (m + n ) (1 − q)m −1 [(b q + a (1 − q)]m [b q + a (1 − q)]n = (m ) (n ) qq m [b q(a /b )]m [b q ]n
gQ (q) =
(5A.18) gQ (q) =
m −1 n −1
q (m + n ) m n (1 − q) a b (m ) (n ) [a + q (b − a )]m +n
(5A.19)
Resubstituting for n = a + c, m = b + d, b = ξ + xT , and a = η + yT , we obtain gQ (q) =
(a + b + c + d) (η + yT )b+d (ξ + xT )a+c (a + c) (b + d) (1 − q)b+d−1 q a+c−1 [η + yT + q(ξ + xT − η − yT )]a+b+c+d
(5A.20)
which is the p.d.f. of the random variable 0 < Q < 1 as deÞned above for the underlying distributional assumptions. APPENDIX 5B: DERIVATION OF THE BAYES ESTIMATOR FOR WEIGHTED SQUARED-ERROR LOSS Given a weighted squared-error loss function for an unknown parameter θ and estimator t = T (x), where the sample data vector x = x1 , x2 , . . . , xn > 0, θ > 0 and the weight function is w(θ ), L(θ, t) = w(θ )(θ − t)2
(5B.1)
Assuming that the prior of θ is λ(θ ), the joint density of prior and f (x) is given by f (x | θ )λ(θ ) (5B.2) Then the conditional (posterior) distribution of θ given the vector x is as follows: k(θ | x) =
f (x˜ | θ )λ(θ ) ˜ | θ )λ(θ ) dθ f (x w(θ )(θ − t)2 k(θ | x) dθ
E[L(θ, t)] =
(5B.3) (5B.4)
The Bayes solution is the minimum of the Bayes risk = E[L(θ, t)], for which dE[L(θ, t)] = − w(θ )2(θ − t)k(θ | x) dθ = 0 (5B.5) dt
252
and
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
θ w(θ )k(θ | x) dθ =
tw(θ )k(θ | x) dθ
(5B.6)
Finally, we obtain the Bayes estimator for the weighted squared-loss function: θ w(θ )k(θ | x) dθ E[θ w(θ ) | X = x] = (5B.7) T = T (x) = − E[w(θ ) | X = x] w(θ )k(θ | x) dθ REFERENCES 1. M. Sahinoglu, Compound Poisson Software Reliability Model, IEEE Trans. Software Eng., 18(7), 624–630 (1992). 2. M. Sahinoglu and U. Can, Alternative Parameter Estimation Methods for the Compound Poisson Software Reliability Model with Clustered Failure Data, J. Software Test. VeriÞcation Reliab., 7(1), 35–57 (1997). 3. M. Sahinoglu, J. Deely, and S. Capar, Stochastic Bayes Measures to Compare Forecast Accuracy of Software Reliability Models, IEEE Trans. Reliab., 50(1), 92–97 (2001). 4. M. Sahinoglu, An Empirical Bayesian Stopping Rule in Testing and VeriÞcation of Behavioral Models, IEEE Trans. Instrum. Meas., 52(5), 1428–1443 (October 2003). 5. S. R. Das, M. Sudarma, M. H. Assaf, E. M. Petriu, W. Jone, K. Chakrabarty, and M. Sahinoglu, Parity Bit Signature in Response Data Compaction and Built-in SelfTesting of VLSI Circuits with Nonexhaustive Test Sets, IEEE Trans. Instrum. Meas., 52(5), 1363–1380 (October 2003). 6. S. R. Das, M. H. Assaf, E. M. Petriu, and M. Sahinoglu, Aliasing-free Compaction in Testing Cores-Based System-on-Chip (SOC) Using Compatibility of Response Data Outputs, Trans. Soc. Des. Process Sci., 8(1), 1–17 (March 2004). 7. S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, W. B. Jone, and M. Sahinoglu, Revisiting Response Compaction in Full-Scan Circuits with Nonexhaustive Test Sets Using Concept of Sequence Characterization, IEEE Trans. Instrum. Meas., Special Issue on VLSI Testing, 54(5), 1662–1677 (October 2005). 8. S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, W. B. Jone, and M. Sahinoglu, Fault Simulation and Response Compaction in Full-Scan Circuits Using HOPE, IEEE Trans. Instrum. Meas., 54(6), 2310–2328 (December 2005). 9. M. Xie, Software Reliability Models: A Selected Annotated Bibliography, J. Software Test. VeriÞcation Reliab., 3, 3–28 (1993). 10. M. Sahinoglu, Statistical Inference on the Reliability Performance Index for Electric Power Generation Systems, Ph.D. dissertation, Institute of Statistics, Texas A&M University, College Station, TX, 1981. 11. D. L. Libby, Multivariate Fixed State Utility Assessment, Ph.D. dissertation, University of Iowa, Iowa City, IA, 1981. 12. M. Sahinoglu, M. T. Longnecker, L. J. Ringer, C. Singh, and A. K. Ayoub, Probability Distribution Function for Generation Reliability Indices: Analytical Approach, IEEE Trans. Power Apparatus Syst., 102, 1486–1493 (1983). 13. D. L. Libby and M. R. Novick, Multivariate Generalized Beta-Distributions with Applications to Utility Assessment, J. Educ. Stat., 7(4), 271–294 (1982).
EXERCISES
253
14. N. L. Johnson and S. Kotz, Continuous Univariate Distributions, Vol. 2, Wiley, New York, 1970. 15. N. L. Johnson, S. Kotz, and N. Balakrishnan, Continuous Univariate Distributions, Vol. 2, 2nd ed., Wiley, Hoboken, NJ, 1995. 16. T. Pham-Gia and Q. P. Duong, The Generalized Beta and F Distributions in Statistical Modeling, Math. Comput. Model., 13, 1613–1625 (1985). 17. R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 3rd ed., Macmillan, New York, 1970. 18. S. D. Silvey, Statistical Inference, 2nd ed., Chapman & Hall, London, 1975. 19. M. Sahinoglu and E. Chow, Empirical-Bayesian Availability Index of Safety and Time Critical Software Systems with Corrective Maintenance, Proceedings of the PaciÞc Rim International Symposium on Dependable Computing, Hong Kong, 1999, pp. 84–91. 20. M. Sahinoglu, Reliability Index Evaluations of Integrated Software Systems for InsufÞcient Software Failure and Recovery Data, Springer-Verlag, Lecture Notes, Proceedings of the First International Conference (ADVIS’-2000), Izmir, Turkey, October 2000, pp. 25–27. 21. M. Sahinoglu and W. Munns, Availability Indices of a Software Network, Proceedings of the 9th Brazilian Symposium on Fault Tolerant Computing, Florianopolis, Brazil, March 2001, pp. 123–131. 22. F. M. Carrano and W. Savitch, Data Structures and Abstractions with Java, Prentice Hall, Upper Saddle River, NJ, 2003. 23. F. M. Carrano and J. P. Prichard, Data Abstraction and Problem Solving with C++, 3rd ed., Addison-Wesley, Reading, MA, 2002. 24. M. Sahinoglu, D. Libby, and S. R. Das, Measuring Availability Indices with Small Samples for Component and Network Reliability Using the Sahinoglu–Libby Probability Model, IEEE Trans. Instrum. Meas., 54(3), 1283–1295 (June 2005). 25. G. G. Roussas, A First Course in Statistics, Addison-Wesley, Reading, MA, 1973.
EXERCISES To use the applications and data Þles, click on “ERDBC” in TWC-Solver on the CD-ROM. 5.1 Programminga Monte Carlo (Static) Simulation of Communication Network for n = 5000 runs. See the network in Figure E5.1(a). Our goal is to simulate the reliability calculations for a source–target reliability. Explore (a) the textbook example (s = 1, t = 7) in Figure E5.1(a) and (b) the telecom network (s = 1, t = 19) in Figure E5.1(b) and Figure E5.2 with each of two sets of data: (1) all components and links will bear a reliability of 0.9, and, (2) all components will have 0.9, but links will be unity (= 1.0). (c) The component reliabilities will be simulated with respect to the Sahinoglu–Libby p.d.f. given the input data for each component. That is, draw a random deviate, q, from the SL p.d.f. with the given historical data as from components 1 to 4 (do the network for each component once). Lay out a network composed of these new ri = 1 − qi obtained from the SL p.d.f. Announce it to be a passage (success)
254
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
FIGURE E5.1(a)
FIGURE E5.1(b)
Seven-node network with s = 1, t = 7.
19-node telephony network with s = 1, t = 19.
if what you draw from the uniform random generator 0 < ui < ri . If not, the component is out. Then generate a new network using the SL p.d.f. Once you do the uniform generation, say with 1000 runs for the network destination, calculate the ratio of successful arrivals from source to target. Then change the network using the SL p.d.f. again. Use 100 networks, each of which needs 1000 runs for successful arrivals. Compute the overall simulation average. The links will be simulated with respect to a Bernoulli p.d.f. for P (the probability of being operative) = 0.9 taken as constant. Generate a Bernoulli
EXERCISES
255
FIGURE E5.2 Monte Carlo simulation result of the 19-node 32-link telephony network for s = 1, t = 19 with 100,000 runs timed.
random deviate (i.e., draw a ui ); if 0 < ui < p, it is a hit (success) for the link. Therefore, you can, for example, advance from 1 to 2. If link reliability is perfect, do not generate anything; simply advance to the neighboring component. The number of times out of n trials you can advance from s = 1 to t = 7 with all successful hits will give you the Monte Carlo simulation probability for this network. Try this Þrst on the simpler network, whose result is given in Figure E5.1(a), then on Figure E5.1(b) 5.2 Programming a Discrete Event (Dynamic) Simulation of a Communication Network. (Use the same networks as in Exercise 5.1.) Method: You may choose to do this project by using a discrete event simulation technique using data set as in Exercise 5.1(b), where perfect links exist. In that case, you need to assume the mean sojourn times for each state, such as λ = (mean sojourn time)−1 . For sake of convenience, let us assume that the probability of an up state for a component, such as 0.9, denotes that nine of 10 time units are operating and one time unit is not operating. Therefore, the reciprocals of the means yield the rates λ (failure rate) = 1/9 and μ (repair rate) = 1/1. Thus, FOR = λ/(λ + μ) = (1/9)/[(1/9) + (1/1)] = (1/9)/(10/9) = 0.1, which checks. Using these rates of sojourn (stay) in terms of generating sojourn times, or times to failure, by negative exponential p.d.f., one may go from state to state. If both states work (coincide or convolve) at the same time, it is a successful connection. How many times out of how many trials you can reach, such as from s = 1 to t = 19, yields the simulated probability of success. Use the input and output data in Table E5.2 for the SL p.d.f. 5.3 Using the input data in Table E5.2 (upper half), verify the results in Table E5.2 (lower half) for the 7-, 8- and 10-node networks only. 5.4 Calculate the source–target reliability for the 19-node problem with link reliability 0.9 and 1.0 using the data in Table E5.2. First assume all units to have 0.9. And then enter the component inputs to verify the 8 results for the
256
SAHINOGLU–LIBBY PROBABILITY DISTRIBUTION FUNCTION
TABLE E5.2 Inputs/Outputs for Exercises 5.2 to 5.4 for Simulating 7-, 8-, 10-, and 19-Node Networks Input/Ouput Data
Component 1
Component 2
Component 3
Component 4
a (no. failure events) b (no. repair events) xT (no. failure time) yT (repair time) c (shape for λ) ξ (inverse scale) d (shape for μ) η (inverse scale)
10 10 1000 h 111.11 0.02 1 0.1 1
5 5 25 h 5h 0.2 1 2 0.5
10 10 1000 h 111.11 h 0.5 1 2 0.25
100 100 10,000 h 1111.11 h 0.5 1 2 0.25
0.907325 0.906655 0.892568 0.7850
0.882397 0.849595 0.852496 0.7053
0.917820 0.906655 0.905617 0.8110
0.902027 0.900715 0.900613 0.801
0.7945
0.7213
0.8186
0.8093
0.7835
0.7016
0.8100
0.7998
0.7676
0.6765
0.7971
0.7858
0.7017
0.6053
0.7339
0.7215
Case 1: single R∗ R ∗∗ E(r) = mean 7-node, 0.9 (0.7999) link = 1.0 8-node, 0.9 (0.8082) link = 1.0 10-node, 0.9 (0.7986) link = 1.0 19-node, 0.9 (0.7845) link = 1.0 19-node, 0.9 (0.7299) link = 0.9
expected value E(r), R ∗ , and R ∗∗ as given in Table E5.2. Obtain Figures 5.3 to 5.6 for component 1 to 4. 5.5 Derive a new multivariate SL probability density function, including a derated state in addition to the up and down states, whose derivation is given in Appendix 5A.
Don’t look on anyone as worthless, no one is worthless; It’s not fair to seek people’s defects and deÞciencies. Don’t look down on anyone, never break a heart; The mystic must love simple all seventy-two nations. —Yunus Emre, the legendary mystic folk poet (1238–1320)
6 RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS Nutshell 6.0 A large amount of work is in progress on reliability block diagramming techniques. Another body of dynamic research is in digital testing of embedded systems with VLSI (very large scale integrated) circuits. Every embedded system, whether simple or complex, can be decomposed to consist of components (blocks) and interconnections or transmissions (links) within a source (ingress) and a target (egress) topology. Four tools are proposed in this study. The Þrst tool, using a novel compression algorithm, is capable of reducing any complicated series–parallel system (not complex) to a visibly easy sequence of series and parallel blocks in a reliability block diagram by Þnding all existing paths, compressing all redundant component duplications algorithmically, and calculating an exact reliability and creating an encoding of the topology. A second tool is to decode and retrieve an already coded source–target dependency relationship using postÞx notation for parallel–series or complex systems. A third tool is an approximate fast upper-bound source–target reliability computing algorithm designed for parallel–series systems to perform state enumeration in a hybrid form assisted by the Polish encoding approach to nonsimple or complex systems to compute the exact source–target reliability. Various examples illustrate how these tools work satisfactorily in harmony. As the fourth tool, challenging overlap and multistate system reliability methods are presented algorithmically at the Þnal stage to reduce the computation speed considerably with no compromise from exact accuracy.
Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
257
258
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
6.1 INTRODUCTION AND MOTIVATION Reliability block diagramming (RBD) has been an active area of research for decades, even more so now with the advent of embedded systems [1–11]. In this chapter we describe and compute the source–target reliability in such embedded systems through an RBD approach. In doing so, we purport to integrate the preceding chapters on component analysis within a system concept [6,9,20,32]. It is assumed that the input data required, such as in the form of static reliability or availability, including the aspect of security for each component and link, in the RBD approach is facilitated correctly by improving the VLSI testing techniques [26–31]. Earlier, simple or complicated parallel–series systems were studied to demonstrate that these networks can be encoded using a modiÞed Polish notation employing postÞxes [12,17,19–22]. Through a user-friendly and graphical Java application, the compression algorithm computes the reliability of any parallel–series network, no matter how large or complicated it is. Furthermore, the encoded topology can be transmitted remotely and then reverse-coded to reconstruct the original network diagram for purposes of securing classiÞed information and saving space. Interest in considering reliability during the design of computer communications networks with a large number of nodes and connecting links, such as those found in hospitals, universities, electricity distribution, gas pipelines, the military or the Internet, has increased in recent years. Due to geographical and physical constraints in such critical systems, designers at the initial or improvement stages usually base their decisions on approximate or upper-bound estimates of reliability to compute a given ingress (source) to egress (target) reliability. This practice may be deceptive, erroneous, and overly optimistic, due to computational complexity, while reliability remains of crucial importance in terms of human life, safety, and health. System reliability can be deÞned as the probability that with all its subsystems and constituting components, a system will complete successfully the task it is intended to perform under the conditions encountered for the period of time speciÞed. System reliability analysis is the process of quantifying a system’s ingress–egress serviceability by examining the dependency relationships between the components that comprise the system. Reliability analysis is essential whenever the probability or cost of failure is high. Modeling allows analysts to determine weak spots in systems so that a maintenance engineer can inventory a backup list of components. In network computing, the reliability analysis focuses on the computer network components and the connections between them to determine overall system reliability as well as the reliabilities between individual nodes in the network. Network reliability computations are similar to those developed for industrial applications, but there are a few exceptions. In industrial applications, all of the components in the system are considered critical to the overall functioning of the system. However, in network applications, the communication between two nodes may use only a select few components in the system, ignoring other noncritical nodes.
SIMPLE ILLUSTRATIVE EXAMPLE
259
Currently, published educational materials cover methods for determining system reliabilities in networks that can be expressed as pure parallel–series systems or, using a substitution method, predominantly parallel–series systems may save several or more nodes, providing feedback. But as experience proves, these ready-to-cook networks rarely occur outside textbooks. These computations prove impossible or mathematically unwieldy when applied to real complex networks and are therefore useful only to teach basic reliability concepts. The graphical screening ease and convenience of RBD algorithms is advantageous for planners and designers trying to improve system reliability by allowing quick and efÞcient intervention that may be required at a dispatch center to observe routine operations and/or identify solution alternatives in case of a crisis. The Boolean decomposition and binary algorithms [13–16] are outside the scope of this work, although they are used in the present chapter to illustrate a new hybrid solution with the Polish notation. The algorithm proposed, through a user-friendly and graphical Java applet, computes the reliability of any complex parallel–series network. Furthermore, the coded topology can be transmitted remotely and then reverse-engineered to reconstruct the original network diagram for purposes of securing classiÞed information and saving space. This, too, can be applied to security-related input for wired or wireless systems. All current exact computational algorithms for general networks are based on enumeration of states, minpaths, or mincuts [2,3]. Network reliability estimation has been used successfully for small networks using neural networks and heuristic algorithms in [7,8] as well as employing a concurrent error detection approach by the coauthor of earlier research [18]. Other researchers have used Monte Carlo simulation [4,5]. Bounds such as Jan’s upper bound, used to reduce the complexity of computations, are approximate [3]. A thorough analysis is given by [1] Finally, the overlap algorithm is presented [24,33] for complex systems. 6.2 SIMPLE ILLUSTRATIVE EXAMPLE For this chapter, some parallel–series examples are used to illustrate the algorithms proposed. As an example of this method, the Java applet in Figure 6.1 examines a slightly modiÞed Example 6.3 of Figure 6.4 given on pp. 106–107 of Reliability Modeling by Linda Woltensholme, published in 1999 [11]. The node failures are all q = 0.1 except for q1 = q3 = 2q = 0.2. Note that for simplicity, links have zero failure probability. Let T denote a tie set. If q = failure probability for all components, then P (T1 ∪ T2 ∪ T3 ) = P (164) + P (1234) + P (1254) − P (12364) − P (12654) − (12354) + P (123654) = (1 − 2q)(1 − 2q + q 2 )(1 + q −2q 3 ) = (0.8)(0.81)(1.098) = 0.711504, which can be observed in array {1, 4} of the solution matrix in Figure 6.1. The method used by Woltensholme, exact reliability block diagram calculation (ERBDC), is an exact calculation of source–target reliability but is also tractable for large networks. The ingress–egress relationship is also tabulated in Figure 6.1 by Polish, or postÞx, notation used by Sahinoglu et al. in 2003 [6,9,12,32], where the postÞxes ∗ and + denote two-at-a-time series and parallel components,
260
FIGURE 6.1 ence 11.
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
RBD and matrix of source–target reliabilities of the network from refer-
respectively. The upper bound of system reliability is calculated by treating the three paths or tie sets in parallel [i.e., reliability upper bound = 1 − (1 − 0.0648)(1 − 0.5184)(1 − 0.5832) = 0.92934], as shown in Figure 6.1. However, when the number of components is increased to a great many more, it becomes tedious to arrange the network in a nice sequence of series and parallel subsystems. The compression algorithm does that and also calculates the exact source–target reliability for nice and simple parallel–series (not complex) but complicated-looking systems, as we show in Section 6.3. 6.3 COMPRESSION ALGORITHM AND VARIOUS APPLICATIONS The algorithm to facilitate a simpler way to compute the source–target reliability is as follows. In a parallel set composed of i, j, k, l, m, . . . paths, at each item i, compress for each following item j . If i can combine with j , do so, and remove j . If not, keep it and compress again with the next kth path until all of the parallel paths have been exhausted. At the end, there is a single compressed path RBD from ingress to egress node. A line between two nodes is treated as a series component between the two, as in Figure 6.1. Line −1 connecting nodes 1 and 2 is expressed as in 1,−1,*,2,*. Two components in parallel are designated as 1, 2, + [6,12,17,19–22]. Let’s take the following parallel–series example, shown in Figure 6.2. The + postÞxes at the end of each serial path denote that those paths will be combined in parallel to calculate the upper bound. Otherwise, each path has
COMPRESSION ALGORITHM AND VARIOUS APPLICATIONS
261
FIGURE 6.2 Source–target reliability (s = 1, t = 8) for a parallel–series network, RBD.
BEGIN RBD 1, 2, *,5, *,8, * to merge with 1, 3, *,5, *,8,
STEP 0
STEP 1 | 2 | 1 _| |___ 5______8 | 3 | to merge with 1, 4, *,6, *,8, *, STEP 2 | 2 | | 1 _| |______________ 5____ |_ 8 | 3 | | | | | | 4 |_____________ 6 __ | | | | to merge with 1,4,*,7,*,8 STEP 3 | 2 | | 1 _| |______________ 5____|_ 8 | 3 | | | | 6| | | 4 _____________| | ___ | | | | | | |7 | | Final tableau – END RBD
FIGURE 6.3
RBD compression algorithm for a simple parallel–series network.
all of its components connected in series, denoted by a succession of * postÞxes. No more than two consecutive components are allowed for each postÞx. We take each path in ascending or descending sequence as convenient to compare and contrast. The compression is executed as many times as there are paths (Figure 6.3). The complicated, intractable parallel–series Ding-Dong1 network
262
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
FIGURE 6.4
Topology for Ding-Dong1 as reduced to the simpler network at the bottom.
Analytical Results: Ingress Node: 1 Egress Node: 13 Without Transmission (Link) Reliabilities, Exact Reliability: 0.7938938796628517 Polish Notation: 1,2 4,*,5,*6,*7,*,8, 9,10,*,+, *, 11,*, 3,14, *,15, *,16, *,17,18, +, *,19, *, +,12, +, *,13, * Path # Reliability Polish Notation 1 0.38742048 1, 2, *,4, *,5, *,6, *,7, *,8, *,11, *,13, *, 2 0.34867844 1, 2, *,4, *,5, *,6, *,7, *,9, *,10, *,11, *,13, *, +, 3 0.43046721 1, 3, *,14, *,15, *,16, *,17, *,19, *,13, *, +, 4 0.43046721 1, 3, *,14, *,15, *,16, *,18, *,19, *,13, *, +, 5 0.72900000 1, 12, *,13, * With Transmission (Link) Reliabilities, Exact Reliability: 0.5513297153873634 Polish Notation: 1, –1, 2, *, –2, *, 4, *, –3, *, 5, *, –4, *, 6, *, –5, *, 7, *, –6, 8, *, –10, *, –7, 9, *, –8, *, 10, *, –9, *, +, *, 11, *, –21, *, –11, 3, *, –12, *, 14, *, –13, *, 15, *, –14, *, 16, *, –15, 17, *, –17, *, –16, 18, *, –18, *, +, *, 19, *, –22, *, +, –19, 12, *, –20, *, +, *, 13, *
FIGURE 6.5
Exact reliabilities for the parallel–series Ding-Dong1 network.
shown in Figure 6.4 will be studied. On the other hand, Figure 6.4 depicts a simulated LAN operation consisting of 22 links and 19 nodes. This network has nodes all with a reliability 0.9 and links with reliability 0.8, respectively. Note that the lines are assigned negative preÞxes, and s = 1 and t = 13 are the ingress (source) and egress (target) nodes, respectively. The network can be translated into a Polish (dependency) notation as in Figure 6.5 to calculate the source–target reliability. The algorithm offers a user-friendly graphical interface, speed, and accuracy, especially in the event of imperfect links beyond a secure environment to transport on the net through a reverse engineering process, presented in Section 6.6. Another application of the compression algorithm for an 11-node simple parallel–series network is shown in Figure 6.6. Source–target reliability analysis using the new Polish notation approach is presented further below. Considering path reliability, the paths or tie sets are as shown in Figure 6.7 when all links are assumed to operate with full reliability of 1. The negative digits denote node connections or links. The + signs at the end of each serial path denote that those paths will be combined in parallel to calculate the upper bound. Otherwise,
COMPRESSION ALGORITHM AND VARIOUS APPLICATIONS
FIGURE 6.6
263
Eleven-node simple parallel-series network with s = 5, t = 9.
Without Link: Path # Reliability 1 0.72900000 2 0.65610000 3 0.47829690 4 0.43046721
Polish Notation 5, 1, *,9, *, 5, 1, *,10, *,9, *, +, 5, 2, *,3, *,4, *,11, *,1, *,9, *, +, 5, 2, *,3, *,4, *,11, *,1, *,10, *,9, *, +
FIGURE 6.7 System reliability and path reliabilities when links are assumed to be perfect in Figure 6.6.
5, 1, *,9, * merges with 5, 1, *,10, *,9, * to give: 5
1 9, to merge with |-----10--------|
5, 2, *,3, *,4, *,11, *,1, *,9, *
5
to converge to:
1 9, which will finally merge with | |----------10----------| |2----3---4----11-----------|
5, 2, *,3, *,4, *,11, *,1, *,10, *,9, * to finally converge to (note redundancy): 5
1 9 |----------10----------| | |2----3---4----11-----------|
FIGURE 6.8 Compression algorithm on the simple parallel–series 11-node network in Figure 6.6.
each path has all its components connected in series denoted by a succession of ∗ postÞxes. One will take each path in an ascending or descending sequence as convenient to compare and contrast, as shown in Figure 6.8. Since the straight line (with a reliability of unity, as assumed) between nodes 5 to 1 and 1 to 9
264
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
With Link: Path # Reliability 1 0.72900000 2 0.65610000 3 0.47829690 4 0.43046721
FIGURE 6.9
Polish Notation 5, -1, *,1, *,-11, *,9, *, 5, -1, *,1, *,-13, *,10, *,-12, *,9, *, +, 5, -2, *,2, *,-3, *,3, *,-4, *,4, *,-5, *,11, *,-6, *,1, *,-11, *,9, *, +, 5, -2, *,2, *,-3, *,3, *,-4, *,4, *,-5, *,11, *,-6, *,1, *,-13, *,10, *,-12, *,9, *, +
Same as Figure 6.7 for path reliabilities with links included.
dominates, the rest of the branches are ineffective. Therefore, the system reliability is the series connection of the three nodes 5, 1, and 9 (i.e., 5, 1, *, 9,* using Polish notation). Therefore, 0.93 = 0.729. If links have nonunity reliability, the links have to be multiplied in series as well. When the links have failures operating with nonunity reliability, it is a different scenario. Figure 6.9 is utilized to compose the Þnalized single-path RBD to calculate the exact reliability. The algorithm works as follows: 1. Take paths 4 and 3 in the reverse order, usually choosing from longest to shortest. Enumerate those common elements, shown in series, in the center to branch out to those legs that are not common, shown in parallel to enable all paths successfully from the source (5) to the target (9) node. That is, merge 5 −2 2 −3 3 −4 4 −5 11 −6 1 −13
10 −12 9, with
5 −2 2 −3 3 −4 4 −5 11 −6 1 −11
9, to converge to
# # # −13 10 −12 # # 9 # 5 −2 2 −3 3 −4 4 −5 11 −6 1 # # −11
(1)
2. Then take the next path backward, 5 −1 1 −13 10 −12 9 and merge it with the RBD in step 1 by following the same rule of thumb: # # # −13 10 −12 # # # 5 −2 2 −3 3 −4 4 −5 11 −6 1 # # 9, with −11 5
−1
1 −13
10 −12 9, to converge to
# # # # # −2 2 −3 3 −4 4 −5 11 −6 # # −13 10 −12 # # 9 # # # 5 # # # 1 # −11 −1 so as to enable passage for all paths from 5 to 9.
(2)
265
HYBRID TOOL TO COMPUTE RELIABILITY FOR COMPLEX SYSTEMS
3. Finally, take the last path in the reverse order to merge with the RBD in step 2: # # # # # −2 2 −3 3 −4 4 −5 11 −6 # # −13 10 −12 # # 9, with # # # 5 # # # 1 # −11 −1 −1
5
−11
1
9,
(3)
to converge to the same tableau as above as the last path already existed in step 2: # # # # # −2 2 −3 3 −4 4 −5 11 −6 # # −13 10 −12 # # 9 # # # (4) 5 # # # 1 # −11 −1 The Þnal tableau can be expressed as in the following reliability block diagram: 10 2 5
−2
3 −3
4 −4
11 −5
1
−13
−6
−12 −11
9 (5)
-1 which results in an exact single-path reliability of 0.729, with its complete Polish notation: 5, −1, −2, 2, ∗ , −3, ∗ , 3, ∗ , −4, ∗ , 4, ∗ , −5, ∗ , 11, ∗ , −6, ∗ , +, 1, ∗ , −11, −13, 10, ∗ , −12, ∗ , +, ∗ , ∗ , 9, ∗ , which describes the relationship above in the Þnal RBD. 6.4 HYBRID TOOL TO COMPUTE RELIABILITY FOR COMPLEX SYSTEMS Let’s take the following non-parallel–series network to compute source–target reliability, whose Boolean decomposition result is known to exist: 0.799, as in Figures E5.1(a) on page 254 and 6.10 [23]. With the Boolean decomposition (keystone) method whose decomposed diagrams are twofold. Either 3 is out or 3 is shorted with 1 in series with 5 and 6 in parallel, and all in series with 7. The system reliability is computed as follows: 1. Node 3 bad : R(system|3bad) = (0.9)[1 − (1 − 0.81)(1 − 0.81)](0.9) = (0.9639)(0.81) = 0.7806. 2. Node 3 good : R(system|3good) = (0.9)[1 − (1 − 0.9)(1 − 0.9)](0.9) = (0.99)(0.81) = 0.8019.
266
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
FIGURE 6.10 RBD for seven-node network shows an exact source–target reliability for a non-parallel–series network with both hybrid and Boolean decomposition. R(s = 1, t = 7) = 0.7997.
3. Result: R(system) = R(3bad) R(system | 3bad) + R(3good) R(system | 3good) = (0.1) (0.7806) + (0.9)(0.8019) = 0.799. The system reliability is also calculated to be 0.799 by the hybrid Polishnotated enumeration. After Polish-notated paths are found, the remaining Þctitious nodes are created to facilitate an enumeration approach. The 100+ nodes symbolize nonexistent bad nodes to denote the complement of a component [e.g., R(105) = 1 − R(5)]. This hybrid method is fast, for it avoids the recalculation of guaranteed paths only by calculating the probabilities of the remaining nodes’ enumerated combination. This technique avoids repetition of identical combination paths. Instead of 36 paths (23 = 8 for each of the four 4-tuples and 2 for each of the two 6-tuples), we use only 18 paths, thus saving 50%. Otherwise, the enumeration needs 27 = 128 paths. The exact reliability using Boolean decomposition using identical nodes is R2 (4R 2 − 3R 3 − R 4 + R 5 ) and using fast upper bound (FUB) employing the compression technique studied above and in references. 12 and 19 gives R 2 (4R 2 − R 3 − 5R 4 + 2R 6 − R 7 + 2R 5 ). The theoretical difference between the Boolean and FUB is R 2 (2R 3 − 4R 4 + R 5 + 2R 6 − R 7 ) = (0.81)[(2)(0.729) − (4)(0.656) + (0.59) + (2)(0.531) − (0.48)] =
HYBRID TOOL TO COMPUTE RELIABILITY FOR COMPLEX SYSTEMS
267
0.007032. Figure 6.11 shows that the difference (fast upper bound–hybrid form) = 0.806812 − 0.799780 = 0.007032, as expected in the theoretical difference. We compare the FUB method’s result with that of the hybrid form by using a 10-node example as in Figure 6.14, where the fast upper-bound reliability = 0.808879873 and the hybrid method reliability = 0.798590485. The difference = 0.8088798 − 0.7985905 = 0.01. Note that the Boolean decomposition is intractable when the networks get larger and more complex. This is why the hybrid form will replace the tedious Boolean decomposition method to give identical results.
FIGURE 6.11 Compression algorithm solution (FUB) for the 7-node complex network in Figure 6.10.
2 0.9
5 0.9
−4 1.0
−1
−5 1.0
1.0
−6
−10 1.0
1.0 1 0.9
−2 1.0
−3
6 0.9
3 0.9
−8 1.0
1.0
4 0.9
−9 1.0
−11 1.0
8 0.9
−12 1.0
−7 1.0
7 0.9
FIGURE 6.12 RBD for an eight-node network for s = 1, t = 8.
268
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
6.5 MORE SUPPORTING EXAMPLES FOR THE HYBRID FORM See the eight-node starlike example given in Figure 6.12. We compare the approximate FUB method’s result with that of the exact hybrid form. Then the 10-node example in Figure 6.14 will follow. The comparative results for the hybrid and FUB method are given in Figures 6.13 and 6.15. The larger the networks get, the smaller the differences between the FUB and the hybrid.
6.6 NEW POLISH DECODING (DECOMPRESSION) ALGORITHM The objective is to generate a reverse-coded reliability block diagram from the Polish notation and recreate the original topology generated by the RBD compression algorithm. The platform used is Java. This diagram helps view complex network paths from an ingress to an egress node, and it ultimately calculates
Ingress Node: 1, Egress Node: 8; Network Reliability (Fast Upper Bound) Method = 0.80895. (Perfect Links) Polish Notation: 1, −1, 2, *, −4, 5, *, −6, 3, *, −7, *, 7, *, −9, 4, *, −8, *, 6, *, − 11, *, −12, +, *, −10, +, *, −5, 6, *, −8, 4, *, −9, *, 7, *, −7, 3, *, −6, *, 5, *, −10, *, −12, +, *, − 11, +, *, +, *, −2, 3, *, −6, 5, *, −4, 2, *, −5, *, 6, *, −8, 4, *, −9, *, 7, *, −12, *, −11, +, *, −10, +, *, −7, 7, *, −9, 4, *, −8, *, 6, *, −5, 2, *, −4, *, 5, *, −10, *, −11, +, *, −12, +, *, +, *, +, −3, 4, *, −8, 6, *, −5, 2, *, −4, *, 5, *, −6, 3, *, −7, *, 7, *, −12, *, −10, +, *, −11, +, *, −9, 7, *, −7, 3, *, − 6, *, 5, *, −4, 2, *, −5, *, 6, *, −11, *, −10, +, *, −12, +, *, +, *, +, *, 8 * Exact hybrid (8-Nodes) = 0.80818398, Difference -(FUB fast upper bound – hybrid form) = 0.80895 − 0.80818 = 0.00077
FIGURE 6.13 Figure 6.12.
Exact source–target reliability for the complex eight-node network in
FIGURE 6.14
RBD for 10-node network for s = 1, t = 10.
NEW POLISH DECODING (DECOMPRESSION) ALGORITHM
269
Ingress Node: 1, Egress Node: 10; FUB (fast upper bound) Method = 0.80887 (Perfect Links). Polish Notation: 1, −1, 2, *, −2, 5, *, −8, 3, *, −7, *, 7, *, −5, 4, *, −6, *, 6, *, −11, 8, *, −15, *, − 12, 9, *, −14, *, +, *, −13, 9, *, −12, 6, *, −11, *, 8, *, −15, *, −14, +, *, +, *, −10, 8, *, −11, 6, *, −6, 4, *, −5, *, 7, *, −13, *, −12, +, *, 9, *, −14, *, −15, +, *, +, *, −9, 6, *, −6, 4, *, −5, *, 7, *, −7, 3, *, −8, *, 5, *, −10, *, 8, *, −15, *, −13, 9, *, −14, *, +, *, −11, 8, *, −10, 5, *, −8, *, 3, *, −7, *, 7, *, −13, *, 9,*, −14, *, −15, +, *, +, −12, 9, *, −13, 7, *, −7, *, 3, *, −8, *, 5, *, −10, *, 8, *, −15, *, −14, +, *, +, *, +, *, −3, 3, *, −7, 7, *, −5, 4, *, −6, *, 6, *, −9, 2, *, −2, *, 5, *, −10, *, −11, +, 8, *, −15, *, −12, 9, *, −14, *, +, *, −13, 9, *, −12, 6, *, −9, 2, *, −2, *, 5, *, −10, *, −11, +, *, 8, *, −15, *, −14, +, *, +, *, −8, 5, *, −2, 2, *, −9, *, 6, *, −6, 4, *, −5, *, 7, *, −13, *, −12, +, 9, *, −14, *, −11, 8, *, −15, *, +, *, −10, 8, *, −11, 6, *, −6, 4, *, −5, *, 7, *, −13, *, −12, +, *, 9, *, −14, *, − 15, +, *, +, *, +, *, +, −4, 4, *, −5, 7, *, −7, 3, *, −8, *, 5, *, −2, 2, *, −9, *, 6, *, −11, 8, *, −15, *, −12, 9, *, −14, *, +, *, −10, 8, *, −11, 6, *, −12, *, 9, *, −14, *, −15, +, *, +, *, −13, 9, *, −12, 6, *, −9, 2, *, −2, *, 5, *, −10, *, −11, +, *, 8, *, −15, *, −14, +, *, +, *, −6, 6, *, −9, 2, *, −2, *, 5, *, −8, 3, *, −7, *, 7, *, −13, *, 9, *, −14, *, −10, 8, *, −15, *, +, *, −11, 8, *, −10, 5, *, −8, *, 3, *, −7, *, 7, *, −13, *, 9, *, −14, *, −15, +, *, +, −12, 9, *, −13, 7, *, −7, *, 3, *, −8, *, 5, *, −10, *, 8, *, −15, *, −14, +, *, +, *, +, *, +, *, 10, * Exact hybrid (10-Nodes) = 0.79859; Difference (FUB fast upper bound – hybrid form) = 0.80887 − 0.79859 = 0.01
FIGURE 6.15 Exact source–target reliability for the complex 10-node network in Figure 6.14.
the system reliability for parallel–series reducible networks. The following is the approach taken to recreate the RBD from a given Polish notation: 1. Accept the Polish notation from the user. The Polish notation consists of nodes (numbers) and operators (∗ or +). 2. Parse the Polish notation to identify the nodes and operations. 3. Identify the node pairs that connect. Use the existing Java components and the node pairs that are identiÞed to draw the RBD. A stack algorithm was employed to accomplish the above. The algorithm accepts the Polish notation and parses the notation using Java’s String Tokenizer. To identify the node pairs that connect, the following logic was incorporated: 1. Push into the stack until an operator is encountered. 2. If the operator is a ∗ (nodes in series): a. Pop the top two elements (nodes) of the stack. b. Form a node pair. c. Concatenate the nodes and node pairs. d. Push the concatenated string onto the top of the stack. 3. If the operator is a + (nodes in parallel): a. Pop the top two elements (nodes) of the stack. b. Concatenate the operator between the two nodes. c. Push the concatenated string onto the top of the stack. 4. Continue performing the foregoing steps until the end of the Polish notation.
270
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
After the node pairs are identiÞed, the graphical Java components FC oval (nodes) and FC line (transmissions or connecting links) display the network. Networks utilizing links were deployed using the same algorithmic process. Negative digits, which designate transmission lines, will Þrst be represented as nodes. Once the initial diagram has been generated, a second process will essentially remove the oval object, which represents a node, leaving the negative node name as the transmission line. The smallest node number is the ingress, the largest one is the egress node. A more complex non-parallel–series commercial telephony network (with 19 nodes and 32 links) whose Polish notation was previously coded is reverse-coded or decoded in Figure 6.16 to reconstruct the original topology. Note that the hardto-read “Polish Notation” box is a page-long postÞx notation obtained previously and inserted using the compression algorithm. Although the Polish notation cannot calculate the exact source–target reliability for non-parallel–series networks (for which a speciÞc hybrid technique is demonstrated in the preceding sections), it can successfully encode and decode any non-parallel–series or simple network for a secure and economical transport. The Polish notation approach also prepares a base for calculating the exact reliability for any complex system utilizing a hybrid enumeration approach. Figure 6.16 denotes all nodes and links invariably with a sample reliability of 0.90 for place holding and s = 1, t = 19. However, these postscripts (Polish notation) do not carry information on the node and link reliabilities. Therefore, converting the topology with all the attached input data into an XML Þle is an alternative solution and has been done in the CD-ROM. Exporting and then importing the same XML Þle as
FIGURE 6.16 Decoding (reverse engineering Polish notation or decompression) for 19-node network.
271
OVERLAP TECHNIQUE
a means for transportation will be efÞcient but not necessarily secure, as the XML Þles can be opened. If the topology is of prime interest, the decoding algorithm is of value to transport very complex networks safely and discreetly with extremely complicated Polish notations difÞcult to decipher, as shown in Figure 6.16. 6.7 OVERLAP TECHNIQUE When we observe networks comprising large parallel–series systems, we break the system down into simple parallel–series subsystems. There are some for which we can achieve this, and others for which we cannot [24,33]. 6.7.1 Overlap Ingress–Egress Reliability Method Take the complex (non-parallel–series) system shown in Figure 6.17. Let subsystem A comprise components 1, 4, and 5 and subsystem B include 2, 3, and 6. The reason that we cannot decompose this system into purely parallel–series topology is because there exists a redundant (or surplus) feedback between these subsystems A and B as observable in Figure 6.17. The improvement comes in when we discover the system as a sum of unique paths from IN to OUT, as displayed in Figure 6.18. The problem posed by Figure 6.18 to enumerate all those paths from IN to OUT is that the more times a node is considered, the greater will be its virtual probability, a fact that causes inßated reliability Þgures. Therefore, it became absolutely necessary to create an advanced algorithm to reduce the unique paths into a quasi parallel–series network as in Figure 6.19. This new advanced algorithm is the overlap method due to the overlapping of the subsystem of nodes [24]. We Þrst study this network example and then present an algorithm
4 1 5
IN (7)
OUT (8)
2 6 3
FIGURE 6.17 Network example with six intermediary nodes from ingress (IN) to egress (OUT).
272
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
1
4
1
5
2
6
1
5
2
5
1
4
2
5
2
6
3
6
3
6
2
5
3
6
2
5
FIGURE 6.18
1
4
Paths that follow from IN to OUT in Figure 6.17.
4 1 5 IN (7)
5
OUT (8)
2 6
3
6
FIGURE 6.19 A step closer to the overlap methodology for ingress–egress RBD.
on how to execute this technique by hand on two complex topologies not more than six or eight nodes in total. This will facilitate the student to start doing these earlier “unsolvable” complex systems, now resolvable by hand, although it takes a considerably long time to do so. This is why a software code is essential to write and apply for tedious networks when hand calculations exceed tens of pages. The problem with Figure 6.19 is that the nodes are still represented more than once, and the reliability Þgure will still be inßated due to double counting, but we are getting closer to the target goal. Figure 6.20 indicates the efÞcient new technique, which we earlier called the overlap method. Therefore, using Figure 6.20 for adding (+) and deleting (−)
273
OVERLAP TECHNIQUE
1−4
1−4−5
1−5
1−2−4−5
2−5
1−2−5
2−6
1−2−4−6
3−6
1−2−5−6
1−2−4−5−6
2−5−6 1−3−4−6
1−3−4−5−6
1−3−5−6
1−2−3−4−6
1−2−3−4−5−6
2−3−5−6
1−2−3−5−6
1−2−3−4−5−6
2−3−6
FIGURE 6.20 Overlap method outlined with the combinations to be added and deleted.
nodes, which are disallowed, the resulting IN–OUT dependency relationship will be established: {1 → 4} + {1 → 5} + {2 → 5} + {2 → 6} + {3 → 6} − {1 → 4 → 5} − {1 → 2 → 5} − {1 → 2 → 4 → 6} − {2 → 5 → 6} − {1 → 3 → 4 → 6} − {1 → 3 → 5 → 6} − {2 → 3 → 6} + {1 → 2 → 4 → 5 → 6} + {1 → 3 → 4 → 5 → 6} + {1 → 2 → 3 → 4 → 6} + {1 → 2 → 3 → 5 → 6} − {1 → 2 → 3 → 4 → 5 → 6}
(6)
Example For the network above, if all the reliabilities are assumed to be 0.9, then IN–OUT = (5)(0.92 ) − [(4)(0.93 ) + (3)(0.94 )] + (4)(0.95 ) − (0.96 ) = (5)(0.81) − (4)(0.729) − (3)(0.6561) + (4)(0.59049) − 0.531441 = 4.05 − 4.8843 + 2.36196 − 0.531441 = 0.996219
(7)
274
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
FIGURE 6.21
Result (= 0.80694) for the sample network using the overlap technique.
and appending the ingress–egress reliabilities, both 0.9, the entire network reliability is then [IN(7)](0.996219)[OUT(8)] = (0.9)(0.996219)(0.9) = (0.81)(0.996219) = 0.80694
(8)
This result is conÞrmed in Figure 6.21 by using TWC-Solver in the CD-ROM. 6.7.2 Overlap Ingress–Egress Reliability Algorithm The overlap exact reliability algorithm generates a minimum list of paths between the ingress and egress nodes. The list of paths will contain all the paths between the ingress and egress nodes such that no path overlaps any other path. For example, path 7–3–6–2–5–8 contains all nodes present in path 7–3–6–8. Therefore, path 7–3–6–8 is said to overlap path 7–3–6–2–5–8 and path 7–3–6–2–5–8 would not be included in the list of minimum paths. SpeciÞc logic must be included in the algorithm to determine if a network from the ingress to the egress node is parallel–series. The logic speciÞes that using any two minimal paths in a network, one can make the assertion that any divergent nodes from the middles of paths must always be followed by the node where they converge, and always be lead by the node where they originally diverged if the network is strictly parallel–series. By comparing the nodes as they are generated, building assertions, and comparing each new node against the existing assertions, we can determine if the network is parallel–series. The Þrst step in determining the minimum paths for the network is to identify the ingress node (7) and the egress node (8). Now a working path is created to
MULTISTATE SYSTEM RELIABILITY EVALUATION
275
hold the nodes that are currently being examined. The ingress node is the Þrst node to be added to the working path. All the links from the ingress node to other nodes are added to the node in the working path. Processing of the network can now begin. Although there are nodes still in the working path to be processed, the following steps are to be performed. If all the links from the last node in the working path have been processed, the node is removed from the working path. If there are links that have not been processed and none of them point to the egress node, pick a link to process. If the node referred to by the link does not short-circuit the working path, add the node to the working path and repeat the steps above. If any of the links point to the egress node, add the path to the list of paths excluding the ingress and the egress nodes and remove the last node from the working path. Now that there is a new path in the path list, the assertions must be checked to ensure that the network can still be considered to be a parallel–series network. For each node in the new path, the following steps need to be performed. Check to see if there is a set of assertions for the node. If there are no assertions, add an assertion set for the node. Continue to add a list of all nodes that follow the node based on the order of the nodes in the path. If the assertion set already exists, remove any node that appears before the current node in the path. Now set the node order in the assertion set to equal the maximum position it held in any path. The next steps are performed only if more than a single node exists in the path. Compare the last two paths added to the path list to determine where the two paths divert and where they converge. If they divert, then converge, add a rule in the assertion set for each of the paths. The rules must state that all the nodes between the divergence and convergent nodes always come between those nodes. For example, if the last path is 7–1–4–8 and the path prior to the last path is 7–1–5–8, an assertion rule for path 7–1–4–8 is added which states that 4 must follow 1 and 4 must be followed by 8. Then an assertion must be added for path 7–1–5–8 which states that node 5 must follow node 1 and node 5 must be followed by node 8. Now the assertions for the last path must be examined to determine if the path is a valid path. Check each node against the rules in the assertion sets. If any node breaks the rules for the nodes that follow other nodes or lead other nodes, the network can no longer be considered a strictly parallel–series network. Repeat the algorithm until there are no nodes left in the working path. Once all nodes have been removed from the working path, the path list will contain the minimum paths in the network. A step-by-step rigorous application is left to the reader. Refer to the appendixes to follow similar examples. 6.8 MULTISTATE SYSTEM RELIABILITY EVALUATION It is sometimes inadequate to describe a node’s states with only UP (fully operating) and DOWN (fully deÞcient), but with more states like DERATED (partially operating close to full) or even MORE DERATED (less partially operating close
276
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
to DOWN) states. It is imperative that these states add up to unity (1.0). Let’s take the situation where there are three states, UP, DER, and DN, and study it for simple series and active parallel systems. 6.8.1 Simple Series System A simple series system with fully operating and derated states is shown in Figure 6.22. Our goal is to calculate the simplest series system reliability of a primitive example, where the node has three states, with probabilities P (UP) = 0.7, P (DER) = 0.2, and P (DN) = 0.1 in Figure 6.22. We use two approaches. Longer State-Enumeration Approach There can be S N = 32 = 9 combinations, where S is the number of states and N is the number of nodes. S = 3 and N = 2 as follows: P (UP and UP) P (UP and DER) P (UP and DN) P (DER and UP) P (DER and DER) P (DER and DN) P (DN and UP) P (DN and DER) P (DN and DN)
= = = = = = = = =
0.72 (0.7)(0.2) (0.7)(0.1) (0.2)(0.7) 0.22 (0.2)(0.1) (0.1)(0.7) (0.1)(0.2) 0.12
Sum of Probabilities
= = = = = = = = =
0.49 0.14 0.07 0.14 0.04 0.02 0.07 0.02 0.01
= 1.00
(9)
Of these nine combinations, the state that yields a fully UP combination is the Þrst line with P (UP and UP) = 0.72 = 0.49. Then states that are indicative of the system being inoperative are those states on the third and sixth to ninth lines, which contain at least one DOWN state, which sum to 0.19. The DERATED
FIGURE 6.22 Simple series system with a derated state.
MULTISTATE SYSTEM RELIABILITY EVALUATION
277
states on the second, fourth and Þfth lines add up to 0.32, or Psys (DER) = 1 − Psys (UP) − Psys (DN) = 1 − 0.49 − 0.19 = 1 − 0.68 = 0.32. Shortcut Formulation Approach Working on the same two-node simple series system, let’s calculate the system derated probability: Psys (UP) = P1 (UP)P2 (UP) = (0.7)(0.7) = 0.49
(10)
Psys (DER) = P1 (UP + DER)P2 (UP + DER) − Psys (UP) = (0.7 + 0.2)2 − 0.72 = 0.81 − 0.49 = 0.32
(11)
Psys (DN) = 1 − Psys (UP) − Psys (DER) = 1 − 0.49 − 0.32 = 0.19 (12) 6.8.2 Active Parallel System The system with IN(1) and OUT(4) both perfect, 2 and 3 DER, is shown in Figure 6.23. Longer State-Enumeration Approach The system-UP scenario is when at least one of the middle two states is UP. This is possible when UP–UP, UP–DER, DER–UP, UP–DN, and DN–UP combinations exist whose sum = 0.49 + 0.14 + 0.14 + 0.07 + 0.07 = 0.91. The system-DER scenario is when at least one of the states is DER. This is when DER–DER, DER–DN, and DN–DER combinations exist whose sum = 0.04 + 0.02 + 0.02 = 0.08. The only remaining combination is DN–DN, whose probability is 0.12 = 0.01, or by subtraction.
FIGURE 6.23 Active parallel system with nodes 1 and 4 fully reliable, 2 and 3 are derated.
278
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Shortcut Formulation Approach Psys (UP) = P1 (UP){1 − [1 − P2 (UP)][1 − P3 (UP)]}P4 (UP) = (1.0)[1 − (1 − 0.7)2 ](1.0) = (1.0)(0.91) = 0.91
(13)
Psys (DER) = P1 (UP)P4 (UP) − {1 − [1 − P2 (UP + DER)] × [1 − P3 (UP + DER)]} − Psys (UP) = (1)(1 − 0.12 ) − 0.91 = 0.99 − 0.91 = 0.08
(14)
Psys (DN) = 1 − Psys (UP) − Psys (DER) = 1 − 0.91 − 0.08 = 0.01 (15) 6.8.3 Simple Parallel–Series System A simple parallel–series system with full (0.7) and derated (0.2) states is shown in Figure 6.24. Longer State-Enumeration Approach The parallel and series topologies merged will contain 34 = 81 combinations, such as UP–UP–UP–UP, DER–UP–UP–UP all the way to DN–DN–DN–DN. This method is a cumbersome and time consuming way to distinguish the desirable states by enumerating. The shortcut technique is faster. Shortcut Formulation Approach P (UP) = P1 (UP)P4 (UP){1 − [1 − P2 (UP)][1 − P2 (UP)]} = (0.7)(0.7)[1 − (1 − 0.7)2 ] = (0.49)(0.91) = 0.4459
(16)
Psys (DER) = P1 (UP + DER)P4 (UP + DER) − {1 − [1 − P2 (UP)][1 − P3 (UP)]} − Psys (UP) = (0.7 + 0.2)2 (1 − 0.12 ) − 0.4459 = (0.81)(0.99) − 0.4459 = 0.356
(17)
Psys (DN) = 1 − Psys (UP) − Psys (DER) = 1 − 0.4459 − 0.356 = 0.19 (18)
FIGURE 6.24
Simple parallel–series system with single derated state for s = 1, t = 4.
MULTISTATE SYSTEM RELIABILITY EVALUATION
279
FIGURE 6.25 Simple parallel–series system with two derated states for s = 1, t = 4.
6.8.4 Simple Parallel System A simple parallel–series system with full (0.4) derated (0.3) and degraded (0.1) states is shown in Figure 6.25. Using the shortcut formulation approach (the stateenumeration method requires S N = 44 = 256 combinations in general; here, due to 1 and 4 being full states, 42 = 16, and therefore cumbersome to work with), we get, using the same logic as we used in Sections 6.8.1 to 6.8.3, P (UP) = P1 (UP)P4 (UP){1 − [1 − P2 (UP)][1 − P3 (UP)]} = (1.0)[1 − (1 − 0.4)2 ] = 1 − 0.36 = 0.64
(19)
Psys (DER) = (1.0){1 − [1 − P2 (UP + DER)][1 − P3 (UP + DER)]} − Psys (UP) = (1.0)(1 − 0.32 ) − 0.64 = 0.91 − 0.64 = 0.27
(20)
Psys (DEGR) = (1.0){1 − [1 − P2 (UP + DER + DEGR)][1 − P3 (UP + DER + DEGR)]} − Psys (UP) − Psys (DER) = (1.0)(1 − 0.12 ) − 0.64 − 0.27 = 0.99 − 0.64 − 0.27 = 0.08 (21) Psys (DN) = 1 − Psys (UP) − Psys (DER) − Psys (DEGR) = 1 − 0.64 − 0.27 − 0.08 = 0.01
(22)
Now consider the following example, which is somewhat similar to the one given on page 71 in reference 25 but with an altogether different formulation and different input data. 6.8.5 Combined System A hydroelectric power plant (Figure 6.26) can generate 100% (fully operating), 75% (derated 1), 50% (derated 2), 25% (derated 3), or 0% (fully down) of rated electric power capacity, depending on the water storage level and thus the amount of steam ßow reaching the turbine. The corresponding system states are 1, 2, 3, 4,
280
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
FIGURE 6.26 Power plant example with four derated turbines (1 to 4) in active parallel and a transformer (egress node 5) Node 0 is used as a placeholder for an ingress node with full reliability.
and 5. The power plant consists of four turbines in active parallel and an output transformer to facilitate distribution in series with the turbines. The available turbine that has the maximal power output is always used. For any demand level of w = 1, 2, 3, 4, 5, the combined system reliability function of states takes the following recursive form: ⎛ ⎞⎡ ⎛ ⎞⎛ ⎞⎛ ⎞ w w w w Rsys (w) = ⎝ R5j ⎠ ⎣1 − ⎝1 − R1j ⎠ ⎝1 − R2j ⎠ ⎝1 − R3j ⎠ j =1
⎛
× ⎝1 −
w j =1
⎞⎤ R4j ⎠⎦ −
j =1 w
j =1
Rsys (j − 1)
j =1
(23)
j =1
where w = 1, 2, 3, 4, 5 and Rsys (0) = 0.0. MSS elements are statistically independent. If Rw1 = 0.4, Rw2 = 0.3, Rw3 = 0.15, Rw4 = 0.1, and Rw5 = 0.05, w(state) = 1, . . ., 5, as shown in Figure 6.26; then the system reliabilities Rsys (w) are Rsys (1) = 0.4[1 − (1 − 0.4)4 ] = 0.34816
(24)
Rsys (2) = (0.4 + 0.3){1 − [1 − (0.4 + 0.3)]4 } − R(1) = 0.69433 − 0.34816 = 0.34617
(25)
Rsys (3) = (0.4 + 0.3 + 0.15){1 − [1 − (0.4 + 0.3 + 0.15)]4 } − R(1) − R(2) = 0.84957 − 0.34617 − 0.34816 = 0.15524
(26)
DISCUSSION AND CONCLUSIONS
281
Rsys (4) = (0.4 + 0.3 + 0.15 + 0.1){1 − [1 − (0.4 + 0.3 + 0.15 + 0.1)]4 } − R(1) − R(2) − R(3) = 0.94999 − 0.15524 − 0.34617 − 0.34816 = 0.10042
(27)
Rsys (5) = (0.4 + 0.3 + 0.15 + 0.1 + 0.05){1 − [1 − (0.4 + 0.3 + 0.15 + 0.1 + 0.05)]4 } − R(1) − R(2) − R(3) − R(4) = 1 − R(1) − R(2) − R(3) − R(4) = 1 − 0.10042 − 0.15524 − 0.34617 − 0.34816 = 0.05001
(28)
These results agree with the software solution using TWC-Solver as shown in Figure 6.26. 6.9 DISCUSSION AND CONCLUSIONS First, the compression technique proposed in the FUB method, although it does not give exact results but an approximate fast upper bound, performs a special coding to encode and decode non-parallel–series (complex) networks. Second, the hybrid-enumeration algorithm proposed, although it is slower, calculates the exact source–target reliability index, starting from simpler and tractable complex networks such as those shown in Figures 6.10, 6.12, and 6.14, and further to very complex, as in Figure 6.16, with 19 nodes and 32 links. This method illustrates the reconstruction of a complex topology by a special conversion or Polish decoding technique whose algorithm is given in Section 6.6. The decoding practice proposed can be useful for security and for time- and space-saving purposes. This package enables encoding and decoding for any network that demands the highest and most critical assurance. In conclusion, aside from calculating the source–target reliability of any complex system, it is shown that the Polish notation constructed from a graphical interface using postÞxes to describe the topology of any complex network is useful for identifying a given topology. Furthermore, the output can then be transported, for reasons of security or saving storage space, to a remote analyst, who in turn, using this algorithm, can reverse-engineer a given Polish notation in a decoding algorithm proposed to reconstruct the topology. Both forward (encoding) and reverse (decoding) algorithms work for both simple parallel–series and non-parallel–series, (i.e., complex) networks. Networks of various complexities are examined. The efforts to save time have progressed to a new algorithm for large networks exceeding 20 nodes. The hybrid-enumeration algorithm can accurately calculate the source–target reliability of a complex system such as the one shown in Figure 6.16, with 19 nodes and 32 links, in roughly 3000 seconds. However, a novel research project using the Overlap technique will increase the computation speed for large complex networks on the order of 50- to
282
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
100-fold, from approximately 3000 seconds to nearly 60 seconds, without sacriÞcing any accuracy for a network such as one with 19 nodes and 32 links. The overlap method, with its extreme speed and additional advantages, such as multistate treatment of components, is also studied in this chapter, through algorithms for hand calculations. The set of algorithms presented for students and readers to enable working on earlier, unsolvable complex networks, now feasible by hand calculations, provide a powerful alternative to commercial software. These algorithms form the subject matter of a new RBD trend.
APPENDIX 6A: OVERLAP ALGORITHM DESCRIBED Create a list to hold the minimum paths. Create a list of nodes (a working path list). Determine the ingress node for the network. Add the ingress node to the working path list. Include an index in the node to denote the current link. Include an indexed list of links to all other nodes [24]. Current State:
Paths:
Ingress Node:
Egress Node:
Working Path: Ingress Node (0) Links (0) (node–node), (1) (node–node) Link Index = −1
While there are still nodes present in the working path, continue working. Step I 1. If no nodes remain in the working path, the process is complete, so go to step II. 2. Increase the link index by 1 for the last node in the working path. 3. If all the links have been processed for the last node in the working path, remove the node and go to step I. 4. Get the node to which the next link points. 5. If it is the egress node, do the following: a. Add the egress node to the working nodes. b. Add the path contained in the working path to the list of paths. c. If the network is currently considered a parallel–series network and there is more than one node in the path, for each node in the path do the following: i. If the node is not in the assertions list: 1. Add the node to the assertions list. 2. Add all the nodes in the path that follow the node to the “always follows” list for the node in the assertions list. ii. If the node is in the assertions list, remove any nodes in the “always follows” list for the node in the assertions list that precedes the node in the path.
OVERLAP ALGORITHM DESCRIBED
283
iii. Set the order number for the node to the highest value it has held in any path. iv. Get the path added to the paths list prior to the current path. v. Walk through the paths from the start and determine when the paths diverge. vi. Walk through the paths from the end and determine where they converge. vii. For each node in the current path starting after the divergent node and ending at the node prior to the convergence node: 1. Add the node to the assertions list if it does not already exist. 2. Add the divergence node to the “follow nodes” list if it is not in the list. 3. Add the convergence node to the “lead nodes” list if it is not in the list. viii. For each node in the path prior to the current path starting after the divergent node and ending at the node prior to the convergence node: 1. Add the node to the assertions list if it does not already exist. 2. Add the divergence node to the “follow nodes” list if it is not in the list. 3. Add the convergence node to the “lead nodes” list if it is not in the list. ix. For each node in the current path, if any node in the “follow nodes” list precedes the node in the current path or any node in the “lead nodes” list follows the node in the current path, mark the network as complex. d. Remove the egress node from the working path. e. Go to step I. 6. If the node does not short-circuit the path (a node is considered to short-circuit the path if it linked to any node already in the working path), add the node to the working path. 7. Go to step I. Step II If the network is not parallel–series, go to step IV. 1. Calculate the reliability of the entire network. 2. Get the list of nodes that always follow the ingress node from the assertions list generated in step I. 3. Make this the “always leads” list. 4. If no nodes follow the ingress node, the network reliability is the reliability of the ingress node, so go to step V. 5. Set the target node to the egress node.
284
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Step III 1. If there is only one node in the “always leads” list: a. Get the “always leads” list for the node in the current “always leads” list and start step III again. a. Recursively call step III with the “always leads” list from item a. b. Set the current reliability as the current reliability × the reliability from the recursive call done in item b. c. Return the current reliability. 2. If there is more than one node in the “always leads” list: a. Find the node in which the all nodes in the “always leads” list eventually reconverge. b. Calculate the system reliability as [1.0 − (recursively call step III with the node from item a as the target node)]. c. Get the system reliability of the reconvergence node to the target node. d. Calculate the current reliability as the current reliability ×(1− system reliability) × reliability from item c. e. Return the current reliability. Step IV The network type is a complex network. 1. Remove the paths that overlap. a. Test all the paths to determine which paths may be removed due to overlapping another path. b. Test each node in the paths list against all the nodes that follow it: i. If every node in the path at index j is in the path at index i, remove it. ii. Else, if every node in the path at index i is in the path at index j , remove it. 2. Create a list index and set it to 0. 3. For each node in the list: a. Get the current path from the list at the index. b. For each path in the paths list following the current path: i. Get the nodes that are in the path that are not in the current path. ii. Create a new path with these nodes and add it to the “pass on paths” list. c. Calculate the reliability as reliability + (path reliability) × [1− (repeat these steps for the “pass on paths” list)]. Step V The algorithm is complete.
285
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1
APPENDIX 6B: OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM APPLIED, EXAMPLE 1 Using the network shown in Figure 6B.1, the following is an example of the overlap technique to determine the minimal paths for the network. The default assertion is that the network type is parallel–series. Create a list to hold the minimum paths. Determine the ingress node for the network. Determine the egress node for the network. Create an assertions list. Create a list of nodes (a working path) and add the ingress node to the working path list. Include a link index in the node to denote the current link with an initial value of −1. Include an indexed list of links to all other nodes. Paths:
Ingress Node 1
Egress Node 8
Working Path: Node 1 Link Index: −1 Links: 0–1, 2 1–1, 4
Increment the link index for node 1. Since node 1 has links that have not been processed, Þnd the next node based on the link index from the links list. Since node 2 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths:
Ingress Node 1
Egress Node 8
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: −1 Links: 0–2, 3 1–2, 5
FIGURE 6B.1
Six-node complex network and its overlap reliability s = 1, t = 6.
286
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Increment the link index for node 2. Since node 2 has links that have not been processed, Þnd the next node based on the link index from the links list. Since node 3 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths:
Ingress Node 1
Egress Node 8
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 0 Links: 0–2, 3 1–2, 5
Node 3 Link Index: −1 Links: 0–3, 4 1–3, 6
Increment the link index for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index from the links list. Since node 4 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths:
Ingress Node 1
Egress Node 8
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 0 Links: 0–2, 3 1–2, 5
Node 3 Link Index: 0 Links: 0–3, 4 1–3, 6
Node 4 Link Index: −1 Links: 0–4, 1 1–4, 5
Increment the link index for node 4. Since node 4 has links that have not been processed, Þnd the next node based on the link index from the links list. The Þrst node in the links list is to node 1. Node 1 is already in the path, so it is ignored. Increment the link index for node 4. Since node 4 has links that have not been processed, Þnd the next node based on the link index. Since node 5 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths:
Ingress Node 1
Egress Node 8
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 0 Links: 0–2, 3 1–2, 5
Node 3 Link Index: 0 Links: 0–3, 4 1–3, 6
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 5
Node 5 Link Index: −1 Links: 0–5, 2 1–5, 6
Increment the link index for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index from the links list.
287
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1
Since node 2 is already in the path, ignore it. Increment the link index for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Node 6 is not already in the working path, so add it to the working path. The working path should now look like this: Paths:
Ingress Node 1
Egress Node 8
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 0 Links: 0–2, 3 1–2, 5
Node 3 Link Index: 0 Links: 0–3, 4 1–3, 6
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 5
Node 5 Link Index: 1 Links: 0–5, 2 1–5, 6
Node 6 Link Index: −1 Links:
Node 6 is the egress node. Add the path to the paths list. The network is considered a parallel–series network and there is more than one node in the path. There are no nodes in the assertions list, so add all the nodes in the path to the assertions list. As each node is added to the assertions list, add all the nodes that follow it to the always follows list. The assertions list looks like this: Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6 6
Order 1 2 3 4 5
Lead Nodes
Follow Nodes
Since there is only one path in the paths list, the current path cannot be compared to any other path. The network type is currently parallel–series. The nodes do not violate the assertion test (a node is not in the lead nodes list that follows the node in the path, and a node does not appear in the follow nodes list preceding the node in the path), so the network type is parallel–series. Now remove node 6 from the working path list. Increment the link index for node 5. Since node 5 does not have links that have not been processed, remove it from the working path list. Increment the link for node 4. Since node 4 does not have links that have not been processed, remove it from the working path. Increment the link index for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index. Node 6 is not already in the working path, thus does not short-circuit the path, so add it to the working path. The working path should now look like this:
288
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Paths: 1, 2, 3, 4, 5, 6 Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 0 Links: 0–2, 3 1–2, 5
Node 3 Link Index: 1 Links: 0–3, 4 1–3, 6
Node 6 Link Index: −1
Node 6 is the egress node. Add the path to the paths list. The network is considered a parallel–series network and there is more than one node in the path. Add any node that is in the path but not in the assertions list. As each node is added to the assertions list, add all the nodes that follow it to the always follows list. For each node already in the assertions list, remove any nodes in the always follows list that precedes the node in the path. The assertions list looks like this: Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6
Order 1 2 3 4 5 6
Lead Nodes
Follow Nodes
Since there is more than one path in the paths list, compare the current path to the path added prior to the last path. The paths diverge at node 3 and converge at node 6. Add the divergence node to the follow nodes and the convergence node to the lead nodes list for each node between the divergence and convergence nodes for both lists. The assertions list looks like this: Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6
Order 1 2 3 4 5 6
Lead Nodes
Follow Nodes
3 3
6 6
Check to see if any node in the lead nodes appears after node 4 or 5 in the last path and if any node in the follow nodes list proceeds node 4 or 5. Since
289
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1
neither is the case, the network is still considered a parallel–series network. Now remove node 6 from the working list. Increment the link index for node 3. Since node 3 does not have links that have not been processed, remove it from the working path list. Increment the link for node 2. Since node 2 has links that have not been processed, Þnd the next node based on the link index. Since node 5 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 4, 5, 6 1, 2, 3, 6 Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 1 Links: 0–2, 3 1–2, 5
Node 5 Link Index: −1 Links: 0–5, 4 1–5, 6
Increment the link for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Since node 4 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 4, 5, 6 1, 2, 3, 6 Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 1 Links: 0–2, 3 1–2, 5
Node 5 Link Index: 0 Links: 0–5, 4 1–5, 6
Node 4 Link Index: −1 Links: 0–4, 1 1–4, 3
Increment the link for node 4. Since node 4 has links that have not been processed, Þnd the next node based on the link index. Node 1 is in the path, so increment the link for node 4. Since node 4 has links that have not been processed, Þnd the next node based on the link index. Since node 3 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 4, 5, 6 1, 2, 3, 6 Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 1 Links: 0–2, 3 1–2, 5
Node 5 Link Index: 0 Links: 0–5, 4 1–5, 6
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 3
Node 3 Link Index: −1 Links: 0–3, 2 1–3, 6
290
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Increment the link for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index. Node 2 is in the path, so increment the link for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index. Since node 6 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 4, 5, 6 1, 2, 3, 6 Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 1 Links: 0–2, 3 1–2, 5
Node 5 Link Index: 0 Links: 0–5, 4 1–5, 6
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 3
Node 3 Link Index: 1 Links: 0–3, 2 1–3, 6
Node 6 Link Index: −1
Node 6 is the egress node. Add the path to the paths list. The network is considered a parallel–series network and there is more than one node in the path. Add any node that is in the path but not in the assertions list. As each node is added to the assertions list, add all the nodes that follow it to the always follows list. For each node already in the assertions list, remove any nodes in the always follows list that precede the node in the path. The assertions list looks like this: Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6
Order 1 2 3 4 5 6
Lead Nodes
Follow Nodes
3 3
6 6
Since there is more than one path in the paths list, compare the current path to the path added prior to the last path. The paths diverge at node 2 and converge at node 3. Add the divergence node to the lead nodes and the convergence node to the follow nodes for each node between the divergence and convergence nodes for both lists. The assertions list looks like this: Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6 6
Order 1 2 3 4 5
Lead Nodes
Follow Nodes
3, 2 3, 2
6, 3 6, 3
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1
291
Check to see if any node in the lead nodes list appears after node 4 or 5 in the last path and if any node in the follow nodes list comes before node 4 or 5. Since node 3 is a lead node and appears after node 4 in the current path, the network is considered complex. Now remove node 6 from the working list. Increment the link index for node 3. Since node 3 does not have links that have not been processed, remove it from the working path list. Increment the link for node 4. Since node 4 does not have links that have not been processed, remove it from the working path list. Increment the link index for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Since node 6 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 4, 5, 6 1, 2, 3, 6 1, 2, 5, 4, 3, 6 Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 4
Node 2 Link Index: 1 Links: 0–2, 3 1–2, 5
Node 5 Link Index: 1 Links: 0–5, 4 1–5, 6
Node 6 Link Index: −1
Node 6 is the egress node. Add the path to the paths list. The network is considered a complex network, so do not add or check the assertions. Remove node 6 from the working path. Increment the link index for node 5. Since node 5 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 2. Since node 2 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 1. Since node 1 has links that have not been processed, Þnd the next node based on the link index. Since node 4 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5,
4, 5, 6 6 4, 3, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: −1 Links: 0–4, 3 1–4, 5
Increment the link for node 4. Since node 4 has links that have not been processed, Þnd the next node based on the link index. Node 3 is not already in
292
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
the working path, thus does not short-circuit the path, so add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5,
4, 5, 6 6 4, 3, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 0 Links: 0–4, 3 1–4, 5
Node 3 Link Index: −1 Links: 0–3, 2 1–3, 6
Increment the link for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index. Since node 2 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5,
4, 5, 6 6 4, 3, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 0 Links: 0–4, 3 1–4, 5
Node 3 Link Index: 0 Links: 0–3, 2 1–3, 6
Node 2 Link Index: −1 Links: 0–2, 1 1–2, 5
Increment the link for node 2. Since node 2 has links that have not been processed, Þnd the next node based on the link index. Node 1 is already in the working path, so ignore it. Increment the link for node 2. Since node 2 has links that have not been processed, Þnd the next node based on the link index. Since node 5 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5,
4, 5, 6 6 4, 3, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 0 Links: 0–4, 3 1–4, 5
Node 3 Link Index: 0 Links: 0–3, 2 1–3, 6
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 5
Node 5 Link Index: −1 Links: 0–5, 4 1–5, 6
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1
293
Increment the link for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Node 4 is already in the working path, so ignore it. Increment the link for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Since node 6 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5,
4, 5, 6 6 4, 3, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 0 Links: 0–4, 3 1–4, 5
Node 3 Link Index: 0 Links: 0–3, 2 1–3, 6
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 5
Node 5 Link Index: 1 Links: 0–5, 4 1–5, 6
Node 6 Link Index: −1
Node 6 is the egress node. Add the path to the paths list. The network is considered a complex network, so do not add or check the assertions. Remove node 6 from the working path. Increment the link index for node 5. Since node 5 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 2. Since node 2 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index. Since node 6 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5, 1, 4, 3,
4, 5, 6 6 4, 3, 6 6 2, 5, 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 0 Links: 0–4, 3 1–4, 5
Node 3 Link Index: 1 Links: 0–3, 2 1–3, 6
Node 6 Link Index: −1
Node 6 is the egress node. Add the path to the paths list. The network is considered a complex network, so do not add or check the assertions. Remove node 6 from the working path. Increment the link index for node 3. Since node 3 does not have any links that have not been processed, remove it from the working
294
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
path. Increment the link index for node 4. Since node 4 has links that have not been processed, Þnd the next node based on the link index. Since node 5 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5, 1, 4, 3, 1, 4, 3,
4, 5, 6 6 4, 3, 6 6 2, 5, 6 6
Working Path: Node 1 Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Index: 1 Links: 0–4, 3 1–4, 5
Node 5 Index: −1 Links: 0–5, 2 1–5, 6
Increment the link for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Since node 2 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5, 1, 4, 3, 1, 4, 3,
4, 5, 6 6 4, 3, 6 6 2, 5, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 1 Links: 0–4, 3 1–4, 5
Node 5 Link Index: 0 Links: 0–5, 2 1–5, 6
Node 2 Link Index: −1 Links: 0–2, 1 1–2, 3
Increment the link for node 2. Since node 2 has links that have not been processed, Þnd the next node based on the link index. Node 1 is in the working path, so ignore it. Increment the link index for node 2. Since node 3 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5, 1, 4, 3, 1, 4, 3,
4, 5, 6 6 4, 3, 6 6 2, 5, 6 6
295
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1 Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 1 Links: 0–4, 3 1–4, 5
Node 5 Link Index: 0 Links: 0–5, 2 1–5, 6
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 3
Node 3 Link Index: −1 Links: 0–3, 4 1–3, 6
Increment the link for node 3. Since node 3 has links that have not been processed, Þnd the next node based on the link index. Node 4 is in the working path, so ignore it. Increment the link index for node 3. Since node 6 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5, 1, 4, 3, 1, 4, 3,
4, 5, 6 6 4, 3, 6 6 2, 5, 6 6
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 1 Links: 0–4, 3 1–4, 5
Node 5 Link Index: 0 Links: 0–5, 2 1–5, 6
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 3
Node 3 Link Index: 1 Links: 0–3, 4 1–3, 6
Node 6 Link Index: −1
Node 6 is the egress node. Add the path to the paths list. The network is considered a complex network, so do not add or check the assertions. Remove node 6 from the working path. Increment the link index for node 3. Since node 3 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 2. Since node 2 does not have any links that have not been processed, remove it. Increment the link index for node 5. Since node 5 has links that have not been processed, Þnd the next node based on the link index. Since node 6 is not already in the working path, thus does not short-circuit the path, add it to the working path. The working path should now look like this: Paths: 1, 2, 3, 1, 2, 3, 1, 2, 5, 1, 2, 5, 1, 4, 3, 1, 4, 3, 1, 4, 5,
4, 6 4, 6 2, 6 2,
5, 6 3, 6 5, 6 3, 6
296
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 4
Node 4 Link Index: 1 Links: 0–4, 3 1–4, 5
Node 5 Link Index: 1 Links: 0–5, 2 1–5, 6
Node 6 Link Index: −1 Links:
Node 6 is the egress node. Add the path to the paths list. The network is considered a complex network, so do not add or check the assertions. Remove node 6 from the working path. Increment the link index for node 5. Since node 5 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 4. Since node 4 does not have any links that have not been processed, remove it from the working path. Increment the link index for node 1. Since node 1 does not have any links that have not been processed, remove it from the working path. Since there are not any nodes left in the working path, the paths list contains all the paths for the network. The paths are: 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 4, 4, 4, 4,
3, 3, 5, 5, 3, 3, 5, 5,
4, 6 4, 6 2, 6 2, 6
5, 6 3, 6 5, 6 3, 6
Calculate the reliability of the network. Remove any paths that are overlapped by any other path. 1, 1, 1, 1,
2, 2, 4, 4,
3, 5, 3, 5,
6 6 6 6
For each path in the paths list, calculate the network reliability. Take the Þrst path and compare it to the paths that follow to get the nodes in each path that are not in the Þrst path. Remove any overlapped paths in the pass on list. Now repeat the process for each path in the pass on list and any sub pass on lists. The following shows the calculations to get system reliability using these steps for the Þrst path in the original list of paths: Original path list 1, 2, 3, 6 1, 2, 5, 6 1, 4, 3, 6 1, 4, 5, 6
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 1
297
Pass on list 5 4 4, 5 Pass on with the overlapped paths removed 5 4 Path 5 Pass on list 2 4 Path (from the pass on list) 4 Pass on list 2 reliability = 0.9 Reliability = (current reliability for this level) + (path reliability)(1 − pass on list 2 reliability) = 0.0 + (0.9)(1 − 0.9) = (0.9)(0.1) = 0.99 Path 4 —There are no passes on list paths so the pass on list reliability = 0.0 Pass on list reliability = 0.09 + (0.9)(1 − 0.0) = 0.09 + 0.9 = 0.99 Network reliability = 0.0 + [(0.9)(0.9)(0.9()0.9)](1 − 0.99) = (0.6561)(0.01) = 0.006561
Now repeat the process for the rest of the paths in the original path list. Original path list 1, 2, 3, 6 1, 2, 5, 6 1, 4, 3, 6 1, 4, 5, 6 Path 1,2,5,6 Pass on list 4, 3 4 Pass on list with the overlapped path removed 4 Path 4 Reliability = 0.9 Network reliability = 0.006561 + [(0.9)(0.9)(0.9)(0.9)(1 − 0.9)] = 0.006561 + (0.6561)(0.1) = 0.072171 Original path list 1, 2, 3, 6 1, 2, 5, 6 1, 4, 3, 6 1, 4, 5, 6
298
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Path 1,4,3,6 Pass on list 5 Path 5 Reliability = 0.9 Network reliability = 0.072171 + (0.9)(0.9)(0.9)(0.9)(1 − 0.9) = 0.072171 + (0.6561)(0.1) = 0.137781 Original path list 1, 2, 3, 6 1, 2, 5, 6 1, 4, 3, 6 1, 4, 5, 6 Path 1,4,5,6 Network reliability = 0.137781 + (0.9)(0.9)(0.9)(0.9)(1 − 0.0) = 0.137781 + 0.6561 = 0.793881
The network reliability is 0.793881, which is identical to its value in Figure 6B.1. The algorithm has been implemented successfully through hand calculations. APPENDIX 6C: OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM APPLIED, EXAMPLE 2 Using the network shown in Figure 6C.1, the following is a simple example of the overlap technique to calculate the reliability by determining the minimal paths for the network. Determine the ingress and egress node for the network. Create a paths list, an assertions list, and a working path list. Add the ingress node to the working path list. Include an index in the node to denote the current link with an initial value of −1. Include an indexed list of links to all other nodes. Continue
FIGURE 6C.1
Five-Node Example and its overlap reliability for s = 1, t = 5.
299
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 2
adding nodes until the egress node is reached. Ignore any nodes that are already in the working path. Paths:
Ingress Node 1
Egress Node 5
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 3 2–1, 4
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 4 2–2, 5
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 3 2–4, 5
Node 3 Link Index: 2 Links: 0–3, 1 1–3, 4 2–3, 5
Node 5 Link Index: −1
Add the path to the path list and create the assertion list. Assertions: Network Type: Parallel–Series Node 1 2 3 4 5
Always Follows 2, 3, 4, 5 3, 4, 5 5 3, 5
Order 1 2 4 3 5
Lead Nodes
Follow Nodes
Inspect the linked nodes and Þnd the next path. Paths: 1, 2, 4, 3, 5
Ingress Node 1
Egress Node 5
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 3 2–1, 4
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 4 2–2, 5
Node 4 Link Index: 2 Links: 0–4, 1 1–4, 3 2–4, 5
Node 5 Link Index: −1
Add the path to the path list and update the assertions list. Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6
Order 1 2 3 4 5 6
Lead Nodes
Follow Nodes
5
4
300
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
The network is still a parallel–series network. Inspect the link nodes and Þnd the next path. Paths: 1, 2, 4, 3, 5 1, 2, 4, 5
Ingress Node 1
Egress Node 5
Working Path: Node 1 Link Index: 0 Links: 0–1, 2 1–1, 3 2–1, 4
Node 2 Link Index: 2 Links: 0–2, 1 1–2, 4 2–2, 5
Node 5 Link Index: −1
Add the path to the paths list and update the assertions list. Assertions: Network Type: Parallel–Series Node 1 2 3 4 5 6
Always Follows 2, 3, 4, 5, 6 3, 4, 5, 6 4, 5, 6 5, 6 6
Order 1 2 3 4 5 6
Lead Nodes
Follow Nodes
5 5
4 2
The network is still a parallel–series network. Inspect the link nodes and Þnd the next path. Paths: 1, 2, 4, 3, 5 1, 2, 4, 5 1, 2, 5
Ingress Node 1
Egress Node 5
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 3 2–1, 4
Node 3 Link Index: 1 Links: 0–3, 1 1–3, 4 2–3, 5
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 2 2–4, 5
Node 2 Link Index: 1 Links: 0–2, 1 1–2, 5
Node 5 Link Index: −1
Add the path to the path list. Since node 4 precedes node 2 in the current path and node 2 precedes node 4 in the Þrst path in the path list, the network is now considered complex. Inspect the link nodes and Þnd the next path.
301
OVERLAP INGRESS–EGRESS RELIABILITY ALGORITHM, EX. 2 Paths: 1, 2, 4, 3, 5 1, 2, 4, 5 1, 2, 5 1, 3, 4, 2, 5
Ingress Node 1
Egress Node 5
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 3 2–1, 4
Node 3 Link Index: 1 Links: 0–3, 1 1–3, 4 2–3, 5
Node 4 Link Index: 2 Links: 0–4, 1 1–4, 2 2–4, 5
Node 5 Link Index: −1
Inspect the link nodes and Þnd the next path. Paths: 1, 2, 4, 1, 2, 4, 1, 2, 5 1, 3, 4, 1, 3, 4,
Ingress Node 1
3, 5 5
Egress Node 5
2, 5 5
Working Path: Node 1 Link Index: 1 Links: 0–1, 2 1–1, 3 2–1, 4
Node 3 Link Index: 2 Links: 0–3, 1 1–3, 4 2–3, 5
Node 5 Link Index: −1
Inspect the link nodes and Þnd the next path. Paths: 1, 2, 4, 1, 2, 4, 1, 2, 5 1, 3, 4, 1, 3, 4, 1, 3, 5
Ingress Node 1
3, 5 5
Egress Node 5
2, 5 5
Working Path: Node 1 Link Index: 2 Links: 0–1, 2 1–1, 3 2–1, 4
Node 4 Link Index: 1 Links: 0–4, 1 1–4, 3 2–4, 5
Node 3 Link Index: 1 Links: 0–3, 1 0–3, 5
Node 5 Link Index: −1
302
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
Inspect the link nodes and Þnd the next path. Paths: 1, 2, 4, 1, 2, 4, 1, 2, 5 1, 3, 4, 1, 3, 4, 1, 3, 5 1, 4, 3,
Ingress Node 1
3, 5 5
Egress Node 5
2, 5 5 5
Working Path: Node 1 Link Index: 2 Links: 0–1, 2 1–1, 3 2–1, 4
Node 4 Link Index: 2 Links: 0–4, 1 1–4, 3 2–4, 5
Node 5 Link Index: −1
Inspect the link nodes and Þnd that all the links have been followed. Thus, the following is the path list: 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5 4, 4, 5 3, 5
3, 5 5 2, 5 5 5
Calculate the reliability of the network. Remove any paths that are overlapped by any other path. 1, 2, 5 1, 3, 5 1, 4, 5
For each path in the path list, calculate the network reliability. Take the Þrst path and compare it to the paths that follow to get the nodes in each path that is not in the Þrst path. Remove any overlapped paths in the pass on list. Now repeat the process for each path in the pass on list and any sub pass on lists. The following shows the calculations to get system reliability using these steps for the Þrst path in the original list of paths: Original path list 1, 2, 5 1, 3, 5 1, 4, 5 Pass on list 3 4
REFERENCES
303
Pass on with the overlapped paths removed 3, 4 Path 3 Pass on list 2 4 Path (from the pass on list) 4 Pass on list 2 reliability = 0.9 Reliability = (current reliability for this level) + (path reliability) × (1.0 − Pass On list 2 reliability) Reliability = 0.0 + (0.9)(1 − 0.9) = (0.9)(0.1) = 0.09 Path 4 Pass on list reliability = 0.09 + (0.9)(1.0 − 0.0) = 0.09 + 0.9 = 0.99 Network reliability = 0.0 + [(0.9)(0.9)(0.9)](1.0 − 0.99) = (0.729)(0.01) = 0.00729 Original path list 1, 2, 5 1, 3, 5 1, 4, 5 Pass on list 4 Pass on with the overlapped paths removed 4 Path 4 Pass on list reliability = 0.0 + (0.9)(1.0 − 0.0) = 0.9 Network reliability = 0.00729 + [(0.9)(0.9)(0.9)](1.0 − 0.9) = 0.00729 + (0.729)(0.1) = 0.08019 Original path list 1, 2, 5 1, 3, 5 1, 4, 5 Network reliability = 0.08019 + [(0.9)(0.9)(0.9)](1 − 0.0) = 0.08019 + (0.729)(1.0) = 0.80919
The network reliability is 0.80919 as veriÞed by Figure 6C.1. The algorithm has been implemented successfully through hand calculations. REFERENCES 1. C. J. Colbourn, Combinatorial Aspects of Network Reliability, Ann. Oper. Res., R-30, 32–35 (1981).
304
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
2. K. K. Aggarwal and S. Rai, Reliability Evaluation in Computer Communication Networks, IEEE Trans. Reliab., 30(1), 32–35 (1981). 3. R. H. Jan, Design of Reliable Networks, Comput. Oper. Res., 20(1), 25–34 (1993). 4. M. S. Yeh, J. S. Lin, and W. C. Yeh, New Monte Carlo Method for Estimating Network Reliability, Proceedings of 16th International Conference on Computers and Industrial Engineering, 1994, pp. 723–726. 5. G. S. Fishman, A Comparison of Four Monte Carlo Methods for Estimating the Probability of s-t Connectedness, IEEE Trans. Reliab., 35(2), 145–155 (1986). 6. M. Sahinoglu and D. Libby, Sahinoglu–Libby (SL) Probability Density FunctionComponent Reliability Applications in Integrated Networks, Proceedings of the Annual Reliability and Maintainability Symposium (RAMS’03), Tampa, FL, January 27–30, 2004, pp. 280–287. 7. B. Dengiz, F. Altiparmak, and A. E. Smith, EfÞcient Optimization of All-Terminal Reliable Networks Using an Evolutionary Approach, IEEE Trans. Reliab., 46(1), 18–26 (1997). 8. B. Dengiz, F. Altiparmak, and A. E. Smith, Local Search Genetic Algorithm for Optimal Design of Reliable Networks, IEEE Trans. Evol. Comput., 1(3), 179–188 (1997). 9. M. Sahinoglu, Reliability Index Evaluations of an Integrated Software System for InsufÞcient Software Failure and Recovery Data, Springer-Verlag Lecture Notes, Proceedings of the First International Conference (ADVIS’2000), Izmir, Turkey, October 2000, pp. 25–27. 10. K. E. Murphy and C. M. Carter, Reliability Block Diagram Construction Techniques: Secrets to Real-Life Diagramming Woes, Proceedings of the Annual Reliability and Maintainability Symposium (RAMS’03), Tutorial Notes, Tampa, FL, January 2003. 11. L. C. Woltenshome, Reliability Modelling: A Statistical Approach, Chapman & Hall/CRC, Boca Raton, FL, 1999, pp. 106–107. 12. M. Sahinoglu, J. Larson, and B. Rice, An Exact Reliability Calculation Tool to Improve Large Safety-Critical Computer Networks, Proceedings DSN’2003, IEEE Computer Society, San Francisco, CA, June 22–25, 2003, pp. B38–B39. 13. T. Luo and K. S. Trivedi, An Improved Algorithm for Coherent System Reliability, IEEE Trans. Reliab., 47(1), 73–78 (March 1998). 14. S. Rai, M. Veeraraghavan, and K. S. Trivedi, A Survey on EfÞcient Computation of Reliability Using Disjoint Products Approach, Networks, 25(3), 174–163 (1995). 15. X. Zang, H. R. Sun, and K. S. Trivedi, A BDD Approach to Dependable Analysis of Distributed Computer Systems with Imperfect Coverage, in D. Avresky (ed.), Dependable Network Computing, Kluwer, Amsterdam, December 1999, pp. 167–190. 16. H. Sun, X. Zang, and K. S. Trivedi, A BDD Based Algorithm for Reliability Analysis of Phase Mission Systems, IEEE Trans. Reliab., 50–60 (March 1999). 17. M. Sahinoglu, An Exact RBD Calculation Tool to Design Very Complex Systems, Invited Talk, Proceedings of the First ACIS International Conference on Software Engineering Research and Applications, San Francisco, CA, June 25–27, 2003. 18. C. V. Ramamoorthy and Y. W. Han, Reliability Analysis of Systems with Concurrent Error Detection, IEEE Trans. Comput., pp. 868–878 (September 1975). 19. M. Sahinoglu, A. Smith, and B. Dengiz, Improved Network Design Method When Considering Reliability and Cost Using an Exact Reliability Block Diagram
REFERENCES
20.
21.
22.
23. 24.
25. 26.
27.
28.
29.
30.
31.
32.
33.
305
Calculation (ERBDC) Tool in Complex Systems, ANNIE—Smart Engineering Systems, Proceedings of the Intelligent Engineering Systems Through ArtiÞcial Neural Networks, Vol. 13, St. Louis, MO, November 1–4, 2003, pp. 849–855. M. Sahinoglu, C. V. Ramamoorthy, A. Smith, and B. Dengiz, A Reliability Block Diagramming Tool to Describe Networks, Proceedings of the Annual Reliability and Maintainability Symposium (RAMS’04), Los Angeles, CA, January 26–29, 2004, pp. 141–145. M. Sahinoglu and W. Munns, Availability Indices of a Software Network, Proceedings of the 9th Brazilian Symposium on Fault Tolerant Computing, Florianopolis, Brazil, March 2001, pp. 123–131. M. Sahinoglu, An Algorithm to Code and Decode Complex Systems, and to Compute s-t Reliability, Proceedings of the Annual Reliability and Maintainability Symposium (RAMS’05), Alexandria, VA, January 24–27, 2005. K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd ed., Wiley, Hoboken, NJ, 2002, pp. 42–60. B. Rice, A Faster Exact Reliability Block Diagramming Calculation for Complex Systems—The Overlap Method, Master of Science Thesis, Troy University, Montgomery, AL, 2007. (Supervised by M. Sahinoglu) A. Lisnianski and G. Levitin, Multi-state System Reliability, World ScientiÞc, Singapore, 2003. S. R. Das, M. Sudarma, M. H. Assaf, E. M. Petriu, W. Jone, K. Chakrabarty, and M. Sahinoglu, Parity Bit Signature in Response Data Compaction and Built-in SelfTesting of VLSI Circuits with Nonexhaustive Test Sets, IEEE Trans. Instrum. Meas., 52(5), 1363–1380 (October 2003). S. R. Das, M. H. Assaf, E. M. Petriu, and M. Sahinoglu, Aliasing-Free Compaction in Testing Cores-Based System-on-Chip (SOC) Using Compatibility of Response Data Outputs, Trans. Soc. Design Process Sci., 8(1), 1–17 (March 2004). S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, W. B. Jone, and M. Sahinoglu, Revisiting Response Compaction in Full-Scan Circuits with Nonexhaustive Test Sets Using Concept of Sequence Characterization, IEEE Trans. Instrum. Meas., Special Issue on VLSI Testing, 54(5), 1662–1677 (October 2005). S. R. Das, C. V. Ramamoorthy, M. H. Assaf, E. M. Petriu, W. B. Jone, and M. Sahinoglu, Fault Simulation and Response Compaction in Full-Scan Circuits Using HOPE, IEEE Trans. Instrum. Meas., 54(6), 2310–2328 (December 2005). S. R. Das, C. Jin, L. Jin, M. H. Assaf, E. M. Petriu, W. B. Jone, S. Biswas, and M. Sahinoglu, Implementation of a Testing Environment for Digital IP Cores, IEEE Trans. Instrum. Meas., 55(6) (December 2006). S. R. Das, J. Zakizadeh, M. H. Assaf, E. M. Petriu, S. Biswas, and M. Sahinoglu, Testing Analog and Mixed-Signal Circuits with Built-in Hardware—New Approach, IEEE Trans. Instrum. Meas., 55(6) (December 2006). M. Sahinoglu, C. V. Ramamoorthy, RBD Tools Using Compression, Decompression, Hybrid Techniques to Code, Decode, and Compute Reliability in Simple and Complex Embedded Systems, IEEE Trans. Instrum. Meas., 54(5), 1789–1799 (October 2005). M. Sahinoglu, B. Rice, D. Tyson, Comparison of Simulation and Analytical Methods to Compute Source-Target Reliability in Very Large Complex Networks, Proceedings of the 27th International Symposium of Operations Research and Industrial Engineering, Dokuz Eylul University, Izmir, Turkey, July 2–4, 2007.
306
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
EXERCISES To use the applications and data Þles, click on “ERBDC” in TWC-Solver on the CD-ROM. 6.1 Assuming that the nodes shown in Figure E5.1(b) have a value of 0.9 and links a perfect availability of 1.0, calculate the s = 1, t = 19 availability using the faster overlap method and Monte Carlo simulation. Compare the execution times and results. 6.2 Using the same topology as in Exercise 6.1, assuming that the nodes still have 0.9 and links have 0.9 availability, calculate the system’s s = 1, t = 19 availability using the overlap method only. 6.3 Assuming that the nodes have 0.9 availability for the seven-node topology shown in Figure 6.10 and links have a perfect availability of 1.0, calculate the s = 1, t = 7 availability using the faster overlap method and slower hybrid-enumeration technique. Compare the execution times and results. 6.4 Repeat Exercise 6.3 for the eight-node topology in Figure 6.12 for s = 1, t = 8. 6.5 Repeat Exercise 6.3 for the 10-node topology shown in Figure 6.14 for s = 1 and t = 10. 6.6 Using MSS reliability principles, supposing that you have fully up (= 0.7), fully down (= 0.2), and derated (= 0.1) states for any node, and with any fastest method you like, calculate P (UP), P (DER), and P (DOWN) for the seven-node topology in Figure 6.10. Repeat this exercise when fully up (= 0.6), derated (= 0.2), degraded (= 0.15), and fully down (= 0.05). 6.7 Repeat Exercise 6.6 using the eight-node topology in Figure 6.12. 6.8 Repeat Exercise 6.6 using the 10-node topology in Figure 6.14. 6.9 Repeat Exercise 6.6 using the 32-node topology in Figure E6.1. 6.10 Using the overlap algorithmic method, calculate analytically by hand obeying the overlap the source–target availability for the network shown in Figure E6.10, where s = 1, t = 4.
FIGURE E6.10
Four-node complex network and its ovelap reliabity with s = 1, t = 4.
EXERCISES
307
6.11 Write a code to simulate the {s, t} network reliability problem. Using a node availability 0.9, assuming a perfect link in Figures E6.1 and E6.10, calculate the s –t reliability using your simulation program. You may choose either Monte Carlo or discrete event simulation. Then repeat the exercise by assuming links to have 0.9 availability.
1, −1, 2, *, −2, 4, *, −3, 3, *, −4, 18, *, −5, 17, *, −6, −29, 6, *, −27, *, +, 7, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, − 17, 14, *, −18, *, −23, +, *, *, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, +, *, *, −7, 7, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, −12, 9, *, −13, *, 10, *, − 14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, +, *, +, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, −28, 15, *, −18, 14, *, −17, *, −23, +, 13, *, −16, 12, *, −11, 8, *, −10, *, −15, 11, *, −14, 10, *, −13, *, 9, *, −12, *, −24, +, *, +, 7, *, −6, −27, 6, *, −29, *, +, 17, *, −5, *, −7, +, *, *, −31, 11, *, −14, 10, *, −13, *, 9, *, −12, *, −15, 12, *, −11, *, 8, *, −10, *, +, −24, +, 7, *, −6, −27, 6, *, −29, *, +, 17, *, −5, *, −7, +, *, *, +, *, −22, +, 18, *, −21, *, −19, 16, *, −20, *, +, *, +, *, −8, −26, 5, *, −25, *, +, 6, *, −27, 7, *, −6, 17, *, −5, *, −7, +, 18, *, −4, 3, *, −28, *, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, −28, 3, *, −4, *, +, 18, *, −21, *, +, *, *, *, +, −12, 9, *, − 13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, −28, 3, *, − 4, *, +, 18, *, −21, *, +, *, *, *, +, *, −29, 17, *, −5, 18, *, −4, 3, *, −28, *, −7, 7, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, -23, +, *, *, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, +, *, +, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, −6, 7, *, −7, 18, *, −4, 3, *, −28, *, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, −28, 3, *, −4, *, +, 18, *, −21, *, +, *, *, *, +, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, − 23, +, 15, *, −19, 16, *, −20, *, −22, −28, 3, *, −4, *, +, 18, *, −21, *, +, *, *, *, +, *, +, *, +, *, +, *, −9, 7, *, −6, 17, *, −5, 18, *, −4, 3, *, −28, *, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, −29, 6, *, −8, −25, 5, *, −26, *, +, 4, *, −3, *, 3, *, −4, 18, *, −21, −22, 15, *, − 19, *, 16, *, −20, *, +, *, −28, 15, *, −19, 16, *, −20, *, −22, 18, *, −21, *, +, *, +, *, *, +, *, −7, 18, *, −4, −5, 17, *, −29, *, 6, *, −8, − 25, 5, *, −26, *, +, *, 4, *, −3, *, +, 3, *, −28, *, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, +, −10, 8, *, −11, *, 12, *, −15, 11, *, − 31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, −28, 3, *, −3, 4, *, −8, −26, 5, *, −25, *, +, *, 6, *, −29, *, 17, *, −5, *, −4, +, *, +, 18, *, −21, *, +, *, *, *, +, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, −28, 3, *, −3, 4, *, −8, −26, 5, *, −25, *, +, *, 6, *, −29, *, 17, *, −5, *, −4, +, *, +, 18, *, −21, *, +, *, *, *, +, −27, 6, *, −8, -25, 5, *, −26, *, +, 4, *, −3, *, 3, *, −4, 18, *, −21, −22, 15, *, −19, *, 16, *, −20, *, +, *, −28, 15, *, −19, 16, *, −20, *, −22, 18, *, −21, *, +, *, +, *, −29, 17, *, −5, *, 18, *, −4, 3, *, −28, *, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, +, *, +, *, +, *, −30, 3, *, −3, 4, *, −2, 2, *, −9, *, 7, *, −6, −27, 6, *, −29, *, +, 17, *, −5, *, −7, +, 18, *, −21, −22, 15, *, −19, *, 16, *, −20, *, +, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, 18, *, −21, *, +, *, *, *, +, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, − 19, 16, *, −20, *, −22, 18, *, −21, *, +, *, *, *, +, *, −8, −26, 5, *, −25, *, +, 6, *, −27, 7, *, −6, 17, *, −5, *, −7, +, 18, *, −21, −22, 15, *, −19, *, 16, *, −20, *, +, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, 18, *, −21, *, +, *, *, *, +, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, − 23, +, 15, *, −19, 16, *, −20, *, −22, 18, *, −21, *, +, *, *, *, +, *, −29, 17, *, −5, 18, *, −7, 7, *, −10, 8, *, −11, *, 12, *, −15, 11, *, − 31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, +, *, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, −6, 7, *, −7, 18, *, −21, −22, 15, *, −19, *, 16, *, −20, *, +, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, −20, *, −22, 18, *, −21, *, +, *, *, *, +, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, 15, *, −19, 16, *, − 20, *, −22, 18, *, −21, *, +, *, *, *, +, *, +, *, +, *, +, *, −4, 18, *, −5, 17, *, −6, −29, 6, *, −8, −25, 5, *, −26, *, +, 4, *, −2, *, 2, *, −9, *, −27, +, *, +, 7, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, −12, 9, *, −13, *, 10, *, − 14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, +, *, *, −7, 7, *, −10, 8, *, −11, *, 12, *, −15, 11, *, −31, *, −16, +, 13, *, −17, 14, *, −18, *, −23, +, *, *, −12, 9, *, −13, *, 10, *, −14, *, −24, +, 11, *, −15, 12, *, −16, *, −31, +, 13, *, − 17, 14, *, −18, *, −23, +, *, *, +, *, +, −22, +, 15, *, −19, *, 16, *, −20, *, −21, +, *, +, −28, 15, *, −18, 14, *, −17, *, −23, +, 13, *, −16, 12, *, −11, 8, *, −10, *, −15, 11, *, −14, 10, *, −13, *, 9, *, −12, *, −24, +, *, +, 7, *, −6, −9, 2, *, −2, *, 4, *, −8, −26, 5, *, −25, *, +, *, − 27, +, 6, *, −29, *, +, 17, *, −5, *, −7, +, *, *, −31, 11, *, −14, 10, *, −13, *, 9, *, −12, *, −15, 12, *, −11, *, 8, *, −10, *, +, −24, +, 7, *, −6, −9, 2, *, −2, *, 4, *, −8, −26, 5, *, −25, *, +, *, −27, +, 6, *, −29, *, +, 17, *, −5, *, −7, +, *, *, +, *, −22, +, 18, *, −21, *, −19, 16, *, −20, *, +, *, +, *, +, *, 19, *
FIGURE E6.12 The s = 1, t = 19 Polish notation necessary to encode the topology, then to decode to retrieve the same 19-node network.
308
RELIABILITY BLOCK DIAGRAMMING IN COMPLEX SYSTEMS
6.12 Repeat Exercise 6.1 using the compression algorithm. Only retrieving the Polish notation for s = 1, t = 9, reverse-engineer or decode the 32-node topology in Figure E6.1 by employing the decoding algorithm. Observe the {s = 1, t = 19} Polish notation necessary to decode the 19-node network (Figure E6.12). 6.13 Using Figure E6.13, and assuming s = 1, t = 52, apply Monte Carlo simulation. See Figure E6.14 to verify.
FIGURE E6.13
52-node example with s = 1, t = 52.
6.14 Using Figure 6.21, apply the ovelap algorithm on page 273 analytically step-by-step to solve {s = 7, t = 8} ingress—egress reliability.
FIGURE E6.14 Monte Carlo simulation result for 52-node 78-link telephony network in Figure E6.13 for s = 1, t = 52 with 100,000 runs timed.
INDEX
Absolute, 47, 97, 140, 151, 243, 271 error loss, 232, 233, 235, 236 penalty error, 112, 113 relative error, 79, 94, 99, 100–102 Accelerated, 116 Acceptance, 112 Accumulated, 64, 80, 82, 93, 117, 181 Accuracy, 48, 78, 80, 98, 99, 100, 110, 112, 113, 118, 129, 143, 155, 162, 189, 206, 257, 262, 253. See also Forecast, accuracy Adaptive maintenance, 131 Aircraft, 45 Algorithm, 18, 92, 170, 177, 215, 241, 244, 275, 284, 303 binary, 259 compression, 257, 258, 260, 264, 267, 269, 270, 258, 306 decoding, 268–271 encryption, 156 enumeration, 281 EBSR, 208, 211 genetic, 303 L-M, 90, 91, 114, 116 MESAT, 187, 189, 205 MLE, 93 NLR, 93 overlap, 282, 283, 307 public key, 158 RBD, 259 reliability, 274, 282, 285, 287, 289, 291, 293, 295, 297–299, 301, 303 SPSS, 82, 93 stack, 269
stopping rule, 171, 173, 174, 175, 184, 187, 215 Alternative, 6, 19, 22, 23, 67, 80, 82, 98, 99, 100, 114, 115, 119–121, 138, 170, 173, 174, 175, 177, 183, 188, 234, 270, 282 Analysis ANOVA, 17 Army Materials Systems, 64 Bayes(ian), 115, 120, 137, 178, 184, 208, 218 component, 258 cost, 173, 197, 202, 203, 212, 214, 223 cost beneÞt, 166, 173, 211 cost of quality, 172 coverage, 225 crypto, 156 data, 153 decision, 165, 166, 210 economic, 197, 228 exploratory data, 79, 96, 196 fault tree, 69 goodness-of-Þt, 223, 224 likelihood, 152 mathematical-statistical, 63 nonparametric, 38 of error processes, 70, 113 posterior, 55 regression, 79, 80, 114 reliability, 71, 258, 262, 304 risk, 122, 124, 132, 137, 162, 166 security, 159 security meter, 120, 135, 137 sensitivity, 187
Trustworthy Computing: Analytical and Quantitative Engineering Evaluation, By M. Sahinoglu Copyright 2007 John Wiley & Sons, Inc.
309
310 Analysis (continued ) sequential, 227 statistical, 62, 67, 163, 174, 215 stopping rule, 223, 224, 229 system, 40 track, 144 tree, 161, 162, 164 vulnerability, 165 Anonymity, 155, 156 Application(s), 7, 8, 17, 46, 106, 125, 130, 131, 138, 156, 167, 174, 183, 204, 205, 215, 216, 221, 229, 233, 243, 245, 253, 258, 260–263, 275, 305 Approach(es) analytical, 67, 251 Bayes(ian), 18, 47, 55, 99, 100, 118, 115, 135, 184, 188, 189, 221 decision-tree (diagram), 122, 138, 142 enumeration, 266, 270, 276, 278 evolutionary, 304 frequency, 143, 144 informative, 112 integral, 25 inverse-transform, 20 K-S, 94 large sample, 103 MESAT, 193 NBD, 153 noninformative, 111, 118 nonsystematic, 120 numerical, 121 Poisson-geometric, 152 prior distribution, 104 privacy meter, 150 probabilistic, 138 qualitative, 121, 155 quantitative, 66, 155, 228 RBD, 258 security meter, 132, 137, 168 shotgun, 206, 208 statistical, 66, 303 testing to death, 173–175, 206 Approximation(s), 45, 71, 116 Arithmetic, 35, 100–102, 110, 112, 113, 158 Arrival, 14, 20, 58, 59, 61, 63, 79, 80, 82, 93, 176–178 Assumption(s), 33, 40, 47, 48, 51, 57, 58, 62, 63, 80, 138, 142, 147, 174, 184, 187, 190, 193, 194, 196, 233 Attack trees, 120, 137 Attribute(d), 75, 113, 120, 207, 215 Audit, 153, 155, 156, 162 Authentication, 143, 155–157, 163
INDEX Availability average, 25 Bayes(ian), 232, 233, 253 calculations, 239, 241 component, 231, 244 expected, 232, 233 long run (term), 25, 75 modeling, 231, 234 network, 231, 233, 234, 239, 244 non, 190 source target, 231 system, 64 Background, 80 Backward, 90, 264 Bathtub curve, 7, 23, 24, 27, 52 Bayes(ian) model(s), 47, 55 Behavior, 47, 48, 53, 79, 99, 174, 187 Bernoulli, 20, 21, 23, 32, 58, 67, 114, 176, 187, 188, 189, 226, 227, 254 Best estimate, 47, 169 Binomial density, 17 distribution, 20, 46, 60, 71, 177 multi, 21 negative, 22, 23, 57, 58, 78, 116, 152, 177–179, 181, 183, 184, 189, 195, 226 processes, 176 quadri, 21 random variable, 21 sampling, 227 setting, 48 type model, 48, 49, 51, 53, 64 Block(s), 72, 158, 188, 199, 257–260, 265, 268 Boundary, 111 Bug(s), 122, 160, 183, 198, 199 Calendar, 48, 52, 61, 80–86, 87, 92, 188, 206 Capability-based attack trees, 120, 154 Catastrophic, 120 Categories, 159 Categorize, 159 Category, 47 Cause(s), 25, 45, 135, 137, 144, 271 Central, 12, 26, 52, 156 limit theory, 67 Change, 39, 40, 44, 49, 55, 190, 236, 254 Characterization, 78, 82 Checkpoint, 101, 174, 215 Class, 35, 49, 55, 63, 66, 97, 130 ClassiÞcation(s), 48, 49, 121 Clock, 52 Clumping, 196
INDEX Cluster(ing), 79, 94, 96, 151–153, 176, 180, 191 Code(s), 51, 82, 123, 137, 151, 156, 170, 173, 177, 206, 210, 235, 242, 272, 308 CoefÞcient, 178, 183, 184, 187, 189, 237 Cold standby, 15 Combination, 62, 156, 160, 161, 178, 180, 266, 276, 277 Comparative, 131 Comparison, 16, 48, 98, 103, 104, 105, 112, 113, 132, 173, 197, 198, 200 Complex(ity), 100, 172, 174, 175, 233, 245, 257, 259, 260, 267, 272, 280–286, 291, 293, 295, 296, 300 measures, 47 metrics, 69 networks, 231, 245, 259, 271, 281, 282, 284, 285, 291, 293, 295 systems, 257–262, 264, 270–272, 304, 305 topology, 272, 281 Computer systems, 68, 304 Conclusion(s), 96, 98, 110, 121, 129, 137, 142, 149, 154, 188, 213, 221, 230, 281 Conditional, 4, 38, 39, 49, 52, 54, 104, 105, 130, 138, 141, 178, 236, 251 ConÞdence, 16, 17, 36, 37, 76, 117, 177, 185, 186, 187, 193, 194, 195, 196, 221, 229 ConÞdentiality, 155 Consensus, 151 Consistency, 81, 98 Consistent, 121, 162 Constant, 9, 17, 20, 23, 51, 65, 71, 79, 97, 99, 100, 106, 110, 111, 122, 123, 130, 147, 168, 178, 180, 181, 184, 188, 190, 208, 220, 230, 244, 255 criticality, 147 deterministic, 130 failure (rate), 24, 37, 43, 69, 72–74, 77, 151 hazard (rate), 7, 34, 49, 52 measures, 99 proportionality, 49 Continuous, 1, 2, 3, 21, 60, 73, 80, 121, 161, 176, 180, 188, 229, 230, 243 Control, 17, 20, 122, 157, 172, 173, 174, 199 Corrective, 131, 252 Correlation, 58, 153, 176, 178, 181, 182, 184, 187, 190, 193 function, 176 Cost analysis, 173, 197, 212, 223, 225 average, 149 beneÞts, 166, 173, 197 budget, 132
311 capital (investment), 123–125, 127, 168–171, 183, 223, 225 coefÞcient, 183 criterion, 198 effective, 132, 172–174, 177, 183, 186, 197, 205, 212, 215, 230 efÞcient, 173, 184 expected, 124, 127, 132, 135, 137, 150, 207, 217 factors, 198 Þxed, 182, 184, 217, 221 improvement, 154 implied, 183 index, 198 maintenance, 131 maximum, 121 model, 183, 195, 196 opportunity, 130 output, 132 parameters, 183, 187–189 projected, 122 redemption, 120 scenarios, 185 shadow, 127 software, 131 testing, 194, 196, 198, 199 utility, 121, 132 variable, 182, 183, 217 Countermeasure (CM), 119, 121–127, 129, 132, 135, 137, 138, 142, 144, 148, 150, 153, 154, 155, 161, 165, 168, 170 lack of (LCM), 121–124, 143, 170 Counting, 47, 57, 58, 150, 173, 176, 177, 187, 188, 195, 207, 272 Coverage, 51, 223, 225 branch(ing), 174–177, 181–184, 188, 195, 196 cost of, 207 data sets, 178, 228 decision, 175 detection, 174 estimation, 228 failure, 173 fault, 173, 200 imperfect, 304 level, 185 minimal, 185, 190, 210 number, 185, 187 reliability, 173, 177 statement, 175 total, 190 white-box, 172 CPU, 12, 39, 40, 48, 52, 53, 54, 55, 73, 75, 77, 79, 80, 81, 188
312 Criteria, 115, 227, 228 Crypto, 156 Data bank, 77 censored, 36, 38, 39, 76 clustered, 80, 82 compaction, 305 complete, 33, 35, 36 correlated, 116 cost, 188 countermeasure, 142 coverage, 178 current, 47 domain, 77 effort-based, 173 empirical, 6, 33, 146, 148, 168, 173 error, 113, 115 exploratory, 79, 96, 196 failure, 33, 39, 48, 67, 76–80, 92–96, 98–100, 152, 252 Þeld, 207, 234 grouped, 34, 35, 39, 40, 73, 77, 80, 95, 114 historical, 234 hybrid, 169 incomplete, 36 insufÞcient, 244 large sample, 238, 244 latent, 183 models, 67, 114 Musa’s sets, 57 nonreplacement, 36 output, 92 qualitative, 127, 169 quantitative, 120, 122, 127 recovery, 253, 304 reliability, 67, 71 repair, 245 repository, 119 resistant, 243 simulated, 79 software, 92 statistical, 143 symmetric, 147 system design, 120 test case based, 174 test, 69 ungrouped, 33–35, 38, 76 VHDL, 198 weekly sets, 82 Death, 36, 39, 46, 173, 174, 175, 206 process, 68 Debugging, 46, 57, 68, 71, 232 Decision, 66, 101, 102, 115, 123, 164, 166, 210, 211, 228
INDEX Decryption, 156 Defective, 10, 22, 173 Defect(s), 7, 48, 54, 55, 57, 62, 63, 257 Degrees of freedom, 15, 16, 17, 98 Delayed, 62, 63, 187, 189 Density, 1, 2, 3, 4, 6, 8, 12, 13, 15, 17, 23, 25, 34, 36, 49, 51, 53, 54, 56, 58, 78, 80, 125, 137, 147, 151, 188, 218, 219, 231–233, 242, 243, 250, 251, 256 Density function, 2, 3, 4, 6, 8, 25, 34, 51, 80, 147, 218, 219, 232, 243, 256 Dependent, 21, 47, 58, 63, 130, 139, 140, 141, 142, 150, 154, 188, 194 Design, 40, 46, 64, 75, 119, 120, 121, 122, 130, 131, 135, 142, 143, 148, 150, 162, 168, 169, 175, 205, 206, 215, 229, 258 Detection, 68, 69, 113, 166, 305 Determination, 161, 196 Device, 11, 14, 73, 77, 122, 143, 144 Diagnostic(s), 98, 190, 195, 196 Discovery, 83, 84, 85, 86, 87 Discrete, 2, 3, 20, 21, 48, 58, 60, 79, 80, 142, 150, 155, 157, 168, 176, 177, 178, 181, 182, 184, 229, 255, 308 Discrimination ratio, 195 Discriminative, 190, 196 Disjoint, 39, 122, 127, 138, 139, 140, 141, 142 Disjointness, 141, 142 Disk, 74 Distribution(s) asymptotic, 31 beta, 17, 18, 180, 181, 187, 234, 252 binomial, 20, 46, 58, 181, 184 compound Poisson, 177 compound(ing), 58, 59, 71, 79, 80, 188, 195 conditional, 104, 105, 178, 236, 251 cumulative, 3, 6, 98 discrete, 226 double, 12 empirical, 94, 98 Erlang(ian), 13–15 extreme value, 24, failure, 4, 5, 24, 33, 35, 49, 55, 62 frequency, 79, 82, 92, 96, 191 function, 6, 18, 176, 180, 195, 233, 234 gamma, 13, 14, 18, 237 geometric, 58, 80, 215 half-normal, 113 hyperexponential, 12 hypergeometric, 70 hyperprior, 104 joint, 56, 59, 218, 249 limiting, 188
INDEX logarithmic-series (LSD), 58, 173 176, 177, 187, 188, 195, 207 log-normal, 27 marginal, 56, 59, 178, 180 mixture, 12 multinomial, 21 negative binomial (NBD), 60, 71, 174–177, 179, 184, 195 negative exponential, 11–13, 20, 21, 50, 80, 115, 168 noninformative, 100 nonparametric, 33 normal, 25–28, 30, 104, 105, 243 Pareto, 56 Poisson, 20, 58, 71, 147, 168, 177, 216 Poisson-geometric, 58, 70, 76, 82 Poisson-logarithmic, 76 posterior, 47, 55, 56, 180, 233, 236, 249 power-function, 13 power-law, 12 prior, 47, 55, 56, 102, 104, 113, 178, 180, 227, 247 probability, 52, 66, 67, 98, 150, 151, 176, 180, 193–196, 252 skewed, 187 statistical, 1, 26, 32, 48, 120, 150, 233 survival, 116 three-parameter beta (SL), 233, 235 truncated, 25 type-I, 30, 31 type-II, 30, 31 type-III, 30, 31 uniform, 53, 105 univariate, 227, 253 Weibull, 23, 24, 31 Down times, 233, 236, 239, 243, 244, 247, 249 Duane (model), 47, 64, 68, 69 Duration, 76, 77 Dynamic, 130, 131, 150, 174, 197, 257 Economic, 123, 172, 174, 177, 182, 187, 189, 196, 197, 204, 207, 211, 215, 221 Effect(ive), 24, 40, 52, 55, 78, 79, 99, 134, 142, 185, 187, 196 avalanche, 146 cause and, 208 domino, 207 logarithmic, 97 ripple, 123, 127, 130 saturation, 189 side, 123, 131 Effort, 137, 179, 183, 184, 205 based, 48, 172, 173, 176, 178, 229
313 discrete, 182 domain, 46, 177, 182, 188, 189 errors, 208 testing, 206, 208, 210 unit, 179 Empirical, 120, 142, 184, 187, 188, 195, 205, 218, 221 Bayes(ian), 47, 67, 69, 114, 116, 165, 168, 174–177, 184, 187, 188, 195, 205–208, 218, 220, 225, 229, 252 data, 33, 146, 148, 173 distribution, 94, 95, 98 rule(s), 206, 227 software testing, 205 Encryption, 143, 156, 166, 170 Ceasar’s, 157 El-Gamal, 158 public key, 157, 158 RSA, 157 Error, 11, 93–103, 117, 135, 136, 150, 167, 194, 205–208, 210–213, 223–225, 235, 243, 251 common mode, 75 detection, 63, 68, 113, 259, 323 detection rate, 3 loss, 232, 233, 235, 236 measurement, 26 mode, 75 post facto, 211 predictive, 97, 103 relative, 79, 80, 83–85, 100, 102 round off, 117 sampling, 11 squared, 47, 79, 80, 94, 98, 178, 181, 232, 233, 236–239, 243, 251 standard, 91, 106 type, 1, 103 vector, 89 Estimation, 3, 4, 38, 47, 55, 63, 94, 113, 149, 150, 154, 193, 232, 233 Bayes(ian), 17, 47, 177, 195, 235 coverage, 195, 228 density, 71 error of, 237 interval, 96 lack of privacy, 150 least-squares, 90, 91 max likelihood, 79, 80, 96 method(s), 33, 36, 101 model, 63 nonlinear, 91 of failures, 57, 78 of hazard, 34
314 Estimation (continued ) parameter, 47, 67, 79, 80, 82, 92, 93, 96–99, 114, 142, 149, 226, 252 prior, 248 procedures, 33, 100 process, 193 reliability, 33, 56, 67, 79, 113–115, 259 regression, 82 risk, 154 statistical, 17, 143, 174 under, 121 Estimator asymptotic, 233, 238, 244 availability, 232, 233, 246 Bayes(ian), 178, 181, 219, 231, 234–239, 241, 243, 244, 251, 252 Kaplan–Meier, 39 large sample, 244 maximum likelihood, 50, 81 of an attack, 143 nonparametric, 37 small sample, 238 type I, 37, type II, 37, unavailability, 232 unbiased, 37 Evaluation, 1, 68, 78, 114, 115, 119, 172, 226, 227, 231, 257, 304 Event(s), 7, 20, 45, 78–80, 93, 122, 123, 127, 132–134, 142, 143, 148–152, 155, 156, 162, 168, 175–178, 215, 244, 255, 262, 308 Expectation, 49, 173, 196, 219, 238 Expected, 46, 47, 51–53, 57, 76, 82, 121, 127, 154, 175, 183, 184, 187, 190, 193, 194, 207, 232–234, 243, 244, 256, 267 availability (un), 232–234 errors (number of), 190 estimator, 183, 207, 217 failures (faults), 50, 63, 73, 79, 81, 179, 182, 217–219 life, 6 load (loss of), 55 loss (cost of), 124, 125, 127, 131, 132, 135, 137, 142, 147, 149, 150, 168, 169 output, 125 repair (cost of), 124 risk (residual), 125, 137 time, 50 value, 20, 22, 59, 77, 81, 125, 178, 181, 182, 216, 220, 236, 256 Experiment(al), 22, 39, 76, 150, 152, 175, 184, 196 Explanatory, 146
INDEX Exploratory, 79, 96, 196 Exponential, 1, 11–15, 20, 21, 24, 31, 36, 37, 49, 50–52, 55, 60–64, 73, 77, 79–82, 92, 97, 147, 167, 168, 195, 217, 248, 256 Exposure, 151, 162 Extended, 1, 47, 63 Factor(s), 27, 45, 46, 63, 105, 123–126, 127, 130, 137, 158, 159, 176, 184, 190, 195, 198, 207, 208, 215 Failure application, 131 chance, 120, 135, 137 clumping of, 51, 78 common-mode, 44, 45, 72 component, 18, 33 constant, 24 coverage, 174 count(ing), 93, 100, 116, 226, 47, 50, 59, 62–64, 78, 79 cumulative, 64 data, 100, 114, 115, 226, 245, 251, 252, 33, 39, 67, 76–80, 92, 93, 98, 99 date, 48 density, 49, 53, 54 detection, 63 distribution, 4, 5, 33, 35, 49, 62 epoch, 57 events, 256 Þnite, 52 grouped, 95 hardware, 170 index, 57 information, 35 injection, 47 instantaneous failure, 6, 45 intensity, 46, 49, 50, 51, 53, 54, 57, 61, 63–65, 69, 73, 76 intentional, 120 malicious, 123, 135, 137 mean time to, 5, 33, 77, 151, 221 model, 30, 49, 50, 63, 64 modes of, 24 number of, 21, 22, 37, 46, 48, 50, 52, 53, 57, 62, 64, 73, 78, 82, 88, 89, 177, 100, 179, 182, 184, 186, 189, 194, 197, 216, 219, 220 power, 170 probability, 20, 43, 74, 151, 259 process, 93, 195 random, 173 rate(s), 5, 6, 24, 37, 42–45, 50, 55, 56, 58, 63, 64, 71–75 remaining, 182
INDEX residual, 98 size, 174 software, 47, 57, 71, 78, 79, 94, 96, 99, 122, 165, 176, 177, 182, 184, 193, 221 system, 14, 43, 45 time(s), 6, 33, 34, 39, 49, 52, 53, 55–58, 62, 64, 72, 77, 220 time to, 6, 14, 15, 25, 31, 33, 49, 57 Fatigue, 24, 27 Fault(s) arrival, 176 content, 80 correction, 55 coverage, 175, 200, 223 detection, 51, 52, 55, 73 expected number of, 63 hazard rate, 64 injection, 47 number of faults, 51, 184, 193, 223 remaining, 185, 193, 216, 223 removal, 70 seeding, 47 size, 207 simulation, 226 spreading model, 47 software, 69, 177, 216 tolerant(ce), 47, 69, 70, 156, 225, 253, 305 tree(s), 47, 69 Files, 116, 229 Final, 149, 171, 261 Finite, 5, 26, 31, 48, 49, 52, 53, 62, 63, 152, 235 Forecast accuracy, 67, 99, 115, 227, 252 quality, 79, 91, 94, 98, 113 Function(s) beta, 18 characteristic, 178, 179 correlation, 176 criterion, 104 cumulative density, 3, 4, 8 cumulative distribution, 6, 51, 98 decreasing, 27, 53, 217, 219 discrete, 155 distribution, 17, 18, 67, 234 expected value, 220 exponential, 217 failure rate, 64 hazard (rate), 2, 6, 23, 28, 34, 40, 47, 49, 52, 54, 64, 73, 76, 77 intensity, 6, 49, 50, 51, 54, 57, 63, 64, 65 joint density, 4 likelihood, 4, 81 linear, 64
315 logarithmic, 53 loss, 47, 56, 116, 178, 181, 219, 233–238, 251, 252 mathematical-statistical, 1 mean value, 46, 49–54, 62–65, 73 moment generating, 232, 235, 249 Pareto, 11, 13 penalty, 234, 235, 245 power, 11, 13, prior, 244 probability density, 1–3, 6, 8, 25, 34, 51, 80, 147, 218, 232, 243, 256, 304 probability distribution, 52, 98, 151, 176, 180, 195, 233, 252 probability mass, 8 reliability, 2, 4, 11, 15, 23, 25, 26, 34, 38, 76, 280 security meter, 130 statistical, 4 survival, 38, 77, 151 type-I, 31 type-III, 31 weight, 237, 238, 251 G3B, 231, 232, 235 Gamma, 1, 17–19, 61, 219, 233–235, 238, 249 density, 13, 15, 219, 242, 243 distribution, 13, 14, 18, 237 family, 56, 247, 249 models, 243, 244 multiplier, 110 pdf, 15, 62, 150, 218, 233, 234 plots, 244, 245 posterior, 55 prior, 18, 55–57, 221–224, 232, 240, 241, 248 variables, 231, 232 Gauss, 91 Generalized, 52, 57, 73, 79, 92, 97, 121, 140, 142, 176, 181, 182, 187, 189, 232, 233, 234 beta, 181, 182, 187, 189, 252, 253 compound, 57, 79 exponential, 92, 97 gamma, 232, 233 Goel–Okumoto NHPP, 73 multivariate beta distribution, 234 Poisson, 71, 92, 97 Geometric, 21–23, 50, 55, 57–60, 76–82, 93, 152, 188, 189, 215, 216, 218, 219, 221–224, 226 Goodness-of-Þt, 46, 71, 80, 94, 98, 99, 174, 187–191, 200, 203, 223, 224
316 Graphical(ly), 161, 230, 258, 259, 262, 270, 281 Graphs, 98, 162 Hardware, 1, 22, 46, 64, 143, 149, 173, 174, 188, 196, 200, 208, 231–233, 243, 247 Hazard, 2, 5, 6, 7, 11, 23, 25, 28, 30, 34, 47, 48, 49, 50, 52, 53, 54, 55, 64, 72, 73, 75–77, 170 rate, 5–7, 11, 23, 30, 49, 50, 52, 54, 55, 64, 72 Histogram, 35, 91 Homogeneous, 48, 62–64, 65, 151 Hyperexponential, 12, 62–64, 240, 244 Hypothesis, 16, 99, 103, 110–112, 193, 194 Identical, 14, 21, 31, 40, 42–44, 48, 55, 56, 60, 67, 73, 75, 95, 137, 149, 152, 153, 156, 176, 183, 194, 240–243, 266, 267, 298 IID, 14, 30, 31, 152 N -, 21 non, 26, 176, 188, 240, 242 variables, 26, 67 Improper, 105 Independence, 138 Independent, 4, 11, 14, 15, 16, 21, 26, 28–31, 40, 45, 46, 48, 55, 57–59, 67, 73, 75, 82, 138–140, 150, 152, 175–178, 188–190, 194, 195, 233 exponential, 14, 55, 152 failure, 59 increments, 176, 190, 195 N -, 4, 15, 26, 29–31 non, 67 Poisson process, 175, 176 S, 189 statistically, 40 threats, 40 time, 46, 150, 176, 178, 188, 189 Infeasibility, 187 Inference, 150, 151, 153, 184, 189, 233 InÞnite, 25, 50, 53, 54, 64, 76, 111, 235, 238 Informative, 98, 99, 112, 113, 232–236, 243, 244 Initial value, 98, 221, 285, 298 Integration, 236 Integrity, 155 Intensity, 6, 46, 49–51, 53, 54, 57, 61, 63–65, 73, 76, 80 Interfailure, 52, 53, 55, 62, 64 Interruptions, 176, 194, 195, 216, 218, 219, 220 Interval, 2, 3, 6, 8, 35, 36, 37, 39, 46, 58, 59, 63, 76, 80, 81, 82, 93, 96, 152, 174, 182, 183, 207, 217–220
INDEX conÞdence, 16, 17, 36, 65, 93, 245 CPU seconds, 73 i(th), 12, 39, 182, 183, 207 k, 35 M-O Poisson, 96 n-, 100 of integration, 3 1000-second, 40 t, 2, 6, 217, 219, 220 testing, 182 time, 35, 39, 42, 46, 48, 58, 59, 81, 100, 183, 217, 218, 227 Inverse, 10, 11, 13, 18, 20, 22, 24, 26, 232, 240, 248, 255 Iterative, 98 Jelinski–Moranda, 46, 47, 49, 68 Job, 121, 126 Joint probability, 218 Key, 46, 156–159, 195 Knowledge, 104, 105, 162 Kolmogorov–Smirnov, 79, 80 K-S, 80, 94, 96–98 Latent, 183 Law, 12, 26, 123, 130, 144, 150 of large numbers, 26, 28 Learning, 63 Least, 42, 57, 75, 77, 82, 90, 91, 96, 98, 118, 119–121, 137, 143, 149, 151, 156, 168, 175, 185, 186, 193, 194, 205, 206, 220, 221, 237, 278, 279 squares, 57, 82, 90, 91, 237 Life cycle, 147, 167, 205 Likelihood, 4, 8, 50, 79, 80, 81, 96, 151, 152, 154, 162, 206, 208, 233, 250 Limitations, 44, 45, 70, 115, 228 Linear, 34, 57, 64, 82, 116 Link(s), 254, 271, 275, 276, 282, 283, 286–288, 290–297, 299, 306, 307, 308 and nodes, 301–303 connecting, 260, 272 imperfect, 262 index, 283, 285–302 perfect, 256 reliability, 254, 256, 257 weakest, 40 List, 1, 160, 190, 244, 260, 276, 277, 284–291, 293, 295–302, 304 Littlewood, 47, 48, 55, 57, 68–70, 115 Load, 55 Logistic, 28, 29, 228, 229 Long-term, 75
INDEX Loss, 47, 55, 56, 123–127, 131–137, 142, 147–150, 162, 167, 169, 178, 181–185, 197, 219, 232–239, 243, 244, 252, 253. See also Function Marginal, 4, 56, 57, 59, 178, 180, 218 Markov, 21, 58, 60, 67, 80, 114, 184, 188, 189, 197, 226, 227, 228 Matrix, 91, 188, 259 Maximum likelihood, 81 Mean(s), 2, 5, 9, 14, 17, 20, 25–27, 30, 33, 35, 40, 45–54, 58, 60, 62–65, 73, 75, 77, 79, 80, 90, 94–106, 110–113, 121, 135, 137, 147, 151, 167, 168, 187, 190, 194, 195, 219, 221, 233, 235, 236, 240, 244, 245, 247, 256, 257, 271. See also Arithmetic squared error, 79, 80, 94, 98, 99 time to crash, 147, 167 time to failure, 5, 33, 46, 57, 221 time to repair, 25, 75 value function, 46, 49–54, 62–65, 73 Measure, 24, 33, 41, 46, 55, 65, 94, 97, 98, 103, 112, 121–123, 126, 131, 175, 244 Measurement(s), 26, 68, 94, 100, 113, 114, 116, 120, 162, 193, 229 Median, 2, 5, 10, 13, 16, 27, 28, 47, 232–236, 240 Memoryless, 21, 219, 220 Metrics, 68, 116 Minimal, 177, 185–187, 190, 195, 196, 210, 229, 251, 274, 285, 298 Minimal coverage, 177 Minimum, 5, 10, 30, 162, 174, 185, 201–203, 236, 251, 274, 275, 282, 285 Mode, 2, 10, 13, 15, 17, 28, 29, 43, 45, 72, 75, 235 Model(s), 12, 17, 27, 54, 55, 56, 73, 76, 80, 92, 93, 94, 98, 99, 105, 121, 122–124, 128, 129, 131, 138, 146–48, 151, 161, 168, 172–175, 176, 183, 184, 189, 193, 196, 197, 199, 207 AMSAA, 64, 65 Bayesian, 47, 103, 104 Bell–LaPadula, 155 binomial, 48 Biba, 155 Chinese wall, 155 Clark–Wilson, 155 compound Poisson, 78, 79, 195 decision tree, 119 Duane’s, 64 failure, 31, 53 failure-counting, 47 Goel–Okumoto nonhomogenous Poisson, 50, 52
317 Harrison–Ruzzo–Ullman, 155 Howden’s, 194 Jelinski–Moranda de-eutrophication, 49, 80 Littlewood–Verral Bayesian, 55 ModiÞed exponential software reliability, 64 Moranda’s geometric, 50, 80 Musa’s basic execution time, 49, 52 Musa-Okumoto logorithmic Poisson execution time, 53, 96, 97 Poisson, 48 Poisson geometric, 81, 82 power, 64 quantitative security meter, 142–145, 150, 155 Rayleigh, 52 reliability, 46–48, 62, 100 Sahinoglu’s compound Poisson geometric, 58 Sahinoglu’s compound Poisson logarithmic series, 57 Sahinoglu–Libby probability, 234 Schick–Wolverton, 64 Schneidewind, 51 static, 47 time between failures, 46 time-domain, 48 TTD, 120, 162, 163 Weibull, 52 Yamada’s delayed and Ohba’s inßection S and hyperexponential, 62, 63 Moment(s), 5, 29, 60, 184, 216, 232, 235, 236, 243, 249 Mortality, 6, 7 MTBF, 50 MTTF, 5, 25, 33, 37, 42, 44, 45, 57, 65, 72–77, 221, 223, 224 MTTR, 25, 75 Multiplication rule, 138, 139 Musa, 47, 52, 53, 57, 68, 69, 70, 71, 73, 76, 79, 80, 93, 96, 101, 113–115, 195, 196, 217, 220, 221, 228, 229 Musa–Okumoto, 47, 53, 73, 76, 79, 80, 93, 96, 101 Mutation testing, 165, 208, 226 Network, 67, 165, 166, 240, 253, 255, 258, 259, 268, 271, 287, 288, 290, 297–300, 303–305 Neural networks, 258 Node, 160, 161, 254–256, 258, 259, 260, 262–264, 266–276, 280–296, 298–302, 305–308 follow, 283 ingress, 262, 270, 274
318 Node (continued ) lead, 283 root, 160, 161 Nondisjointness, 120, 169 Nonhomogeneous, 47, 48, 52, 53, 62, 79, 80, 151, 152, 173, 176, 188, 195 Nonparametric, 33, 37, 38, 40, 72, 73, 76, 94, 96 Nonrepudiation, 155 Null hypothesis, 193 Numerical, 51, 94, 105, 120, 128, 155, 233, 235 Occurrence(s), 2, 22, 63, 148, 168, 174, 176, 189 223, 224 One step, 173 Operating, 11, 14, 33, 38, 52, 55, 64, 72, 208, 232, 243, 256, 263, 275, 279 system, 72 Operation(s), 46, 64, 73–75, 121, 127, 156, 157, 161, 170, 215, 259, 261, 269 modulus, 157 Optimal, 51, 170, 177, 185, 211, 215, 216, 230 Optimization, 165, 304 Optimum, 70 Order statistics, 34 Output(s), 41, 98, 122, 125, 132, 205, 243, 256, 280, 281 Package, 116 Parameter estimation, 47, 67, 79, 80, 89, 96, 98, 114, 226, 252 Performance, 40, 96, 97, 98, 131 Phase, 12, 25, 63, 82, 175, 206, 208, 210 Poisson, 20, 21, 32, 47, 48, 49, 50, 52, 53, 54, 55 62–69, 70, 71, 73, 76–82, 92, 93, 96, 97, 101, 114, 115, 147, 148, 151–154, 168, 184, 187–190, 195, 197, 198, 206, 207, 215–229, 252 compound, 47, 57–60, 78, 79, 91, 93, 96, 97, 152, 193, 216, 223 distribution, 20, 59, 168, 177 geometric, 57, 59, 80, 81, 152, 188, 189, 218, 223 Musa–Okumoto logarithmic series, 47, 53, 80 nonhomogeneous process (NHPP), 47–50, 53, 78, 80, 151, 152, 173, 176 random numbers, 20 Sahinoglu’s compound geometric, 57 Population, 29, 151, 245 Prediction(s), 48, 49, 63, 70, 78–80, 94, 98, 99, 103, 113, 115, 118, 147 Predictive, 46, 48, 78, 80, 97, 99, 100, 103, 110, 112, 113, 118
INDEX Privacy, 143, 150–157, 164, 165, 167 Probability density, 6, 232 Probability mass, 8 Process(es), 10, 12, 14, 25, 27, 48, 49, 50, 52, 53, 55–62, 63–65, 71, 78–82, 164, 172–178, 188, 193–196, 200, 205, 213, 218–221, 239, 243, 258, 262, 270, 275, 282, 296, 297, 302 Bernoulli, 176 homogeneous (HPP), 151 Markovian birth and death, 46 random, 49, 52 veriÞcation, 174 Product(s), 10, 12, 27, 40, 51, 59, 71–73, 82, 125, 130, 172–175, 190, 193, 194, 205–207, 214, 218, 219, 229, 233, 241 limit, 38, 39 Program, 48, 53–55, 63, 76, 93, 106, 117, 156, 170, 176, 177, 183, 193, 206, 216, 217, 221, 241, 308 Qualitative, 113, 119–121, 127–129, 138, 155, 168 Quality, 17, 20, 40, 73, 79, 82, 94, 98–100, 113, 172–175, 193, 194, 214 Quantitative, 103, 104, 115, 119–122, 127–129, 132, 138, 142, 150, 154, 155, 162, 169, 173 Random, 8, 10, 18, 24, 28, 49, 57, 58, 59, 60, 77, 99, 101, 105, 120, 122, 125, 129, 130, 137, 147, 148, 150, 151, 158, 159, 167, 168, 173, 178, 179, 180, 194, 195, 206, 215, 216, 217, 218, 243, 245, 246, 247, 249, 253–255 deviate(s), 11, 14, 23 number generation(s), 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 66 number(s), 8, 10, 11, 13–32 sampling, 4 variable, 4, 10, 14–17, 19–22, 25–27, 29, 33, 55, 56, 79, 81, 97, 100, 103, 123, 174, 176–181, 187, 188, 194, 216, 217, 232, 234–236, 238, 239, 244, 249, 251 Rank deÞcient, 91 Rate, 5, 6, 7, 9, 12, 18, 20, 23, 24, 27, 37, 42, 43, 44, 45, 48–59, 63, 64, 65, 71–75, 77, 151, 167, 176, 178, 194, 232, 233, 247, 249, 256 defect, 51, 63 detection, 51, 52, 63, 68, 73, 113 failure, 56, 64, 71–75, 77, 151, 229, 231–233, 247, 255 outage, 18, 55, 232, 249
INDEX Poisson, 20, 178 recovery, 18, 232 repair, 55, 231, 233, 243, 244, 247, 249, 256 time-dependent error-detection, 68, 113 Rayleigh, 23, 32, 52 Recovery, 18, 232 Recursive, 38, 235, 280, 284 Reduction, 53, 92, 217 Redundancy, 1, 15, 40–42, 72, 75, 76, 156, 263 Regression, 82, 93, 114, 116, 227 Regression testing technique, 176 Relative, 2, 41, 79, 80, 83–87, 92–94, 98–102, 130, 144, 150 frequency, 2, 144, 150 Reliability, 1, 2, 4, 7, 8, 11, 15, 16, 23, 24, 25, 26, 28, 31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, 63, 64, 65, 68, 69, 71, 72, 73, 75, 76, 77, 78, 79, 80, 99, 100, 101, 113, 118, 120, 151, 169, 172, 173, 174, 177, 188, 189, 193, 194, 195, 221, 253, 256–263, 265–275, 277, 280, 281, 283, 284, 296, 297, 298, 302, 303, 308 Bayes(ian), 241, 252 block diagramming, 258, 265 engineering, 2, 8, 47 equation, 7, 44 function, 2, 4, 5, 8, 11, 15, 23, 26, 34, 38, 47, 115 hardware, 45, 46 management, 121, 122, 150, 152, 166 residual, 122, 124–126, 130, 132, 137, 146–150, 170, 171 scenarios, 120 security, 119, 121, 123, 128, 137, 138, 148, 149, 154, 155 series, 40 software, 45, 46, 48, 49, 55, 62, 64, 78–80, 97, 99, 101, 113, 118, 151, 172, 193, 195, 221 Renewal, 152 Residual, 46, 57, 62, 78, 82, 92, 98, 122–127, 129, 130, 132, 137, 146–150, 169–171 Response, 151, 188, 226, 251, 304 Reward, 184 Risk, 119–138, 141, 143, 146–150, 153–155, 161, 162, 164–171, 195, 205, 236, 240, 251 Run, 25, 72, 76, 121, 130, 186, 187, 194, 208 Safety, 68, 123, 155, 170, 253, 258, 304 Sample, 4, 6, 21, 26, 29–33, 46, 82, 95–97, 100, 101, 103, 110–112, 117, 122, 123, 125, 127, 150, 164, 178, 221, 232–234, 238–240, 244–247, 251, 270, 274
319 Sampling, 4, 70, 79, 80, 102, 103, 148, 165, 189, 195, 196, 197, 227, 228 error, 11, 150 plans, 11, 143 Security, 1, 119–123, 128–132, 135, 137, 138, 142–144, 146–151, 153–155, 160–164, 167–169, 173, 190, 229, 258–259, 281 Semiquantitative, 119 Simulation, 1, 11, 105, 106, 120, 123–125, 127, 129, 130, 132–142, 147–150, 167, 168, 255, 259, 308 defect, 48 discrete event, 147, 148, 150 Monte Carlo, 105, 120, 123–125, 127, 129, 130, 134, 137, 138, 142, 147, 148, 150, 168, 184, 254, 255, 256, 259, 304, 306, 308 program, 310 Software, 1, 10, 40, 45–49, 51–57, 62–64, 68–73, 76, 77–82, 92–96, 99–101, 113, 118, 120–123, 130–137, 143, 149–153, 167, 169–177, 181–184, 188–190, 193–198, 205–207, 213–221, 232, 233, 272, 281, 282 engineering, 68, 114, 165, 166, 172, 176, 193, 213, 215, 225, 228, 229, 304 environment, 52 failure(s), 47, 48, 57, 71, 78, 79, 81, 93, 96, 100, 101, 122, 174, 176, 177, 181, 182, 194, 216, 221 fault(s), 47, 49, 69, 131, 177, 216, 220 maintenance, 120, 121, 131–133, 135, 137, 164, 165, 168, 225, 227 module, 72, 73 reliability, 1, 46, 99, 120, 193 Standard deviation, 95, 96, 97, 103, 163 Standby, 14, 15, 16, 43, 44 Statistical, 1, 2, 4, 6, 17, 25, 26, 32, 33, 43, 46, 48, 49, 62, 63, 66–68, 70, 98–100, 112–115, 120, 121, 129, 138, 142–145, 150, 151, 154, 162, 172–174, 189, 193, 195, 196, 206, 215, 226–228, 234, 253, 254, 306 Statistics, 11, 17, 34, 46, 66, 67, 79, 94–98, 114–116, 122, 132, 162, 165, 168, 215, 227–229, 252–254 305 Stochastic, 48, 67, 70, 78, 98–103, 105, 111, 113–115, 151, 167, 188, 253 Stochastic processes, 71, 227 Storage, 281 283
320 Sum of squares, 15, 35 System, 1, 2, 4, 6, 8, 10, 12, 14–16, 18, 20, 22, 24–27, 30, 31, 40–45, 55, 56, 64, 67, 68, 72, 75, 76, 114, 120, 122, 123, 130–136, 143, 156, 159, 161, 163, 166, 168–170, 175, 194, 208, 226, 227, 240 241, 253, 259, 260–262, 265–268, 271–273, 277–279, 280–283, 286, 298, 304, 305 System analysis, 40 System reliability, 2, 4, 6, 40, 41, 43–45, 258–260, 266, 269, 275, 280, 284, 296 Systems, 21, 31, 41, 42, 44, 45, 64, 65, 67–72, 74, 114, 121, 131, 149, 156, 160, 162, 164, 166, 168, 193, 215, 225, 228, 242, 253, 254, 259–264, 266–274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 305 Test, 33, 36, 37, 38, 48, 63, 64, 67–69, 72, 75–77, 83–87, 94, 103, 112–118, 172–178, 182–190, 194–198, 202, 205–215, 223, 225–229, 252, 253, 286, 287 goodness-of-Þt, 209 Kolmogorov–Smirnov (K-S), 80, 95 white-box, 175 Testing, 1, 10, 16, 33, 35–37, 39, 40, 48, 58, 62–70, 75, 78–80, 82, 97, 102, 110, 113–115, 120, 121, 142, 149, 164, 168, 172–177, 182–190, 193–200, 202, 205–210, 224–230, 233, 252, 257, 258, 305 Theorem, 16, 26, 57, 58, 180 Threat, 121–129, 132, 135–146, 149, 150, 165, 167, 168, 170, 171, 196
INDEX Time to defeat (model), 120, 154, 162–164 Time to failure, 5, 6, 14, 15, 25, 31, 33, 49, 57, 221 Transform (inverse), 10, 11, 13, 20, 22, 24, 28, 31, 128 Tree diagram, 121, 122, 123–128, 132, 133, 138, 139–143, 146, 147, 167, 168 Trend(s), 50, 51, 98, 100, 111, 112, 151, 152, 172, 176, 282 Unavailability, 233, 236, 239, 243, 245–249 Unavailable, 36 Unbiased, 37, 67 Unpredictable, 7 Unreliability, 75 Uptime, 18, 231 User, 113, 123, 130, 156, 158, 184, 200, 216, 218, 259, 261, 264, 269 Utilization, 1, 173 Validation, 69, 98, 196, 227, 228 Variance, 16, 17, 20, 21, 26, 27, 29, 35, 58, 60, 80, 95–97, 104, 105, 111, 112, 152, 189, 236 Variation(s), 34, 50, 55, 100, 236 Venn diagram, 140, 141 Vulnerability(ies), 121, 124–130, 132, 134–138, 139, 141, 142, 143, 144, 145, 146, 150, 164, 165, 167, 168, 169–171 Web, 12, Wireless, 259 Yamada, 47, 69, 113