ffirs.qxd
3/3/2009
6:01 PM
Page i
PRACTICAL SYSTEM RELIABILITY
ffirs.qxd
3/3/2009
6:01 PM
Page ii
IEEE Press 445 Hoes Lane Piscataway, NJ 08855 IEEE Press Editorial Board Lajos Hanzo, Editor in Chief R. Abari J. Anderson S. Basu A. Chatterjee
T. Chen T. G. Croda M. El-Hawary S. Farshchi
B. M. Hammerli O. Malik S. Nahavandi W. Reeve
Kenneth Moore, Director of IEEE Book and Information Services (BIS) Jeanne Audino, Project Editor Technical Reviewers Robert Hanmer, Alcatel-Lucent Kime Tracy, Northeastern Illinois University Paul Franklin, 2nd Avenue Subway Project Simon Wilson, Trinity College, Ireland
ffirs.qxd
3/3/2009
6:01 PM
Page iii
PRACTICAL SYSTEM RELIABILITY
Eric Bauer Xuemei Zhang Douglas A. Kimber
IEEE Press
A JOHN WILEY & SONS, INC., PUBLICATION
ffirs.qxd
3/3/2009
6:01 PM
Page iv
Copyright © 2009 by the Institute of Electrical and Electronics Engineers, Inc. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representation or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data is available. ISBN 978-0470-40860-5 Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1
ffirs.qxd
3/3/2009
6:01 PM
Page v
For our families, who have supported us in the writing of this book, and in all our endeavors
ftoc.qxd
3/3/2009
6:04 PM
Page vii
CONTENTS
Preface Acknowledgments 1 Introduction
xi xiii 1
2 System Availability 2.1 Availability, Service and Elements 2.2 Classical View 2.3 Customers’ View 2.4 Standards View
5 6 8 9 10
3 Conceptual Model of Reliability and Availability 3.1 Concept of Highly Available Systems 3.2 Conceptual Model of System Availability 3.3 Failures 3.4 Outage Resolution 3.5 Downtime Budgets
15 15 17 19 23 26
4 Why Availability Varies Between Customers 4.1 Causes of Variation in Outage Event Reporting 4.2 Causes of Variation in Outage Duration
31 31 33
5 Modeling Availability 5.1 Overview of Modeling Techniques 5.2 Modeling Definitions 5.3 Practical Modeling 5.4 Widget Example 5.5 Alignment with Industry Standards
37 38 58 69 78 89
6 Estimating Parameters and Availability from Field Data 6.1 Self-Maintaining Customers 6.2 Analyzing Field Outage Data 6.3 Analyzing Performance and Alarm Data
95 96 96 106 vii
ftoc.qxd
3/3/2009
viii
6:04 PM
Page viii
CONTENTS
6.4 6.5 6.6
Coverage Factor and Failure Rate Uncovered Failure Recovery Time Covered Failure Detection and Recovery Time
107 108 109
7 Estimating Input Parameters from Lab Data 7.1 Hardware Failure Rate 7.2 Software Failure Rate 7.3 Coverage Factors 7.4 Timing Parameters 7.5 System-Level Parameters
111 111 114 129 130 132
8 Estimating Input Parameters in the Architecture/Design Stage 8.1 Hardware Parameters 8.2 System-Level Parameters 8.3 Sensitivity Analysis
137 138 146 149
9 Prediction Accuracy 9.1 How Much Field Data Is Enough? 9.2 How Does One Measure Sampling and Prediction Errors? 9.3 What Causes Prediction Errors?
167 168 172 173
10 Connecting the Dots 10.1 Set Availability Requirements 10.2 Incorporate Architectural and Design Techniques 10.3 Modeling to Verify Feasibility 10.4 Testing 10.5 Update Availability Prediction 10.6 Periodic Field Validation and Model Update 10.7 Building an Availability Roadmap 10.8 Reliability Report
177 179 179 206 208 208 208 209 210
11 Summary
213
Appendix A System Reliability Report outline 1 Executive Summary 2 Reliability Requirements 3 Unplanned Downtime Model and Results Annex A Reliability Definitions Annex B References Annex C Markov Model State-Transition Diagrams
216 215 217 217 219 219 220
Appendix B Reliability and Availability Theory 1 Reliability and Availability Definitions 2 Probability Distributions in Reliability Evaluation 3 Estimation of Confidence Intervals
221 221 228 237
ftoc.qxd
3/3/2009
6:04 PM
Page ix
CONTENTS
ix
Appendix C Software Reliability Growth Models 1 Software Characteristic Models 2 Nonhomogeneous Poisson Process Models
245 245 246
Appendix D Acronyms and Abbreviations
263
Appendix E Bibliography
265
Index
279
About the Authors
285
fpref.qxd
3/3/2009
6:06 PM
Page xi
PREFACE
T
HE RISE OF THE INTERNET,
sophisticated computing and communications technologies, and globalization have raised customers’ expectations of powerful “always on” services. A crucial characteristic of these “always on” services is that they are highly available; if the customer cannot get a search result, or order a product or service, or complete a transaction instantly, then another service provider is often just one click away. As a result, highly available (HA) services are essential to many modern businesses, such as telecommunications and cable service providers, Web-based businesses, information technology (IT) operations, and so on. Poor service availability or reliability often represents real operating expenses to service providers via costs associated with: 앫 Loss of brand reputation and customer good will. Verizon Wireless proudly claims to be “America’s most reliable wireless network” (based on low ineffective attempt and cutoff transaction rates), whereas Cingular proudly claims “Fewest dropped calls of any network.” Poor service availability can lead to subscriber churn, a tarnished brand reputation, and loss of customer good will. 앫 Direct loss of customers and business. Failure of an online provisioning system or order entry system can cause customers to be turned away because their purchase or order cannot be completed. For instance, if a retail website is unavailable or malfunctioning, many customers will simply go to a competitor’s website rather than bothering to postpone their purchase and retrying later. 앫 Higher maintenance-related operating expenses. Lower reliability systems often require more maintenance actions and raise xi
fpref.qxd
3/3/2009
xii
6:06 PM
Page xii
PREFACE
more alarms. More frequent failures often mean more maintenance staff must be available to address the higher volume of maintenance events and alarms. Repairs to equipment in unstaffed locations (e.g., outdoor base stations) require additional time and mileage expenses to get technicians and spare parts to those locations. 앫 Financial penalties or liquidated damages due to subscribers/ customers for failing to meet service availability or “uptime” contractual requirements or service level agreements (SLAs). This practical guide explains what system availability (including both hardware and software downtime) and software reliability are for modern server, information technology or telecommunications systems, and how to understand, model, predict and manage system availability throughout the development cycle. This book focuses on unplanned downtime, which is caused by product-attributable failures, rather than planned downtime caused by scheduled maintenance actions such as software upgrades and preventive maintenance. It should be noted that this book focuses on reliability of mission-critical systems; human-lifecritical systems such as medical electronics, nuclear power operations, and avionics demand much higher levels of reliability and availability, and additional techniques beyond what is presented in this book may be appropriate. This book provides valuable insight into system availability for anyone working on a system that needs to provide high availability. Product managers, system engineers, system architects, developers, and system testers will all see how the work they perform contributes to the ultimate availability of the systems they build. ERIC BAUER XUEMEI ZHANG DOUGLAS A. KIMBER Freehold, New Jersey Morganville, New Jersey Batavia, Illinois February 2009
flast.qxd
3/4/2009
8:54 AM
Page xiii
ACKNOWLEDGMENTS
We thank Abhaya Asthana, James Clark, Randee Adams, Paul Franklin, Bob Hanmer, Jack Olivieri, Meena Sharma, Frank Gruber, and Marc Benowitz for their support in developing, organizing and documenting the software reliability and system availability material included in this book. We also thank Russ Harwood, Ben Benison, and Steve Nicholls for the valuable insights they provided from their practical experience with system availability. E.B. X.Z. D.K.
xiii
c01.qxd
2/8/2009
5:19 PM
CHAPTER
Page 1
1
INTRODUCTION
Meeting customers’ availability expectations for an individual product is best achieved through a process of continuous improvement, as shown in Figure 1.1. The heart of the process is an architecture-based, mathematical availability model that captures the complex relationships between hardware and software failures and the system’s failure detection, isolation, and recovery mechanisms to predict unplanned, product-attributable downtime (covered in Chapter 5). In the architecture or high-level design phase of a product release, parameters for the model are roughly estimated based on planned features, producing an initial availability estimate to assess the feasibility of meeting the release’s availability requirements (covered in Chapter 8). In the system test phase, updated modeling parameters (such as hardware failure rate calculations, software failure rate estimations from lab data, and measured system parameters) can be used in the model to produce a revised availability estimate for the product (covered in Chapter 7). After the product is deployed in commercial service, outage data can be analyzed to calculate actual rate of outage-inducing software and hardware failures, outage durations, and so on; these actual values can be used to better calibrate modeling parameters and the model itself (covered in Chapter 6). If there is a gap between the actual field availability and the product’s requirements, then a roadmap of availabilityimproving features can be constructed, and the availability prediction for the next release is produced by revising modeling parameters (and the model itself, if significant architectural changes are made) to verify feasibility of meeting the next release’s availability requirements with the planned feature set, thus closing the loop (covered in Chapter 10). Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
1
c01.qxd
2/8/2009
2
5:19 PM
Page 2
INTRODUCTION
Construct Road map of Availability— Improving Features to Close Any Gap
Estimate Availability from Lab Data and Analysis
Figure 1.1. Managing system availability.
The body of this book is organized as follows: 앫 Chapter 2, System Availability, explains the classical, service providers’ and TL 9000 views of availability. 앫 Chapter 3, Conceptual Availability Model, explains the relationship between service-impacting failure rates, outage durations and availability. 앫 Chapter 4, Why Availability Varies between Customers, explains why the same product with identical failure rates can be perceived to have different availability by different customers. 앫 Chapter 5, Modeling Availability, explains how mathematical models of system availability are constructed; an example is given. 앫 Chapter 6, Estimating Parameters from Field Data, explains how system availability and reliability parameters can be estimated from field outage data. 앫 Chapter 7, Estimating Input Parameters from Lab Data, explains how modeling input parameters can be estimated from
c01.qxd
2/8/2009
5:19 PM
Page 3
INTRODUCTION
앫
앫 앫
앫 앫
앫 앫 앫 앫 앫
3
lab data to support an improved availability estimate before a product is deployed to customers (or before field outage data is available). Chapter 8, Estimating Input Parameters in Architecture/Design Stage, explains how modeling input parameters can be estimated early in a product’s development, before software is written or hardware schematics are complete. Good modeling at this stage enables one to verify the feasibility of meeting availability requirements with a particular architecture and high-level design. Chapter 9, Prediction Accuracy, discusses how much field data is enough to validate predictions and how accurate availability predictions should be. Chapter 10, Connecting the Dots, discusses how to integrate practical software reliability and system availability modeling into a product’s development lifecycle to meet the market’s availability expectations. Chapter 11, Summary, summarizes the key concepts presented in this book, and the practical ways those concepts may be leveraged in the design and analysis of real systems. Appendix A, Sample Reliability Report Outline, gives an outline for a typical written reliability report. This explains the information that should be included in a reliability report and provides examples. Appendix B, Reliability and Availability Theory Appendix C, Software Reliability Growth Models Appendix D, Abbreviations References Index
c02.qxd
2/8/2009
5:20 PM
CHAPTER
Page 5
2
SYSTEM AVAILABILITY
There is a long history of so-called “Five-9’s” systems. Five-9’s is shorthand for 99.999% service availability which translates to 5.26 down-minutes per system per year. Telecommunications was one of the first areas to achieve Five-9’s availability, but this Five9’s expectation is now common for telecommunications, missioncritical servers, and computing equipment; in some cases, customers expect some individual elements to exceed 99.999%. Many telecommunications Web servers, and other information technology systems routinely exceed 99.999% service availability in actual production. The telecommunications industry, both service providers and equipment manufacturers, tailored the ISO 9000 quality standard to create the TL 9000 standard. More specifically, TL 9000 was created by the Quality Excellence for Suppliers of Telecommunications (QuEST) Forum. The QuEST Forum is a consortium of telecommunications service providers, suppliers, and liaisons* that is dedicated to advancing “the quality, reliability, and performance of telecom products and services around the world.” TL 9000 gives clear and formal rules for measuring the reliability and availability of servers and equipment that supports the Internet Protocol (IP) and a wide variety of telecommunications and computing center equipment. TL 9000 defines a number of metrics and the associated math and counting rules that enable tracking of very specific quality, reliability, and performance aspects of a wide variety of products. The metric names consist of a few letters that define the area being measured, along with a number to distinguish between similar metrics within that area. For example, *At the time of this writing, the QuEST forum membership included more than 25 service providers, more than 80 suppliers, and over 40 liaisons. Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
5
c02.qxd
2/8/2009
6
5:20 PM
Page 6
SYSTEM AVAILABILITY
there is a metric called “SO4,” which is the fourth type of metric that deals with “system outages.” Because both equipment manufacturers and users (i.e., telecommunications service providers) defined TL 9000, it offers a rigorous and balanced scheme for defining and measuring reliability and availability. TL 9000 is explicitly applicable to many product categories used with all IP-based solutions and services, including: 앫 IP-based Multimedia Services (Category 1.2.7), including video over IP, instant messaging, voice features, multimedia communications server, media gateway 앫 Enhanced Services Platforms and Intelligent Peripherals (Category 6.1) like unified/universal messaging 앫 Network Management Systems, both online critical (Category 4.2.1), such as network traffic management systems, and online noncritical (Category 4.2.2) such as provisioning, dispatch, and maintenance 앫 Business Support Systems (Category 4.2.3), such as the inventory, billing records, and service creation platforms 앫 General Purpose Computers (Category 5.2) such as terminals, PCs, workstations, and mini-, mid-, and mainframes 앫 All networking equipment such as network security devices (Category 6.6), routers (Categories 1.2.9 and 6.2.7), PBX’s (Category 6.4), and virtually all equipment used by communications service providers TL 9000 reliability measurements are broadly applicable to most IP- and Web-based services and applications, and, thus, this book will use TL 9000 principles as appropriate. There are two parts to the TL 9000 Standard: a Requirements Handbook and a Measurements Handbook. The Measurements Handbook is the standard that is most applicable to the topics covered in this book because it defines the metrics for how to measure system availability and software reliability. Applicable TL 9000 principles will be explained, so no previous knowledge of TL 9000 is necessary.
2.1
AVAILABILITY, SERVICE, AND ELEMENTS
TL 9000’s Quality Measurement Systems Handbook, v4.0, defines availability as “the ability of a unit to be in a state ready to per-
c02.qxd
2/8/2009
5:20 PM
Page 7
2.1
AVAILABILITY, SERVICE, AND ELEMENTS
7
form a required function at a given instant in time or for any period within a given time interval, assuming that the external resources, if required, are provided.” Thus, availability is a probability, the probability that a unit will be ready to perform its function, and, like all probabilities, it is dimensionless. Practically, availability considers two factors: how often does the system fail, and how quickly can the system be restored to service following a failure. Operationally, all hardware elements eventually fail because of manufacturing defects, wear out, or other phenomena; software elements fail because of residual defects or unforeseen operational scenarios (including human errors). Reliability is typically defined as the ability to perform a specified or required function under specific conditions for a stated period of time. Like availability, reliability may be expressed as a percentage. It is the probability that a unit will be able to perform its specified function for a stated period of time. Availability and reliability are often confused, partly because the term reliability tends to be used when availability is what was really intended. One classic example that helps distinguish reliability from availability is that of an airplane. If you want to fly from Chicago to Los Angeles, then you want to get on a very reliable plane, one that has an extremely high probability of being able to fly for the 4 to 5 hours the trip will take. That same plane could have a very low availability. If the plane requires 4 hours worth of maintenance prior to each 4 hour flight, then the plane’s availability would only be 50%. High-availability systems are designed to automatically detect, isolate, alarm, and recover from inevitable failures (often by rapidly switching or failing over to redundant elements) to maintain high service availability. A typical design principle of socalled “high availability” systems is that no single failure should cause a loss of service. This design principle is often referred to as “no single point of failure.” As complex systems may be comprised of multiple similar or identical elements, it is often useful to distinguish between service availability and element availability. Service is generally the primary functionality of the system or set of elements. Some services are delivered by a single, stand-alone element; other services are delivered by a set of elements. A simple example of these complementary definitions is a modern commercial airliner with two jet engines. If a single jet engine fails (an element failure), then
c02.qxd
2/8/2009
8
5:20 PM
Page 8
SYSTEM AVAILABILITY
propulsion service remains available, albeit possibly with a capacity loss, so the event is not catastrophic; nevertheless, this element failure is certainly an important event to manage. If the second jet engine fails before the first jet engine has been repaired, then a catastrophic loss of propulsion service occurs. Conceptually, element availability is the percentage of time that a particular element (e.g., the jet engine) is operational; service availability is the percentage of time that the service offered by one or more elements (e.g., propulsion) is operational. As clustered or redundant architectures are very common in high availability systems and services, clearly differentiating service availability from element availability is very useful. TL 9000 explicitly differentiates these two concepts as network element outages (e.g., product-attributable network element downtime tracked by the TL 9000 NEO4 metric) versus product-attributable service downtime (tracked by the SO4 metric). Unless otherwise stated, this book focuses on service availability.
2.2
CLASSICAL VIEW
Traditionally, systems were viewed as having two distinct states: up and down. This simplifying assumption enabled the following simple mathematical definition of availability: Uptime MTTF Availability = ᎏᎏᎏ = ᎏᎏ Uptime + Downtime MTTF + MTTR
(2.1)
Mean time to failure (MTTF) often considered only hardware failures and was calculated using well-known hardware prediction methods like those described in the military standard MIL-HDBK-STD-217F or the Telcordia telecommunications standard BR-SR-332 (also known as Reliability Prediction Procedure, or RPP). Section 1 in Appendix B illustrates the definition of MTTF in mathematical format, and shows its relationship with the reliability function. Mean time to repair (MTTR) was often assumed to be 4 hours. Although this calculation did not purport to accurately model actual system availability, it did represent a useful comparison value, much like Environmental Protection Agency (EPA) standard gas mileage in the United States. An added benefit is that this definition is very generic and can easily
c02.qxd
2/8/2009
5:20 PM
Page 9
2.3
CUSTOMER’S VIEW
9
be applied across many product categories, from military/aerospace to commercial/industrial and other fields. This classical view has the following limitations: 앫 Hardware redundancy and rapid software recovery mechanisms are not considered yet are designed into many modern high-availability systems so that many or most failures are recovered so rapidly that noticeable outages do not occur. 앫 Service repair times vary dramatically for different failures. For instance, automatic switchovers are much faster than manual repairs, and recovering catastrophic backplane failures often takes significantly longer than recovering from circuit pack failures. 앫 Many complex systems degrade partially, rather than having simple 100% up and completely down states. For instance, in a digital subscriber line (DSL) access element, a single port on a multi-line DSL card can fail (affecting perhaps < 1% of capacity), or one of several multiline DSL cards can completely fail (affecting perhaps 10% of capacity), or the aggregation/backhaul capability can completely fail (affecting perhaps 100% of capacity). Clearly, loss of an entire (multiline) DSL access element is much more severe than the loss of a single access line. Thus, sophisticated customers generally take a more refined view of availability. 2.3
CUSTOMERS’ VIEW
Sophisticated customers often take a more pragmatic view of availability that explicitly considers actual capacity loss (or capacity affected) for all service disruptions. As sophisticated customers will typically generate trouble tickets that capture the percentage of users (or the actual number of users) that are impacted and the actual duration for service disruptions, they will often calculate availability via the following formula: Availability = In-service time – ⌺Outage events Capacity loss × Outage duration ᎏᎏᎏᎏᎏᎏᎏᎏ In-service time (2.2)
c02.qxd
2/8/2009
10
5:20 PM
Page 10
SYSTEM AVAILABILITY
In-service time is the amount of time the equipment was supposed to be providing service. It is often expressed in system minutes. Capacity loss is the percentage of provisioned users that are impacted by the outage (or, alternatively, the number of users). Outage duration is typically measured in seconds or minutes. Equation 2.2 prorates the duration of each outage by the percentage of capacity lost for that outage, and then adds all the outages together before converting the outage information to availability. As an example, consider a home location register (HLR) database system that stores wireless subscriber information on a pair of databases. The subscriber information is evenly allocated between the two servers for capacity reasons. If one of the database servers incurs a 10 minute outage, then half of the subscribers will be unable to originate a call during that 10 minute interval. If that was the only outage the HLR incurred during the year, then the annual availability of the HLR is: Availability = 1 year – (50% capacity loss × 10 min downtime) ᎏᎏᎏᎏᎏᎏ 1 year
(2.3)
525960 – 5 = ᎏᎏ = 99.999% 525960 This works out to be 99.999%. Notice that in this example the availability was calculated for an entire year.* Other periods could be used, but it is customary to use a full year. This is primarily because downtime, which is the inverse of the availability, is typically expressed in minutes per year.
2.4
STANDARDS VIEW
The QuEST Forum has standardized definitions and measurements for outages and related concepts in the TL 9000 Quality Management System Measurements Handbook. Key concepts from *This book uses 525,960 minutes per year because when leap years are considered, the average year has 365.25 days, and 365.25 days times 24 hours per day times 60 minutes per hour gives 525,960 minutes. It is acceptable to use 525,600 minutes per year, thus ignoring leap years. The important thing is to be consistent—always use the same number of minutes per year.
c02.qxd
2/8/2009
5:20 PM
Page 11
2.4
STANDARDS VIEW
11
the TL 9000 v4.0 Measurement Handbook relevant to software reliability and system availability are reviewed in this chapter. 2.4.1
Outage Attributability
TL 9000 explicitly differentiates product-attributable outages from customer-attributable or other outages. Product-attributable outage. An outage primarily triggered by a) The system design, hardware, software, components or other parts of the system b) Scheduled outage necessitated by the design of the system c) Support activities performed or prescribed by an organization, including documentation, training, engineering, ordering, installation, maintenance, technical assistance, software or hardware change actions, and so on d) Procedural error caused by the organization e) The system failing to provide the necessary information to conduct a conclusive root cause determination f) One or more of the above Customer-attributable outage. An outage that is primarily attributable to the customer’s equipment or support activities triggered by a) Customer procedural errors b) Office environment, for example power, grounding, temperature, humidity, or security problems c) One or more of the above d) Outages are also considered customer attributable if the customer refuses or neglects to provide access to the necessary information for the organization to conduct root cause determination.
As used above, the term “organization” refers to the supplier of the product and its support personnel (including subcontracted support personnel). This book focuses on product-attributable outages. 2.4.2
Outage Duration and Capacity Loss
TL 9000 explicitly combines outage duration and capacity loss into a single parameter:
c02.qxd
2/8/2009
12
5:20 PM
Page 12
SYSTEM AVAILABILITY
Outage Downtime—The sum, over a given period, of the weighted minutes a given population of a system, network element (NE), or service entity was unavailable, divided by the average in-service population of systems, network elements, or service entities.
Crucially, TL 9000 explicitly uses weighted minutes to prorate downtime by capacity lost. 2.4.3
Service Versus Element Outages
As many systems are deployed in redundant configurations to assure high availability, TL 9000 explicitly differentiates service-impacting outage from network-element-impacting outage: Service Impact Outage—A failure in which end-user service is directly impacted. End user service includes but is not limited to one or more of the following: fixed-line voice service, wireless voice service, wireless data service, high-speed fixed access (DSL, cable, fixed wireless), broadband access circuits (OC-3+), narrowband access circuits (T1/E1, T3/E3). Network Element Impact Outage—A failure in which a certain portion of a network element functionality/capability is lost, down, or out of service for a specified period of time.
The Service Impact Outage measurements are designed to assess the impact of outages on end-user service. As such, they look at the availability of the primary function (or service) of the product. The Network Element Impact Outage measurements are designed to assist the service provider in understanding the maintenance costs associated with a particular network element. They include outages that are visible to the end user as well as failure events such as loss of redundancy, which the end user will not see. 2.4.4
Outage Exclusion and Counting Rules
Outages often have variable durations and impact variable portions of system capacity. Thus, as a practical matter it becomes important to precisely agree on which events are significant enough to be counted as “outages” and which events are so transient or so small as to be excluded from consideration as “outages.” Naturally, equipment suppliers often prefer more generous outage exclu-
c02.qxd
2/8/2009
5:20 PM
Page 13
2.4
STANDARDS VIEW
13
sion rules to give a more flattering view of product-attributable service availability, whereas customers may prefer to take a more inclusive view and count “everything.” The TL 9000 measurements handbook 4.0 provides the following compromise for typical systems: All outages shall be counted that result in a complete loss of primary functionality for all or part of the system for a duration greater than 15 seconds during the operational window, regardless of whether the outage was unscheduled or scheduled.
Different services have different tolerances for short service disruptions. Thus, it is important for suppliers and customers to agree on how brief a service disruption is acceptable for automatic failure detection, isolation, and recovery. Often, this maximum acceptable service disruption duration is measured in seconds, but it could be hundreds of milliseconds or less. Service disruptions that are shorter than this maximum acceptable threshold can then be excluded from downtime calculations. Generally, capacity losses of less than 10% are excluded from availability calculations as well. TL 9000 sets specific counting rules by product category. Setting clear agreements on what outages will be counted in availability calculations and what events can be excluded is generally a good idea. 2.4.5
Normalization Factors
Another crucial factor in availability calculations is the so-called “normalization unit.” Whereas “system” and “network element” seem fairly straightforward in general, modern bladed and clustered architectures can be interpreted differently. For example, if a single chassis contains several pairs of blades, each hosting a different application, then should service availability be normalized against just the blades hosting a particular application or against the entire chassis? How should calculations change if a pair of chassis, either collocated or geographically redundant, is used? Since system availability modeling and predictions are often done assuming one or more “typical” configurations (rather than all possible, supported configurations), one should explicitly define this typical configuration(s) and consider what normalization factors are appropriate for modeling and predictions.
c02.qxd
2/8/2009
14
5:20 PM
Page 14
SYSTEM AVAILABILITY
2.4.6
Problem Severities
Different failures and problems generally have different severities. Often, problems are categorized into three severities: critical (sometimes called “severity 1”), major (sometimes called “severity 2”), and minor (sometimes called “severity 3”). TL 9000’s severity definitions are broadly consistent with those used by many, as follows. Critical Critical conditions are those that severely affect the primary functionality of the product and, because of the business impact to the customer, require nonstop immediate corrective action, regardless of time of day or day of the week, as viewed by a customer upon discussion with the organization. They include 1. Product inoperability (total or partial outage) 2. Reduction in capacity capability, that is, traffic/data handling capability, such that expected loads cannot be handled 3. Any loss of emergency capability (for example, emergency 911 calls) 4. Safety hazard or risk of security breach Major Major severity means that the product is usable, but a condition exists that seriously degrades the product operation, maintenance, administration, and so on, and requires attention during predefined standard hours to resolve the situation. The urgency is less than in critical situations because of a lesser immediate or impending effect on problem performance, customers, and the customer’s operation and revenue. Major problems include: 1. Reduction in the product’s capacity (but the product is still able to handle the expected load) 2. Any loss of administrative or maintenance visibility of the product and/or diagnostic capability 3. Repeated degradation of an essential component or function 4. Degradation of the product’s ability to provide any required notification of malfunction Minor Minor problems are other problems of a lesser severity than “critical” or “major,” such as conditions that result in little or no impairment of the function of the system.
c03.qxd
2/8/2009
5:21 PM
CHAPTER
Page 15
3
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
3.1
CONCEPT OF HIGHLY AVAILABLE SYSTEMS
All systems eventually experience failures because no large software product is ever “bug-free” and all hardware fails eventually. Whereas normal systems may crash or experience degraded service in response to these inevitable failures, highly available systems are designed so that no single failure should cause a loss of service. At the most basic level, this means that all critical hardware is redundant so that there are no single points of failure. Figure 3.1 presents the design principle of highly available systems. The infinite number of potential failures is logically represented on the left side as triggers or inputs to the highly available system. Highly available systems include a suite of failure detectors, typically both hardware mechanisms (e.g., parity detectors and hardware checksums) and software mechanisms (e.g., timers). When a failure detector triggers, then system logic must isolate the failure to a specific software module or hardware mechanism and activate an appropriate recovery scheme. Well-designed highavailability systems will feature several layers of failure detection and recovery so that if the initial recovery was unsuccessful, perhaps because the failure diagnosis was wrong, then the system will automatically escalate to a more effective recovery mechanism. For instance, if restarting a single process does not resolve an apparent software failure, then the system may automatically restart the processor hosting the failed process and, perhaps, evenPractical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
15
c03.qxd
2/8/2009
16
5:21 PM
Page 16
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
Figure 3.1. Model of high-availability systems.
tually restart the software on the entire system. Undoubtedly, a human operator is ultimately responsible for any system, and if the system is not automatically recovering successfully or fast enough, the human operator will intervene and initiate manual recovery actions. Figure 3.2 illustrates the logical progression from failure to recovery of a highly available system. The smiley face on the left side of the figure represents normal system operation. The lightning bolt represents the inevitable occurrence of a stability-impacting failure. These major failures often fall into two broad categories: 1. Subacute failures that do not instantaneously impact system performance (shown as “Service Impaired”), such as memory or resource leaks, or “hanging” of required software processes or tasks. Obviously, a resource leak or hung/stuck process will eventually escalate to impact service if it is not corrected. 2. Acute failures that “instantaneously” and profoundly impact service (shown as “Service Impacted”), such as the catastrophic failure of a crucial hardware element like a processor or networking component. An acute failure will impact delivery of at least some primary functionality until the system recovers from
c03.qxd
2/8/2009
5:21 PM
Page 17
3.2
CONCEPTUAL MODEL OF SYSTEM AVAILABILITY
17
Some failures do not immediately impact service, like resource exhaustion (e.g., memory leaks) System Impaired
Normal Operation
Failure
Some failures immediately impact service, like hardware failures of crucial components Some failures cascade or eventually lead to service impact, like process failures when requested required resources are not available (e.g., uncorrected memory leaks eventually cause service impact)
Normal Operation
Service Impacted If service (or “primary functionality”) is impacted for longer than 15 seconds, then event is technically a TL 9000 Service Outage, and thus counts against SO metrics
Figure 3.2. Generic availability-state transition diagram.
the failure (often by switching to a redundant hardware unit or recovering failed software). Highly available systems should detect both acute and subacute failures as quickly as possible and automatically trigger proper recovery actions so that the duration and extent of any service impact is so short and small as to be imperceptible to most or all system users. Different applications with different customers may have different quantitative expectations as to how fast service must be restored following an acute failure for the interruption to be considered acceptable rather than a service outage. Obviously, systems should be architected and designed to automatically recover from failures in less than the customers’ maximum acceptable target time.
3.2
CONCEPTUAL MODEL OF SYSTEM AVAILABILITY
System availability is concerned with failures that produce system outages. Generally speaking, outages follow the high-level flow
c03.qxd
2/8/2009
18
5:21 PM
Page 18
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
shown in Figure 3.3. A service-impacting failure event occurs, and then the system automatically detects and correctly isolates the failure, raises a critical alarm, and successfully completes automatic recovery action(s). As most high availability systems feature some redundancy, a failure of a redundant component will generally be automatically recovered by switching the service load to a redundant element. Then service is restored, the alarm is cleared when the failed component is repaired or replaced, and the system returns to normal. However, the system could fail to automatically detect the failure fast enough, prompting manual failure detection; and/or the system could fail to indicate the failed unit, prompting manual diagnostics and fault isolation; and/or the system’s automatic recovery action could fail, prompting manual recovery actions. A failure of both a redundant element and automatic failure detection, isolation, and recovery mechanisms so that service is not automatically restored is sometimes called a “double failure.” Outages have three fundamental characteristics: 1. Attributable Cause—The primary failure that causes the outage. Flaws in diagnostics, procedures, customer’s actions, or other causes may slow outage resolution, but prolonging factors are not the attributable cause of the outage itself. 2. Outage Extent—Some percentage of the system is deemed to be unavailable. Operationally, outage extent is generally quan-
Figure 3.3. Typical outage flow.
c03.qxd
2/8/2009
5:21 PM
Page 19
3.3
FAILURES
19
tized as a single user (e.g., a line card on an access element) at a field-replaceable unit (FRU) level (e.g., “10 %”), or the entire system (e.g., “100 %”). Other capacity loss levels are certainly possible, depending on the system’s architecture, design, and deployed configuration. 3. Outage Duration—After the primary failure occurs, the event must be detected, isolated, and recovered before service is restored. In addition to the activities shown in Figure 3.3, logistical delays (such as delays acquiring a replacement part or delays scheduling an appropriately trained engineer to diagnose and/or repair a system) can add significant latency to outage durations. Chapter 4 reviews why outage durations may vary from customer to customer. The following sections provide additional details for the different pieces of the conceptual model.
3.3
FAILURES
Failures generally produce one or more “critical” (by TL 9000 definition) alarms. Because most systems have multiple layers of failure detection, a single failure can eventually be detected by multiple mechanisms, often leading to multiple alarms. On highavailability systems, many of these critical failures will be rapidly and automatically recovered. Service disruptions caused by many alarmed failures may be so brief or affect so little capacity that they may not even be recorded by the customer as an outage event. By analogy, if the lights in your home flicker but do not go out during a thunderstorm, most would agree there was an electricity service disruption, but very few would call that a power outage. Thus, it is useful to differentiate outage-inducing failures (which cause recorded outages) from other failures (which often raise alarms, but may not lead to recorded outages). Failures that produce system outages can generally be categorized by root cause into one of the following: 앫 (Product-Attributable) Hardware—for events resolved by replacing or repairing hardware 앫 (Product-Attributable) Software (Includes Firmware)—for software/firmware outages that are cleared by module, processor, board, or system reset/restart, power cycling, and so on
c03.qxd
2/8/2009
20
5:21 PM
Page 20
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
앫 Nonproduct-Attributable—for procedural mistakes (e.g., work not authorized, configuration/work errors), natural causes (e.g., lightning, fires, floods, power failures), and man-made causes (e.g., fiber cuts, security attacks) Readers with some familiarity with the subject may be wondering why “procedural error” is not listed as a source of outageinducing failures. It is true that procedural errors, both during the initial installation and during routine system operation, may result in outages. Many factors may contribute to these procedural outages, such as poor training, flawed documentation, limited technician experience, technician workload, customer policies, etc. Additionally, because procedural outages are a result of human interaction, and humans learn and teach others, the procedural outages for a given system typically decrease over time, often dramatically. For these reasons, installation, documentation, and training failures that result in downtime are beyond the scope of this document and will not be addressed. It is often insightful to add second-tier outage classifications identifying the functionality impacted by the outage, such as: 앫 Loss of Service—Primary end-user functionality is unavailable. This is, by definition, a “service outage.” 앫 Loss of Connectivity—Many systems require real-time communications with other elements to access required information such as user credentials or system, network, user or other configuration data. Inability to access this information may prevent service from being delivered to authorized users. Thus, an element could be capable of providing primary functionality, but be unable to authorize or configure that service because of connectivity problems with other elements. 앫 Loss of Redundancy—Many high-availability systems rely on redundancy either within the system or across a cluster of elements to enable high service availability. A failure of a standby or redundant element may not affect service, but it may create a simplex exposure situation, meaning that a second failure could result in service loss. 앫 Loss of Management Visibility—Some customers consider alarm visibility and management controllability of an element to be primary functionality, and, thus, consider loss of visibility to be an outage, albeit not a service outage. After all, if one loses alarm visibility to a network element, then one does not really know if it is providing service.
c03.qxd
2/8/2009
5:21 PM
Page 21
3.3
FAILURES
21
앫 Loss of Provisioning—A system may be fully operational but incapable of adding new subscribers or provisioning changes to existing subscribers. Beyond categorizing failures by root cause, one should also consider the extent of system capacity affected by the failure. On real, in-service elements the extent of failure is often resolved to the number of impacted users or other service units. 3.3.1
Hardware Failures
Hardware, such as components and electrical connections, fails for well-known physical reasons including wearing out and electrical or thermal overstress. Hardware failures are generally resolved by replacing the field-replaceable unit (FRU) containing the failed hardware component, or repairing a mechanical issue (e.g., tightening a loose mechanical connection). Firmware and software failures should be categorized separately because those failures can be resolved simply by restarting some or all of the processors on the affected element. Sometimes, outages are resolved by reseating a circuit pack; while it is possible that the reseating action clears a connector-related issue, the root cause of the failure is often software or firmware. Thus, product knowledge should be applied when classifying failures. Note that the failure mitigation, such as a rapid switchover to a redundant element, is different from the failure cause, such as hardware or software failure. For example, one cannot simply assume that just because service was switched from active element to standby element, that the hardware on the active element failed; a memory or resource leak could have triggered the switchover event and some software failures, like resource leaks, can be recovered by switching to a standby element. Hardware failure rate prediction is addressed by several standards. A detailed discussion of these standards is provided in Chapter 5, Section 5.5.1. Additional information on calculating hardware failure rates is provided in Chapter 7, Section 7.1. 3.3.2
Software Failures
Software and firmware failures are typically resolved by restarting a process, processor, circuit pack, or entire element. Also, sometimes running system diagnostics as part of troubleshooting may happen to clear a software/firmware failure because it may force a
c03.qxd
2/8/2009
22
5:21 PM
Page 22
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
device or software back to a known operational state. Note that operators typically “resolve” software or firmware failures via restart/reboot, rather than “fix” them by installing patched or upgraded software; eventually, software failures will be “fixed” by installing new software, changing system configuration or changing a procedure. Thus, whereas a residual defect may be in version X.01 of a product’s software (and fixed in version X.02), the duration of an outage resulting from this defect will last only from the failure event until the software is restarted (nominally minutes or seconds), not until version X.02 is installed (nominally weeks or months). Occasionally, software failures are addressed by correcting a system configuration error or changing system configuration or operational procedures to work around (avoid) a known failure mode. Software outages in the field are caused by residual software defects that often trigger one of multiple types of events: 1. Control Flow Error—The program does not take the correct flow of control path. Some examples are executing the “if” statement instead of the “else,” selecting the wrong case in a switch statement, or performing the wrong number of iterations in a loop. 2. Data Error—Due to a software fault, the data becomes corrupted (e.g., “wild write” into heap or stack). This type of error typically will not cause an outage until the data is used at some later point in time. 3. Interface/Interworking Error—Communications between two different components fail due to misalignment of inputs/outputs/behaviors across an interface. The interface in question could be between software objects or modules, software drivers and hardware, different network elements, different interpretations of protocols or data formats, and so on. 4. Configuration Error—The system becomes configured incorrectly due to a software fault or the system does not behave properly in a particular configuration. Examples of this type of error include incorrectly setting an IP address, specifying a resource index that is beyond the number of resources supported by the system, and so on. Predicting software failure rates is more difficult than estimating hardware failure rates because:
c03.qxd
2/8/2009
5:21 PM
Page 23
3.4
OUTAGE RESOLUTION
23
1. Impact of residual software defects varies. Some residual defects trigger failures with catastrophic results; others produce minor anomalies or are automatically recovered by the system. 2. Residual defects only trigger failures when they are executed. Since execution of software (binaries) is highly nonuniform (e.g., some parts get executed all the time, whereas some hardly ever get executed), there is wide variation in how often any particular defect might be executed. 3. Software failures are sometimes state-dependent. Modern protocols, hardware components, and applications often support a bewildering number of modes, states, variables, and commands, many of which interact in complex ways; some failures only occur when specific settings of modes, states, variables, and commands combine. For example, software failure rates for some systems may increase as the system runs longer between restarts; this phenomenon prompts many personal computer (PC) users to periodically perform prophylactic reboots.
3.4
OUTAGE RESOLUTION
At the highest level, outage recoveries can be classified into three categories, as follows. 3.4.1
Automatically Recovered
Many failures will be automatically recovered by the system by switching over to a redundant unit or restarting a software module. Automatically recovered outages often have duration of seconds or less, and generally have duration of less than 3 minutes. Although customers are likely to write a trouble ticket for automatically recovered hardware outages because the failed hardware element must be promptly replaced, customer policy may not require automatically recovered software outages to be recorded via trouble tickets. Thus, performance counters of automatic switchover events, module restarts, etc, may give more complete records of the frequency of automatically recovered software outages. Trouble tickets for automatically recovered outages are often recorded by customer staff as “recovered without intervention” or “recovered automatically.” Automatically recovered outages are said to be “covered” because the system successfully detected, iso-
c03.qxd
2/8/2009
24
5:21 PM
Page 24
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
lated, and recovered the failure; in other words, the system was designed to “cover” that type of failure. 3.4.2
Manual, Emergency Recovered
Some failures will require manual recovery actions, such as to replace a nonredundant hardware unit. Alternately, manual recovery may have been required because: 앫 The system did not automatically detect the failure fast enough. 앫 The system did not correctly isolate the failure (e.g., the system indicted the wrong hardware unit, or misdiagnosed hardware failure as a software failure). 앫 The system did not support automatic recovery from this type of failure (e.g., processor, board, or system does not automatically restart following all software failures). 앫 The system’s automatic recovery action did not succeed (e.g., switchover failed). Although the maintenance staff should promptly diagnose any outage that is not automatically recovered, customer policy may direct that not all outages be fixed immediately on an “emergency” basis. Although large capacity-loss outages of core elements will typically be fixed immediately, recovery from smaller capacity-loss events may be postponed to a scheduled maintenance window. For example, if a single port on a multiport unit fails and repair will require nonaffected subscribers to be briefly out of service, a customer may opt to schedule the repair into an off-hours maintenance window to minimize overall subscriber impact. Manually recovered outages are generally trouble ticketed by the customer and are often recorded as “replaced,” “repaired,” “reseated circuit pack,” and so on. Outages manually recovered on an emergency basis are usually less than an hour for equipment in staffed locations or for software outages that can be resolved remotely. Generally, there will be minimal or no logistics delays in resolving emergency outages because spare hardware elements will be available on site and appropriately trained staff will be available on site or on call. 3.4.3
Manual, Nonemergency Recovered
Customers may opt to recover some outages during scheduled maintenance windows rather than on an emergency basis immedi-
c03.qxd
2/8/2009
5:21 PM
Page 25
3.4
OUTAGE RESOLUTION
25
ately following the failure event, to minimize overall service disruption to end users. Likewise, to minimize operating expense, customers may opt to postpone recovery from outages that occur in off hours until normal business hours to avoid the overtime expenses. Also, logistical considerations will force some repairs to be scheduled; for example, failed equipment could be located in a private facility (e.g., a shopping mall or commercial building) that cannot be accessed for repair at all times, or because certain spares are stored off-site. Customers often mark trouble tickets that are addressed on a nonemergency basis as being “parked,” “scheduled,” or “planned.” As a practical matter, equipment suppliers should not be accountable for excess downtime because outages were resolved on a nonemergency basis. Interestingly, outage durations generally improve (i.e., get shorter) over time because: 앫 Maintenance staff becomes more efficient. As staff gains experience diagnosing, debugging and resolving outages on equipment, they will get more efficient and, hence, outage durations will decrease. 앫 Automatic recovery mechanisms become more effective. As systems gain experience in the field, system software is generally improved to correctly detect, isolate, alarm, and automatically recover more and more types of failures, and, thus, some failures that would require manual recovery in early product releases will be automatically recovered in later releases. Likewise, some failure events that are initially detected by slower secondary or tertiary failure-detection mechanisms are likely to be detected by improved primary and secondary mechanisms, thus shortening failure detection for some events. Also, recovery procedures may be streamlined and improved, thus shortening outage durations. The overall effect is that in later releases, both a larger portion of failure events are likely to be automatically recovered than in earlier releases, and the outage durations for at least some of those events is likely to be shorter. The combined effect of improved automatic recovery mechanisms and customer experience are that outage durations generally shorten over time. Software failure rates of existing software also tend to decrease (i.e., improve) from release to release as more residual defects are found and fixed. The combined effect of these trends is a general growth in field availability as system software
c03.qxd
2/8/2009
26
5:22 PM
Page 26
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
matures and is upgraded. This availability growth trend is shown in Figure 3.4.
3.5
DOWNTIME BUDGETS
Just as financial budgets can be created and quantitatively managed, so can downtime budgets. The first challenge is to create a complete and correct set of downtime categories. TL 9000’s Standard Outage Template System (http://tl9000.org/tl_sots.htm) offers an excellent starting point for a downtime budget. The italicized text in the outline below are direct quotes from Standard Outage Template System documentation. TL 9000 begins by factoring downtime contributors into three broad categories based on the attributable party: 1. Customer Attributable—Outages attributable primarily to actions of the customer, including: 앫 Procedural—“Outages due to a procedural error or action by an employee of the customer or service provider.” “Actions”
Automatic recovery mechanisms become more effective, thus covering failures previously requiring manual action
Automatic recovery mechanisms become more effective, shortening failure detection, isolation and recovery times
Service provider learning and process/procedure improvements shorten manual outage detection and recovery latency
Figure 3.4. Availability growth over releases.
c03.qxd
2/8/2009
5:22 PM
Page 27
3.5
DOWNTIME BUDGETS
27
include decisions by a customer not to accept available redundancy offered by the product supplier. 앫 Power Failure, Battery or Generator—“[power failures] from the building entry into the element.” 앫 Internal Environment—“Outages due to internal environmental conditions that exceed the design limitations of the Vendor system’s technical specifications.” 앫 Traffic Overload—“Outages due to high traffic or processor load that exceeds the capacity of a properly designed and engineered system.” 앫 Planned Event (customer-attributable)—“Planned events not covered by other categories, e.g. equipment moves but not corrective actions.” 2. Product Attributable—Outages attributable primarily to design and implementation of the product itself, or actions of the supplier in support of installation, configuration, or operation of that product, including: 앫 Hardware Failure—“Outages due to a random hardware or component failure not related to design (MTBF).” 앫 Design, Hardware—“Outages due to a design deficiency or error in the system hardware.” 앫 Design, Software—“Outages due to faulty or ineffective software design.” 앫 Procedural—“Outages due to a procedural error or action by an employee or agent of the system or equipment supplier.” 앫 Planned Event—“Scheduled event attributable to the supplier that does not fit into one of the other outage classifications.” 3. Third-Party Attributable—Outages attributable primarily to actions of others, including: 앫 Facility Related—“Outages due to the loss of [communications] facilities that isolate a network node from the remainder of the communications network.” 앫 Power Failure, Commercial—“Outages due to power failures external to the equipment, from the building entry out.” 앫 External Environment—“Outages due to external environmental conditions that exceed the design limitations of the Vendor system’s technical specifications. Includes natural disasters, vandalism, vehicular accidents, fire, and so on.” Focusing only on product-attributable downtime, allows one to use a simple downtime budget with three major categories:
c03.qxd
2/8/2009
28
5:22 PM
Page 28
CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY
1. Hardware—Covers hardware-triggered downtime. In simplex systems, hardware failures are often a substantial contributor to downtime; in duplex systems, hardware-triggered downtime generally results from the latency for the system to switch over to redundant elements. Naturally, hardware-triggered downtime is highly dependent on the systems’ ability to automatically detect hardware failures and rapidly switch over to redundant elements; one hardware failure that requires manual intervention to detect, isolate, and/or recover will probably accrue more product-attributable downtime than many automatically recovered hardware failures. 2. Software—Covers failures triggered by poor software or system design, or activation of residual software defects. As with hardware-triggered downtime, events that fail to be automatically detected, isolated, and successfully recovered by the system typically accrue much more downtime than automatically, successfully recovered events. 3. Planned and Procedural—Covers downtime associated with both successful and unsuccessful software upgrades, updates, retrofits, and patch installation, as well as hardware growth. Downtime attributed to poorly written, misleading, or wrong product documentation and maintenance procedures can be included in this category. One can, of course, use different taxonomies for downtime, or resolve downtime into more, smaller categories. For example, a software downtime budget could be split into downtime for software application and downtime for the software platform; hardware downtime could be budgeted across the major hardware elements. Since “five 9s” or 99.999% service availability maps to 5.26 downtime minutes per system per year, a “five 9s downtime budget” must allocate that downtime across the selected downtime categories. The downtime allocation will vary based on the system’s redundancy and recovery architecture, complexity of the hardware, maturity of the software, training and experience of the support organization, and other factors, including precise definitions and interpretations of the downtime budget categories themselves. Often a 20%:60%:20% downtime budget allocation across hardware/software/planned and procedural is a reasonable starting point for a mature system, or:
c03.qxd
2/8/2009
5:22 PM
Page 29
3.5
DOWNTIME BUDGETS
29
앫 Hardware—1 downtime minute per system per year 앫 Software—3.26 downtime minutes per system per year 앫 Planned and Procedural—1 downtime minute per system per year 앫 Total budget for product-attributable service downtime—5.26 downtime minutes per year, or 99.999% service availability Having set a downtime budget, one can now estimate and predict the likely system performance compared to that budget. If the budget and prediction are misaligned, then one can adjust system architecture (e.g., add more redundancy, make failure detection faster and more effective, make automatic failure recovery faster and more reliable), improve software and hardware quality to reduce failure rates, increase robustness testing to assure fast and reliable operation of automatic failure detection and recovery mechanisms, and so on.
c04.qxd
2/8/2009
5:23 PM
CHAPTER
Page 31
4
WHY AVAILABILITY VARIES BETWEEN CUSTOMERS
Question: Can a product with identical failures (rates and events) have different perceived or measured availability for different customers? Answer: Yes, because customers differ both on what events they record as outages and on how long it takes them to resolve those events. The factors that cause variations in what events are reported and how long those events take to be resolved are detailed in this chapter. These factors also contribute to why observed availability varies from predicted availability. Note that customers using the same product in different configurations, leveraging different product features, and/or using those features in different ways may observe different failure rates; those variations in operational profiles are not considered in this chapter.
4.1 CAUSES OF VARIATION IN OUTAGE EVENT REPORTING There are several causes of the variation in how customers report outage events. These include: 앫 Definition of “primary functionality” 앫 How scheduled events are treated Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
31
c04.qxd
2/8/2009
32
5:23 PM
Page 32
WHY AVAILABILITY VARIES BETWEEN CUSTOMERS
앫 Customer’s policies and procedures 앫 Compensation policies The following sections elaborate on these different causes. 4.1.1
Definition of “Primary Functionality”
Only failures that cause loss of “primary functionality” are deemed by customers to be outages. Although total or profound loss of service is unquestionably an outage, failures causing other than total service loss may be viewed differently by different customers. For example, some customers consider alarm visibility and management controllability of a network element to be a primary functionality of the element, and thus deem any failure that causes them to lose management visibility to that element to be an outage event; other customers consider only disruptions of the end-user services offered by the element to be outages. 4.1.2
Treatment of Scheduled Events
Planned/scheduled events, such as applying software patches and upgrades, are typically more frequent than failure-caused outage events for high-reliability systems. Customers often have different processes in place to manage planned/scheduled events and unplanned/unscheduled events, which can include different databases to track, measure, and manage those events; they may even record different data for planned and unplanned outages. As customers may have different metrics and targets for planned/scheduled and unplanned/unscheduled events, the data may be categorized, analyzed, measured, and reported differently, thus causing differences in perceived availability. 4.1.3
Customer’s Policies and Procedures
Customer staff, including staff in network operations centers (NOCs), are often very busy and, unless policy requires it, they may not record brief outage events. For instance, one customer may ticket all critical alarms, including those that automatically recover or are quickly resolved without replacing hardware, whereas another customer may only require tickets for alarms that are standing for more than a particular time (e.g., 15 minutes) or
c04.qxd
2/8/2009
5:23 PM
Page 33
4.2
CAUSES OF VARIATION IN OUTAGE DURATION
33
require special action such as replacing hardware. Often, formal outage notification policies will be in place that dictate how quickly an executive must be notified after a high-value element has been out of service. Obviously, if executives are notified once or more of a product-attributed outage for any element, they are likely to be more cautious or have negative bias regarding the quality and reliability of that element. 4.1.4
Compensation Policies
Some customers tie aspects of the compensation for maintenance engineers to key performance indicators of quality and reliability. For instance, one sophisticated customer counts not approved in advance equipment “touches” by maintenance engineers (more “touches” is bad) on the hypothesis that well-maintained equipment should not have to be “touched” on an emergency (not approved in advance) basis. Some customers might tie some aspect(s) of trouble ticket resolution (e.g., resolution time) to compensation, or perhaps even tie service availability of selected high-value elements to compensation. As most metrics that are tied to compensation are both tracked carefully and actively managed by impacted staff, including availability-related metrics in compensation calculations is likely to impact the availability metrics themselves.
4.2
CAUSES OF VARIATION IN OUTAGE DURATION
Outage duration varies from customer to customer due to several factors: 앫 Efficiency of the customer staff in detecting and resolving outages 앫 How “parked” outages are treated 앫 Externally attributable causes These factors are discussed in more detail in the following sections. 4.2.1
Outage Detection and Resolution Efficiency
Latency to detect, isolate, and resolve outage events is impacted by customer policies including:
c04.qxd
2/8/2009
34
5:23 PM
Page 34
WHY AVAILABILITY VARIES BETWEEN CUSTOMERS
앫 Training and Experience of Staff—Better trained, more experienced staff can diagnose failures and execute recovery actions more effectively and faster than inexperienced and poorly trained staff. 앫 Sparing Strategy (e.g., On-site Spares) —Obviously, if a hardware element fails but the spare is not located on-site, then an additional logistics delay may be added to the outage duration. 앫 Operational Procedures (a.k.a., Method of Procedure, or MOPs) and Tools—Better operational procedures can both streamline execution times for activities such as debugging or recovering a failure, as well as reducing the likelihood of errors that can prolong the outage. Likewise, better monitoring, management, and operational support tools can both accelerate and improve the accuracy of fault detection and isolation. 앫 Alarm Escalation and Clearance Policies (e.g., No Standing Alarms)—Some customers strive for no standing alarms (a.k.a., “clean boards”), whereas others tolerate standing alarms (a.k.a., “dirty boards”). Standing alarms may slow detection and isolation of major failure events, as maintenance engineers have to sift through stale alarms to identify the cause of the major failure. 앫 Support Contracts—If a customer has already purchased a support contract, then they may contact the supporting organization sooner for assistance in resolving a “hard” outage, thus shortening the outage duration. Without a support contract in place, the customer’s staff may naturally spend more time trying to resolve the outage rather than having to work through the administrative process or approvals to engage an external support organization on the fly, thus potentially prolonging the outage. 앫 Management Metrics and Bonus Compensation Formulas— Many businesses use performance-based incentive bonuses to encourage desirable actions and behaviors of staff. For instance, if bonuses are tied to speed of outage resolution on selected types of network elements (e.g., high-value or high-impact elements), one would expect staff to preempt outage resolution of nonselected network elements to more rapidly restore the bonus-bearing elements to service. Likewise, if a customer has a policy that any outage affecting, say, 50 or more subscribers, and lasting for more than, say, 90 minutes must be reported to a customer executive, then one might expect staff to work a bit
c04.qxd
2/8/2009
5:23 PM
Page 35
4.2
CAUSES OF VARIATION IN OUTAGE DURATION
35
faster on larger outages to avoid having to call an executive (perhaps in the middle of the night). 앫 Government Oversight (e.g., Mandatory FCC Outage Reporting Rules)—Governments have reporting rules for failures of some critical infrastructure elements, and affected customers will strive to avoid the expense and attention that comes from these filings. For example, the United States Federal Communications Commission (FCC) has established reporting rules for outage events impacting 900,000 user minutes and lasting 30 minutes or more. Naturally, customers strive to minimize the number of outage events they must report to the FCC and, thus, strive to resolve large events in less than 30 minutes. 앫 Sophistication/Expectations of Customer Base—Customers in different parts of the world have different expectations for service availability, and end users will respond to those local expectations. Thus, leading customers are likely to have more rigorous policies and procedures in place to resolve outages in markets where end users are more demanding than the more relaxed policies and procedures that might suffice in less-demanding markets. 앫 “Golden” Elements—Some network elements directly support critical services or infrastructure (e.g., E911 call centers, hospitals, airports), or critical end users; these elements are sometimes referred to as “golden.” Given the increased importance of these golden elements, restoring service to any of these golden elements is likely to preempt other customer activities, including restoring service to nongolden elements. Thus, one would expect that outage durations on golden elements are likely to be somewhat shorter than those on ordinary (nongolden) network elements. 4.2.2
Treatment of “Parking” Duration
Manual recovery of minor outages is sometimes deferred to a maintenance window or some later time, rather than resolving the outage immediately. This is sometimes referred to as “parking” an outage. Although all customers will precisely track the outage start and outage resolution times, they may not record precisely when the decision was made to defer outage recovery and exactly when the deferred recovery actually began; thus, it is often hard to determine how much of the parked outage duration should be at-
c04.qxd
2/8/2009
36
5:23 PM
Page 36
WHY AVAILABILITY VARIES BETWEEN CUSTOMERS
tributed to the product versus how much should be attributed to the customer. Beyond simply deferring a well-defined action (e.g., a software restart) to a maintenance window, the delay could be necessitated by logistical or other real-world situations such as: 앫 Spare parts not being available locally 앫 Appropriately trained staff not being immediately available 앫 Recovery being preempted, postponed, or queued behind a higher-priority recovery action 앫 Delays in physically accessing equipment, perhaps because it is located on private property, in a secured facility, at a remote location, and so on As minutes count in availability calculations of mission-critical and high-value systems, rounding parking times to 15 minute or 1 hour increments, or not explicitly tracking parking times, can significantly impact calculations of product-attributable downtime and availability. As a simplifying assumption, one might cap the maximum product-attributable downtime for outages to mitigate this uncertainty. Outage durations longer than the cap could then be allocated to the customer rather than the product. 4.2.3
Externally Attributable Outages and Factors
Occasionally, extraordinary events occur that can prolong outage resolution times, such as: 앫 Force majeur (e.g., hurricanes, fires, floods, malicious acts). TL 9000 classifies outages associated with these types of events as being “Externally Attributable Outages.” 앫 Unfortunate timing of failures (e.g., New Year’s Eve, Christmas Day, national holidays, during software/system upgrades/retrofits) 앫 Worker strikes at the customer or logistics suppliers
c05.qxd
2/8/2009
5:39 PM
CHAPTER
Page 37
5
MODELING AVAILABILITY
“All models are wrong, some are useful.” —George Box, industrial statistician An accurate, architecture-based model of system availability is useful to: 1. Assess the feasibility of meeting a particular availability target. One can predict availability of a product from system test results, or even as early as the architecture phase, before a single circuit has been designed or line of code has been written. This is useful in selecting the hardware architecture (e.g., how much hardware redundancy is necessary in a system), determining the appropriate investment in reliability-improving features (e.g., how hard software has to work to rapidly detect, isolate, and recover from failures), setting hardware and software failure-rate targets, and, ultimately, setting availability expectations for a particular product release. 2. Understand where system downtime is likely to come from and how sensitive downtime is to various changes in system characteristics. Given an architecture-based availability model, it is easy to estimate the availability benefit of, say, reducing the hardware failure rate of a particular circuit pack; improving the effectiveness of automatic detection, isolation, and recovery from hardware or software failures; shortening system reboot time, and so on. 3. Respond to customer requests (e.g., request for proposals, or RFPs) for system availability predictions, especially because modeling is recommended by standards (such as Telcordia’s SR-TSY-001171). Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
37
c05.qxd
2/8/2009
38
5:39 PM
Page 38
MODELING AVAILABILITY
In this chapter, we discuss the reasons for building availability models, cover reliability block diagrams (RBDs) and Markov models in detail, and then define the terms used in creating availability models. We then tie it all together with an example model for a hypothetical system called the “Widget System,” and then wrap it up with a set of pointers to additional information on how to create availability models.
5.1
OVERVIEW OF MODELING TECHNIQUES
There are many different kinds of reliability/availability models, including: 앫 Reliability block diagrams (RBDs) graphically depict simple redundancy strategies; these are the easiest models to construct. 앫 Markov models use state transition diagrams to model the time spent in each operational and nonoperational state from which the probability of the system being operating and/or down can be calculated. 앫 Fault tree models construct a tree using individual fault types and identify the various combinations and sequences of faults that lead to failure. 앫 Minimal cut-set method. A cut set is a set of components which, when failed, cause failure of the system. This method identifies minimum cut sets for the network/system and evaluates the system reliability (or unreliability) by combining the minimum cut sets. 앫 A Petri net is a graphical modeling language that allows the user to represent the actual system functions. Blockages or failures can be studied while monitoring the performance and reliability levels. Models can be complicated and may be difficult to solve. 앫 Monte Carlo simulations generate numbers of pseudorandom input values and then calculate the system outputs, such as service availability. They are useful where no specific assumptions have to be made for input parameters, such as failure or repair rates. Each of these techniques is reviewed below. Of these different model types, RBDs and Markov models are the two most frequent-
c05.qxd
2/8/2009
5:39 PM
Page 39
5.1
OVERVIEW OF MODELING TECHNIQUES
39
ly used methods, and will be covered in the most detail. In practice, the two methods are often used together because they complement each other well. RBDs display the redundancy structure within the system. Markov models address more detailed reliability evaluations. The advantages and drawbacks of the other modeling techniques are discussed in their respective sections. 5.1.1
Reliability Block Diagrams
Reliability block diagrams (RBDs) are one of the most common forms of reliability models. They are easy to understand and easy to evaluate. They provide a clear graphical representation of the redundancy inherent within a system. RBDs can be used to describe both hardware and software components and topologies at a high level. Figure 5.1 shows several RBDs: one for a serial system, one for a parallel system, and one for a hybrid system. The basic idea behind an RBD is that the system is operational if there is at least one path from the input to the output (from the left to the right by convention). In the serial example, failure of any one of components A, B, or C will break the path, and, thus,
Figure 5.1. Generic reliability block diagrams.
c05.qxd
2/8/2009
40
5:39 PM
Page 40
MODELING AVAILABILITY
the system will be unavailable. In the parallel system, both components D and E must fail for there to be no path from input to output. Finally, in the hybrid system, if either component A or C fails, the system is down, but both components D and E must fail before the system goes down. The Widget Example in Section 5.4 gives sample RBDs that represent a typical system. Despite their clarity and ease of use, RBDs do have some drawbacks. First, since each block can be in only one of two states (up or down) it is hard to represent some common configurations such as load sharing and redundant units that have to perform failovers before their mate can provide service. Markov models do not have these limitations. Additional information on RBDs is available in chapter 3 of reference [AT&T90]. Complicated systems can often be represented as a network in which system components are connected in series or parallel, are meshed, or a combination of these. Network reliability models address system reliability evaluation based on component reliability and the topologies through which the components are connected. For simple systems, the components can be connected in series, parallel, or a combination of both. System reliability can be then evaluated. 5.1.1.1 Series RBD Systems A series system is one with all of its components connected in series; all must work for the system to be successful. If we assume that each component’s reliability is given by Ri, then the system reliability is given in Equation 5.1, where the reliabilities R and Ri are expressed as percentages. Equation 5.1 also applies to availabilities. R = ⌸ Ri i
(5.1)
Because the reliability of a series system is the product of the individual component reliabilities, the system reliability is always worse than the reliability of the worst component; series systems are weaker than their weakest link. 5.1.1.2 Parallel RBD Systems A parallel system is one with all of its components connected in parallel; only one needs to work for the system to be successful. If we assume that each component’s reliability is given by Ri, then
c05.qxd
2/8/2009
5:39 PM
Page 41
5.1
OVERVIEW OF MODELING TECHNIQUES
41
the system reliability is given by Equation 5.2, where the reliabilities R and Ri are expressed as percentages. Equation 5.2 also applies to availabilities. R = 1 – ⌸ (1 – Ri) i
(5.2)
For parallel systems, the resultant system reliability is greater than the reliability of any individual component. 5.1.1.3 N-out-of-M RBD Systems Another common type of system that may be analyzed using RBDs is the N-out-of-M system. In an N-out-of-M system, there are M components, of which N must be operational for the system to be operational. The block diagram itself looks like a parallel system diagram but, typically, there is some indication that the components are in an N-out-of-M configuration: m–n m! Rs = 冱 ᎏᎏ (R)m–i (1 – R)i i=0 i!(m – i)!
冢
冣
(5.3)
Equation 5.3 models the system reliability based on the number of failed and working components, which can be analyzed mathematically by a binomial distribution (for details, see Section 2.1 in Appendix B—Reliability and Availability Theory). A classic example of an N-out-of-M system is a two-out-offour set of power supplies. In this configuration two supplies are powered from one source and two more from a separate source, with all four outputs connected together to power the system. In this configuration, failure of either power source will not cause a system outage, and the failure of any two power supplies will still leave the system operational. Another example of an N-out-of-M system is a multicylinder internal combustion engine. After failure of some number of cylinders, the engine will no longer have enough power to provide service. For example, consider an eight-cylinder engine in a car or airplane. If the engine must have at least four cylinders running to continue to move the car or keep the airplane airborne, then ignition system failures such as spark plug failures, plug wire failures, and ignition coil failures (for engines with one coil per cylinder) could each be modeled as a four-out-of-eight system.
c05.qxd
2/8/2009
42
5:39 PM
Page 42
MODELING AVAILABILITY
The methods derived from the basic series, parallel, and Nout-of-M models can be used to evaluate systems with a combination of the different types of configurations. More complex systems require more sophisticated methods to evaluate reliability of the entire system. Discussion of these methods is provided in subsequent sections. RBDs help us to understand the system better and also enable us to decompose the system into pieces, which we can then analyze independently. For example, consider the three systems shown in Figure 5.1. In the RBD for the serial system, we see three separate components, A, B, and C. Because they are in series, we can model each independently (typically by using a Markov model, which will be discussed in a later section) and then add the downtimes for A, B, and C together to get the total downtime for the series system. This is essentially what Equation 5.1 says, but some people find it easier to relate to the summation of the individual downtimes than they do to the product of the availabilities given in Equation 5.1. The RBD for the parallel system of Figure 5.1 would typically be analyzed as a single entity, in this case an active/standby pair. That analysis would typically entail a Markov model. Finally, the hybrid system shown in Figure 5.1 would be done as three separate models, one for each of the series elements, in this case a model for component A, a model for the D and E pair of components, and a model for component C. The resultant downtimes from the three models would then be added to arrive at the downtime for the hybrid system, just as for the simple series system. If availability were desired, then the resultant system downtime could easily be converted to availability. This would be done by subtracting the downtime from the amount of time in a year, and then dividing that result by a full year. For instance, if the model predicted 30 minutes of system downtime, we would subtract the 30 minutes from the 525,960 minutes in a year, leaving 525,930 minutes, and divide that by the number of minutes in the year (525,960), yielding a result of 99.9943%. 5.1.2
Markov Models
When evaluating the comprehensive system reliability/availability of a real system, the system structure, topology, and operating logic as well as the underlying probability distribution associated
c05.qxd
2/8/2009
5:39 PM
Page 43
5.1
OVERVIEW OF MODELING TECHNIQUES
43
with the components of the system need to be incorporated. RBDs and other simple network models are often insufficient to provide a comprehensive understanding of system availability and how to improve it. In particular, for repairable systems, they assume that the repair process is instantaneous or negligible compared with the operating time. This is an inherent restriction and additional techniques are required if this assumption is not valid. One very important technique that overcomes this problem and which has received considerable attention and use in the industry is known as the Markov approach or Markov modeling. Several texts [Feller68, Shooman68, Sandler63, Kemeny60, and Pukite98] are available on the subject of application of Markov chains to reliability analysis. Markov models model the random behavior of systems that vary discretely or continuously with respect to time and space. The Markov model approach can model the memoryless behavior of the system, that is, that the future random behaviors of the system are independent of all past states except the immediately preceding one. In addition, the process must be stationary, which means the behavior of the system must be the same at all points of time irrespective of the point in time being considered. Typically, system failure and the failure recovery process can be described by a probability distribution that is characterized by a constant failure or recovery rate, which implies that the probability of making a transition between two states remains constant at all points in time. This makes the Markov model approach straightforward for industrial practitioners to adapt when they model system failure and the failure recovery process. Appendix B documents widely used probability distributions in reliability engineering. Markov models are a bottom-up method that allows the analysis of complex systems and repair strategies. The method is based on the theory of Markov chains. This method represents system operations, failures, and repairs at specific points in time with state machines. The advantage of this method is that the system behavior can be analyzed thoroughly. It can incorporate details such as partial failures, capacity loss, and repair strategies. Sensitivity analysis of all the potential features to improve overall availability can be explored. [Trivedi02] provides a good introduction and summary to stochastic modeling techniques and tools that can be applied to computer and engineering applications. Markov models are relatively easy to solve, even for complex systems,
c05.qxd
2/8/2009
44
5:39 PM
Page 44
MODELING AVAILABILITY
using commonly available tools such as a spreadsheet tool like Microsoft Excel. Markov models can be applied to model system, subsystem, and component availability. For complicated systems, a relatively standard procedure for evaluating the reliability/availability of a system is to decompose the system into its constituent components, and individually estimate the reliability/availability of each of these components. Finally, the component reliability/availability evaluation results can then be combined to estimate the reliability/availability of the complete system. The foundation for Markov modeling is the state transition diagram. The state transition diagram (or state diagram, for short) describes the possible states of the system, the events that cause it to transition from one state to another, and the rates at which these transitions occur. The basic concepts of Markov modeling can be illustrated by the state diagram shown in Figure 5.2, which is the state transition diagram for a simplex system. The states in Figure 5.2 represent the operating mode that the system can be in, and the transitions between the states represent the rates of the transitions. State 1 represents the “active” state, in which the system is fully operational. State 2 represents the “down covered” state, in which a failure has been recognized by the system and recovery is initiated. State 3 represents the “down uncovered” state, in which the system has failed but the failure has not yet been rec-
3 Down Uncovered
2 Down Covered
Figure 5.2. Markov model for a simplex system.
c05.qxd
2/8/2009
5:39 PM
Page 45
5.1
OVERVIEW OF MODELING TECHNIQUES
45
ognized, so recovery actions have not yet been initiated. By convention, failure rates are represented as , repair rates are represented as , and coverage factors are represented as C. This is a discrete Markov model since the system is stationary and the movement between states occurs in discrete steps. Consider the first time interval and assume that the system is initially in State 1, which is the state in which the system is operating normally. If the system fails and the failure is detected, the system moves into State 2 with rate C, then the system transitions back to State 1 with rate R after the repair is done. On the other hand, if the system fails and the failure is uncovered, then the system transitions from State 1 to State 3 with rate (1 – C). After the failure is eventually detected, the system transitions from State 3 to State 2 with rate SFD. After solving Equations 5.4 below, the steady-state probabilities of the system being in each state can be calculated. The downtime can be calculated by adding up the time spent in the down states (State 2 and State 3 in this example). Now that we know how the system operates, how do we solve the model to determine the time spent in each state? Because we are interested in the long-term steady-state time in each state, we know that the input and output for each state must be equal. So, we can write an equation for each state that says the input minus the outputs is equal to zero. If we let Pi represent the probability of being in state i, then we get the three equations of Equation 5.4. State 1 (active)—Normal operation:
RP2 – CP1 – (1 – C)P1 = 0 State 2 (down covered)—Detected failure:
SFDP3 + CP1 – RP2 = 0
(5.4)
State 3 (down uncovered)—Undetected failure: (1 – C)P1 – SFDP3 = 0 When we try to solve these three equations for P1, P2, and P3, we discover we have only two independent equations (go ahead and convince yourself!) but three unknowns. The final equation
c05.qxd
2/8/2009
46
5:39 PM
Page 46
MODELING AVAILABILITY
we require to solve this set of simultaneous equations is the one that shows that the sum of the probabilities must be 1: P1 + P2 + P3 = 1
(5.5)
With three independent equations in three unknowns, we may solve for the steady-state probability of being in each of the three states. That probability may then be multiplied by a time period to find out how much of that time period is spent in each state. For example, if the probability of being in state 1 is 2%, then in a year, the system will spend 2% of its time in state 1, or approximately 175 hours (8766 hours in a year × 2% ⬵ 175 hours) in state 1. This method works for state transition diagrams of any size, although the larger the diagram the more cumbersome the math becomes. Fortunately, it is easy to automate the math using computers, so system complexity is not an insurmountable barrier to good modeling. Matrix algebra and matrix techniques are typically used in solving the Markov models. We will work through an example of the simplex model above using the parameter values given in Table 5.1. This says that the failure rate is 10,000 FITs (a FIT is a failure in 109 hours), which is a reasonable estimate for something like a server. It also says that 90% of the faults are automatically detected and alarmed, it takes 4 hours to repair the unit, and it takes an hour to detect that the unit has failed if the failure was uncovered (and, hence, unalarmed). Notice that all the times have been converted to rates and the rates are expressed as per hour. TIP: We strongly recommend converting all rates to per hour before using them in a model. This will avoid erroneous results due to unit mismatch. We will use the equations for states 2 and 3 above, along with the equation that says the probabilities of being in each state must sum to 1. We will rearrange the equations so that the coefficients
Table 5.1. Input parameters for modeling example Parameter Failure rate Coverage Repair rate Detection rate for uncovered faults
Symbol
Value
Units
C R SFD
1.00E-05 90.00% 0.25 1
failures/hour % per hour per hour
c05.qxd
2/8/2009
5:39 PM
Page 47
5.1
OVERVIEW OF MODELING TECHNIQUES
47
are in order from P1 through P3. Doing that yields the following three equations: CP1 – RP2 + SFDP3 = 0 (1 – C)P1 + 0P2 – SFDP3 = 0
(5.6)
P1 + P2 + P3 = 1 To solve this set of equations for each Pi we will define three matrices: A, P, and R. We define a matrix A that represents the coefficients from the equations such that, for the above example, we have the coefficient matrix shown in Equation 5.7.
冤
C (1 – C) A= 1
– R 0 1
SFD –SFD 1
冥
(5.7)
Using the equations for states 2 and 3 has an added advantage—it is very easy to fill in the values for the first two rows of matrix A. Table 5.2 shows how to easily fill in the values. Across the top we can write the “from” state number and along the rows we write the “to” state number, and then we fill in the rates from state to state. When we hit an entry where from and to are the same, we write everything that leaves that state, using a minus (–) sign to indicate that the rate is an outgoing rate. So, for example, in the second row of column 1, we enter the rate from state 1 to state 3. Referring back to the state transition diagram, we see that this is (1 – C). Next, we define P as the vector of probabilities that we are in each state. Thus, in the above example, with three states, we have the probability vector,
冤冥
P1 P = P2 P3
(5.8)
Table 5.2. State transition information for example model From state To state 2 3
1 C (1 – C)
2 – R 0
3 SFD –SFD
c05.qxd
2/8/2009
48
5:39 PM
Page 48
MODELING AVAILABILITY
Finally, we define a vector R that represents the right-hand side of the above equations. All the equations have a right-hand side of zero, except the one that says the probabilities must sum to one. Thus, for our example, we get the results vector
冤冥
0 R= 0 1
(5.9)
Then, using matrix notation, we can express the above set of equations as AP = R
(5.10)
What we really want to know is the values for the elements of P. We already know the values for A and R, so solving for P we get P = A–1R
(5.11)
where A–1 is the inverse of the matrix A. Going back to our example, and filling in the actual numbers, we get
冤
0.000009 A = 0.000001 1
–0.25 0 1
1 –1 1
冥
(5.12)
Inverting A results in –1
A
冤
3.99983601 –3.99984 = 3.998 × 10–6
4.999795 –3.9998 –1
0.999959 4 × 10–5 1 × 10–6
冥
(5.13)
Multiplying R by A–1 yields our answer:
冤
0.999959 3.99984 × 10–5 P= 9.99959 × 10–7
冥
(5.14)
Thus, the probability of being in state 1 is 99.9959%, the probability of being in state 2 is 0.00399984%, and the probability
c05.qxd
2/8/2009
5:39 PM
Page 49
5.1
OVERVIEW OF MODELING TECHNIQUES
49
of being in state 3 is 0.0000999959%. At first glance, these appear to be very tiny probabilities for states 2 and 3. To get to something that is easier to relate to, we will convert these to the number of minutes per year spent in each state. This is easy to do; we simply multiply each probability by the number of minutes in a year. There are 525,600 minutes per nonleap year, or 525,960 if you consider each year as having 365.25 days to accommodate leap years. Using the 525,960 value, and multiplying by the probabilities shown above, we discover that the simplex system of our example spends the number of minutes/year in each of the three states shown in Table 5.3. States 2 and 3 are both down states. Thus, our example system is down for 21.57 minutes per year—the sum of the time spent in each of states 2 and 3. Therefore, the availability of this system is the probability of being in the active state (state 1), or 99.9959%. There are many available tools that can be used to solve for the probabilities, from spreadsheets to custom software designed specifically for availability analysis. By understanding the basics, each modeler is free to pick a tool that meets their budget and format needs. Virtually everyone in the business world has access to a spreadsheet, thus it is appropriate to mention how to perform the above calculations using a spreadsheet. The first step is to enter the array data A into a matrix in the spreadsheet. You can do this as mentioned above using the “from state” and “to state” as a guide. Then, at the bottom of the matrix add a row of all ones to cover the equation that states the sum of the probabilities must be 1. You will now have matrix A filled in. For our example it would look like Table 5.4. Next, we need to create the inverse of matrix A. To do this, we select a range of cells somewhere else in the spreadsheet that is the same size as matrix A (in this case it will be 3 by 3). Then select the matrix invert function and select the values from matrix A
Table 5.3. Example simplex system modeling results State
Probability
Minutes/year
1 2 3
0.999959 3.9998E-05 9.9996E-07
525938.44 21.04 0.53
c05.qxd
2/8/2009
50
5:39 PM
Page 50
MODELING AVAILABILITY
Table 5.4. Coefficient matrix in spreadsheet form From state To state
1
2
3
2 3 ⌺Pi = 1
0.000009 0.000001 1
–0.25 0 1
1 –1 1
as the input. If you are using Microsoft Excel, you can do this by highlighting a new 3 by 3 area, selecting the “MINVERSE” function from the list of functions shown under the “Insert|Function...” menu item, and clicking OK. You will now be presented with a dialog that asks you to identify the input matrix. You can enter the range of cells for matrix A directly into the dialog (such as G24:I26) or click the button on the right end of the edit box and then highlight matrix A. Once you have selected matrix A, do not click OK in the dialog. You really want the entire matrix inversion, so hit Ctrl-shift-enter (hold down the control and shift keys, then hit the enter key while holding them down). This will populate the entire inverse of matrix A. You should have something that looks like Table 5.5. Next, we need to enter the vector for the right-hand side of the equations. This must be a columnar vector to keep the values properly associated with the equations they match. All entries except the last row will be 0; the last entry will be a 1. For our example system, vector R should look like Table 5.6. The next step is to perform the matrix multiplication to solve for the probabilities. We do this in a manner similar to how we did the matrix inversion. First, we select the output area. This should be a single column with one row for each state (three rows in our example). Next, we select the matrix multiplication function (MMULT in Microsoft Excel) and multiply the inverse of matrix A by vector R. In Microsoft Excel, we do the matrix multiplication by
Table 5.5. Inverted coefficient matrix in spreadsheet form 3.999836007 –3.99984001 3.99984E-06
4.9998 –3.9998 –1
0.99996 4E-05 1E-06
c05.qxd
2/8/2009
5:39 PM
Page 51
5.1
OVERVIEW OF MODELING TECHNIQUES
51
Table 5.6. Results vector in spreadsheet form 0 0 1
highlighting the output matrix (a single column with three rows for our example) and selecting “Insert|Function...” from the menu. This will open a dialog box asking us to select the two matrices to be multiplied. Select the inverse of the matrix A for the first matrix and select vector R as the second matrix. You can select them either by entering their cell designations directly or by clicking the buttons on the right side of the edit boxes and highlighting the matrix in the spreadsheet. Once both matrices have been selected, you need to hit ctrl-shift-enter as in the matrix inverse case. This will populate the probability matrix. Typically, we look at downtime on an annual basis, so normally the probabilities would be multiplied by the number of minutes in a year. Table 5.7 shows the result of the matrix multiplication in the middle column, with the state number in the first column and the probabilities multiplied by 525,960 minutes/year in the third column. This table was generated using Microsoft Excel exactly as described here. It shows that the example system would be down for about 21.5 minutes/year (the sum of the downtime incurred in states 2 and 3), and in normal operation the rest of the year. So far, we have described the basic Markov chain models and the transition probabilities in these models. The Markov approach implies that the behavior of the system must be stationary or homogenous and memoryless. This means that the Markov approach is applicable to those systems whose behavior can be described by a probability distribution that is characterized by a constant failure and recovery rate.
Table 5.7. Probability solution and downtime in spreadsheet form State
Probability
Minutes/Year
1 2 3
0.999959 3.9998E-05 9.9996E-07
525938.44 21.04 0.53
c05.qxd
2/8/2009
52
5:39 PM
Page 52
MODELING AVAILABILITY
Only if the failure and recovery rate is constant does the probability of making a transition between two states remain constant at all points of time. If the probability is a function of time or the number of discrete steps, then the process is nonstationary and designated as non-Markovian. Appendix B documents widely used probability functions, among which the Poisson and exponential distributions have a constant failure rate. Reference [Crowder94] documents more details on probability models and how the statistical analysis is developed from those models. 5.1.3
Fault Tree Models
Fault tree analysis was originally developed in the 1960s by Bell Telephone Laboratories for use with the Minuteman Missile system. Since then it has been used extensively in both aerospace and safety applications. Fault tree analysis, like reliability block diagrams, represents the system graphically. The graphical representation is called a “fault tree diagram.” Rather than connecting “components” as in an RBD, a fault tree diagram connects “events.” Events are connected using logical constructs, such as AND and OR. Figure 5.3 shows the fault tree diagram for the hybrid system shown in Figure 5.1. In the system of Figure 5.3, there are two simplex components, A and C, and a pair of redundant elements, D and E. The events depicted in the fault tree diagram are the failures of each of the components A, C, D, and E. These events are shown in the labeled boxes along the bottom of the diagram. The diagram clearly shows that the top-level event, which is system failure, results if A or C fail, or if both D and E fail. Other gates besides the AND and OR gates shown in Figure 5.3 are also possible, including a Voting-OR gate (i.e., N of M things must have faults before the system fails), a Priority AND (events must happen in a specific sequence to propagate a failure), and a few others. In a fault tree diagram, the path from an event to a fault condition is a “cut set.” In more complex systems, a single event may be represented as an input multiple times. The shortest path from a given event to a system fault is called a “minimal cut set.” Cut sets are discussed further in Section 5.1.4 One of the drawbacks to the use of fault tree analysis is the limited ability to incorporate the concepts of repair and maintenance, and the time-varying distributions associated with them.
c05.qxd
2/8/2009
5:39 PM
Page 53
5.1
OVERVIEW OF MODELING TECHNIQUES
53
System Failure
Logical OR of inputs A, C and output of AND gate.
Logical AND of inputs D and E.
D
E
A
C
Figure 5.3. Fault tree diagram for the hybrid system of figure 5.1.
Markov models are capable of including repair and maintenance actions, and are, thus, preferred for redundant systems, and especially for systems that are repaired while still providing service. It is generally fairly straightforward to convert from a fault tree diagram to an RBD (with a few exceptions), but the converse is not always true. Fault trees work in the “failure space,” or deal with the events that cause a system failure, whereas RBDs work in the “success space” or the situations in which the system is operating correctly. Additional information on fault tree analysis can be found in both the U.S. Nuclear Regulatory Commission Fault Tree Handbook [NUREG-0492] and NASA Fault Tree Analysis with Aerospace Applications [NASA2002], which is an update to NUREG0492 with an aerospace focus. 5.1.4
Minimal Cut-Set Method
A cut set is defined as a set of system components that, when failed, causes failure of the system. In other words, it is a set of components that, when cut from the system, results in a system failure. A minimal cut set is a set of system components that, when failed, causes failure of the system but when any one component of the set has not failed, does not cause system failure (in
c05.qxd
2/8/2009
54
5:39 PM
Page 54
MODELING AVAILABILITY
this sense the components in each cut set are put in parallel). In this method, minimum cut sets are identified for the network/system and the system reliability (or unreliability) is evaluated by combining the minimum cut sets (the minimum cut sets are then drawn in series). However, the concept of series from Section 5.1.1.1 cannot be used. Assume Ci is the ith cut set. The unreliability of the system is then given by Fs = P(C1 傼 C2 傼 . . . 傼 Cn). The reliability of the system is complementary to the unreliability. To clarify this concept, consider the fault tree diagram shown in Figure 5.3. The cut sets are {A}, {C}, {A, C}, {D, E}, {A, E}, {A, D}, {C, D}, {C, E}, {C, D, E}, {A, D, E}, and {A, C, D, E} because all these combinations of failures in components A, C, D, and E will result in a system failure. However, the minimal cut sets are {A}, {C}, and {D, E}. Each of these sets has at least one event that, when removed, will not restore the system to operational status. The unreliability of this example is then FS = P(CA 傼 CC 傼 CDE) = P(CA) + P(CC) + P(CDE) – P(CA 傽 CC) – P(CA 傽 CDE) – P(CC 傽 CDE) + P(CA 傽 CC 傽 CDE) where P(CA) = 1 – RA P(CC) = 1 – RC P(CDE) = (1 – RD)(1 – RE) P(CA 傽 CC) = P(CA)P(CC) = (1 – RA)(1 – RC) P(CA 傽 CDE) = P(CA)P(CDE) = (1 – RA)(1 – RD)(1 – RE) P(CC 傽 CDE) = P(CC)P(CDE) = (1 – RC)(1 – RD)(1 – RE) P(CA 傽 CC 傽 CDE) = P(CA)P(CC)P(CDE) = (1 – RA)(1 – RC)(1 – RD)(1 – RE) and Ri is the reliability of component i. Therefore, FS = (1 – RA) + (1 – RC) + (1 – RD)(1 – RE) – (1 – RA)(1 – RC) – (1 – RA)(1 – RD)(1 – RE) – (1 – RC)(1 – RD)(1 – RE) + (1 – RA)(1 – RC)(1 – RD)(1 – RE) = 1 – RARCRD – RARCRE + RARCRDRE
c05.qxd
2/8/2009
5:39 PM
Page 55
5.1
OVERVIEW OF MODELING TECHNIQUES
55
From the RBD method, we can derive the unreliability function as follows: FS = 1 – RS = RARC[1 – (1 – RD)(1 – RE)] = 1 – RARCRD – RARCRE + RARCRDRE This result of the system reliability derived from the cut-set method agrees with the system reliability calculated from the RBD method. Cut-sets can be useful to identify the events that cause a system to become inoperative. For example, any event that occurs in every minimal cut set is clearly an event whose impact should be mitigated, whether through redundancy or other mechanisms. The mathematics behind cut sets may become cumbersome, and will be different for every system. Additionally, cut-set analysis does not easily lend itself to inclusion of time dependent distributions or the inclusion of maintenance or repairs. 5.1.5
Petri-Net Models
Invented in 1962 by Carl A. Petri, the Petri-net method (also known as a place/transition net or P/T net) is a bottom-up method that utilizes a symbolic language. A Petri-net structure consists of places, transitions, and directed arcs. The Petri-net graph allows the user to represent the actual system functions and use markings to assign tokens to the net. Blockages or failures can be studied while monitoring the performance and reliability levels. A Petri net consists of four basic parts that allow construction of a Petri-net graph: 1. 2. 3. 4.
A set of places represented by circles A set of transitions represented by vertical bars One or more input functions One or more output functions
Figure 5.4 shows a simple example of using a Petri net to describe the failure and recovery mechanism of a simplex system or component. Places and transitions are shown in Figure 5.4. The input functions in this example are the assumptions about the failure
c05.qxd
2/8/2009
56
5:39 PM
Page 56
MODELING AVAILABILITY
Figure 5.4. A Petri-net example.
and recovery transitions (in this case, exponential) and the parameter values for these distributions. The output function is the availability result, in particular, availability as a function of the failure and recovery distributions. Places contain tokens, represented by dots. Tokens can be considered resources. The Petri-net marking is defined by the number of tokens at each place, which is also designated as the state of the Petri net. Petri nets are a promising tool for describing systems that are characterized as being concurrent, asynchronous, distributed, parallel, nondeterministic, and/or stochastic. There are several variations of Petri nets used in reliability analysis. A stochastic Petri net (SPN) is obtained with the association of a firing time with each transition. The firing of a transition causes a change of state. A generalized stochastic Petri net (GSPN) allows timed transitions with zero firing times and exponentially distributed firing times. The extended stochastic Petri net (ESPN) allows the firing times to belong to an arbitrary statistical distribution. In a timed Petri net, time values are associated with transitions. Tokens reside in places and control the execution of the transitions of the Petri net. Timed Petri nets are used in studying performance and reliability issues of complex systems, which include finding the expected delay in a complex set of actions, average throughput capacities of parallel computers, or the average failure rate for fault-tolerant designs.
c05.qxd
2/8/2009
5:39 PM
Page 57
5.1
OVERVIEW OF MODELING TECHNIQUES
57
Petri nets can be combined with Markov models to analyze the stochastic processes of failure and recovery, although, in general, they are more difficult to solve than Markov models. 5.1.6
Monte Carlo Simulation
All the modeling methods we have discussed so far are analytical methods. A calculation method is analytical when the results are computed by solving a formula or set of formulas. However, analytical methods are not always possible or practical. Numerical methods, which are an alternative to analytical methods, are sometimes used to evaluate the reliability and availability of complex systems. One commonly used numerical method is Monte Carlo simulation. Monte Carlo simulation uses random numbers to represent the various system inputs, such as failure rates, repair rates, and so on. One advantage to using the Monte Carlo method is that the inputs can utilize distributions that make analytical methods difficult or even impossible to solve. In reliability engineering, Monte Carlo simulations repeatedly evaluate the system reliability/availability using a logical model of the system such as an RBD. The input parameter values are regenerated randomly prior to each analysis. While the parameters are regenerated each time, each parameter is constrained by the distribution function specified for that parameter. Monte Carlo analysis does not require the solution of a large or complex set of equations; all that is required is a logical model of the system and the specification of the input parameter distributions. To help explain how Monte Carlo analysis works, we will assess the availability of the system shown in the RBD in Figure 5.5. This system is a very simple simplex system consisting of a DC motor. Additionally, from analysis of field data it has been determined that the failures follow a Weibull distribution with a char-
Figure 5.5. A Monte Carlo example.
c05.qxd
2/8/2009
58
5:39 PM
Page 58
MODELING AVAILABILITY
acteristic life () of 1000 hours and a shape factor () of 2.0. Further, let us assume that the repair times are represented by a lognormal distribution with a shape parameter () of 5.0 and a scale parameter () of 2.0. Using the above RBD and the failure and repair distribution information, we can run a Monte Carlo analysis of this system. If we perform a Monte Carlo analysis with 1000 iterations, starting with a random number seed of 1, we obtain a system availability of 73.1%. What happens if we rerun the analysis, but start with a random number seed of 2? In that case, we will get a system availability of 72.6%, which is a difference of half a percent. If we continue to experiment and reanalyze using 10,000,000 iterations, we get a system availability of 73.5914% using 1 as a seed, and 73.6025% using 2 as the seed. Now the difference is 0.0111%, a much smaller value. As one would expect, increasing the number of iterations increases the accuracy of the results. The trade-off is that additional iterations require additional compute time.
5.2
MODELING DEFINITIONS
The following definitions apply to the models and systems described in this document. They are aligned with the customer’s view of system availability (described in Chapter 3, Section 3.3) and the TL 9000 Measurements Handbook. 5.2.1
Downtime and Availability-Related Definitions
The following chapters define the modeling outputs. In other words, these terms define the quantity whose value the models seek to determine. Typically, these terms are normalized to a single system and express the annual value the average single system will experience. 5.2.1.1 Primary Functionality Downtime and availability are defined in terms of the primary functionality [TL 9000 Measurement Handbook] of the system. Typically, the primary functionality of a system is to process transactions (or perhaps more accurately, the ability to generate revenue for the customer). Sometimes, it is appropriate to rede-
c05.qxd
2/8/2009
5:39 PM
Page 59
5.2
MODELING DEFINITIONS
59
fine the primary functionality of a particular system to understand the downtime associated with different functionalities. For example, a customer may require 99.999% transaction processing availability and 99.995% management visibility from a single system. In this case, two models could be constructed: one using transaction processing as the defined primary functionality, and the other using management visibility as the defined primary functionality. 5.2.1.2 Downtime Downtime is the amount of time that the primary functionality of the system is unavailable. Downtime is typically expressed in minutes per year per system. Unless specifically stated otherwise, downtime includes both total and prorated partial loss of primary functionality. Occasionally, it is of interest to exclude partial outages (i.e., consider total capacity loss events, only). In these cases, it should be clearly stated that the downtime consists of total outages only. 5.2.1.3 Unavailability Unavailability is the percentage of time the primary functionality of the system is unavailable. Mathematically, this is the downtime divided by the in-service time. For example, if a system experienced 5 minutes of downtime in a year, then unavailability is 5/525960 = 0.00095%. (525,960 is the number of minutes in a year, including leap years.) 5.2.1.4 Availability Availability is the percentage of time the primary functionality of the system is available. This is 100% minus unavailability, or 100% minus the downtime divided by the in-service time. Because downtime includes prorated partial outages, this is the same as the customer’s view of availability described in Chapter 2, Section 2.3. 5.2.1.5 Product-Attributable Downtime Product-attributable downtime is the downtime triggered by the system design, hardware, software, or other parts of the system. Earlier versions of the TL 9000 Measurement Handbook referred to product-attributable downtime as “supplier-attributable” downtime.
c05.qxd
2/8/2009
60
5:39 PM
Page 60
MODELING AVAILABILITY
5.2.1.6 Customer-Attributable Downtime Customer-attributable downtime is downtime that is primarily attributable to the customer’s equipment, support activities, or policies. It is typically triggered by such things as procedural errors, the office environment (such as power, grounding, temperature, humidity, or security problems). Historically, this has been referred to as “service-provider attributable” downtime. 5.2.1.7 Total Versus Partial Outages and Partial Outage Prorating The impact of an outage can be described by the system capacity loss due to the outage. Total outages typically refer to outages that cause 90% or more capacity loss of a given system, whereas outages that cause less than 90% of the system capacity loss are typically referred to as partial outages. The system downtime, hence, is calculated by weighting the capacity loss of any given outages. For example, the system downtime due to a total outage of 10 minutes is 10 minutes, whereas the system downtime due to a partial outage that causes 50% of the system capacity to be lost for 10 minutes is counted as 5 minutes. The telecom standards [GR-929] and [GR-1929] document measurement definitions for total versus partial outages, and TL 9000 provides a detailed definition of total versus partial outages for all product categories. Using prorating on partial outages has a side effect that may be counterintuitive to many readers: it means the downtime due to multiple sets of components is the same as the downtime for a single set. For example, consider a single pair of database servers in an active standby configuration. Let us assume that this set of servers has an annual downtime of 10 minutes per year (derived either by field performance or modeling). What happens if we grow the system to include an additional pair of database servers so we can double the system capacity? If the second pair is identical to the first, then it too will be down 10 minutes per year. Now the system will see 20 minutes of outage each year, 10 minutes from the original database pair and 10 minutes from the pair we just added. But now the loss of a single database pair reduces system capacity by 50%, so we discount the downtime for each by 50% , meaning we now have two separate outages that count as 5 minutes each, for a total annual downtime of 10 minutes. This is the same downtime we had with a single database pair! And, no matter how many pairs we add we will still have 10
c05.qxd
2/8/2009
5:39 PM
Page 61
5.2
MODELING DEFINITIONS
61
minutes per year of downtime if we prorate the downtime based on capacity! This fact can be very handy when we have to build models of systems that can vary in size. The above example demonstrates that we do not need to build a model for each possible number of database server pairs; we build a model for the simplest case of a single pair and we can reuse that downtime number for any system configurations that use a different number of database pairs. 5.2.1.8 Counting and Exclusion Rules Counting rules are the rules that determine which outages are included in the system downtime, and which outages (if any) may be excluded. Typically, these come from the purchasers of the system (or possibly a group of purchasers) because they understand the financial implications of the different types of outages. For example, for telecom equipment, TL 9000 specifies the counting rules based on the specific category of equipment. In TL 9000, most equipment categories may exclude outages of less than 15 seconds, and outages that affect less than 10% of the system capacity. The counting rules clearly specify what problems or outages are considered too small to consider. Because there is a cost associated with counting outages—reports have to be created and tracked—there is a crossover point at where the cost of reporting the outage exceeds the actual revenue lost by the outage. The counting rules (or exclusion rules) define this point in a clear manner and make outage counting a much more soluble problem. 5.2.2
Unplanned Downtime Modeling Parameter Definitions
The definitions in this chapter apply to the input parameters used to model unplanned downtime. They are subdivided into four major categories: 1. 2. 3. 4.
Failure rates Recovery times Coverage Failovers
The techniques for estimating each of these parameters from field, lab, and design/architecture data are described in Chapters 6, 7,
c05.qxd
2/8/2009
62
5:39 PM
Page 62
MODELING AVAILABILITY
and 8, respectively. Additionally, the nominal ranges for these parameters are provided in Chapters 7 and 8. System availability is more sensitive to changes in some parameters than in other parameters. The parameters to which the availability is more sensitive we have labeled influential parameters. The sections that follow specify which parameters are influential and which are less so. 5.2.2.1 Failure Rate Definitions Each of the different components that make up the system can have its own failure rate. Within the models, these failure rates are usually expressed in terms of failures per hour. The failure rates may be provided in many different forms, such as FIT rate (failures in 109 hours), mean time between failure (MTBF) in hours or years, and failures per year. Each of these different forms of failure rate must be converted to the same units (preferably failures per hour) prior to their use in the models. 5.2.2.1.1 Hardware Failure Rate. Hardware failure rate is the steady-state rate of hardware incidents that require hardware maintenance actions (typically FRU replacement). Typically, the hardware failure rate includes both service-affecting and nonservice-affecting hardware faults. This is typically an influential parameter. Hardware failure rates vary over the hardware’s lifetime. There are often some early life failures when components with manufacturing defects or other weaknesses fail; this is often referred to as “infant mortality.” This infant mortality period may last for several months. After these early life failures have occurred, then the hardware failure rate stabilizes to a (relatively) steady failure rate. Eventually, the hardware reaches the end of its designed service life and wearout failures begin to increase the hardware failure rate. This canonical hardware failure rate as a function of time is often referred to as the “bathtub curve” and is shown in Figure 5.6. Hardware failure rate predictions and calculations (for example, TL 9000’s Yearly Return Rate or YRR) are for the so-called constant failure rate period—the bottom of the “bathtub.” This rate is depicted as “FR” in Figure 5.6. As discussed earlier, the exponential distribution has a memoryless property and a constant failure (or hazard) rate (see Appendix B for the mathematical de-
c05.qxd
2/8/2009
5:39 PM
Page 63
5.2
MODELING DEFINITIONS
63
Figure 5.6. Bathtub curve.
tails). Hence, it has been widely used to describe the failure phenomena, in particular, hardware failure processes. After some period of time, the failure rate begins to increase due to wear-out mechanisms. This signals the end of service life for the component. There is no concrete definition of exactly when end of service life occurs; values from 125% to 200% of the steady-state failure rate are frequently used, but the exact definition may vary from component to component and manufacturer to manufacturer. End of service life can vary dramatically depending on the wear-out mechanisms of the individual component. Electromechanical components, such as fans or disk drives, tend to wear out more quickly than components like semiconductors with no moving parts. Among the probability distributions (Appendix B), the Weibull distribution has a very important property, that is, the distribution has no specific characteristic shape. In fact, if the values of the three parameters in the Weibull distribution are properly chosen, it can be shaped to represent all three phases of the bathtub curve. Hence, the Weibull distribution is the most widely used distribution function used to analyze experimental data. This makes Weibull (and a few other distribution functions such as gamma and lognormal which are discussed in Appendix B) a very important function in experimental data analysis. 5.2.2.1.2 Software Failure Rate. The software failure rate used in modeling is the rate of software incidents that require a module, process, application restart, or a reboot to recover. Module/
c05.qxd
2/8/2009
64
5:39 PM
Page 64
MODELING AVAILABILITY
process/application restart or reboot is analogous to “hardware maintenance action” in the hardware failure-rate definition. It should be noted that the software failure rate is not necessarily the same as the software defect rate. This is because there are many possible software defects that will not affect the system availability. For example, software that generates a misspelled message or paints the screen the wrong color clearly has a defect, but these defects are not likely to result in a system outage. Software failure rate is typically an influential parameter. Chapter 7 and Chapter 8 discuss how to estimate software failure rates from the test data and outage rate from field data, respectively. 5.2.2.2 Recovery Time Definitions Note that whereas hardware repair time is fairly straightforward, there are many potential variables that factor into the software recovery time, including the escalation strategy, the probability of success at each escalation level, and the time spent to successfully execute each escalation level. The definitions here assume a threetiered automatic software escalation strategy. The first tier detects failures within a single task or process and restarts the task or process. The recovery escalates to the second tier if restarting an individual task or process fails to restore primary functionality. At the second tier, the entire application is restarted. If this fails to restore primary functionality, escalation proceeds to the third tier, where a reboot occurs. If the reboot fails, then a manual recovery will be required. Not all systems will map directly to the threetiered approach described and defined here, but the concepts and principles will apply to most systems, and can easily be modified to fit any specific system and software recovery strategy. The individual recovery time parameters are described in the following sections. Successful detection times of the covered failures are included in the models in this book. 5.2.2.2.1 Hardware FRU Repair Time. Hardware FRU repair time is the average amount of time required to repair a failed hardware FRU. This includes both the dispatch and actual repair time. This is an influential parameter for simplex systems, but is not very influential for redundant systems. 5.2.2.2.2 Covered Fault Detection Time. Detection time for alarmed and/or covered failures is the amount of time it takes to
c05.qxd
2/8/2009
5:39 PM
Page 65
5.2
MODELING DEFINITIONS
65
recognize that the system has failed (in a detected manner) before automatic recovery takes place. Although this time duration is typically very short, this is included in the models in this book. 5.2.2.2.3 Uncovered Fault Detection Time. Uncovered fault detection time is the amount of time it takes to recognize that the system has failed when it was not automatically detected. This often requires a technician to recognize that there is a problem (possibly via troubleshooting alarms on adjacent systems or because performance measures deviate from expectations). The value used for this parameter comes from analyzing field outage data. This parameter does not include the recovery time; it is just the time required for a person to detect that the system has failed. Uncovered fault detection time is typically an influential parameter. 5.2.2.2.4 Single-Process Restart Time. Single-process restart time is the amount of time required to automatically recognize that a process has failed and to restart it. This parameter applies to systems that monitor the individual processes required to provide primary functionality, and is the average time required to detect a failure and restart one of those processes. It also applies to systems that use software tasks instead of processes if those tasks are monitored and are restartable. This parameter can be somewhat influential. 5.2.2.2.5 Full-Application Restart Time. Full-application restart time is the amount of time required to fully initialize the application. A full application restart does not include a reboot or a restart of the operating system. Full application restart time applies to systems in which full application restart is one of the recovery levels, and it does not include restarting lower levels of software such as the operating system or platform software. This is typically not an influential parameter. 5.2.2.2.6 Reboot Time. Reboot time is the amount of time required to reboot and initialize an entire server, including the operating system and the application. This can be a somewhat influential parameter for simplex software systems, but is typically not very influential in redundant systems. There also tends to be a wide variation on this parameter, from fairly quick reboots for real-time
c05.qxd
2/8/2009
66
5:39 PM
Page 66
MODELING AVAILABILITY
operating systems to tens of minutes for non-real-time systems with large databases that need to be synchronized during reboot. 5.2.2.2.7 Single-Process Restart Success Probability. Singleprocess restart success probability is the probability that restarting a failed process will restore primary functionality on the server. This parameter has an effect on the average software recovery time due to the weighting it gives to the single-process restart time, but this is typically not an influential parameter. 5.2.2.2.8 Full-Application Restart Success Probability. Full-application restart success probability is the probability that restarting the full application will restore primary functionality on the server. This parameter has an effect on the average software recovery time due to the weighting it gives to the full-application restart time, but this is typically not an influential parameter. 5.2.2.2.9 Reboot Success Probability. Reboot success probability is the probability that rebooting the server will restore primary functionality on the server. This parameter has an effect on the average software recovery time due to the weighting it gives to the reboot time, and the weighting it gives to a typically much slower manual software recovery via the unsuccessful percentage, but reboot success probability is typically not an influential parameter because of the relatively low probability of needing a reboot (both the single-process restarts and the full-application restarts have to fail before a full reboot is necessary). 5.2.2.3 Coverage Definitions Coverage is a probability; therefore, it is expressed as a percentage. The following sections describe the different types of coverage. These parameters are not necessarily correlated, although in actual systems there is probably some correlation between them. Part of this is because the mechanisms for improving coverage in one area frequently provide some level of coverage in another area. An example of this is a bus watchdog timer. The bus watchdog timer can detect accesses to invalid addresses that result from a hardware fault in the address decoder, but it can also detect an invalid address access due to an invalid pointer dereference in software. Coverage can be difficult to measure, and the correlation between different coverage types is even more difficult to measure. Because
c05.qxd
2/8/2009
5:39 PM
Page 67
5.2
MODELING DEFINITIONS
67
of this, the models use independent values for each of the different coverage types defined in the following chapters. Any correlation that is known may be incorporated into the actual coverage values used in the model. 5.2.2.3.1 Hardware Fault Coverage. Hardware fault coverage is the probability that a hardware fault is detected by the system and an automatic recovery (such as a failover) is initiated. This parameter represents the percentage of hardware failures that are automatically detected by the system. The detection mechanism may be hardware, software, or a combination of both. The important aspect is that the fault was detected automatically by the system. As an example, consider a hardware fault that occurs in an I/O device such as a serial port or an Ethernet controller. This fault might be detected in hardware by using parity on the data bus, or it could be detected in software by using a CRC mechanism on the serial data. The important thing is that it can be detected automatically, not whether the detection mechanism was hardware or software. Hardware fault coverage is an influential parameter in the downtime calculation of redundant systems (such as active/standby or N+K), but it is not very influential for simplex systems. 5.2.2.3.2 Software Fault Coverage. Software fault coverage is the probability that a software fault is detected by the system and an automatic recovery (such as a failover) is initiated. This parameter represents the percentage of software failures that are automatically detected by the system. The detection mechanism may be hardware, software, or a combination of both. The important aspect is that the fault was detected automatically by the system. For example, consider a software fault that incorrectly populates the port number of an incoming port to a value greater than the number of ports in the system. This error could be detected by a software audit of the port data structures, or it could be detected by the hardware when an access to an out-of-range port is attempted. The important thing is that it can be detected automatically, not whether the detection mechanism was hardware or software. Hardware fault coverage is an influential parameter in the downtime calculation of redundant systems (such as active/standby or N+K), but it is not very influential for simplex systems. This is because the automatic recovery time for covered faults is usually much shorter than the detection time for uncovered faults.
c05.qxd
2/8/2009
68
5:39 PM
Page 68
MODELING AVAILABILITY
5.2.2.3.3 Failed-Restart Detection Probability. Failed-restart detection probability is the probability that a failed reboot will be detected automatically by the system. Some systems will attempt to reboot multiple times if a reboot fails to restore primary functionality, whereas others will simply raise an alarm and give up. In either case, if the reboot has failed and the failure goes unnoticed, the failed unit will remain out of service until a technician notices the unit has failed. If the failed reboot is detected automatically, then either an alarm will be raised informing the technician that he or she needs to take action, or the system will make another attempt at rebooting the failed unit. This is typically a noninfluential parameter, although it is more influential in simplex systems than in redundant systems. 5.2.2.4 Failover Definitions The failover definitions all apply to redundant systems; a simplex system has nothing to failover to. 5.2.2.4.1 Automatic Failover Time. Automatic failover time is the amount of time it takes to automatically fail primary functionality over to a redundant unit. This is a moderately influential parameter, but it is not necessarily continuous. TL 9000 allows an outage exclusion for most product categories; outages less than a certain threshold (15 seconds in TL 9000 Release 4) do not need to be counted. This means that most systems should strive to detect and failover faulty units within the customer’s maximum acceptable time (e.g., 15 seconds for most TL 9000 product categories). 5.2.2.4.2 Manual Failover Time (Detection and Failover). Manual failover time is the amount of time it takes a technician to detect the need for a manual failover and to manually force a failover. Manual failovers are only required after an automatic failover has failed. Manual failover time is typically a noninfluential parameter. 5.2.2.4.3 Automatic Failover Success Probability. Automatic failover success probability is the probability that an automatic failover will be successful. This is typically a noninfluential parameter. 5.2.2.4.4 Manual Failover Success Probability. Manual failover success probability is the probability that a manual failover will
c05.qxd
2/8/2009
5:39 PM
Page 69
5.3
PRACTICAL MODELING
69
be successful. An unsuccessful manual failover typically leads to a duplex failure in which the recovery can be much longer. In that case, both the active and the standby units/instances are repaired or rebooted. This is typically a noninfluential parameter.
5.3
PRACTICAL MODELING
Real systems and solutions are made up of various interlinked hardware and software modules that work together to provide service to customers or users. It is best to start by creating a reliability block diagram that clearly shows which major modules must be operational to provide service, and highlights the redundancy arrangements of those critical modules. Each redundancy group or simplex element that appears in series in the critical path through the reliability block diagram can be separately and individually modeled, thus further simplifying the modeling. For example, consider the sample in Figure 5.7, which shows components A and C as critical simplex elements, and component B is protected via a redundancy scheme. Thus, system downtime can be modeled by summing the downtime for simplex components A and C, and separately considering the downtime for the cluster of Component Bs. The remainder of this section will present sample Markov availability models of common redundancy schemes. These sample models can be used or modified as building blocks for systemor solution-level availability models. 5.3.1
Simplex Component
Figure 5.8 models a simplex component with imperfect coverage in three states:
Figure 5.7. Simple reliability block diagram example.
c05.qxd
2/8/2009
70
5:39 PM
Page 70
MODELING AVAILABILITY
앫 State 1—The component is active and fully operational 앫 State 2—The component is down (nonoperational) and system and/or maintenance staff are aware of the failure so recovery activities can be initiated 앫 State 3—The component is down (nonoperational) but neither the system itself nor maintenance staff is aware of the failure, so recovery activities cannot yet be initiated. State 3 is often referred to as a “silent failure” state. This model is frequently used for things like backplanes when there is only a single backplane for the entire system. Typically, highly available systems employ redundancy schemes to increase availability, so this model should not see abundant use in a highly available system. Because backplanes in general are quite reliable, and have fully redundant connections within a single backplane, it is usually acceptable for them to be simplex. In the rare cases in which failure detection is instantaneous and perfect, this model degenerates into the simpler model shown in Figure 5.9. 5.3.2
Active–Active Model
The model shown in Figure 5.10 is used for duplex systems that split the load between a pair of units. When one unit goes down,
1 Working Uncovered Failure
(1-C))λ
Covered Failure
3 Down Uncovered
Cλ
Uncovered Failure is Detected
μSFD
μR Repair
2 Down Covered
States 2 and 3 are down states Figure 5.8. Simplex component Markov model.
c05.qxd
2/8/2009
5:39 PM
Page 71
5.3
PRACTICAL MODELING
1 Working Uncovered Failure
(1-C))λ )λ
Covered Failure
3 Down Uncovered
Cλ λ
μR Repair
Uncovered Failure is Detected
μSFD
2 Down Covered
The full model automatically changes to this model when the coverage is 100%.
Figure 5.9. Simplex, perfect coverage component Markov model.
Uncovered Failure
1 Duplex
2(1-CA) λ
Covered Failure
Uncovered Failure Detected μSFDTA
4 One Fails Covered
μ
μ
2CAλ
3 One Fails Uncovered
Repair
Repair
2nd Failure
2 Simplex Manual Failover
Successful Failover
FMμ FOM
Fμ FO
5 Failover Failed
Failed Failover
(1-F)μFO
λ
6 Duplex Failure
2nd Failure
λ
Failed Manual Failover
(1-FM)μFOM
2nd Failure
λ
States 3, 4 and 5 are 50% down. State 6 is 100% down.
Figure 5.10. Full active–active component Markov model.
71
c05.qxd
2/8/2009
72
5:39 PM
Page 72
MODELING AVAILABILITY
there is a 50% capacity loss until the lost traffic can be reestablished on the other unit. A good example of this type of equipment is a redundant hub or router. The model degenerates to that shown in Figure 5.11 when the failovers are perfect, that is, every failover attempt works perfectly. For those systems in which failovers occur relatively instantaneously, the model further degenerates to the one shown in Figure 5.12. And, for those rare cases where every failure is properly detected and a perfect instantaneous failover initiated, the model degenerates to the one shown in Figure 5.13. 5.3.3
Active–Standby Model
The active–standby model is used for duplex configurations that have one unit actively providing service and a second unit on standby just in case the first unit fails. A good example of this type of system is a redundant database server. One server actively an-
Uncovered Failure
1 Duplex
2(1-CA) λ
Covered Failure
Uncovered Failure Detected μ SFDTA
4 One Fails Covered
μ
μ
2CAλ
3 One Fails Uncovered
Repair
Repair
2nd Failure
2 Simplex Manual Failover
Successful Failover
FMμ FOM
Fμ FO
5 Failover Failed
Failed Failover
(1-F))μFO
λ
6 Duplex Failure
2nd Failure
λ
Failed Manual Failover
(1-FM)μFOM M
2nd Failure
λ
The full model automatically goes to this model when F = 100%.
Figure 5.11. Active–active with 100% failover success.
c05.qxd
2/8/2009
5:39 PM
Page 73
5.3
Uncovered Failure
Covered Failure
μ SFDTA
4 One Fails Covered
Repair
μ
Repair
μ
2CAλ
Uncovered Failure Detected
73
1 Duplex
2(1-CA) λ
3 One Fails Uncovered
PRACTICAL MODELING
2nd Failure
2 Simplex Manual Failover
Successful Failover
FMμFOM
FμFO
5 Failover Failed
Failed Failover
(1-F))μFO
λ
6 Duplex Failure
2nd Failure
λ
Failed Manual Failover
(1-FM)μFOM M
2nd Failure
λ
The full model automatically goes to this model when the failover time is very short.
Figure 5.12. Active–active with perfect, instantaneous failover.
swers queries while the other is in standby mode. If the first database fails, then queries are redirected to the standby database and service resumes. Figure 5.14 shows the state transition diagram for the full active–standby model. The full active–standby model degenerates to that shown in Figure 5.15 when the failovers are perfect, that is, every failover attempt works perfectly. For those systems in which failovers occur relatively instantaneously, the active–standby model further degenerates to the one shown in Figure 5.16. And, for those rare cases in which every failure is properly detected and a perfect instantaneous failover initiated, the active–standby model degenerates to the one shown in Figure 5.17. 5.3.4
N+K Redundancy
There are two different types of N+K redundancy. The first is true N+K, where there are N active units and K spare units. In true
c05.qxd
2/8/2009
74
5:39 PM
Page 74
MODELING AVAILABILITY
Uncovered Failure
1 Duplex
(1-CA) λ
Covered Failure
Uncovered Failure Detected
μSFDTA
4 One Fails Covered
μ
μ
CA2 λ
3 One Fails Uncovered
Repair
Repair
2nd Failure
2 Simplex Manual Failover
Successful Failover
FMμFOM
FμFO
5 Failover Failed
Failed Failover
(1-F))μFO
2nd Failure
λ
λ
6 Duplex Failure
2nd Failure
λ
Failed Manual Failover
(1-FM)μFOM M
The full model automatically goes to this model when C = 100%.
Figure 5.13. Active–active with perfect coverage and instant failover.
N+K redundancy the K spares do not perform any work until one of the N units fails, at which point traffic is routed to one of the K units and that unit assumes the work of the failed unit. The second type of N+K redundancy is called N+K Load Shared. In this configuration, the load is split across all N+K units, each handling roughly the same amount of load. The redundancy comes from the fact that only N units are required to support the needed capacity of the system. In N+K Load Shared systems there is a partial outage whenever any unit fails. This is because it takes a finite amount of time to redistribute that traffic to the remaining units. In true N+K systems, a partial outage occurs whenever one of the N units fails, but no outage occurs when one of the K units fails. Note that with equal N and K, the portion of partial outage in a true N+K system is greater than in an N+K Load Shared system, but that it is slightly less likely to occur. In the end, the choice becomes a matter of preference; both are very reasonable ways to construct a highly available system.
c05.qxd
2/8/2009
5:39 PM
Page 75
5.3
Uncovered Failure on the Active
1 Duplex
(1-CA)λ
Uncovered Failure Detected
μSFA
5 Active Down Covered
Uncovered Failure on the Standby
(1-CS) λ
Covered Failure on the Standby
Repair
μ
Covered Failure on the Active
4 Active Down Uncovered
75
PRACTICAL MODELING
CS λ
CA λ
2 Simplex
Uncovered Failure Detected μ SFS
Manual Failover
FMμFOM
2nd Failure 2nd
Successful Failover
FμFO (1-F))μFO
λ
Repair
Failure
μ
λ
2nd Failure
7 Failover Failed
Failed Failover
3 Standby Down Uncovered
λ
6 Duplex Failed
Failed Manual Failover
(1-FM)μFOM
2nd Failure
λ
States 4, 5, 6, and 7 are down states.
Figure 5.14. Full active–standby model.
An example of an N+K system is a cluster of servers, or server farm, used to create a website for access by a large number of users. Figure 5.18 shows the state transition diagram for an N+K Load Shared system. 5.3.5
N-out-of-M Redundancy
Section 5.1.1.3 discussed the RBD for an N-out-of-M system. The RBD method calculates the reliability of the system based on a binomial formula. Compared to the RBD method, the Markov model discussed here is richer since it allows detailed modeling of failure detection probabilities and failure recovery hierarchies. N-out-of-M redundancy is similar to N+K but it is used when failovers are essentially instantaneous. Practically speaking, it is typically used with power supplies and cooling fans. Power supplies typically have their outputs wired together, so failure of any
2/8/2009
76
5:39 PM
Page 76
MODELING AVAILABILITY
1 Duplex
Uncovered Failure on the Active
(1-CA) λ
(1-CS) λ
Covered Failure on the Standby
Repair
μ
Covered Failure on the Active
CS λ
CAλ
2 Simplex
4 Active Down Uncovered Uncovered Failure Detected
μSFA
5 Active Down Covered
Uncovered Failure on the Standby
Uncovered Failure Detected μ SFS
Manual 왖 Failover
FMμFOM M
Repair
Failure
μ
λ
Fμ FO
2nd Failure
7 Failover Failed
(1-F))μ FO
3 Standby Down Uncovered 2nd Failure
2nd
Successful Failover
Failed Failover
왖
c05.qxd
λ
λ
6 Duplex Failed
Failed Manual Failover
( M)μFOM (1-F M
2nd Failure
λ
The full model automatically goes to this model when F = 100%.
Figure 5.15. Active–standby with 100% failover success.
single supply is instantly covered by the other supplies supplying more current. A similar thing occurs with fans—when a single fan fails the remaining fans in the system can go to a higher speed and keep the system cool. Figure 5.19 shows the Markov model for the N-out-of M redundancy scheme. 5.3.6
Practical Modeling Assumptions
In modeling the availability of a real system that consists of various hardware and software components, the first item that needs to be addressed is that the system needs to be decomposed into hardware and software subsystems. One of the simplifications is to separate the hardware and software and model them with two separate sets of models. To do this, first we have to demonstrate that two separate models (one for hardware and one for software) do not yield significantly different downtime results as compared to an integrated hardware and software model. Figure 5.20 shows an integrated hardware and software model for an active–warm
c05.qxd
2/8/2009
5:39 PM
Page 77
5.3
Uncovered Failure on the Active
1 Duplex
(1-CA) λ
PRACTICAL MODELING
Uncovered Failure on the Standby
(1-CS) λ
Repair
μ
Covered Failure on the Active
2 Simplex
4 Active Down Uncovered Uncovered Failure Detected
μ SFA
5 Active Down Covered
Covered Failure on the Standby
CS λ
CAλ
Uncovered Failure Detected μ SFS
Manual Failover
μ
M FOM M Successful Failover
7 Failover Failed
(1-F))μFO
3 Standby Down Uncovered 2nd Failure
2nd
FμFO Failed Failover
77
Repair
Failure
μ
λ
2nd Failure
λ
λ
6 Duplex Failed
Failed Manual Failover
(1-FM)μFOM
2nd Failure
λ
The full model automatically goes to this model when when the failover time is very short.
Figure 5.16. Active–standby with perfect, instantaneous failover.
standby design. Table 5.8 lists the state definitions of the model in Figure 5.20. The predicted downtime based on the integrated model and two separate hardware and software active–standby models are within 5% of each other. This suggests that the separate models can be used to simplify the downtime modeling. The reasons that we recommend using separate hardware and software models for practical applications are: 1. The separate models produce downtime predictions that are within the acceptable range of precision. 2. Simpler models are easier to manage, which prevents unnecessary human errors. 3. Simpler models require fewer input parameters to be estimated. The uncertainties in the parameter estimations, in turn, might dwarf the precision offered by the more complicated models.
c05.qxd
2/8/2009
78
5:39 PM
Page 78
MODELING AVAILABILITY
1 Duplex
Uncovered Failure on the Active
(1-CA) λ
Uncovered Failure on the Standby
(1-CS) λ
Repair
μ
Covered Failure on the Active
CS 2 λ Uncovered Failure
CA λ
μ SFA
5 Active Down Covered
Manual Failover
FMμFOM M
2nd Failure 2nd
Repair
Failure
μ
Successful Failover
Fμ FO
7 Failover Failed
Failed Failover
(1-F)) μFFO
3 Standby Down Uncovered
Detected μ SFS
2 Simplex
4 Active Down Uncovered Uncovered Failure Detected
Covered Failure on the Standby
2nd Failure
λ
λ
6 Duplex Failed
Failed Manual Failover
(1-FM)μFOM
2nd Failure
λ
The full model automatically goes to this model when when coverage = 100%.
Figure 5.17. Active–standby with perfect coverage and instant failover.
Caution does need to be taken, though, in developing these models. The most important thing is that the interactions between the separated components need to be considered in the separated models. Failure states and the capacity losses need to be correctly reflected.
5.4
WIDGET EXAMPLE
This section ties the information from the preceding chapters together with an example for a hypothetical product called the “Widget System.” The hypothetical Widget System supports transaction processing functionality in a redundant architecture that can be tailored to a variety of different application needs. The Widget System is built on a scalable blade server chassis supporting application-specific boards. Internally, the widget system will be able to steer traffic to the appropriate application blade.
2/8/2009 5:39 PM
Figure 5.18. N+K Load Shared redundancy model.
c05.qxd Page 79
79
Mλ( 1-C)
M Active μ
MλC M-1 Active
2
μ
3
μ
(M-2)λ C N-1 Active
4
μ
(N-1) λC
…
11 2 Active
5 2 Active
Figure 5.19. N-out-of-M redundancy model.
M-2 Active
(M-1)λC
…
(N-1)λ(1-C)
N-1 Active
10
2λ(1-C)
μ
2λC
1 Active
6
1 Active
12
λ (1-C)
μ
λC
Cλ
1
μ SFDT
M-2 Active
μSFDT
M-1 Active
- C) ) λ(1
9
μ SFDT
8
μSFDT
(M-2)λ(1-C)
M 2 )λ(1 - C)
( M-1) λC (N 1 ) λ(1-C )
( M- 2) λC -C)
(N-1) λC
(M1
2 λC 2 λ(1
(M-1)λ(1-C)
μSFDT
80
13
0 Active
7
0 Active
5:39 PM
λ(1-C )
2/8/2009
μ SFDT
c05.qxd Page 80
25
29
23
11
10
2
9
5
4
7
Figure 5.20. Integrated active–warm standby hardware and software Markov model.
28
26
22
21
18
17
5:39 PM
20
19
14
2/8/2009
16
15
c05.qxd Page 81
81
c05.qxd
2/8/2009
82
5:39 PM
Page 82
MODELING AVAILABILITY
Table 5.8. State definitions for the integrated model State 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
ACTIVE pack Working Detected s1 failure. Restart process Detected s2 failure or failed restart of s1 failure. Initiate failover Working Failover to standby fails or manual detection of silent SW failure. Reboot & rebuild Silent SW failure Working Detected s2 failure or failed restart of s1 failure. Initiate failover Detected s1 failure. Restart process Detected HW failure. Initiate failover Silent HW failure Failover fails after detected HW failure or manual detection of silent HW failure Working Detected s1 failure. Restart process Detected s2 failure or failed restart of s1 failure. Initiate failover Silent SW failure. Detection of silent SW failure. Reboot and rebuild
20
Detected HW failure. Initiate failover Silent HW failure Failover fails after detected HW failure or manual detection of silent HW failure HW failure. Replace pack
21
Reboot & rebuild new pack
22 23 24
Silent HW failure Detected HW failure. Initiate failover Working
25 26
Detection of silent HW failure or failover to standby fails Silent HW failure
27
Detected s1 failure. Restart process
28 29
Detected s2 failure or failed restart of s1 failure or detection of silent SW failure. Reboot & rebuild Silent SW failure.
30
Working
STANDBY pack Working Working Working Reboot & copy data Reboot & rebuild Working Silent SW failure Silent SW failure Silent SW failure Silent SW failure Silent HW failure Reboot & rebuild Silent HW failure Silent HW failure Silent HW failure Silent HW failure Attempt reboot; detect HW failure Silent HW failure Silent HW failure Attempt reboot; detect HW failure HW failure. Replace pack Reboot & rebuild new pack Working Working Detected HW failure Reboot & rebuild Detected HW failure Detected HW failure Detected HW failure Detected HW failure Reboot & copy data
c05.qxd
2/8/2009
5:39 PM
Page 83
5.4
WIDGET EXAMPLE
83
The Widget System is representative of a variety of different potential products. It could be a telecom system in which the interface cards connect to the subscribers’ lines (in which case they would be called line cards), and the control boards manage the setup and teardown of the subscriber calls. It could also be used on a factory floor to control a robotic assembler. In this case, the interface cards would read the robot’s current position and pass that information to the control boards. The control boards would then calculate the correct control signal based on the current position, the desired position, the velocity, and the acceleration. The control boards would send the control signal information back to the interface cards, which in turn would translate it into the actual signal that drives the robots’ motors. The Widget System is implemented on a small footprint, industry standard platform. The platform includes a single chassis, redundant power supplies, fan assemblies, Ethernet external interfaces, and blades providing bearer path services. The ability to include a combination of CPU and network processor equipped blades enables exceptional performance, rapid introduction of new features, and a smooth evolution path. Figure 5.21 shows the Widget System hardware architecture. There is a redundant pair of control boards that provide the operations, administration, and maintenance (OAM) interfaces, centralized fault recovery control, initialization sequencing, and the monitoring and control for the chassis. Additionally, they provide
Additional slots for growth
Figure 5.21. Widget System hardware architecture.
c05.qxd
2/8/2009
84
5:39 PM
Page 84
MODELING AVAILABILITY
the communications interconnections between the remaining cards in the chassis. A pair of interface cards is shown. These cards provide the interface to the ingress and egress communications links for the Widget System. They may optionally include a network processor to enable handling of very high speed links. Redundant power converters are provided. In a typical installation, one converter is connected to each power bus, so that the system power remains operational in the event one of the power buses or one of the converters fails. Cooling is provided by a set of fan assemblies. Each fan assembly contains a fan controller and two fans. The fan controller controls the fan speed and provides status back to the control board. The fan controllers are designed so that the failure of a controller will result in both fans going to full speed. The system is designed so that it may operate indefinitely in the case of a single fan failure (the remaining fans will all go to full speed). Figure 5.22 shows the widget system hardware reliability block diagram. The backplane, power, fan assembly and other serial elements as well as the control board and interface cards are included. Figure 5.23 shows the software architecture that resides on each blade. Both the control boards and the interface cards contain this software architecture, with the exception that the OAM software task resides on the control boards only. The software on the interface cards has internal monitors, as well as having the control board software as a higher-level overseer. The connecting lines in the figure represent that the health of software processes Task 1, Task 2, . . . , and OAM SW is monitored by the monitoring software, MonSW.
Figure 5.22. Widget System hardware reliability block diagram.
c05.qxd
2/8/2009
5:39 PM
Page 85
5.4
WIDGET EXAMPLE
85
Operating System Other Platform SW
MonSW Task1 .
OAM SW Task2.
…
Task6 .
Figure 5.23. Widget software architecture.
A detailed software RBD for the control board is shown in Figure 5.24. The software RBD for the interface card is similar to the control board RBD. The interface card software is typically simpler than the control board software, but the same method can be used to model and estimate downtime. In this study, it is assumed that the interface cards operate in an active–standby mode. They can operate in other modes as well, such as simplex. A simplex interface card architecture results in a higher interface card downtime.
Figure 5.24. Widget System control board software reliability block diagram.
c05.qxd
2/8/2009
86
5:39 PM
Page 86
MODELING AVAILABILITY
The system software is composed of numerous modules [such as the operating system (OS), database software, platform software, etc.]. All of these software modules have different impacts on transaction processing downtime. For instance, failures of the OS and the transaction processing modules directly cause transaction-processing downtime, whereas failures of the OAM software and the MonSW cause transaction processing downtime only when a second failure (of the OS or transaction-processing software) occurs before the OAM and MonSW fully recover; hence, the OAM software and MonSW only have an indirect impact on transaction-processing downtime. The model considers these different failure impacts and it also incorporates the multiple layers of the failure escalation and recovery hierarchy. The next section documents the model input parameters, the assumptions of the parameter values and the methods of estimating these values. The Markov model in Figure 5.25 depicts the failure and failure-recovery scenarios for the active–standby control board (CB), and can be used to calculate the CB hardware downtime. Table 5.9
1 Duplex
Uncovered Failure on the Active
(1-CS)
(1-CA)λ
Uncovered Failure Detected
μSFA
5 Active Down Covered
Covered Failure on the Standby
Repair
μ
Covered Failure on the Active
4 Active Down Uncovered
Uncovered Failure on the Standby
CS λ
CA λ
2 Simplex
Uncovered Failure Detected μ SFS
3 Standby Down Uncovered
Manual Failover
FMμFOM
2nd Failure 2nd
Successful Failover
Fμ FO
7 Failover Failed
Failed Failover
(1-F))μ FO
Repair
Failure
μ
λ
2nd Failure
λ
λ
6 Duplex Failed
Failed Manual Failover
(1-FM) μFOM
2nd Failure
λ
States 4, 5, 6, and 7 are down states.
Figure 5.25. Active–standby Markov model for control board hardware.
c05.qxd
2/8/2009
5:39 PM
Page 87
5.4
WIDGET EXAMPLE
87
summarizes the parameters in the Markov model of Figure 5.25 while Table 5.10 shows the parameter values. The Markov model is solved for the steady state percentage of time spent in each state. Then, the time spent in states where the system is unavailable is summed to get the total downtime for the Control Board Hardware.
Table 5.9. Active–standby Markov model parameters Parameter Failure rate (failures/hour)
Symbol
Definition Failure rates of the unit/software
Coverage factor for the active mode
CA
Probability of detecting a failure on the active unit/software by system diagnostics. (1 – CA) denotes the probability that the system misses detecting a failure on the active unit/software. (1 – CA) is the undetected failure probability).
Coverage factor for the standby mode
CS
Probability of detecting a failure on the standby unit/software by system diagnostics. (1 – CS) denotes the probability that the system misses detecting a failure on the standby unit. CS might equal CA.
Failover success probability
F
Success probability of automatically promoting the standby instance to active
Failover duration (hours)
1/FO
Manual failover success probability Manual failover time (hours) Repair/reboot time (hours)
Automatic failover duration
FM
Success probability for manually forcing traffic from a failed unit to an operational standby unit
1/FM
Mean time to reestablish a standby instance by manually forcing traffic from a failed unit to an operational standby unit
1/
Mean time to repair a failed hardware unit (or reboot the software), which typically includes system initialization time
Uncovered failure detection time on the active unit (hours)
1/SFA
Mean time to detect an uncovered failure on the active unit. This detection typically involves human interaction, and, thus, is slower than automatic detections.
Uncovered failure detection time on the standby unit (hours)
1/SFS
Mean time to detect an uncovered failure on the standby unit. This detection typically involves human interaction and, thus, is slower than automatic detections, and may take longer than detecting uncovered failures on the active unit, since no service has been lost.
c05.qxd
2/8/2009
88
5:39 PM
Page 88
MODELING AVAILABILITY
Table 5.10. Assumed parameter values for widget control board hardware Parameter
Value
Failure rate (failures/hour) Coverage factor for the active mode Coverage factor for the standby mode Failover success probability Failover duration (min) Standby recovery time (min) Manual failover success probability Manual failover time (min) Repair/reboot time (min) Uncovered failure detection time (min)
0.0000042 90% 90% 99% 0.1666667 240 99% 30 240 60
A similar Markov model is built for the interface cards, the chassis, the power converters, the fans, and the software that resides on the control board and the interface cards. Each Markov model is then solved independently, and the results added together to obtain the downtime of the entire system. Table 5.11 shows CB hardware downtime while Table 5.12 shows the resultant system downtime and availability. The predicted downtime of approximately 46 minutes is not all that great. In Chapter 8, Section 8.3, where we discuss sensitivity, we will show how to improve this system to reduce the downtime and make the system more reliable. The Widget System model has demonstrated the application of RBDs and Markov models to a small system. The methods used in the example all apply to modeling larger, more complex systems. For additional information on how to model system availability, the references in Appendix E should be consulted.
Table 5.11. Control board hardware downtime per state State
Downtime (min/yr)
1 2 3 4 5 6 7 Total Downtime
525936.7883 17.6715 5.3009 0.2209 0.0061 0.0013 0.0110 0.2393
c05.qxd
2/8/2009
5:39 PM
Page 89
5.5
ALIGNMENT WITH INDUSTRY STANDARDS
89
Table 5.12. Widget system downtime and availability Component Hardware Infrastructure (Fans, Backplane, Power Entry, etc.) Control board Interface card Power converter Software Control board Interface card Total
5.5
Downtime (min/yr)
Predicted availability
0.57
99.9999%
0.24 0.03 0.03
100.0000% 100.0000% 100.0000%
43.26 1.62 45.74
99.9918% 99.9997% 99.9913%
ALIGNMENT WITH INDUSTRY STANDARDS
There are many industry standards that might apply to a particular system or piece of equipment, as evidenced by the long list of standards in the references in the Appendix. In fact, Telcordia alone has so many standards that refer to reliability that they have issued a “Roadmap to Reliability Documents” [Telcordia08], which lists three pages worth of Telcordia reliability related standards! GR-874-CORE, “An Introduction to the Reliability and Quality Generic Requirements,” also provides a good overview of the various Telcordia standards that are reliability related. The authors consider the TL 9000 Measurements Handbook to be the overarching standard that takes precedence over the other standards in the cases where there is a conflict. There are multiple reasons to consider the TL 9000 Measurements Handbook first: 1. It was created by a large consortium of service and equipment providers, so it represents a well-balanced viewpoint. 2. It defines how to measure the actual reliability performance of a product in the field. 3. It measures reliability performance as it relates to revenue generation, which is the driving force for customers. 4. It is updated regularly, so it stays current with the equipment types and practices in actual use. In addition to aligning with TL 9000, there are a number of other potentially applicable standards. Alignment with them is discussed in the following sections. In the following sections, the
c05.qxd
2/8/2009
90
5:39 PM
Page 90
MODELING AVAILABILITY
shorthand reference for each standard is used to make it easier to read. For the complete reference name and issue information see the references in Appendix E. 5.5.1
Hardware Failure Rate Prediction Standard Alignment
There are two primary standards for predicting hardware failure rates: MIL-HDBK-217F and SR-332. The models described above can accept hardware failure predictions based on either of these standards. In addition, some equipment suppliers have proprietary methods of predicting hardware failure rates, which may also be used as input to the models. Each of these different prediction methods typically produces a different failure rate prediction. MIL-HDBK-217F tends to be quite pessimistic (i.e., it predicts a failure rate greater than what will actually be observed in the field), whereas release 1 of SR-332 is less pessimistic than MIL-HDBK-217F, but still pessimistic when compared with observed values. Release 2 of SR-332 attempts to address this pessimism, although, due to the relative newness of Release 2, it is still too early to tell how successful the attempt has been. To obtain accurate results from the models, it is often appropriate to scale the predictions with a scaling factor. This factor should be based on comparisons of actual failure rates with those of the prediction method used. Additionally, the scaling factor should consider the type of equipment and the environment in which it is operating, as well as the specific release of the prediction standard in use. For example, using SR-332 Release 1, the scaling factor for CPU boards in a controlled environment (such as an airconditioned equipment room) might be one-third (meaning that the observed failure rate is one-third the predicted failure rate), whereas the scaling factor for a power supply in a controlled environment might be one-half. If there is insufficient field data available to determine an appropriate scaling factor, 1 should be used as the scaling factor until enough field data becomes available. See Chapter 7, Section 7.1 for additional discussion of hardware failure rates. [MIL-STD-690C], “Failure Rate Sampling Plans and Procedures,” provides procedures for failure rate qualification, sampling plans for establishing and maintaining failure rate levels at selected confidence levels, and lot conformance inspection procedures associated with failure rate testing for the purpose
c05.qxd
2/8/2009
5:39 PM
Page 91
5.5
ALIGNMENT WITH INDUSTRY STANDARDS
91
of direct reference in appropriate military electronic parts established reliability (ER) specifications. [GR-357] also discusses hardware component reliability, but is primarily focused on device qualification and manufacturer qualification. GR-357 also defines the device quality levels used in SR-332, and because SR-332 is an acceptable (preferred) method for generating hardware failure rates, the modeling described above is aligned with GR-357 as well. [SR-TSY-000385] provides an overview of the terms and the mathematics involved in predicting hardware failure rates, as well discussing the typical failure modes for a variety of device types. SR-TSY-000385 also touches on system availability modeling, but this is covered primarily by using RBDs, rather than by using the preferred method of Markov modeling. Additional standards that relate to specific component types include: 앫 [GR-326-CORE]—Covers single-mode optical connectors and jumpers 앫 [GR-468-CORE]—Covers optoelectronic devices 앫 [GR-910-CORE]—Covers fiber optic attenuators 앫 [GR-1221-CORE]—Covers passive optical components 앫 [GR-1274-CORE]—Covers printed wiring assemblies exposed to hygroscopic dust 앫 [GR-1312-CORE]—Covers optical fiber amplifiers and dense wavelength-division multiplexed systems 앫 [GR-2853-CORE]—Covers AM/digital video laser transmitters and optical fiber amplifiers and receivers 앫 [GR-3020-CORE]—Covers nickel–cadmium batteries These standards should be consulted for a more in-depth understanding of the reliability of the individual component types, but typically do not play a significant role in generating a system level availability model. 5.5.2
Software Failure Rate Prediction Standard Alignment
Telcordia [GR-2813-CORE] gives “Generic Requirements for Software Reliability Prediction.” Chapter 3.4 of the standard begins, “Since no single software reliability prediction model is accepted universally, these requirements describe the attributes that such a
c05.qxd
2/8/2009
92
5:39 PM
Page 92
MODELING AVAILABILITY
model should have for use in predicting the software reliability of telecommunications systems.” The standard talks about using the prediction models to “determine the number of faults remaining in code” and to “determine the effort necessary to reach a reliability objective” in assessing whether the software is ready to be released. These are the exact tasks we use software reliability growth models (SRGMs) to complete. We also use SRGMs to compare/calibrate the software failure rate estimated in the testing environment and the software failure rate of early releases observed in the field environment to predict field failure rate prior to general availability (GA). This method is discussed in GR-2813-CORE. Telcordia GR-2813-CORE also discusses how to “correlate software faults with the characteristics of the software” such as software code size and complexity. Another Telcordia standard, SR-1547, “The Analysis and Use of Software Reliability and Quality Data,” describes methods for analyzing failure counts as a function of other explanatory factors such as complexity and code size. Our software metrics approach of estimating failure rates is consistent with both GR-2813-CORE and [SR-1547]. 5.5.3
Modeling Standards Alignment
The primary standard that relates to modeling is SR-TSY-001171, “Methods and Procedures for System Reliability Analysis.” This standard covers Markov modeling and encourages the use of coverage factors when modeling. The modeling described above is based on Markov techniques, includes the use of coverage factors, and is strongly aligned with the methods described in SR-TSY001171. There are multiple industry standards that suggest ranges or limits on the specific input values to be used but, unfortunately, these standards sometimes conflict with each other. In these cases, the most recent release of the TL 9000 Measurements Handbook should be used as the highest priority standard. The National Electronics Systems Assistance Center (NESAC), a consortium of North American telecommunications service providers, also issues targets for many of the TL 9000 metrics. The NESAC guidelines and targets should be the second highest priority. Both TL 9000 and the NESAC guidelines are kept current, reflect the views of a consortium of service providers, and view availability from the perspective of revenue generation for the service provider, which is their ultimate business goal. This implies that the ulti-
c05.qxd
2/8/2009
5:39 PM
Page 93
5.5
ALIGNMENT WITH INDUSTRY STANDARDS
93
mate consideration is availability of the primary functionality of the system, and that outages of the primary functionality for all product-attributable reasons (typically meaning both hardware and software causes) must be considered. [MIL-STD-756B], “Reliability Modeling and Prediction,” establishes uniform procedures and ground rules for generating mission reliability and basic reliability models and predictions for electronic, electrical, electromechanical, mechanical, and ordnance systems and equipments. [IEC 60300-3-1] provides a good overview of the different modeling methods (which they refer to as “dependability analysis methods”). They include a description of each method along with the benefits, limitations, and an example for each method. 5.5.4
Equipment Specific Standards Alignment
There are a number of standards that relate to a specific category of telecommunications equipment. Among them are: 앫 [GR-63-CORE]—Covers spatial and environmental criteria for telecom network equipment 앫 [GR-82-CORE]—Covers signaling transfer points 앫 [GR-418-CORE]—Covers fiber optic transport systems 앫 [GR-449-CORE]—Covers fiber distribution frames 앫 [GR-508-CORE]—Covers automatic message accounting systems 앫 [GR-512-CORE]—Covers switching systems 앫 [GR-929-CORE]—Covers wireline systems 앫 [GR-1110-CORE]—Covers broadband switching systems 앫 [GR-1241-CORE]—Supplement for service control points 앫 [GR-1280-CORE]—Covers service control points 앫 [GR-1339-CORE]—Covers digital cross-connect systems 앫 [GR-1929-CORE]—Covers wireless systems 앫 [GR-2841-CORE]—Covers operations systems These standards should be consulted to obtain equipmentspecific information, such as downtime requirements and objectives, downtime budgets, and so on. 5.5.5
Other Reliability-Related Standards
There are a number of other reliability-related standards that do not apply directly to modeling. Many of these are related to the re-
c05.qxd
2/8/2009
94
5:39 PM
Page 94
MODELING AVAILABILITY
liability and quality processes that help ensure a robust product. The appendices provide a comprehensive list of standards that may be consulted by the reader interested in obtaining a broader background in quality and reliability. [MIL-HDBK-338], “Electronic Reliability Design Handbook,” provides procuring activities and contractors with an understanding of the concepts, principles, and methodologies covering all aspects of electronic systems reliability engineering and cost analysis as they relate to the design, acquisition, and deployment of equipment or systems. [IEC 60300-3-6] (1997-11), “Dependability Management—Part 3: Application Guide—Section 6: “Software Aspects of Dependability,” describes dependability elements and tasks for systems or products containing software. This document was withdrawn in 2004 and replaced by IEC 60300-2. [IEC 61713] (2000-06), “Software Dependability Through the Software Life-Cycle Process—Application Guide,” describes activities to achieve dependable software to support IEC 60300-3-6 (1997-11) (replaced by IEC 60300-2); the guide is useful to acquire.
c06.qxd
2/10/2009
9:47 AM
CHAPTER
Page 95
6
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
“You can’t manage what you don’t measure.” Customers generally keep maintenance records, often called trouble tickets, for all manual emergency and nonemergency recoveries, and often at least some automatic recoveries. Records often capture the following data: 앫 Date and time of outage event 앫 Equipment identifier, such as model, location, and serial number 앫 Outage extent, such as number of impacted subscribers or percentage of capacity lost 앫 Outage Duration, typically resolved to minutes or seconds 앫 Summary of failure/impairment 앫 Actual fix, such as “replaced hardware,” “reset software,” or “recovered without intervention” 앫 Believed root cause, such as hardware, software, planned activity, or procedural error These records may include details on other relevant items, including: 앫 Emergency versus nonemergency recovery, such as “PLANNED =y,” “SCHEDULED=y,” or a nonzero value for “PARKING_ DURATION” 앫 Fault owner, such as equipment supplier, customer error, or power company Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
95
c06.qxd
2/10/2009
96
9:47 AM
Page 96
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
Given outage trouble tickets and the number of systems deployed by a service provider, it is straightforward for customers to calculate availability and failure rates for elements they have in service. Customers will often calculate availability and outage rates for high-value and/or broadly deployed elements on a monthly basis. These results may be shared with suppliers on a monthly, quarterly, or annual basis. This chapter explains how this customer outage data can be analyzed to estimate important input parameters to system-level availability models and compute actual product-attributable availability. Given properly estimated input parameters and actual availability, one can calibrate the availability model, thus improving prediction accuracy for releases not yet deployed or developed. 6.1
SELF-MAINTAINING CUSTOMERS
Customers generally write outage trouble tickets for all manually recovered outages and, possibly, some automatically recovered outages. Customers generally only escalate an outage to the supplier (e.g., a supplier’s Customer Technical Assistance Center) when they are no longer making acceptable progress addressing the outage themselves. For example, if the system raises a critical alarm that a hardware element has failed, the customer’s staff will often replace the hardware element and return the failed pack to the supplier or a third-party repair house without engaging the equipment supplier in real time. Alternately, the first time a hardto-debug outage occurs, the customer may contact the supplier’s Customer Technical Assistance Center for assistance, thus creating a formal assistance request. As leading customers often have efficient knowledge management schemes in place, subsequent occurrences of that (initially) hard-to-debug outage are likely to be quickly debugged and promptly resolved by following the procedure used to resolve the first occurrence of the outage. 6.2
ANALYZING FIELD OUTAGE DATA
Customer’s outage records can be analyzed to understand the actual reliability and availability of a product deployed by a particular customer, as well as to estimate modeling parameters and validate an availability model. The basic analysis steps are:
c06.qxd
2/10/2009
9:47 AM
Page 97
6.2
1. 2. 3. 4. 5. 6. 7. 8.
ANALYZING FIELD OUTAGE DATA
97
Select target customer(s) Acquire the customer’s outage data Scrub the outage data Categorize the outages Normalize capacity loss Calculate exposure time Compute outage frequency Compute availability Each of these steps is reviewed below.
6.2.1
Select Target Customer(s)
As detailed in Chapter 4, both reported outage rates and outage durations may vary significantly between customers operating identical equipment. Thus, it is generally better to solicit and analyze a homogeneous dataset from a single customer than to aggregate data from different customers with different policies and procedures. Although the perceived reliability and availability from that single customer will not be identical to the perception of all customers, one can better characterize the operational policies and factors that determined the reliability/availability of the selected customer and, thus, intelligently extrapolate those results to other customers that might have somewhat different operational policies and factors. The criteria for selecting the target customer(s) to acquire data from include those that: 1. Have significant deployments of target element. Realize that deployment includes both number of elements and months in service to produce overall element-years of service. 2. Are willing and able to provide the data. This depends on two core factors: the willingness of the customer to share this information with the supplier, and the ability of the customer’s data systems to actually generate the report(s) necessary for an adequate analysis. Data systems at some customers may be regionally organized or segmented as outage-related data in such a way as to make it awkward or inconvenient to consolidate the data into a single dataset that can be efficiently analyzed. It is often most convenient to arrange for the participation in the analysis of an engineer who is fluent in the spoken language of the customer to minimize practical issues associated with:
c06.qxd
2/10/2009
98
9:47 AM
Page 98
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
앫 Euphemisms. “Bumping,” “bouncing,” “sleeping,” and “dreaming” are all euphemisms for specific element failure modes that some English-speaking customers use. Other failure-related euphemisms may be language-, country- and perhaps even customer-specific. 앫 Abbreviations. “NTF” for “No Trouble Found” or “CWT” for “Cleared While Testing” may be common in English; however non-English-speaking customers may use acronyms that may be unfamiliar to nonnative speakers of those languages. 앫 Language subtleties. Customers in some countries may separately track outages attributed to “masculine other” (e.g., “otros” in Spanish) versus “feminine other” (e.g., “otras” in Spanish). Implications of these different classifications may not be obvious to engineers not intimately familiar with regional language usage patterns. As a practical matter, in-country supplier technical support staff is typically well equipped to clarify any issues involved in translation/interpretation of customer’s outage trouble tickets. 6.2.2
Acquire Customer’s Outage Data
Work with the targeted customer(s) to acquire: 1. Customer’s outage tickets for the product of interest. Data is typically provided as a Microsoft Excel spreadsheet with one row per outage event with columns containing various eventrelated parameters, including those enumerated at the beginning of this chapter. Twelve months of data is generally optimal for analysis because it gives visibility over an entire year, thus capturing any seasonal variations and annual maintenance/upgrade cycles. Some customers routinely offer this data to equipment suppliers as monthly, quarterly, or annual “vendor report cards,” “vendor availability reports,” and so on. 2. Number of network elements in service (by month) in the window of interest. The number of network elements in service is an important factor in determining the service exposure time. Chapter 9, Section 9.1 answers the question of “how much data is enough” in doing the field availability analysis. 3. Name of the equipment provider’s customer team engineer who can answer general questions about the customer’s deploy-
c06.qxd
2/10/2009
9:47 AM
Page 99
6.2
ANALYZING FIELD OUTAGE DATA
99
ment of the target product (e.g., what software version is running and when was it upgraded, how are elements configured). The analysis produced from this data is intended strictly for the equipment supplier’s internal use, and it is not suggested that the results be shared with the customer. The primary reasons for not generally offering to share the results of analysis are: 1. Availability may be worse than the customer had realized. 2. The customer’s operational definitions of availability may differ from the equipment supplier’s definitions, potentially opening up an awkward subject (see Chapter 4, Section 4.1.1). 3. Analysis may reveal that the customer’s policies, procedures, and other factors are better or worse than their competitors. This is obviously important for the equipment supplier team to understand, but generally inappropriate to reveal to customers (see Chapter 4, Section 4.2). 6.2.3
Scrub Outage Data
Upon receipt of the outage data, one should review and scrub the data to address any gaps or issues. Specifically, one should check that: 앫 No outages for elements other than the target element are included. 앫 No outages before or after the time window of interest are included. 앫 There are no duplicate records. 앫 Incomplete or corrupt records have been repaired (by adding nominal or “unknown” values). In the worst case, events are omitted from the analysis. 앫 Key data fields (e.g., outage duration, outage impact/capacityloss, actual fix) are nonblank; repair with nominal values, if necessary. Although TL 9000 metrics calculations may exclude brief and small-capacity-loss events, if records for those events are available, then they should be included in reliability analysis to better understand the system’s behavior of both “covered” failures and small capacity-loss events.
c06.qxd
2/10/2009
100
6.2.4
9:47 AM
Page 100
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
Categorize Outages
While many customers explicitly categorize the root cause and/or actual fix of each outage, one should map each outage into standard categories to enable consistent analysis. At the highest level, outages should be classified into appropriate orthogonal categories, such as: 앫 (Product-attributable) Software/firmware—for software/firmware outages that are cleared by reset, restart, power cycling, etc. 앫 (Product-attributable) Hardware—for events resolved by replacing or repairing hardware 앫 Customer Attributable—for customer mistakes (e.g., work not authorized, work/configuration errors) 앫 External Attributable—for events caused by natural disasters (e.g., fires, floods) or caused by third parties (e.g., commercial power failures, fiber cuts, security attacks) 앫 Planned/Procedural/Other Product Attributable. Occasionally, product-attributable causes other than hardware or software/firm-ware will cause outages; thus, these events can be separately categorized to avoid compromising the hardware and software/firmware categories without attributing a productrelated failure to the customer. Outage recoveries should be categorized based on actual fix, duration, and so on, as: 앫 Automatic recoveries—for “unplanned” events, such as those listed as “recovered without intervention” or “recovered automatically.” These outages are typically 3 minutes or less. 앫 Manual emergency recoveries—for “unplanned” events recovered by reseating or replacing circuit packs, manually restarting software, and so on. These outages are typically longer than 3 minutes. 앫 Planned/scheduled recoveries. When manual outage recovery is intentionally deferred to a maintenance window or off hours. This is typically flagged as “planned activity.” Optionally, second-tier classifications can also be added if desired, such as:
c06.qxd
2/10/2009
9:47 AM
Page 101
6.2
ANALYZING FIELD OUTAGE DATA
101
앫 Transaction processing—core service was lost 앫 Management visibility—alarms; management visibility was lost 앫 Connectivity—connectivity to gateways and supporting elements lost 앫 Provisioning, if reported by the customer 앫 Loss of redundancy. If the customer reports loss of redundancy or simplex exposure events, then those should be explicitly considered. 6.2.5
Normalize Capacity Loss
Customers often precisely quantize the capacity loss of each outage as a discrete number of impacted subscribers, lines, ports, trunks, etc. For availability calculations, these capacity losses should be normalized into percentages, e.g., 100% outage if (essentially) all lines, ports, trunks, subscribers, etc, on a particular element were impacted. Operationally, round capacity loss to percentage loss from failure of the most likely failed component (e.g., line card [perhaps 10%], shelf/side [perhaps 50%], or the entire system [100%]) 6.2.6
Calculate Exposure Time
Exposure time of systems in service is measured in NE years. Operationally, one typically calculates this on a monthly basis by summing across the number of elements in service in a particular month times the number of days in that month. Mathematically, NE Years of service = ⌺month Number of elements × Days in month ᎏᎏᎏᎏᎏᎏ 365.25 Days 6.2.7
(6.1)
Compute Outage Rates
Outage rates are not prorated by capacity loss, and are calculated via a simple formula like Equation (6.2). Equation (6.2) calculates the hardware outage rate for a given network element: ⌺Product-attributable hardware outages Outage rateHardware = ᎏᎏᎏᎏ NE years of service
(6.2)
c06.qxd
2/10/2009
102
9:47 AM
Page 102
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
One should separately compute hardware and software outage rates; optionally, one can compute outage rates for secondary categories (e.g., transaction processing, management visibility, connectivity, and provisioning). Similar calculations can be completed for software outage rate or for other outage classifications. It is often insightful to compare outage rates for hardware, software, and procedural causes; what percent of outages are coming from each category? It is straightforward to see that the reciprocal of the outage rate (say ) measures the average time an outage occurs, or the mean time to outage (MTTO). We discussed in Chapter 5 that the single-parameter exponential distribution is often used to model the random variable of time to failure. In this case, once the outage rate is estimated from the failure data, the probability density function of time to outage is determined. These estimators are known as parametric estimators according to the estimation theory in statistics. The parametric estimators are built from the knowledge or the assumption of the probability density function for the data and for the magnitudes to estimate. In the exponential case, since the true value of the parameter is unknown and the estimator is only made from a set of noisy observations, it is desired to evaluate the noise in the observation and/or the error associated with the estimation process. One way of getting an indication of the estimation confidence is to estimate the confidence bounds or confidence intervals, say [L, U] for the outage rate, where L is the lower bound and U is the upper bound. The confidence intervals associate the point estimator with the error or confidence level. For example, if an interval estimator is [L, U] with a given probability 1 – ␣, then L and U will be called 100(1 – ␣)% confidence limits. This means that the true failure rate is between L and U with a probability of 100(1 – ␣)%. The chi-squared distribution can be used to derive the confidence limits for the failure rate estimation, which can be summarized as follows: For a sample with n failures during a total T units of operation, the random interval between the two limits in Equation (6.3) will contain the true failure rate with a probability of 100(1 – ␣)%: 2 1–( ␣/2),2n ˆ L = ᎏᎏ 2T
and
(2␣/2),2n ˆ U = ᎏ 2T
(6.3)
c06.qxd
2/10/2009
9:47 AM
Page 103
6.2
ANALYZING FIELD OUTAGE DATA
103
Section 3.1 in Appendix B documents the theoretical development of these limits. A numerical example is shown as follows. Assuming after testing T = 50,000 hours, n = 60 failures are observed, we obtain the point estimate of the failure rate: 60 ˆ = ᎏ = 0.0012 outages/hour 50,000
(6.4)
For a confidence level of 90%, that is, ␣ = 1 – 0.9 = 0.1, we calculate the confidence intervals for the failure rate as: 2 2 1–( 0.95,120 95.703 ␣/2),2n ˆ L = ᎏᎏ = ᎏᎏ = ᎏ = 0.000957 outages/hour 2T 2 × 50,000 100,000 (6.5)
and 2 146.568 (2␣/2),2n 0.05,120 ˆ U = ᎏ = ᎏᎏ = ᎏ = 0.001465 outages/hour 2T 2 × 50,000 100,000 (6.6)
In summary, the point estimate of the outage rate is 0.0012 outages/hour. With a probability of 90%, the true outage rate will be between the interval (0.000957, 0.001465) outages/hour. This method can be applied to failure rate estimation for both point and interval estimates when analyzing failure data. 6.2.8
Compute Availability
Annualized downtime can be derived by dividing prorated outage durations by total in-service time; mathematically, this is Anualized downtime = ⌺Product-attributable events Capacity loss × Outage duration ᎏᎏᎏᎏᎏᎏᎏ (6.7) NE Years of service Annualized downtime is generally the easiest availability-related number to work with because it is easy to understand, budget, and consider “what-if” scenarios with. One should calculate annualized downtime for both hardware and software separately, as well as overall product-attributable downtime.
c06.qxd
2/10/2009
104
9:47 AM
Page 104
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
Note that product requirements and/or TL 9000 outage measurement rules may support omitting some small capacity-loss events from the annualized downtime calculation. For example, a total-capacity-loss-only calculation might exclude all events that impact less than 90% of capacity. Judgment and/or TL 9000 outage measurement rules may support capping maximum product-attributable outage duration to avoid factoring excess logistical or policy delays in outage resolution into the calculation. For example, it may be appropriate to cap product-attributable outage durations for failures in staffed offices to 1 hour, and to 2 or 4 hours in unstaffed locations. Although there may be a few longer duration outages for events that are escalated through the customers’ organization and/or to the equipment supplier in early deployments, customers will generally integrate this knowledge rapidly and generally have much shorter outage durations on subsequent failures. As customers (and equipment-supplier decision makers) often consider availability as a percentage compared to “99.999%,” one should also compute availability by percentage using the following formula, where annualized downtime is calculated from Equation (6.7): 525,960 – Annualized downtime Availability = ᎏᎏᎏᎏ 525,960
(6.8)
As we discussed in Chapter 2 and Chapter 5, Equation (6.8) is formulated based on the assumption that the components have two states, up and down, and for which the uptimes and downtimes are exponentially distributed. The uptimes and downtimes are estimated from recorded data, and, hence, the confidence level of unavailability estimation can also be made from the same set of data. In fact, Baldwin and coworkers introduced the approach to estimate the confidence limits of unavailability for power generating equipment in 1954 [Baldwin54]. The unavailability can be calculated from the availability equation in Equation (2.1): MTTF Unavailability = 1 – A = 1 – ᎏᎏ MTTF + MTTR
or
U= ᎏ + (6.9)
c06.qxd
2/10/2009
9:47 AM
Page 105
6.2
ANALYZING FIELD OUTAGE DATA
105
where and are failure and repair rates respectively and U is unavailability. Note that MTTF = 1/ and MTTR = 1/. The average uptime duration m and the average downtime duration r estimated can be evaluated from the recorded data. Using these two values, a single point estimate of the unavailability can be evaluated from Equation (6.10): r Uˆ = ᎏ r+ m
(6.10)
The confidence level can also be made from the same set of recorded data. Based on [Baldwin54], the F-distribution can be used to derive the confidence intervals for the unavailability. Section 3.2 in Appendix B describes the details and the end results are shown in Equation (6.11): r Upper limit, UU = ᎏᎏ r + ⬘⬘m (6.11) r Lower limit, UL = ᎏ r + ⬘m where ⬘ and ⬘⬘ are constants depending upon the chosen confidence level, which can be found from the F-distribution table. The example below shows the estimation process. Let a = number of consecutive or randomly chosen downtime durations; let b = number of consecutive or randomly chosen uptime durations. Consider a component that operates in the field and the following data is collected: a = b = 10, r = 5 hours, m = 2000 days = 48,000 hours. Evaluate (1) the single-point estimate of unavailability and (2) the limits of unavailability to give 90% confidence of enclosing the true value. The point estimate of unavailability is given by Uˆ = 5/(48,000 + 5) = 0.000104. From the condition given, we have ␣ = 0.90 and (1 – ␣)/2 = 0.05. Using the F-distribution tables, we have Pr [F20,20 ⱖ ⬘ = 0.05], hence, ⬘ = 2.12; Pr[F20,20 ⱖ (1/⬘⬘)] = 0.05, hence ⬘⬘ = (1/⬘) = 0.471. Therefore, the upper and lower limits can be derived from
c06.qxd
2/10/2009
106
9:47 AM
Page 106
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
Equation (6.11). Section 3.2 in Appendix B documents the theoretical development of these limits. The upper limit for U is UU = 5/[5 + (0.471 × 48,005)] = 0.000221, and the lower limit for U is UL = 5/[5 + 2.12 × 48,005)] = 0.0000491. From this example the following statements can be made: 앫 The single-point estimate of unavailability is 0.000104, or the availability is 99.9896%. 앫 There is a 90% probability that the true unavailability lies between 0.0000491 and 0.000221, or the availability is between 99.9779% and 99.99509%. 앫 There is a 95% probability that the true unavailability is less than 0.000221, or the availability is greater than 99.9779%. 앫 There is a 95% probability that the true unavailability is greater than 0.0000491, or the availability is less than 99.99509%.
6.3
ANALYZING PERFORMANCE AND ALARM DATA
Some systems directly record reliability-related parameters such as software restarts or switchovers as part of the system’s performance counters. Depending on exactly how these metrics are defined and organized, one may be able to aggregate this data across multiple elements for a sufficiently long time to estimate the software failure rate and other parameters. Naturally, these techniques will be product specific, based on the precise performance and alarm counters that are available. Customers may deploy service assurance or management products to archive critical or all alarms. It may be possible to extract failure rates directly from the critical alarm data, but one must be careful to discard redundant and second-order alarms before making any calculations. Likewise, one must also exclude alarms from: 앫 Intermediate or adjacent elements, such as critical alarms raised by a base station, because the backhaul connection was disrupted, or service failures because a supporting system (e.g., authentication server) could not be reached. 앫 Planned activities such as applying software upgrades. Although service may not be disrupted by a software upgrade, a
c06.qxd
2/10/2009
9:47 AM
Page 107
6.4
COVERAGE FACTOR AND FAILURE RATE
107
software restart is often required, and the software restart is likely to cause a brief loss of management visibility. Regardless of whether or not service was impacted, this event does not contribute to a failure rate because the alarm was triggered by a planned upgrade rather than a hardware or software failure. Also be aware that some restarts may be triggered by the upgrade of adjacent elements; for example, base stations might have to be restarted to resynchronize with upgraded software on a radio network controller. Extracting reliability/availability parameter estimates from alarm data requires a deep understanding of a system’s alarms and general behavior.
6.4
COVERAGE FACTOR AND FAILURE RATE
While all service-impacting failures are likely to generate one or more critical events (e.g., process restarts or switchovers) only a fraction of those alarmed events are likely to result in outage trouble-tickets. The lower bound of coverage factors can be established by Equation 6.12 (given for hardware; a similar formula can be used for software): Coverage factorhardware ⱖ
⌺hardware Automatically recovered events ᎏᎏᎏᎏᎏ (6.12) ⌺hardware All critical events Ideally, one would use correlated critical alarm data along with outage data as follows, via Equation (6.13) (given for software; a similar formula can be used for hardware): Coverage factorsoftware ⬇
⌺software Unique critical alarms – ⌺software Manually recovered critical events ᎏᎏᎏᎏᎏᎏᎏᎏᎏ ⌺software Unique critical alarms (6.13) Unique critical alarms represents unique software alarms as a proxy for software failures.
c06.qxd
2/10/2009
108
9:47 AM
Page 108
ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA
All manual emergency and scheduled recoveries are, by definition, nonautomatic and, hence, are uncovered. The uncovered outage rate should be equal to failure rate times (1 – coverage):
⌺ Manual recoveries ᎏᎏᎏ ⬇ Failure rate × (1 – Coverage) NE Years of service
(6.14)
The coverage factors for hardware and software should be estimated separately. After the coverage factor and the outage rate are estimated, the overall failure rate (as opposed to outage rate) can be derived from Equation (6.14).
6.5
UNCOVERED FAILURE RECOVERY TIME
All manually recovered outage events and scheduled recoveries are, by definition, nonautomatic and, hence, uncovered. Durations of uncovered outage events inherently include uncovered failure detection time and manual recovery/repair time. Manually recovered outage durations will typically have a distribution from, perhaps, 5 to 15 minutes for some software events to hours for a few extraordinary cases. Rather than averaging the brief, typical events with the extraordinary cases (thus producing an excessively pessimistic view), the authors recommend using the more robust median value of manually recovered outage duration. Mathematically, the median represents the midpoint of the distribution; half the values are above and half the values are below. In contrast, the mean value (mathematical average) of manually recovered outage durations is often rather pessimistic because some portion of the long duration outages may have been deliberately parked and resolved on a nonemergency basis. Uncovered failure recovery times are generally different for hardware and software failures, and, thus, should be estimated separately. Over months, quarters, and years, uncovered failure recovery times are likely to shrink somewhat as the customers’ staff become more efficient at executing manual recovery procedures.
c06.qxd
2/10/2009
9:47 AM
Page 109
6.6
6.6
COVERED FAILURE DETECTION AND RECOVERY TIME
109
COVERED FAILURE DETECTION AND RECOVERY TIME
The median outage duration of automatically recovered outages is generally a good estimate of covered failure recovery time. Covered hardware and software failure recovery times may be different. Depending on the architecture of the system and the failure detection and recovery architecture, it might be appropriate to characterize several distinct covered failure recovery times. A good example of this is a system with several different redundancy schemes, such as an N+K front end and a duplex back-end database. Recovery for the front end may be as quick as routing all new requests away from a failed unit, whereas recovery for the back-end database may require synchronizing outstanding transactions, something that is likely to take longer than the front-end rerouting.
c07.qxd
2/10/2009
9:49 AM
CHAPTER
Page 111
7
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
Results from some system testing and verification activities can be used to refine reliability parameter estimates to predict system availability as a product is released to customers; this chapter details how this can be done.
7.1
HARDWARE FAILURE RATE
Hardware failure rates can be predicted using well-known methodologies such as those in Telcordia’s SR-332 Issue 2 or MILHDBK-217F, which predict the failure rates for an entire field replaceable unit (FRU) by combining the predicted failure rates of the individual parts and other factors. Whereas MIL-HDBK-217F tends to be very conservative and overestimates the hardware failure rate, Telcordia SR-332 Issue 2 is anticipated to yield predictions (Issue 2 is relatively new, so its accuracy is not fully determined yet) closer to the actual observed hardware failure rates. Commercial tools such as Relex are available to simplify producing these hardware failure rate predictions. There are also companies that provide failure rate estimations as a service. Interestingly, the rate at which hardware field-replaceable units are returned to equipment suppliers is quite different from the “actual” or “confirmed” hardware failure rate. Hardware that is returned by customers to a repair center and retested with no failure discovered is generally referred to as “no trouble found” Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
111
c07.qxd
2/10/2009
112
9:49 AM
Page 112
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
(NTF) or, sometimes, “no fault found” (NFF). Thus, the hardware return percentage is actually Returns % ⬇
⌺Time period (Confirmed hardware failures + no trouble found)
ᎏᎏᎏᎏᎏᎏ ᎏ (7.1) Number of installed packs
“No trouble found” hardware occurs for a number of reasons, including: 앫 Poor diagnostic tools and troubleshooting procedures. If diagnostics, troubleshooting, training, or support are inadequate, then maintenance engineers may “shotgun” the repair by simply replacing packs until the problem is corrected. If the customer’s policy prohibits packs that were removed during troubleshooting from being reinserted into production systems before being retested, then all of those packs will be returned. Obviously, a circuit pack that was replaced but did not resolve a problem is unlikely to have failed and, thus, is likely to be labeled “no trouble found” at the repair center. Note that some software failures may be misdiagnosed as hardware failures. For instance, a memory or resource leak that apparently causes a board to fail will appear to be “repaired” by replacing the board and, thus, may result in the original board being sent back for repair. 앫 Intermittent or transient problems. Some failures are intermittent or caused by power, signal, or other transient events. Often, the situations that trigger these intermittent or transient problems will be absent at the repair center and, thus, the hardware will be labeled “no trouble found.” 앫 Stale firmware or obsolete patch version. Hardware design flaws, component quality issues, and firmware or device configuration bugs are occasionally discovered in production hardware. As repair centers will often apply all recommended hardware and firmware changes to a returned circuit pack before retesting it, it is possible that one of the recommended changes corrected the original failure and, thus, no failure is found when the pack is retested after completion of the recommended changes. Depending on warranty repair policy, customers may even be motivated to represent packs as having failed to get them updated to the latest hardware and firmware revisions if they would normally be charged for those updates.
c07.qxd
2/10/2009
9:49 AM
Page 113
7.1
HARDWARE FAILURE RATE
113
No trouble found rates for complex hardware can run quite high; NTF packs can even outnumber confirmed hardware failures. Although NTFs clearly represent a quality problem, one should be careful using actual hardware return rates in place of predicted or confirmed hardware failure rates. TL 9000 actually measures return rates based on the length of time individual packs have been in service: 앫 The early return indicator (ERI) captures the rate of hardware returns for the first six months after shipment of a hardware component. 앫 The yearly return rate (YRR) captures the rate of hardware returns for the year following the ERI period. 앫 The long-term return rate (LTR) captures the rate of hardware returns for all time after the YRR time period. It should be noted that the hardware failure rate for a particular component may vary significantly by customer, the locality in which it is deployed, and various other factors. Variables such as temperature, humidity, electrical stress, and contaminants in the air can all affect hardware reliability. For example, a component deployed in a city with high levels of pollution and poor (or no) air conditioning is likely to have significantly higher failure rates than the same component in a rural area in an air-conditioned equipment room. Ideally, the actual hardware failure rate would be used in all calculations. Sadly, the actual hardware failure rate is not as easy to calculate as it would seem. One of the more difficult inputs to determine is the exposure time; that is, for how many hours has the component been installed and powered in the field? Although this sounds simple on the surface, numerous factors conspire to make it difficult to determine. Most manufacturers know when a component was shipped, but do not know how long the interval from shipping to installation is. Additionally, many customers order spares, which sit in a spares cabinet until needed. These spare components, if included in the failure rate calculation, would make the failure rate look better than it really is. To accurately calculate hardware failure rates, these factors must be considered. The good news, if there is any, is that hardware failures typically contribute a relatively small portion of system downtime to the typical high-availability system, thus re-
c07.qxd
2/10/2009
114
9:49 AM
Page 114
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
ducing the impact of errors in calculating the hardware failure rate.
7.2
SOFTWARE FAILURE RATE
As system testing is generally designed to mimic most or all aspects of typical customers’ operational profiles, one would expect that system test experience should represent a highly accelerated test period, vaguely analogous to accelerated hardware testing. Whereas hardware and software failures are fundamentally different, the rate of encountering new, stability-impacting defects in the system test environment should be correlated with the rate of encountering stability-impacting defects in field operation. Software reliability growth modeling or SRGM is a technique that can be used to model the defect detection process. It analyzes defects found during the system test period to estimate the rate of encountering new, stability-impacting defects. Comparing actual software failure rates from field data (described in Chapter 6, Section 6.3) with the testing failure rate analyzed by SRGM from the corresponding releases, one can estimate the acceleration or calibration factor to convert the rate of new, stability-impacting defects per hour of system testing to software failures per year in the field. Other techniques for software failure rate prediction use software metrics. We propose a mapping method for early software failure rate prediction (say in the design phase). In this method, software metrics such as code size, complexity, and maturity (or reuse rate) are assessed with objective or subjective rankings. Then software failure rates are predicted by mapping a combination of the metrics settings to software failure rate data. This method relies on historical data of software metrics and software failure rates; see Chapter 8, Section 8.1.5 for details. Other metrics such as defect density and function point are also used in software size, effort, and failure rate prediction. Function point is a standard metric for the relative size and complexity of a software system, originally developed by Alan Albrecht of IBM in the late 1970s [FP08]. These methods are static methods that correlate the defect density level (say, number of defects per thousand lines of code, KLOC) or function point level to software failure rates. The problem with the density approach is that the relationship between defects and failures is unknown, and this kind of approach oversimplifies it. We should be very cautious in at-
c07.qxd
2/10/2009
9:49 AM
Page 115
7.2
SOFTWARE FAILURE RATE
115
tempting to equate fault density to failure rate [Jones1991, Stalhane1992]. Another example shows defect density in terms of function points for different development phases. Table 7.1 [Jones1991] reports the benchmarking study based on a large amount of data from commercial sources. This kind of benchmark helps perform defect density prediction at a high level but caution should always be taken when applying this to a specific application or project. Another constraint is that the function points would have to be measured first, and the function point measurements are not always available in every project. The next section reviews the theory of SRGM, explains the steps needed to complete SRGM, and reviews how to convert the results of SRGM into parameters to be used in an availability model. References [Lyu96, Musa98, Pham2000] provide both detailed background information about SRGM and good summaries of most widely used SRGMs. 7.2.1
Theory of Software Reliability Growth Modeling (SRGM)
The most widely used class of SRGMs assumes that the fault discovery process follows a nonhomogenous Poisson process (NHPP). The software debugging process is modeled as a defect counting process, which can be modeled by a Poisson distribution. The word “nonhomogenous” means that the mean of the Poisson distribution is not a constant, but rather a time-dependent function. In this book, we use the term fault to mean a bug in the software and failure to mean the realization by a user of a fault. The mean value function of the NHPP is denoted by m(t), which represents the expected number of defects detected by time t. Quite often, it is defined as a parametric function depending on
Table 7.1. Sample defects per function point Defect origins Requirement Design Coding Documentation Bad fixes Total
Defects per function point 1 1.25 1.75 0.6 0.4 5
c07.qxd
2/10/2009
116
9:49 AM
Page 116
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
two or more unknown parameters. The most common models use a continuous, nondecreasing, differentiable bounded mean value function, which implies that the failure rate of the software, (t) = m⬘(t), monotonically goes to zero as t goes to infinity. The time index associated with NHPP SRGMs can represent cumulative test time (during the testing intervals) or cumulative exposure time among users (during the field operation phases). In the former case, the application of the model centers on being able to determine when the failure rate is sufficiently small that the software can be released to users in field environments. In the latter case, the application of the model centers on estimating the failure rate of the software throughout the early portion of its life cycle, and also on collecting valuable field statistics that can be folded back into test environment applications of subsequent releases. One of the earliest choices for a mean value function is m(t) = a(1 – e–bt), proposed by Goel and Okumoto [Goel79a]. Here, a denotes the expected number of faults in the software at t = 0 and b represents the average failure rate of an individual fault. The Goel–Okumoto (GO) model remains popular today. Recent applications and discussions of the GO model can be found in [Wood96, Zhang02, Jeske05a, Jeske05b, and Zhang06]. The mean value function of the GO model is concave and, therefore, does not allow a “learning” effect to occur in the test environment applications. Learning refers to the experience level of system testers, which ramps up in the early stages of the test environment as the testers and test cases become more proficient at discovering faults. An alternative mean value function that has the potential to capture the learning phenomena is the S-shaped mean value function [Yamada83, Ohba84, Pham97, Goel79b, Kremer83]. The most important application of SRGMs is to use them to predict the initial field failure rate of software and the number of residual defects as the software exits a test interval. Prediction of the initial field failure rate proceeds by collecting failure data (usually grouped) during the test interval, typically using maximum likelihood estimation (MLE) to estimate the parameters of m(t). Using SRGMs to dictate how much additional test time is required to ensure that the failure rate is below a specified threshold when it is released is less common. The basic steps of applying the SRGM method are: 1. Use SRGM (one or more models) to fit the testing data 2. Select the model that gives the best “goodness of fit”
2/10/2009
9:49 AM
Page 117
7.2
SOFTWARE FAILURE RATE
117
3. Use statistical methods to estimate the parameters in the SRGMs to obtain the software failure rate during testing 4. Calibrate the software failure rate during testing to predict the software failure rate in the field Appendix C documents the methods of calculating the maximum likelihood estimates of the SRGMs and the criteria for selecting the best-fit model(s). Figure 7.1 shows a typical example of applying SRGM to predict the software failure rate. The X-axis gives the cumulative testing effort, generally expressed in hours of test exposure. Ideally, this represents actual tester hours of test exposure, excluding time for administrative and non-testing tasks, and also excludes testing setup time, defect reporting time, and even tester vacations and holidays and so on. The Y-axis gives the cumulative number of nonduplicate, stability-impacting defects. If the project does not explicitly identify stability-impacting defects, then nonduplicate, high-severity defects are often a good proxy. The smooth curve represents a curve fitted through the data that asymptotically approaches discovering the “last bug” in the system. The vertical dotted line shows the point in time when the data ended (i.e., how much cumulative testing had been completed).
Cumulative Defects
c07.qxd
Estimated SW failure rate = # of residual defects per fault failure rate
Cumulative Testing Effort
Figure 7.1. Software reliability growth modeling example.
c07.qxd
2/10/2009
118
9:49 AM
Page 118
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
The gap between the cumulative number of defects discovered at the vertical dotted line and the cumulative defects asymptote estimates the number of residual stability-impacting defects. For example, if the GO model m(t) = a(1 – e–bt) is used here, then the total number of defects is a and the average failure rate of a fault is b [Zhang02]. From the graph, the number of residual defects is the vertical distance from the asymptote to the cumulative number of defects by time T. By multiplying the per-fault failure rate and the estimated number of residual defects, one can estimate the rate of discovering new, stability-impacting defects at the end of system testing. Typically, although testers try to mimic the user’s environment, the test environment and the field environment do not match up completely. The following are the reasons for the mismatch of the two environments: 1. During the testing phase, testers intentionally try to break the software and find the defects. In this sense, software testing is more aggressive and, hence, yields a much higher defect detection rate. 2. During the field operation, the users are using the software for its designed purpose and, hence, the field operation is much less aggressive. 3. Most of the software defects are detected and removed during the testing interval and the remaining defects are significantly less likely to trigger failures in the field. Hence, when we predict the software field failure rate from the software failure rate estimated in the testing environment, we need to adjust for the mismatch of the testing and the field environment by using some calibration factors. [Zhang02] and [Zhang06] document details of why and how to address this practical issue when using SRGMs. To adjust the mismatch, this rate should be correlated with the software failure rate observed in the field by using some calibration factor. This calibration factor is best estimated by comparing the lab failure rate of a previous release against the field failure rate for that same release. Assuming that testing strategies, operational profiles, and other general development processes remain relatively consistent, the calibration factor should be relatively constant. References [Zhang02] and
2/10/2009
9:49 AM
Page 119
7.2
SOFTWARE FAILURE RATE
119
[Jeske05a] discuss more details on calibrating the software failure rate estimated in the testing environment to predict software failure rate in the field. [Jeske05b] and [Zhang06] also discuss two other practical issues: noninstantaneous defect removal time and deferral of defect fixes. Most SRGMs focus on defect detection and they assume that fault removal is instantaneous. So software reliability growth can be achieved after software defects are detected. Also, the detected defects will be fixed before software is released. In practice, it takes a significant amount of time to remove defects, and fixes of some defects might be deferred to the next release for various reasons; for example, if it is part of a new feature. Other practical issues include imperfect debugging and imperfect fault removal. Fortunately, these behaviors are often fairly consistent from release to release and, thus, can be roughly addressed via proper calibration against field data. In addition to qualitatively estimating software failure rates, SRGM offers a quantitative, easy-to-understand comparison of releases by overlaying multiple releases onto a single chart. For example, consider Figure 7.2, below. The triangles give the fitted SRGM curve for the first major release of a particular product; the diamonds give the fitted curve of the second major release. Even though the second release was tested more than the first release, the testers clearly had to work harder to find stability-impacting
Cumulative Defects
c07.qxd
Figure 7.2. Comparing releases with SRGM.
c07.qxd
2/10/2009
120
9:49 AM
Page 120
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
defects; this suggests significant software reliability growth from the first release to the second release. The key features of implementing the SRGM approach to estimate software reliability include: 앫 Normalize the test exposure against the “real” effort, rather than the calendar time. System testing effort is often nonuniform over time because blocking defects can be encountered at any time, thus slowing progress. Likewise, work schedules undoubtedly vary across the test interval with holidays, vacations, meetings, administrative activities and so on, as well as periods of very long hours (often toward the end of the test interval). Plotting defects against calendar time homogenizes all these factors, making it hard to understand what is really happening. The authors recommend normalizing to “hours of testing effort” to remove the impact of these factors. 앫 Focus on stability-impacting defects. It is not uncommon for reported rates and trends for minor defects and enhancement requests to vary over the test cycle. Early in the cycle, testers may report more minor events, especially if they are blocked from their primary testing; later in the test cycle, testers may be too busy to file minor defect or enhancement reports. Some projects even report that severity definitions vary over the course of the development cycle in that if a severe defect is discovered in the first half of testing, it may be categorized as severity 2 (major), but if that same defect is discovered in the last third of the test cycle, it might be categorized as severity 1 (critical). Limiting the scope to stability-impacting defects avoids these reporting variations. One can compare characteristics of stability-impacting defects with those of other defect severities to check data validity, but stability-impacting defects will drive the outage-inducing software failure rate. The best situation is that the data can be scrubbed carefully to identify service-impacting defects. If this cannot be achieved, SRGM can be applied to data sets of different severities. The trends for different groups of severity levels can be analyzed and compared as shown in Figure 7.3. The ideal situation is for the defect tracking system to provide a field for “stability impacting” and have the system testers fill this in. This eliminates the need to scrub the data and avoids some of the other issues associated with using severity levels.
2/10/2009
9:49 AM
Page 121
7.2
SOFTWARE FAILURE RATE
121
Cumulative Defects
c07.qxd
Figure 7.3. SRGM example by severity.
SRGM makes the following assumptions: 1. System test cases mimic the customers’ operational profile. This assumption is consistent with typical system test practice. 2. System testers recognize the difference between a severe outage-inducing software failure (also known as a “crash”) and a minor problem. If system testers cannot recognize the difference between a critical, stability-impacting defect and a minor defect, then all system test data is suspect. 3. Severe defects discovered prior to general product availability are fixed. If, for some reason, the fix of some detected defects will be deferred, these defects should be counted as residual defects. 4. System test cases are executed in a random/uniform manner. “Hard” test cases are distributed across the entire test interval, rather than being clustered (e.g., pushed to the end of the test cycle). 7.2.2
Implementing SRGM
Below are the steps used to implement SRGM: 1. 2. 3. 4.
Select the test activities to monitor Identify stability-impacting defects Compute the test exposure time Combine and plot the data
c07.qxd
2/10/2009
122
9:49 AM
Page 122
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
5. Analyze the data to estimate per-fault failure rate and the number of residual service-impacting defects 7.2.2.1 Select the Test Activities to Monitor Product defects are reported throughout the development cycle and out to trial and commercial deployment. Because SRGM relies on normalizing defect discovery against testing effort required to discover those defects, it is critical to understand what test activities will be monitored. Typically, one will focus on defects and exposure time for the following test activities: 앫 앫 앫 앫
System feature testing, including regression testing Stability testing Soak testing Network-level or cluster-level testing
Regression tests are important since regression tests ensure that the detected defects are removed and, hence, the software reliability growth really takes place. (Traditional SRGMs assume reliability growth using the detection times of the defects; that is, defects are removed as soon as they are detected. For most applications, this is reasonable since regression tests ensure that the defects are removed). Another reason is that for a given operational profile, the regression tests typically indicate that software stability is achieved. Defects generated from the following activities should be excluded from consideration when estimating the software failure rate in the testing environment: 앫 앫 앫 앫 앫
Developer unit testing and coding activities Unit/system integration testing Systems engineering/controlling document changes Design, document, and code reviews Trial/field problems
The reasons for not including these defects are (1) the goal is to estimate system-level software failure rates, and the defects from the early or later development phases are not representative of the system software failure rate; (2) normalizing exposure times during these phases is difficult. Stability-impacting defects discovered during nonincluded activities are not “lost”; rather they represent uncertainty on
c07.qxd
2/10/2009
9:49 AM
Page 123
7.2
SOFTWARE FAILURE RATE
123
where the “0 defects” horizontal axis on the SRGM plot should have been. Fortunately, the location of the “0 defects” X-axis is unimportant in SRGM because the gap between defects discovered and defect asymptote is what really matters regarding defects. 7.2.2.2 Identify Stability-Impacting Defects Having selected the test activities to focus on, one must then set a policy for determining which defects from that activity will “count.” The options, in descending order of preference, are: 1. Include an explicit “stability-impacting” flag in the defect tracking system and instruct system testers to assert this flag as appropriate. A product-specific definition of “stability-impacting” defect should assure consistent use of this flag. 2. Manually scrub severity 1 and severity 2 defects to identify those that really are stability-impacting, and only consider these identified/approved events. 3. Use severity 1 and severity 2 defects “raw,” without any scrubbing (beyond filtering out duplicates). Note. Duplicate defects must be removed from the dataset. For spawned/duplicate defects, only include the parent defects. Nondefects (typically includes no-change modification/change requests) or user errors and documentation defects should be removed from the dataset. Defects detected during the customerbased testing period are typically analyzed separately. We use them to proxy the software failure rates in the field environment, which can then be used to calibrate the test and the field software failure rates to improve future predictions. 7.2.2.3 Compute the Test Exposure Time Test exposure time represents the actual time that the system is exposed to testing, typically expressed in hours, measured on a weekly basis (or more frequently). On some projects, this information is directly recorded in a test management and tracking tool. If test hours are not directly recorded, then they can generally be estimated as tester half-days of test effort per week. The weakest (but still acceptable) measure of test exposure is normalizing against test cases attempted or completed per week. Test cases-per-week data are tracked on most projects, so this data should be readily available. Unfortunately, there can be a wide variation in time required to complete each individual test case,
c07.qxd
2/10/2009
124
9:49 AM
Page 124
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
making it a weak proxy for exposure time. Nevertheless, it is superior to calendar time and should be used when tester hours or tester half-days-per-week data is not available. 7.2.2.4 Combine and Plot the Data Cumulative test effort and cumulative stability-impacting defects are recorded on at least a weekly basis. Data can be recorded in a simple table, as shown in Table 7.2, and plotted. This type of tabular data can often be easily imported to an SRGM spreadsheet tool for analysis. 7.2.2.5 Analyze the Data The most important characteristic to assess from the SRGM plot is whether the defect discovery rate is linear, or if the curve is bending over and asymptotically approaching the predicted number of stability-impacting defects for the particular operational profile. Several different curve-fitting model strategies can be used to predict the asymptote and slope, but a simple visual analysis will often show whether the defect discovery rate is slowing down. A curve gently approaching an asymptote suggests that new stability-impacting defects are becoming harder and harder for system testers to find, and, thus, that acceptable product quality/reliability is imminent or has been reached. A linear defect-discovery rate indicates that there are still lots of stability defects being discovered during testing and, thus, that the product is not sufficiently stable to ship to customers. Figure 7.4 shows the three stages of the software debugging process. A highly effective visualization technique is to simply overlay the SRGM curves for all known releases, as shown in Figure 7.2, making it very easy to see where the current release is at any point in time compared to previous releases. Typically, the cumulative defect data show either a concave or an S-shaped pattern. The concave curve depicts the natural de-
Table 7.2. Sample SRGM data table Week index Week 1 Week 2 Week 3 ...
Cumulative test effort (machine-hours)
Cumulative stability impacting defects
255 597 971 ...
8 17 28 ...
c07.qxd
2/10/2009
9:49 AM
Page 125
7.2
SOFTWARE FAILURE RATE
125
Figure 7.4. Software debugging stages.
fect debugging process; that is, the total number of detected defects increases as testing continues. The total number of detected defects grows at a slower slope and eventually approaches the asymptote as the software becomes stable. The S-shaped curve, on the other hand, indicates a slower defect detection rate at the beginning of the debugging process, which could be attributed either to a learning curve phenomenon [Ohba84, Yamada92], or to software process fluctuations [Rivers98], or a combination of both. Then the detected defects accumulate at a steeper slope as testing continues and, eventually, the cumulative defect curve approaches an asymptote. Accordingly, the SRGM models can be classified into two groups: concave and S-shaped models, as shown in Figure 7.5. There are more than 80 software reliability group models in the literature (see Appendix C for details). In practice, a few of the early models find wide applications in real projects. There are two reasons for this: 1. Most of the models published later were derived from the early models and they typically introduce more parameters to incorporate different assumptions about the debugging process, such
c07.qxd
2/10/2009
126
9:49 AM
Page 126
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
Figure 7.5. Concave versus S-shaped SRGMs.
as debugging effort, fault introduction, imperfectness of the debugging process, and so on, but they are not fundamentally different from the early, simpler models. 2. Models with more parameters require larger datasets, which can be a realistic limitation. A few frequently used SRGM models are described below. We previously discussed the GO model, which is one of the earliest but most widely used models. Its mean value function of m(t) = a(1 – e–bt) has a concave shape, where a represents the total number of software defects and b represents the average failure rate of a fault. So the GO model assumes that there are a fixed number of total defects in the software and, on average, these defects cause a failure with a rate b. The mean value function m(t) represents the number of expected defects found by time t. Another concave SRGM is the Yamada exponential model [Yamada86]. Similar to the GO model, this model assumes a constant a for the total number of defects, but its fault detection function incorporates a time-dependent exponential testing-effort function. The mean value function for this model is m(t) = a(1 – e–r␣[1–e(–t)]). Figure 7.6 is an example demonstrating the close fit of both GO and Yamada exponential models to a given (concave) dataset.
c07.qxd
2/10/2009
9:49 AM
Page 127
7.2
SOFTWARE FAILURE RATE
127
Figure 7.6. Concave SRGM model examples.
The delayed S-shaped model [Yamada83] and inflexion Sshaped model [Hossain93] are two representative S-shaped models. The delayed S-shaped model is derived from the GO model; modifications were made to make it S-shaped. The inflexion Sshaped model is also extended from the GO model; the defect detection function is modified to make it S-shaped. The mean value function of the two models are m(t) = a[1 – (1 + bt)e–bt] and m(t) = a(1 – e–bt)/(1 + e–bt), which reduces to the GO model if  = 0. Some models [Pham99] can be either concave or S-shaped, that is, these models can have different shapes when fitted to different data. Typically, these models have more parameters in them. They often have greater goodness-of-fit power but, on the other hand, more parameters need to be estimated. Figure 7.7 shows these models fitted to an S-shaped dataset. When applying these models to the data, the first criterion is that the model pattern should match the data pattern. Then the parameters in the model need to be estimated (see Parameter Estimation in Appendix C for details) and the model that provides the best fit is selected (see Model Selection in Appendix C for details). For some data, several SRGMs might provide relatively close results, which is confirming. In this situation, one of these models (typically the one with the fewest parameters) can be used from a practical point of view. Sometimes, the curves might look close but it is good practice to check the goodness-of-fit readings (for details, see Section 2 in Appendix C). Commercial tools and self-developed
c07.qxd
2/10/2009
128
9:49 AM
Page 128
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
Figure 7.7. S-shaped SRGM model examples.
programs are typically used to estimate the parameters. The number of residual defects and the average failure rate of a fault are obtained, from which the software failure rate can be predicted. Typically, the cumulative defects and cumulative testing time are input to these software tools and the tools (1) produce estimates of the parameters in the models and (2) compare the fitted curve with the raw data to show the goodness of fit. The software failure rate and the number of residual defects can then be calculated. In addition to the mathematical curve-fitting technique, if one has both validated the field software failure rate and completed SRGM plots for one or more previous releases, then one can “eyeball” the curves and may be able to visually estimate the overall software failure rate. 7.2.2.6 Factor the Software Failure Rate Typically, architecture-based system availability models will require finer grained software failure rate inputs than a single overall failure rate estimate. For example, failure rates might be estimated to the FRU or even the processor, to the application versus platform level, or even down to the module/subsystem level. Thus, it is often necessary to factor this overall software failure rate into constituent failure rates, which can be input to the availability model. Several strategies for factoring the overall failure rate are: 앫 By defect attribution. Defects are typically assigned against a specific software module (or perhaps FRU), thus making it very
c07.qxd
2/10/2009
9:49 AM
Page 129
7.3
COVERAGE FACTORS
129
easy to examine the distribution of stability-impacting defects by FRU. 앫 By software size. Many projects track or estimate lines of new/changed code by software module, thus making it easy to examine the distribution of new/changed code across a release. 앫 Other qualitative factors, like module complexity, reuse rate, frequency of execution, and so on, can also be considered, as can expert opinion of software architects and developers and system testers. One or more of these factors can be used to allocate the software failure rates to software modules, which are direct inputs for system availability models. Once the failure rates of software modules are determined, two objectives can be achieved: (1) rank the software modules according to their failure rates and identify the high-risk software modules, and (2) feed the software failure rates and other statistics analyzed from testing data, such as fault coverage and failure recovery times, back to the architecture-based models and update the availability prediction.
7.3
COVERAGE FACTORS
Fault-insertion testing provides one of the best mechanisms for estimating the coverage factor, and is recommended in standards such as GR-282-CORE, Software Reliability and Quality Acceptance Criteria (SRQAC), objective O3-11[7], which states: “It is desirable that the system be tested under abnormal conditions utilizing software fault insertion.” Fault-insertion testing is one of the best practices for creating abnormal conditions, so in many cases fault-insertion testing can be used to achieve multiple goals. The coverage factor can be estimated approximately from the results of fault-insertion testing by averaging the first and final test pass rates. Assuming that system testers choose appropriate (i.e., nonredundant) faults to insert into the system and set correct pass criteria, the probability of the system correctly addressing an arbitrary fault occurring in the field is expected to fall between the first-pass test-pass rate (p1) for fault insertion tests and the finalpass test-pass rate (p2). We use the first-pass test-pass rate to proxy the coverage for untested faults and the final-pass test-pass rate to proxy the coverage for tested faults. Assume that f represents the percentage of the entire fault population that is not covered by the
c07.qxd
2/10/2009
130
9:49 AM
Page 130
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
selected/inserted faults. Mathematically, the coverage of the system can be estimated as Coverage factor ⬇ f × p1 + (1 – f) × p2
(7.2)
where f represents the fraction of the fault population that is not tested by the selected faults, p1 is first-pass test-pass rate, and p2 is final-pass test-pass rate. As a starting point, one can set f to 50% and, thus, the coverage factor estimation simplifies as follows: p1 + p2 Coverage factor ⬇ ᎏ 2
(7.3)
Coverage factors should be estimated separately for hardware and software by considering only software fault-insertion test cases and results when calculating the software coverage factor, and hardware cases and results when calculating hardware coverage factor. As best practice is to attempt several dozen hardware fault-insertion tests against complex boards, one can often estimate the hardware coverage factor for at least some of the major boards in a system; when possible, those board-specific hardware coverage factors should be used.
7.4 7.4.1
TIMING PARAMETERS Covered Failure Detection and Recovery Time
Recovery time for covered hardware failures is measured during appropriate fault-insertion testing. Measured time is from start of service impairment (often the same as fault-insertion time) to service restoration time. Note that this should include the failure detection duration, although the fault detection duration might be very short compared to the recovery duration. Best practice is to execute each test case several times and use the median value of the measured detection plus recovery latencies for modeling. The covered software failures are typically recovered by process restart, task/process/processor failover, or processor/ board reboot. One-second resolution is best for most systems, but 6 seconds (0.1 minute) or 15 seconds (0.25 minute) are also acceptable.
c07.qxd
2/10/2009
9:49 AM
Page 131
7.4
7.4.2
TIMING PARAMETERS
131
Uncovered Failure Detection and Recovery Time
Uncovered failure recovery time is often estimated from field performance of similar products deployed in similar situations and customers. Results of serviceability studies can reveal differences that might shorten or lengthen uncovered failure recovery time relative to similar products. Typical uncovered failure recovery times for equipment in staffed locations are: 앫 Uncovered failure detection time on an active unit. Thirty minutes is generally a reasonable estimate of the time to detect an uncovered fault on an active unit. Elements that are not closely monitored by the customer and/or elements that are not frequently used might use longer uncovered failure detection times; uncovered failures on some closely monitored systems might be detected even faster than 30 minutes. 앫 Uncovered failure detection time on a standby unit. Twenty-four hours is often assumed for uncovered failure detection time on standby units. For example, best practice is for customers to perform routine switchovers onto standby units periodically (e.g., every week or every day) to verify both that all standby hardware and configurations are correct, and that staff are well practiced on emergency recovery procedures. Elements with standby redundancy that are more closely monitored and execute routine switchovers may use shorter uncovered failure detection times on standby units; customers with less rigorous maintenance policies (e.g., not frequently exercising standby units) might have longer uncovered failure detection times on standby units. 7.4.3
Automatic Failover Time
Automatic failover times are measured in the laboratory during switchover tests. Best practice is to repeat the switchover tests several times and use the median value in availability modeling. 7.4.4
Manual Failover Time
If the manual failover time for a previous release or similar product is available, then use that value. That value could be refined based on the results of a serviceability assessment. Thirty minutes is a typical value for equipment in staffed offices.
c07.qxd
2/10/2009
132
7.5 7.5.1
9:49 AM
Page 132
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
SYSTEM-LEVEL PARAMETERS Automatic Failover Success Rate
Failover success is estimated as the percentage of automatic switchover tests that succeed. Fortunately, the automatic failover success rate is fairly easy to estimate from switchover testing that is routinely performed on highly available systems. Binomial statistics can be used to calculate the number of tests that must be made to establish an automatic failover success rate with reasonable (60%) and high (90%) statistical confidence. Assume that we need N tests to demonstrate a failover success probability of p. The risk of having n failures (here n = 0, 1, 2, etc.) can be calculated by a binomial distribution: Pr(n) = 冱 n
N
冢 n 冣(1 – p) p
n (N–n)
N can be calculated by associating the risk with the confidence level. Table 7.3 and Table 7.4 summarize the number of test attempts N needed for different numbers of failures for 60% and 90% confidence levels. The left-most column shows the target success rate parameter; the remaining columns show how many tests must be completed to demonstrate that success rate with
Table 7.3. Test case iterations for 60% confidence
Failover success probability 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 99.5% 99.9%
Number of test interations to demonstrate 60% confidence, assuming 0 failures 11 13 14 17 19 23 29 40 60 120 240 1,203
1 failures 20 22 25 29 34 40 51 67 101 202 404 2,025
2 failures 31 34 38 44 51 62 78 103 155 310 621 3,107
c07.qxd
2/10/2009
9:49 AM
Page 133
7.5
SYSTEM-LEVEL PARAMETERS
133
Table 7.4. Test case iterations for 90% confidence Number of test interations to demonstrate 90% confidence, assuming
Failover success probability
0 failures
90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 99.5% 99.9%
22 24 28 32 37 45 56 76 114 229 459 2,301
1 failures 38 42 48 55 64 77 96 129 194 388 777 3,890
2 failures 52 58 65 75 88 105 132 176 265 531 1065 5,325
zero, one, or two failures. Ideally, the automatic failover success rate is specified in a product requirements document, and this drives the system test team to set the number of test iterations to demonstrate that success rate to the appropriate confidence level. On the other hand, for a given test plan and execution, failover success probability can be estimated from the test results. If N failover tests are attempted from which n tests pass, then the failover success probability can be estimated pˆ = n/N
(7.4)
A 100(1 – ␣)% confidence interval for p is given by ᎏ , pˆ + z ᎏ冣 冢 pˆ – z 冪莦 冪莦 N N p(1 ˆ – p) ˆ
␣/2
p(1 ˆ – p) ˆ
␣/2
where z␣/2 is the upper ␣/2 percentage point of standard normal distribution. It takes the values of 0.84 and 1.64 for 60% and 90% confidence levels, respectively. This method is based on normal approximation to the binomial distribution. To be reasonably conservative, this requires that n(1 – p) be greater than 5. If n is large and (1 – p) is small (<0.05), Poisson approximation can be used. Figures 7.8 and 7.9 depict the relationship between the failover success probability and the number of test attempts for 0
2/10/2009
134
9:49 AM
Page 134
Failover Success Probability
ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA
100% 99% 98% 97% 96% 95%
No Failures
94%
1 Failure
93%
2 Failures
92% 91% 90% 0
100
200
300
400
500
600
700
800
900 1,000
Test Iterations Figure 7.8. Testing iterations versus failover success probabilities for 60% confidence.
Failover Success Probability
c07.qxd
100% 99% 98%
• No failures in the
97%
• first 1000 iterations
96%
•corresponds to 99.77%
95% 94%
No Failures
93%
1 Failure
92%
• failover success •
possibility
2 Failures
91% 90% 0
200
400
600
800
1,000
Test Iterations Figure 7.9. Testing iterations versus failover success probabilities for 90% confidence.
c07.qxd
2/10/2009
9:49 AM
Page 135
7.5
SYSTEM-LEVEL PARAMETERS
135
failures, 1 failure, and 2 failures. The testers can use these graphs to get an idea of the level of failure success probability for any given set of test attempts and results. Figures 7.8 and 7.9 match to 60% and 90% conference levels, respectively. From the two charts, we can clearly see that it gets harder to achieve higher failover success probabilities. For example, the failover success probabilities with no failures for the first 1000 test iterations only corresponds to 99.77% failover success probability with 90% confidence. As it is generally inappropriate to assume a 100% automatic failover success rate, one should consider limiting this parameter to 99.5% or 99.9%, even if all automatic failover tests passed during system testing. 7.5.2
Manual Failover Success Rate
If the manual success rate for a previous release or similar product is available, then use that value. That value could be refined based on the results of a serviceability assessment.
c08.qxd
2/10/2009
9:50 AM
CHAPTER
Page 137
8
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
For both new products and new releases of existing products, it is important to understand whether the product is capable of achieving the service availability specified in the product’s requirements. If the proposed architecture and high-level design are not likely to reach the required service availability, then it is useful to understand what potential architectural and design changes will make it more likely that service availability requirements will be met. Fortunately, architecture-based availability modeling can be completed at the architecture or high-level design phase by constructing a mathematical availability model (see Chapter 5) and using estimated availability parameters to predict system availability. Although the quantitative availability prediction itself is likely to be soft, this early analysis will often reveal where system downtime is likely to come from, indicate the most sensitive/influential aspects of the design, and help set targets and budgets for failure rates, recovery and switchover times, and coverage factors. This modeling also offers quantitative answers to architectural questions regarding system downtime, such as: 앫 What if we make (or do not make) a particular hardware unit redundant? 앫 What if we only support board-level software restart instead of processor-level or process-level software restart? Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
137
c08.qxd
2/10/2009
138
9:50 AM
Page 138
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
앫 What if we shorten system (or board) restart? 앫 What if we change some redundancy scheme to have faster recovery (e.g., move from active/cold-standby to active/warmstandby or active/active)? 앫 What if we made a bad estimate for one of the input parameters for the availability model? Intelligent use of availability models in the architecture stage can enable estimation of downtime improvement (or decay) for major features, thus enabling one to identify the most cost-effective availability-improving features.
8.1 8.1.1
HARDWARE PARAMETERS Hardware Failure Rate
Each hardware unit generally follows one of three scenarios: 1. Commercial off-the-shelf (COTS) hardware element. Hardware failure rate/mean time between failure data should be available from the OEM supplier. This failure rate estimate may come from their field data or it may be predicted according to some industry standard, such as SR-332 Issue 2, or MIL-HDBK-217F. Actual field data is best, but predicted failure data is acceptable. 2. Reuse of an equipment supplier hardware element. Hardware failure rate data for existing hardware elements should be available from the equipment supplier’s data systems. In this case, field return rates [of the entire platform or the individual field replaceable units (FRUs) when available], are used for hardware failure rates, with an appropriate adjustment to account for NTFs. Otherwise, use lab observed failure rates or the failure rate predicted using the industry standard methods. 3. New or modified design. If hardware design is sufficiently complete, then failure rates can be predicted by industry standards. If design is not sufficiently complete (which is typical for the architecture stage), then the hardware failure rate can be estimated by similarity, described below. 8.1.1.1 By-Similarity Hardware Failure Rate Estimations In predicting the hardware failure rates using similarity, the following aspects should be considered:
c08.qxd
2/10/2009
9:50 AM
Page 139
8.1
HARDWARE PARAMETERS
139
앫 The size and function of each FRU 앫 Optional components, such as mezzanines (e.g. PMC cards), disks, or network ports 앫 The thermal/cooling mechanisms of the system 앫 Power supplies By comparing these attributes of the new design with attributes of existing, well-understood designs, one can make a rough estimate of hardware failure rates. For example, a new controller board that is the same physical size as an existing controller board is likely to have a failure rate that is within ±20% of the failure rate of the existing board, assuming that the other factors (optional components, cooling, power supplies, etc.) are equivalent. In the extreme case, perhaps very early in the architecture phase when details are still sketchy, the failure rate of another FRU of the same physical board area may be used until a better estimate is available. This works because hardware designers rarely leave empty space on a board; any extra space is used to add features that differentiate their product. Thus, a pair of identically sized boards (with the same caveats about optional mezzanines, etc.) will typically be filled with roughly the same number of components, and will have failure rates that are similar enough to be used in the very early modeling stages. 8.1.2
Hardware Coverage Factor
Hardware coverage factors typically range from 80% to 95%, except for DC power supplies, where they are typically 99% (i.e., it is fairly easy to detect when power fails). Some very critical network elements may have hardware coverage from 95% to 99%. Hardware coverage factors grow from both rigorous hardware fault insertion testing campaigns and field exposure that eventually exposes virtually all hardware failure modes. Depending on the similarity of hardware and platform software to previous or other product releases, the hardware coverage factor may be estimated to be as high as for previous or similar releases. Newness of hardware or platform software would suggest lower coverage factors; plans for diligent hardware fault-insertion testing could suggest a higher coverage factor. 8.1.3
Covered Hardware Failure Detection and Recovery Time
Switchover times for successfully detected hardware failures may be estimated from previous or similar products and/or from system requirements.
c08.qxd
2/10/2009
140
8.1.4
9:50 AM
Page 140
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
Hardware MTTR
Hardware MTTR can be estimated from historic data for previous or similar products that are deployed with similar staffing and sparing arrangements. For staffed sites with on-site spares, 2 hours is typically used (Telcordia recommends 3 hours for these sites—2 hours for dispatch and 1 hour to do the repair). 4 hours is typically used for an unstaffed site and remote terminals. 8.1.5
Software Failure Rate
In the architecture/design stage, rough estimates such as “1 to 2 software failures per blade (or processor) per month” or “1 to 2 software failures per blade per year” or “1 to 2 software failures per system per year” are often sufficient to support architecture and design decisions. These rough estimates in the architecture/ design stage will be refined with test results as the product completes system testing. Two approaches, similarity and software metrics, are used to estimate the software failure rate in the architecture phase before most software has been written; best practice is to use both techniques and triangulate an estimated failure rate based on both results. 8.1.5.1 Similarity Approach This approach uses the actual software failure rate of a similar or previous system as a baseline, and scales the failure rates up or down, as appropriate. Factors to consider when scaling software failure rates include: 앫 Differences in functionality. How similar is the functionality offered by the new product/release? Deleting functionality implies a lower failure rate; adding additional functionality implies a higher failure rate. 앫 Differences in new configuration. How similar are hardware and software configurations to the baseline system? New configurations imply higher failure rates. 앫 Differences in development processes and team. Is the same team developing the new product/release? Are they following the same processes and practices? New development teams and processes imply higher failure rates. Quality improving process changes and activities can, of course, reduce software failure rates.
c08.qxd
2/10/2009
9:50 AM
Page 141
8.1
HARDWARE PARAMETERS
141
앫 Differences in architecture. Are reliability-improving features included in the software architecture? Are significant reliability-impacting deficiencies of the previous product/release addressed in this release? Architectural and design changes imply higher failure rates; “reliability-improving” features generally improve coverage factors, recovery times, or success probabilities rather than software failure rates. 8.1.5.2 Software Metrics Approach Software failure rates can be roughly estimated on a module (or processor or blade) basis by considering size, complexity, maturity, and reuse. Most importantly, these module-level estimates give insight into which modules are likely to have the highest software failure rates and, hence, where to focus availability-improving investments such as efficient failure-detection, isolation, and recovery mechanisms. Likewise, higher failure rate modules may be architecturally shifted in the design to be out of the critical service delivery path; for example, complex OAM software might be run on a different processor so that recovering from failures in OAM modules will not impact delivery of primary user services. The four per-module attributes to consider are: 1. Size. How big is the module likely to be (e.g., how many lines of noncommented source code)? Note that all code (new, reused, and third party) is considered here. Often, relatively small portions of application/product-specific code are combined with huge (and very mature) operating systems, databases, protocol stacks, middleware and other enabling software. One can often use simple categories like small, medium, large, and huge (e.g., for operating systems). 2. Reuse. How much of the code in a module is likely to be reused from previous releases or other projects? We note that “modified” code should not be considered reused code. Reused code has been tested and presumably used in the field, and, therefore, should have fewer residual defects. For this method, the extent of reuse is high, medium or low. 3. Maturity. Reused code, by definition, has some maturity to it. This maturity can range from simply having completed system testing in another context with little or no operational exposure, all the way to very, very mature software that is broadly deployed with millions of hours of operational exposure (e.g.,
c08.qxd
2/10/2009
142
9:50 AM
Page 142
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
Linux and commercial operating systems). Software “maturity” characterizes how much operational execution time the software has experienced in field service. Over time, more and more software defects are activated, and, hopefully, debugged and corrected in patches or updates. For instance “maturity” is the fundamental difference between the initial release of a software product like Microsoft Windows, Service Pack 1 or 2, and so on. Field exposure time might be a good metric to get some subjective assessment. For example, an operating system such as Linux may be very large and complex, but the maturity is very high, and, thus, should be associated with a low software failure rate. The Linux drivers for a new hardware design (sometimes called Linux Support Package) are small and low maturity and may be of high complexity; hence, they might have higher software failure rates. 4. Complexity. How complicated, delicate, or tricky is the new code going to be to write and debug? Generally, highly complex software includes many operators and operands, complicated logic structures, intensive memory management, and a large number of classes and subroutines. Highly complex software is more fault-prone. “High” or “low” are often sufficient for estimates, but finer granularity such as McCabe’s measure [McCabe76] and Halstead’s measure [Halstead77] can be used when they are available. Some of the measurement tools require the projects to hook the external tool into the code under study. It is widely accepted that software size and complexity are highly correlated with the number of total defects in a given software module. So large software modules with high complexity are likely to have more inherent defects, and if these modules are not tested thoroughly they have the potential to cause high failure rates. On the other hand, if the software modules are well tested, then the number of residual defects can be significantly reduced and so will the failure rate. Software reuse and maturity are related in that “new” code is inherently immature. Most products and systems reuse huge amounts of software, including operating systems, drivers, databases, middleware, protocols, and application/business logic. For each of these major modules, one can consider how mature the original piece was and how much of the module in the target sys-
c08.qxd
2/10/2009
9:50 AM
Page 143
8.1
HARDWARE PARAMETERS
143
tem is actually reused (versus changed). For instance, although many commercial and open-source operating systems are very large, they often are very widely deployed and very mature, and, thus, often represent a very small portion of likely software failures. During the design phase, the overall software failure rate for a product might be obtained by comparing historical data of similar products and triangulating reused software from other products. However, the overall software failure rate might not be directly actionable in identifying where the failures come from and how recovery mechanisms work when these failures occur. Software metrics can be helpful to provide actionable suggestions on what modules the development/testing efforts should concentrate on. The software metrics provide a mapping approach that correlates the settings of the software metrics to software failure rates. Using this approach for a few releases establishes a framework for refining software failure rate estimation during early development phases. An example for the Widget System described in Chapter 5, Section 5.4 is given in Table 8.1. Let us describe briefly how we arrived at the failure rate ranges shown in Table 8.1. The goal here is not to come out with a perfect mapping table, since different applications have different
Table 8.1. Widget software failure rate estimate by metrics Process on control board
Failure Maturity rate Percentage (high/ estimate of failure Complexity low) (per year) rates
Name
Description
Size
Reuse rate
OS
Operating system
Huge
High
High
MonSW Other platform SW OAM SW Task1 Task2 Task3 Task4 Task5 Task6
HA monitoring SW Platform SW
Large Large
High High
OAM SW Application SW Application SW Application SW Application SW Application SW Application SW
Large Large Large Large Small Large Small
Medium Low Medium Low Low Medium Low
Total
0.03
0.4%
Low High
Very mature High High
0.18 0.42
2.7% 6.3%
High High High Low High High High
High Low Low Low High Low High
0.91 1.00 0.91 0.83 0.75 0.91 0.75
13.6% 14.9% 13.6% 12.4% 11.2% 13.6% 11.2%
6.69
100%
c08.qxd
2/10/2009
144
9:50 AM
Page 144
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
failure rates, but rather to demonstrate some basic ideas of how to use this kind of mapping approach. In our example, as in most software-based products, the operating system is vastly larger and more mature than the application-specific software. Best practice is to estimate the software failure rate for commercial operating systems and other mature software based on field data. For the application-specific software, we used two levels for absolute size, three levels for reuse and two levels for complexity; hence, there are a total of 12 options available for the first three metrics. For the four low-reuse-rate combinations, the maturity is also low. For the other combinations for which the reuse rate is high or medium, the maturity can be high or low. Hence, there are another 16 combinations. In total, there are 20 combinations for the software metrics. Below are some assumptions used to come up with the failure rates shown in the Widget model: 1. Assume that the initial failure rates for Widget software modules might range from 0.1 failures/year to 1.0 failures/year. Thus, we use 0.1 and 1.0 as the respective best-case and worstcase midpoint failure rate values. The best-case value is assigned to the small absolute size, high-reuse, low-complexity, and high-maturity modules, and the worst-case value is assigned to the large absolute size, low-reuse, high-complexity, and low-maturity modules. 2. The expected failure rate is influenced by the size of the new code, which is determined by size and reuse rate. Here are the new code size ranks in descending order of the combination of original size and reuse rate: large, low; large medium; small, low; small medium; large, high; and small, high. 3. When the new code size and the complexity levels for two different software modules are the same or are only one level apart, then the module with the lower complexity has a lower failure rate. When the new code and complexity levels are two or more levels apart, then, regardless of the complexity level, the module with the most new code has a higher failure rate. 4. When the other metrics for two different software modules are the same level or are only one level apart, then the module with the lower maturity has a higher failure rate. When the new code levels are two or more levels apart, then, regardless of the maturity level, the module with the most new code has a higher failure rate.
c08.qxd
2/10/2009
9:50 AM
Page 145
8.1
HARDWARE PARAMETERS
145
5. To account for variation, prediction intervals can be used. The failure rate intervals can be obtained by going up and down 20% of the assigned midpoint value. Richer historic data should enable better calibration of the model, and more mature development processes should yield more consistent results, thus enabling more accurate software metrics predictions. Software development teams with more mature processes are likely to have more consistent software quality and reliability, thus making initial predictions from this method better. Software development teams with less mature processes are inherently likely to have less consistent quality and reliability, thus making similarity estimates weaker. 8.1.6
Software Coverage Factor
Typical software coverage values used for initial availability estimation in the early design phases range from 80% to 95%. Products with rigorous software fault-insertion test programs or welldesigned fault detection mechanisms can have software coverage factors in the 95% to 99% range. For new products with basic/ simple fault monitoring capabilities or minimal/limited software fault-insertion testing, values between 80% and 90% should be used. 8.1.7
Covered Software Failure Detection and Recovery Time
Latency for automatically detecting, isolating, and recovering software failures can be estimated from similar products or from product requirements. It is often useful to construct a simple, two-part budget and consider the failure detection and isolation time separately from the recovery time. TL 9000’s 15 second exclusion rule generally represents a reasonable upper bound on target values for automatic software failure detection, isolation, and recovery latency. For example, budgeting 5 seconds to detect a failure, 2 seconds to clean up, and 8 seconds to restart and resynchronize a software process/task is a reasonable budget. Note that different mechanisms may have different detection and recovery latency. The primary detection mechanism may work in seconds, but secondary or tertiary mechanisms (e.g., watch-
c08.qxd
2/10/2009
146
9:50 AM
Page 146
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
dog/sanity timers) may take minutes to activate. In a fine-grained system model in which these mechanisms are modeled separately, the discrete values may be used. In a coarser-grained model, an aggregated or lumped value may be used. One way to estimate a lumped value is to probabilistically combine the individual values. For example, consider a system that employs a three-level automatic escalation hierarchy and has 95% fault coverage: 1. A single process restart that takes 2 minutes and succeeds 95% of the time 2. A full application restart that takes 10 minutes and succeeds 95% of the time 3. A complete reboot that takes 20 minutes and succeeds 95% of the time 4. A manual recovery is required when the preceding levels fail, and manual recovery takes 30 minutes 5. Uncovered faults take an hour for manual detection and recovery Here “success” is defined as restoring the ability to provide service. (If a failover to a redundant unit has occurred, then the unit going through escalation might not be required to provide service once it is capable; it could become the standby unit.) In the above scenario, the weighted average of the success probability and the recovery time is approximately 5.3 minutes. This software recovery hierarchy is shown pictorially in Figure 8.1. Table 8.2 shows how the mean software recovery time is calculated for the above example. The “%” column shows the percentage of faults covered by each level of recovery, the “Time” column shows how long each recovery level takes, and the “Weighted” column weights the time by the percentage. The percentages are summed to make sure no errors were made (they should add up to 100 %!), and the weighted minutes are summed to get the typical software recovery time.
8.2 8.2.1
SYSTEM-LEVEL PARAMETERS Uncovered Failure Detection and Recovery Times
This includes the time required to manually detect the uncovered failure and isolate the failed FRU or software module that is
c08.qxd
2/10/2009
9:50 AM
Page 147
8.2
SYSTEM-LEVEL PARAMETERS
147
Application Restart
Figure 8.1. Software recovery hierarchy.
faulty. Historical data from similar products or earlier releases of the same product provide the best estimate. System requirements should be used if no historical data is directly available. Typically, the median outage duration time for manual, emergency-recovered failures for previous releases or similar products should provide a reasonable estimate for this parameter.
Table 8.2. Calculating typical software recovery time Typical SW recovery time calculation Parameter Uncovered Faults Covered and recovered by process initialization Covered, recovered by application initialization Covered, recovered by reboot Covered, but required manual recovery anyway Total
Time (minutes)
Weighted (minutes)
5.0000% 90.2500%
60.00 2.00
3.00 1.81
4.5125%
10.00
0.45
0.2256% 0.0119%
20.00 30.00
0.05 0.00
%
100.0000%
5.30
c08.qxd
2/10/2009
148
9:50 AM
Page 148
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
The following list enumerates the typical values used for specific uncovered failure detection times: 앫 Uncovered failure detection time on an active hardware unit. It typically takes 30 minutes to an hour to detect an uncovered fault on an active unit. This is derived from field data from numerous equipment suppliers’ products. 앫 Uncovered failure detection time on a standby hardware unit. 24 hours is typically assumed to be needed to detect an uncovered fault on a standby unit. 앫 Uncovered failure detection time on an active software instance and a hot/warm standby software instance. It typically takes 30 minutes to an hour to detect an uncovered software fault on an active unit. This is derived from field data from numerous equipment suppliers’ products. 앫 Uncovered failure detection time on a cold standby software unit. 24 hours is typically assumed to be needed to detect an uncovered fault on a cold standby software unit. It should be noted that true standby units are becoming less common. This is because many systems are being designed with standby units that are running their operating system and fault detection software constantly and only turn the application on and off. Other newer systems are running in a load-shared manner in which all redundant units share part of the load. In systems of this type, the value for the active units should be used on all units, as all units are essentially active. 8.2.2
Automatic Failover Time
Automatic failover time is generally specified in system requirements. The specified value is an appropriate initial estimate. In cases in which the automatic failover time is unspecified, the values from previous releases or similar products should be used. If there are counting rule exclusions (such as an exclusion that says outages of less than 15 seconds do not need to be counted), then the target for detection and failover should be less than the exclusion value. This allows most covered faults to recover within the exclusion time and, thus, not count against downtime.
c08.qxd
2/10/2009
9:50 AM
Page 149
8.3
8.2.3
SENSITIVITY ANALYSIS
149
Automatic Failover Success Rate
Historical data measured from the laboratory provides a good estimate of the automatic failover success rate. Typically, this value will be 99% or greater, but it is unrealistic to use a 100% success rate. 8.2.4
Manual Failover Time
Manual failover times for previous releases or similar products can provide a good initial estimate for this value. Often, a value of 30 minutes is used. 8.2.5
Manual Failover Success Rate
Historical data measured from the lab provides a good estimate of the automatic failover success rate. Like the automatic failover success rate, this value will typically be greater than 99%, but not 100%. In most well-designed systems, the automatic and manual failovers will result in the same set of events occurring because, although they have different triggers, once triggered both failovers will execute the same software to perform the failover.
8.3
SENSITIVITY ANALYSIS
It is common for the reliability models of real systems to have dozens of different input parameters, from the failure rates of the individual hardware components to failover times for each component to the different software failure rates and recovery values for each software component. With so many different parameters, how do you figure out which ones are the most important or the most influential? And, once you know which ones are the most influential, how do you use that information to make the system better? The answer to the first question is sensitivity analysis. This section will demonstrate how to perform a sensitivity analysis using the previously discussed Widget System as an example. Once the analysis is complete, we will show how the results can be used to guide improvements to the system.
c08.qxd
2/10/2009
150
9:50 AM
Page 150
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
The steps in performing a sensitivity analysis are: 1. 2. 3. 4.
Identify the parameters of interest Determine the default or nominal value for each parameter Rank the parameters in order of influence Examine the most influential parameters individually in greater detail
Step 1 (identify the parameters of interest) is straightforward. Initially, it is recommended that all the parameters in the model be considered. The more experienced modeler may choose to eliminate some parameters that previous experience has shown are not likely to be influential, or that they know the system developers will not be able to change. Eliminating some parameters may save a little work for the modeler who is performing sensitivity analysis by hand. Step 2 (determine the default or nominal value for each parameter) has typically been done as part of the initial availability model. To perform step 3, it is important that each parameter be set back to its default value to provide the baseline downtime against which parameter changes are compared. Step 3 (rank the parameters in order of influence) is where we start to learn about our system. It is also a step that can take a significant amount of work to perform. In this step, we modify each parameter individually until the baseline downtime has changed by a fixed percentage. We typically adjust the input parameter until we see a 5% change in system downtime, and then determine the amount (percentage) that we had to change the parameter to achieve the 5% downtime change. As a practical matter, we usually look for an increase in system downtime because some parameters may be too close to the limit (especially in well-designed systems) to decrease downtime an additional 5%. Typically, there are some parameters that cannot change enough to make downtime increase by 5%. We typically limit the range to something like 500% for parameters that increase in value to increase downtime and 1% for those that decrease in value to increase downtime. In other words, if we have increased a parameter by a factor of 5 and the downtime has increased, but not by 5% yet, we quit looking at this parameter and mark it as noninfluential. Likewise, if a parameter has been decreased to 1% of its default value, and downtime has increased, but not yet by 5%, we quit and mark the
c08.qxd
2/10/2009
9:50 AM
Page 151
8.3
SENSITIVITY ANALYSIS
151
parameter as noninfluential. These limits are arbitrary; other limits could be reasonably used, but we have found these to work well. Once we have determined the amount each parameter has to change to increase system downtime by 5%, we rank the parameters. The one that changed the smallest percentage is first, followed by the second smallest change, and so on. The noninfluential parameters, as determined above, come last. Step 4 (examine the most influential parameters individually in greater detail) consists of detailed analysis of the more influential parameters—those at the top of the list we created in Step 3. For this, we usually create charts showing the range of downtime we can expect for what is considered the likely range of the input parameter. This helps determine what the potential gain in downtime is for a change in the particular parameter. This is helpful because the system developers may be able to change the value of one parameter significantly more than they can a different, potentially more influential, parameter. To see how this all works in a practical system, the following paragraphs demonstrate steps 1 through 4 on the Widget System that was described in Chapter 5, Section 5.4. Step 1—Identify the Parameters of Interest For this example, we chose to include all the input parameters in the model. The hardware-related parameters are listed in Table 8.3, whereas the software-related parameters are listed in Table 8.4. These tables also identify the default value for each parameter, as required for Step 2. Step 2—Determine the Default Value for Each Parameter In this step, we use the default values that we used for the original Widget System model presented earlier. These values are listed in Table 8.3 and Table 8.4 along with the parameter descriptions. Step 3—Rank the Parameters in Order of Influence Steps 1 and 2 identified the parameters and set the default values for each of them. In this step we rank the parameters in order of influence by modifying each until the downtime increases by 5%, and then rank them based on how much (percentage-wise) each parameter had to change to obtain the 5% downtime change. Table 8.5 shows the results of this ranking for the parameters that
c08.qxd
2/10/2009
152
9:50 AM
Page 152
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
Table 8.3. Widget System hardware-modeling parameters Description
Default
Failure rates Backplane noncatastrophic FIT rate Backplane catastrophic FIT rate Power entry module FIT rate Fan tray FIT rate Power Converter FIT rate Control board FIT rate Interface card FIT rate
400 FITs 10 FITs 64 FITs 3400 FITs 2800 FITs 4200 FITs 500 FITs
Recovery parameters HW FRU repair time Backplane noncatastrophic repair time Backplane catastrophic repair time Hardware coverage Power converter coverage Uncovered HW fault detection time—active Uncovered HW fault detection time—standby HW failover success percentage HW failover time—default Manual failover success percentage—HW Manual failover time
4 hours 2 hours 24 hours 90% 99% 1 hour 24 hours 99% 10 seconds 99% 1/2 hour
Table 8.4. Widget System software-modeling parameters Description Failure rates Control board software failure rate Interface card software failure rate Recovery parameters SW process failure detection time SW process restart time Full application initialization time Reboot time Software coverage Uncovered SW fault detection time—active Uncovered SW fault detection time—standby Single process recovery success percentage Full application init success percentage Reboot success percentage SW failover success percentage SW failover time Manual failover success percentage—SW Manual failover time—SW
Default 6.69 failures/year 0.25 failures/year 10 seconds 5 seconds 2 minutes 5 minutes 90% 1 hour 24 hours 95% 95% 95% 99% 10 seconds 99% 1/2 hour
c08.qxd
2/10/2009
9:50 AM
Page 153
8.3
SENSITIVITY ANALYSIS
153
Table 8.5. Widget System influential parameters Sensitivity analysis matrix Inputs Outputs ___________________________________________________ _____________________ Description Parameter To ⌬ 5% % Change Software coverage SW failover success percentage Uncovered SW fault detection time—active Control board software failure rate Hardware coverage Manual failover time—SW Interface card software failure rate SW failover time
C_sw F_sw MTTR_sfdta_sw
0.89 0.97 1.06
1 2 6
FR_cb_sw C_hw MTTR_fom_sw FR_lc_sw MTTR_fo_sw
7.09 0.04 1.06 0.61 0.50
6 95 111 142 199
had some influence. Table 8.6 shows the parameters that were determined to be noninfluential. The dividing line between influential and noninfluential was 500% for increasing parameters and 1% for decreasing parameters, as discussed previously. A careful examination of Table 8.5 and Table 8.6 shows that there are eight influential parameters and 26 noninfluential parameters. What we want to do now is focus on the influential parameters. By doing the sensitivity ranking, we have already been able to eliminate three-quarters of the parameters from further consideration. This will save time and energy, since we do not need to perform the in-depth analysis of the 26 noninfluential parameters. The reader may wonder why Table 8.6, which lists the noninfluential parameters, has an “NA” in columns 2 and 3. This is because we stopped trying when we hit the threshold we set for claiming that a parameter was influential. In reality, the two tables are a single table within a spreadsheet. If we hit a threshold when calculating the sensitivity, then we quit trying and filled in “NA.” We then sorted the table according to the “% Change” column and broke it into two separate tables so we could concentrate on the influential table. Step 4—Examine the Influential Parameters in Detail The first thing we do in this step is plot the downtime against the influential parameter values. This produces a series of charts we can then examine to get a feel for how much downtime (min-
c08.qxd
2/10/2009
154
9:50 AM
Page 154
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
Table 8.6. Widget System noninfluential parameters Sensitivity analysis matrix Inputs Outputs _____________________________________________________ ___________________ Description Parameter To ⌬ 5% % Change HW failover success percentage F_hw Power converter coverage C_pwr_hw Backplane noncatastrophic FIT FIT_bkpl_noncat_hw rate Backplane noncatastrophic MTTR_bkpl_noncat_hw repair time Manual failover success Fm_hw percentage—HW Uncovered HW fault detection MTTR_sfdta_hw time—active Control board FIT rate FIT_cb_hw Backplane catastrophic FIT rate FIT_bkpl_cat_hw Backplane catastrophic repair MTTR_bkpl_cat_hw time Manual failover success Fm_sw percentage—SW HW failover time—default MTTR_fo_hw Single process recovery success R_proc percentage Full application initialization R_app_init success percentage Reboot success percentage R_reboot Manual failover time MTTR_fom_hw SW process failure detection time MTTR_detect_sw SW process restart time MTTR_proc_sw HW FRU Repair time MTTR_hw Full application initialization MTTR_app_init_sw time Power converter FIT rate FIT_pwr_hw Reboot time MTTR_reboot_sw Uncovered HW fault detection MTTR_sfdts_hw time—standby Uncovered SW fault detection MTTR_sfdts_sw time—standby Interface card FIT rate FIT_lc_hw Power entry module FIT rate FIT_pem_hw Fan tray FIT rate FIT_fan_hwv
NA NA NA
NA NA NA
NA
NA
NA
NA
NA
NA
NA NA NA
NA NA NA
NA
NA
NA NA
NA NA
NA
NA
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA
NA NA NA
NA
NA
NA NA NA
NA NA NA
c08.qxd
2/10/2009
9:50 AM
Page 155
8.3
SENSITIVITY ANALYSIS
155
utes/year) we might be able to save by changing the individual parameters. In Step 3 we identified eight influential parameters we wanted to look at more closely. The graphs of Figure 8.2 through Figure 8.9 show how the total system downtime (as opposed to just downtime due to software causes, for example) varies with changes in each of those parameters. As we examine these charts we should keep in mind the nominal value of the system downtime, which for this example is 45.74 minutes per year. Knowing this will help us pinpoint our starting point on each graph. Now that we have all these detailed graphs, what do we do with them? First we will take a high-level look and see what patterns or interesting observations we see. Naturally, from the first to the last chart, we see a decreasing change in the range of system downtime. This is to be expected, since they are in order of parameter influence. Some of the key observations are made near the graphs so the reader can read the text while looking at the sensitivity graph. In Figure 8.2, we see that the downtime ranges from over 108 minutes/year down to just over 4 minutes per year as the software coverage goes from 75% to 100%. Our initial estimate of software coverage was 90%, so a value of 75% is pretty far off. Practically speaking, we would expect the coverage to range from around 85% for a new, relatively untested system on up to 97 or 98% for a mature system with a significant investment in high-availability
Figure 8.2. Widget System software coverage sensitivity.
c08.qxd
2/10/2009
156
9:50 AM
Page 156
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
hardware, software, and robustness testing. This means that changes in the software coverage alone might be able to alter the system downtime across a range of about 55 minutes per year (from 66.56 to something just above 12 minutes per year), or improve it by 33.3 minutes if we assume our initial estimate of 90% coverage is correct. If we are designing a five-9’s system, which must have a total downtime of 5.26 minutes per year or less, this is a significant range. From this, we can conclude that anything we can do to improve the software coverage is likely to improve the system availability and is probably worthwhile to pursue. We can also conclude that improving coverage alone is not sufficient to bring this system above 99.999% availability, since achieving 100% coverage is practically impossible. In Figure 8.3, we see that downtime varies by about 10 minutes as the software failover success percentage goes from 95% to 100%. Originally, we estimated the software failover success rate to be 99%. A system being designed for high availability should have a software failover success rate that is at least 99% for several reasons. First, the software failover success rate is an influential parameter and should be as close to perfect (100%) as possible. Second, unlike coverage, it is relatively straightforward to test and measure. Software failover success percentage may be measured by performing many failovers and counting the percentage that work correctly. A large number of failovers will need to be per-
Figure 8.3. Widget System SW failover success sensitivity.
c08.qxd
2/10/2009
9:50 AM
Page 157
8.3
SENSITIVITY ANALYSIS
157
formed to get a statistically accurate measure of the success rate, but performing these tests is usually fairly simple, although it may be time-consuming (reducing failover time also reduces test time as well as system downtime, which is something to keep in mind when considering system improvement options). Based on this discussion, we could expect improvements in software failover success to give us at most about 2.08 minutes of downtime improvement per year (by moving from 99% to 100% failover success) but, in reality, we will probably never get to 100% failover success, so the gain would be something less than 2.08 minutes. In Figure 8.4, we see the sensitivity to the uncovered software fault detection time. This is the amount of time it takes a human to discover that the system is not providing the service it is supposed to. The default value used was 1 hour. It may be difficult for the system designer to address this parameter, primarily because it is driven by the owner of the system, not the builder. If, once the system is installed, the owner continually monitors it, then the uncovered fault detection time is likely to be low. If, on the other hand, the owner ignores the system after it is installed, and only attends to it when one of the clients complains about the service (or lack of it), then the uncovered fault detection time is likely to be longer. Despite the difficulty, there are some things the system designer/developer can do. For example, adding a feature that enables the system to be monitored remotely may encourage the sys-
Uncovered SW Fault Detection Time (hours)
Figure 8.4. Widget System uncovered SW fault detection time sensitivity.
c08.qxd
2/10/2009
158
9:50 AM
Page 158
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
tem owner to watch the system more closely. If we can decrease the uncovered software fault detection time from 1 hour to 45 minutes, we can save 10.4 minutes per year of downtime. The software failure rate sensitivity for the control board software is shown in Figure 8.5. Our initial estimate of the control board software failure rate was 6.69 failures per year. From the graph, we see that the downtime changes about 6.5 minutes per year for each failure per year in the control board software. We also see that if we can get the control board software failure rate down below one failure per year, we can save around 37 minutes per year of downtime (by going from 45.74 down to below 9 minutes per year). This is a pretty wide range, and because it is difficult to predict the software failure rate with a high degree of accuracy, especially for a new system, further investigation of the control board software failure rate is warranted. Typically, highavailability systems that achieve 99.999% availability will have software failure rates in the area of one failure per year and below. Since the control board software failure rate is currently predicted to be well above that, and a considerable amount of downtime improvement will be seen if we can improve the control board software failure rate, this parameter deserves further attention. Figure 8.6 shows the sensitivity chart for the hardware coverage. We initially assumed that the hardware coverage was 90%. If we had “perfect” coverage, we could save 0.27 minutes of down-
Control Board Software Failure Rate (Failures per Year)
Figure 8.5. Widget System control board software failure rate sensitivity.
c08.qxd
2/10/2009
9:50 AM
Page 159
8.3
SENSITIVITY ANALYSIS
159
Figure 8.6. Widget System hardware coverage sensitivity.
time as compared to our initial estimate of 90%. Likewise, if the real hardware coverage turned out to be 85%, the annual downtime would increase by 0.14 minutes per year. For very high availability systems this amount of downtime will start to become important, but when we compare this with some of the other influential parameters, the amount of downtime we can gain by changing the hardware coverage within a practical range is small in comparison. Therefore, we can conclude that improving hardware coverage is of lower priority than improving some of the other parameters. Figure 8.7 shows the sensitivity to the manual failover time for software failures. Manual failover time is the amount of time it takes a person to isolate a problem to a specific unit (for example, a control board) and then initiate a failover manually. We see from the chart that each additional 15 minutes of manual failover time adds around one minute of annual downtime to the system. This is a moderate amount of downtime, although there is a limit to how much quicker manual failovers could be—humans can only move so fast! One possibility for improving the manual failover time is improved diagnostics to help isolate the problem to a specific unit more quickly. The interface card software failure rate sensitivity is shown in Figure 8.8. Our initial estimate of the interface card software failure rate was 0.25 failures per year, which corresponded to the nominal value of 45.74 minutes per year of system downtime. If it
c08.qxd
2/10/2009
160
9:50 AM
Page 160
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
Manual Failover Time—SW (hours)
Figure 8.7. Widget System manual SW failover time sensitivity.
turned out that the software were twice as buggy as we predicted, then the downtime would be 47.36 minutes per year, or an increase of 1.62 minutes per year over our initial estimate. Likewise, if the software were twice as good as we initially predicted, the downtime would improve by 0.81 minutes per year. Currently, this is a small amount of the total system downtime, but as we approach 99.999% availability this will start to become a more sig-
Interface Card Software Failure Rate (Failures per Year)
Figure 8.8. Widget system interface card software failure rate sensitivity.
c08.qxd
2/10/2009
9:50 AM
Page 161
8.3
SENSITIVITY ANALYSIS
161
nificant range. Additionally, because it is difficult to predict the software failure rate with a high degree of accuracy, especially for a new system, further investigation of the interface card software failure rate is warranted. Figure 8.9 shows the sensitivity to software failover time. We initially estimated the software failover time at 10 seconds (0.16667 minutes). If we can cut this in half, we can realize a downtime savings of 0.57 minutes per year. This is a pretty small amount of improvement. With the original estimate of 10 seconds falling below the 15 second threshold at which outages typically are not counted, this parameter can probably be ignored during the first stage of system improvements. Now that we have analyzed our top eight parameters, what have we learned? Which parameters are the ones we should try hardest to improve? If we go back through the analysis we did along with each graph, we can easily list them in the order of how much downtime they could save relative to our original estimate of 45.74 minutes per year for the system. The results are shown in Table 8.7. From this, we can see that of the eight parameters that are likely to improve system availability within practical bounds, three really stand out as having a large potential: (1) control board software failure rate, (2) software coverage, and (3) uncovered software fault detection time. These three are the ones we should con-
SW Failover Time (minutes)
Figure 8.9. Widget System software failover time sensitivity.
c08.qxd
2/10/2009
162
9:50 AM
Page 162
ESTIMATING INPUT PARAMETERS IN THE ARCHITECTURE/DESIGN STAGE
Table 8.7. Widget System downtime improvement by parameter Parameter Control board software failure rate Software coverage Uncovered SW fault detection time—active SW failover success percentage Manual failover time—SW Line card software failure rate SW failover time Hardware coverage
Downtime savings (minutes/year) 37.00 33.30 10.40 2.08 1.00 0.81 0.57 0.27
centrate on first. The reader may be wondering why the relative order of the parameters has changed since the first sensitivity ranking shown in Table 8.5. The reason for the reordering is the range across which the parameters may reasonably vary. For example, the software failure rate can vary across a large range, creating a larger downtime effect than something with a smaller range, such as the failover success, which is already near the end of its range with a nominal value of 99%. We also need to point out that the downtime improvements predicted in Table 8.7 are not cumulative. In other words, if we improved every one of the parameters, we would not reduce downtime by the sum of the savings shown for each individual parameter. Now that we know where to focus our efforts, we need to figure out how to do it. Most products have a list of features that have been proposed but are currently not committed for development. This list usually comes from a variety of sources: customer input, product management, architects and developers who see weaknesses in one area, competitive intelligence, and so on. It is the job of product management to prioritize the list of proposed features and to select those features that will be implemented in each future release. Frequently, the reliability-related features pose a significant challenge for the product managers. This is because most of the reliability features sound good and qualitatively seem like good things to do, and the product managers seldom know how to compare one feature against another. Sensitivity analysis can help immensely with this challenge. The following example demonstrates how. Suppose the Widget System has a list of 50 proposed new features, of which the following are expected to decrease system downtime due to unplanned events:
c08.qxd
2/10/2009
9:50 AM
Page 163
8.3
SENSITIVITY ANALYSIS
163
1. 2. 3. 4. 5. 6.
Add SNMP-based remote system monitoring Redesign the cooling system to employ larger fans Add process monitoring capability to all software processes Buy static-code analysis tools and check CB and IC software Redesign the CB and IC to add parity to the address buses Add heartbeats between the IC and CB to allow the CB to monitor the IC 7. Add data auditing to the CB software to find data inconsistencies 8. Test-only feature that adds numerous robustness tests 9. Design a system-wide exception hierarchy for software exceptions One good way to visually assess these features is to build a mapping table that shows how the features map to the five parameters we wish to improve. Table 8.8 shows the mapping for the proposed Widget System features. It is easy to see that one of the features does not address any of the parameters we are most interested in. This feature, “redesign the cooling system to employ larger fans,” should, thus, be low priority. The product manager should defer this to a future release unless there were other nonreliability-related reasons to include it in this release. Additionally, two of the parameters (manual failover time for software and software failover time) are unaddressed by any of the features. Depending on results of the rest of the feature analysis, it may be appropriate to consider separate features that directly address these parameters. We have easily reduced the feature list by one. Now it gets a little more difficult, but because we have a model, we can estimate the downtime improvement of each of the proposed features. We do this by plugging the estimated parameter improvements back into the model. We should note that the estimates for each of the features are not necessarily cumulative. For example, if we look at software coverage, there are five features that improve it, and the total estimated improvement would be 6.5% if we added all five together. Typically, there is some overlap among the features; that is, some software faults are likely to be detected by more than one of the new features. Thus, care must be taken when creating a table like this to account for overlap and count the overlapping portions in only one of the features. We have done that in Table 8.8. Regardless, using the estimates as if we were only implementing the features one at a time is a reasonable way to make comparisons between the features.
164 2
2
0.5%
0.5%
0.5%
2%
3%
15 min
0.25%
0.025
0.05
SW failover Manual IC SW success failover failure percentage time—SW rate
3%
Software failover time
Hardware coverage
9:50 AM
Add SNMP-based remote system monitoring Redesign the cooling system to employ larger fans Add process monitoring capability to all software processes Buy static-code analysis tools and check CB and IC software Redesign the CB and IC to add parity to the address buses Add heartbeats between the IC and CB to allow CB to monitor IC Add data auditing to the CB software to find data inconsistencies Test-only feature that adds numerous robustness tests Design a system-wide exception hierarchy for software exceptions
Feature description
Uncovered Software SW fault coverage detection time
2/10/2009
CB SW failure rate
Table 8.8. Feature mapping to reliability parameters
c08.qxd Page 164
c08.qxd
2/10/2009
9:50 AM
Page 165
8.3
SENSITIVITY ANALYSIS
165
Table 8.9. Widget System per-feature downtime improvement
Feature description Test-only feature that adds numerous robustness tests Buy static-code analysis tools and check CB and IC software Add process monitoring capability to all software processes Add SNMP-based remote system monitoring Add heartbeats between the IC and CB to allow CB to monitor IC Add data auditing to the CB software to find data inconsistencies Design a system-wide exception hierarchy for software exceptions Redesign the CB and IC to add parity to the address buses
Downtime improvement (min/year) 14.94 13.25 12.49 10.47 8.32 2.08 2.08 0.08
When we substitute the modified parameters for each feature back into our model, we get the results shown in Table 8.9. The product manager now has the information needed to make intelligent decisions about which features to include in each release. This information can be used along with the estimated cost to develop each feature to calculate the cost per downtime minute for each feature and then invest in the features with the lowest cost per downtime minute first. Additionally, if it is known roughly how much can be invested in reliability improving features in each release, then they a reliability roadmap showing the system availability on a per-release basis can easily be created. For example, assume that the product manager has assessed the cost of deploying each of the eight features listed in Table 8.9, and has determined that the development budget will only support implementing the first three in the current release. Using that information, along with the estimated parameter improvements from Table 8.7, the system downtime for this release may be calculated. In this case, the new release will have a downtime of 13.16 minutes per year. This is a significant step toward 99.999%, but still does not quite make it. The next step is to repeat the process for additional releases until the system is capable of meeting its availability target.
c09.qxd
2/10/2009
9:53 AM
CHAPTER
Page 167
9
PREDICTION ACCURACY
Software reliability and system availability predictions are inherently “softer” than hardware reliability predictions because software and systems exhibit more construction variability and overall complexity. Nevertheless, good and well-controlled development processes and practices can consistently produce high-quality, high-reliability products that can be reasonably well predicted. By recognizing system availability predictions as being tools, just like market forecasts, that are useful in guiding business decisions, one can focus on using predictions to make the best business decisions rather than worrying about quantifying the confidence intervals. After all, just because a market forecast is not likely to be highly accurate does not mean it is not essential to diligently create the best estimate that one possibly can to drive the best business decisions. We have discussed analyzing data from both laboratory testing and field operation and using the statistics generated from these data in the prediction of the unknown reliability characteristics for the new product. In doing so, the statistical analysis we perform frequently involves making inferences from a sample to a population. The noises in the data we sample will be carried into the prediction and, hence, is one factor that can influence the prediction accuracy. In Chapter 6, we discuss the estimation error and how to associate the confidence intervals to a point estimate to show the confidence level. In this chapter, we focus on reducing the uncertainties related with sampling and discuss how to improve prediction accuracy. This chapter addresses three basic questions: 1. How much field data is enough? Or how large a sample should be drawn? Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
167
c09.qxd
2/10/2009
168
9:53 AM
Page 168
PREDICTION ACCURACY
2. How does one measure sampling and prediction errors? 3. What causes prediction errors? 9.1 9.1.1
HOW MUCH FIELD DATA IS ENOUGH? Why Study a Sample? Why Not the Entire Population?
In many applications, the population is too large to study, that is, hundreds of thousands or millions of cases. In these applications, samples instead of the population are taken and studied. The following factors associated with studying an entire population are practical and important: 1. Time, money, and patience limit studying an entire population. 2. Depending upon the time it takes to gather the data, the population may change by the time the generalization is made, making the generalization invalid. In reliability engineering, reliability characteristics such as time to failure, availability, and outage duration change over time. Moreover, the real-world operational characteristics introduce some amount of variation in these reliability characteristics. Hence, samples are taken and analyzed to study the reliability characteristics and their trend over time. The methods used in sampling theory include probability techniques and nonprobability techniques. Examples of the former include simple random sampling, stratified sampling, cluster sampling, multistage sampling, and systematic sampling. Examples of the latter include quota sampling, accidental sampling, purposive sampling, and snowball sampling. Chaudhuri documented details on these techniques and the statistical analyses associated with them [Chaudhuri2005]. For example, we use cluster sampling to analyze data from different customers. Generally, the best field data to analyze comes from a single customer for a contiguous time window. Ideally, it would consist of a single software release running on a set of uniform hardware configurations, operated by staff with consistent training, policies, and procedures. Although these constraints will generally reduce the overall number of events to analyze, those events should be much more consistent because many of the factors that affect outage event and duration data should be homo-
c09.qxd
2/10/2009
9:53 AM
Page 169
9.1 HOW MUCH FIELD DATA IS ENOUGH?
169
geneous. Unfortunately, using a smaller, homogeneous dataset raises the question of how much field data is enough to draw meaningful conclusions. Another example is that we use multistage sampling when analyzing the characteristics of failures during burn-in, normal operation, and wear-out phases. Since the failure phenomena are quite different during different phases of the product life cycle, as shown in Figure 5.6, different samples are drawn at different stages and analyzed separately. Once the sampling method(s) are determined, the next question is the size of the samples; in other words, how large a sample should be drawn? The goal is to acquire a sample that is representative of the population, that is, a proportional miniature of the population. The following factors should be considered in determining an adequate sample size: 1. The variability of the trait in the population. The more variable the trait, the larger the sample must be. 2. The sampling method used. Some methods are more efficient than others in securing a sample that is representative. 3. The power of the statistic (1 – ) used to analyze the data. Some are more powerful than others. The statistic (1 – ) measures one type of sampling error; see the discussion in this section for more detailed discussion of this measurement. 4. The required level of accuracy of the generalization. Up to a certain limit, the larger the sample, the more accurate the generalization. Theoretically, one never knows for sure if the sample is representative of the population unless the sample is the population. It is true, though, that the larger the sample, the more likely it will be representative. The smaller the sample, the greater the underrepresentation of the population variance. There are methods to check the representativeness of a sample. One way is to draw a second sample of the same size, by the same method, and compare the statistics derived from the first and the second samples. If the two sets of sample statistics are comparable, confidence that the first sample is representative increases. Based on our experience in reliability applications, two key characteristics “size” an outage dataset:
c09.qxd
2/10/2009
170
9:53 AM
Page 170
PREDICTION ACCURACY
1. Exposure time—the number of element years of service covered by a particular dataset. As products can have widely differing failure rates, it is often useful to consider how many predicted hardware system MTBF periods are covered by the dataset. Although one may have significant uncertainty about the software failure rate of a system, the predicted hardware MTBF of a system is straightforward to calculate. For example, if the predicted hardware MTBF for a hardware element is 2 years and 100 element years of service are covered (for example, 25 hardware elements for 4 years), then 50 hardware MTBF periods are covered. 2. Number of outage events by category—the number of categorized outage events is clearly a significant parameter. As more events are recorded in each category, richer information becomes available and deeper analysis is possible. At the highest level, one should count the number of product-attributable hardware and product-attributable software events to characterize the “size” of the dataset. As one analyzes deeper characteristics of a system—say software failures on a particular FRU, or even automatically recovered software failures on a particular FRU—one may be able to draw conclusions from larger datasets that contain sufficient events of each particular category to draw meaningful insights. Thus, all analyses of field data should always explicitly state both of these size characteristics to better calibrate the strength of any observations or conclusions drawn from the dataset. As a crude “bigger-than-a-breadbox” metric, the authors have used the general buckets described in Table 9.1 for sizing outage datasets. There are two basic risks associated with using a given dataset to predict the actual long-term availability. The first risk is that the dataset will indicate that the system is much less reliable than it
Table 9.1. Dataset characterization Standard Gold Silver Bronze
Probability 90% 80% 50%
Note Inside the range 9 of 10 times Inside the range 4 of 5 times Inside the range 1 of 2 times
c09.qxd
2/10/2009
9:53 AM
Page 171
9.1 HOW MUCH FIELD DATA IS ENOUGH?
171
really will be in the long term (due to a high number of failures during the data collection interval). The second risk is that the dataset will indicate the system is much more reliable than it really will be in the long term. The first risk is called the “producer’s risk, or alpha (␣) risk.” This is because, based on the dataset, the producer could conclude the system is not reliable enough to ship, and then expend extra effort and expense to improve the system reliability. The second risk is called the “consumer’s risk, or beta () risk.” This is because the consumer could receive a system that is actually less reliable than indicated by the dataset. In assigning the dataset ratings (gold, silver, or bronze), we assumed that the producer and the consumer shared the risk equally, that is, ␣ = . The gold, silver, and bronze standards were computed such that the estimated field hardware MTBF during the dataset interval will be between 67% and 150% of the true hardware MTBF. (The hardware MTBF is used because measuring and estimating hardware failure rates is a well-understood process, whereas the equivalent process for software is less mature). However, there is still a finite probability that the true MTBF will lie outside this range. The gold, silver, and bronze standards are defined by the probability that the true hardware MTBF will fall between 67% and 150% of the hardware MTBF estimated by the dataset. Table 9.2 shows the probability that the true hardware MTBF will fall within the range for each characterization level. It should be noted that “real-world” factors will increase the errors associated with these analyses. Additionally, subbronze standard datasets require extra care, especially for small datasets with no observed failures. Table 9.2. Dataset characterization levels
General characterization Gold “very good” dataset Silver “good” dataset Bronze “acceptable” dataset Inconclusive “poor” dataset
Exposure time (NE years) >57×HW-MTBF >35×HW-MTBF >12×HW-MTBF <12×HW-MTBF
Nominal exposure Product-attributable time outage events (NE years) >57 >35 >12 <12
>100 >50 >20 <20
c09.qxd
2/10/2009
172
9:53 AM
Page 172
PREDICTION ACCURACY
9.2 HOW DOES ONE MEASURE SAMPLING AND PREDICTION ERRORS? 9.2.1
Sampling Error
Analysis of field outage data should yield downtime/availability estimates. Larger datasets should yield stronger estimates. As a practical matter, how close should the predicted failure rate and availability estimates be to actual field data? Or, in other words, how does one know if the generalization from the sample to the population is correct? In statistical inference, the correctness of a generalization from the sample to the population can be measured by estimating the margin of error associated sample statistics, which is known as the standard error of that statistic. Standard errors estimated from the sampled data can be used to construct a confidence interval with a certain confidence level. In Chapter 6, we discussed calculating confidence intervals for two important reliability metrics: failure rate and unavailability (and hence availability). We have discussed taking multiple samples of the same size and by the same method to verify if a random sample is representative. Suppose we collect n samples. As the sample size (n) becomes larger, the sampling distribution of means becomes approximately normal, regardless of the shape of the variable in the population. This is known as the central limit theorem (CLT). When multiple samples are taken, the mean of the sample means are used to estimate the true population mean, and the standard deviation of the sampling distribution, which is known as its standard error, is used to estimate the standard deviation of the population. Once the mean and standard deviation are determined, confidence intervals can be constructed to associate the point estimate with the confidence level. Section 3.3 in Appendix B provides more detailed discussion. 9.2.2
Prediction Error
First, one must define a method for characterizing the prediction error. The authors suggest characterizing prediction error by taking the absolute value of the difference between the predicted downtime and the actual downtime, and dividing the result by the predicted downtime. Mathematically, the prediction error PE is given by
c09.qxd
2/10/2009
9:53 AM
Page 173
9.3
冨
WHAT CAUSES PREDICTION ERRORS?
DTpred – DTact PE = ᎏᎏ DTpred
冨
173
(9.1)
where DTpred is the predicted downtime and DTact is the actual downtime. For example, if 5 minutes of downtime is predicted and 30 actually occurs, then prediction error is 500% [(30 – 5)/5]. Alternately, if 30 minutes is predicted and 5 minutes actually occurs, then prediction error is 83% [(30 – 5)/30]. Naturally, the units on the prediction and the actual downtime should be identical. For instance, if the prediction is for unplanned product-attributable hardware and software downtime, then the actual downtime against which the prediction is compared should be unplanned product-attributable hardware and software downtime.
9.3
WHAT CAUSES PREDICTION ERRORS?
Given tightly controlled development processes, a correctly calibrated model with accurate input parameters, and a large deployment of systems, one might expect prediction errors of perhaps ±25% relative to normal, steady-state field operation. “Normal” steady-state field operation explicitly excludes: 앫 Exceptional or epidemic quality problems that could cause failure rates to substantially differ from predictions. For example, a manufacturing defect in a batch of parts used in the production of some systems can cause hardware failures that are higher than predicted. 앫 Differences in operational profile. Products are architected, designed and tested to perform a particular function in a particular operational context or profile. Obviously, development teams strive to align the design and test operational profiles to match those of their customers and end users. Differences between the design and test operational profile and the endusers’ operational profile represent gaps that the product may not have been designed to cover and may not have been tested against. For example, whereas it is possible to haul loads of bricks in the trunk of a four-door sedan, since most sedans are not designed or tested for hauling bricks, they might not be as reliable hauling bricks as they would be carrying people.
c09.qxd
2/10/2009
174
9:53 AM
Page 174
PREDICTION ACCURACY
앫 Customer training and experience. Sometimes outages on new, unfamiliar equipment take customers longer to resolve manually. This may be because of limited training and experience or issues with procedures or policies. 앫 Other customer-specific factors. Chapter 4 enumerated many specific reasons why perceived failure rates and outage durations vary between customers. Variations in reported failure rates and actual outage durations directly impact measured availability. Should prediction error substantially exceed this tolerance and neither exceptional quality problems nor customer training and experience or other customer-specific factors explain the gap, then consider: 1. Incorrectly estimated input parameters. Failure rates, coverage factors, and outage durations are highly influential parameters, and initial estimates could differ from actual field performance. 2. Flawed model. The model might not reflect all the important service-impacting failure modes, or might not properly model failure detection and isolation performance, or might not adequately model the system’s recovery strategies. 3. Incorrect operational profile. Perhaps customers are using the product differently from how it is being tested and, thus, system test results may be poor indicators of how the system will perform in the customers’ networks. 4. Substantial human- or procedure-attributed downtime. This book has focused on product-attributed downtime, which is typically dominated by software or hardware failures. Systems can also experience downtime from human errors, vague or incorrectly documented procedures, poor training or instructions, and so on. If a substantial downtime from human- or procedural-attributed events is assigned by the customer against product downtime, then this may cause a difference from a typical downtime prediction. The simplest solution for this is to include a downtime allocation for other product-attributed downtime to cover miscellaneous human- or procedure-attributed downtime that is deemed by customers to be product- or supplier-attributable. The allocation for “other”
c09.qxd
2/10/2009
9:53 AM
Page 175
9.3
WHAT CAUSES PREDICTION ERRORS?
175
product-attributed downtime will vary from product to product depending on the nature and extent of human operations, administration, maintenance, provisioning, and other interactions with the system itself. “Other” product-attributed downtime will often represent 5% to 30% of overall product-attributed downtime.
c10.qxd
2/8/2009
5:51 PM
CHAPTER
Page 177
10
CONNECTING THE DOTS
This chapter presents a typical strategy for applying the practical software reliability and system availability modeling techniques presented in this book to develop a high-availability system that meets the market’s availability expectations. Figure 10.1 illustrates the general steps in the development process that are typically necessary to deliver high-availability systems. The major development-phase activities are: 앫 Set availability targets and requirements. If clear requirements and targets are not set, then it is unlikely that customer expectations will be met. 앫 Select appropriate architectural reliability techniques. Architects and designers must create an appropriate system architecture, and include the necessary reliability techniques in both the high-level and low-level designs to meet those requirements. 앫 Model to assure feasibility. Mathematical modeling should be used to both assure that the architectural reliability techniques chosen are sufficient to meet the requirements, and also to assure that quantitative system performance characteristics such as switchover and recovery times, failure rates, and coverage factors are appropriate. 앫 Design and develop product. Product design and development proceeds to address all functional and nonfunctional product requirements, including reliability and availability requirements. Particulars of the design and development process are not covered in this book. 앫 Testing. Appropriate robustness/resiliency, stability/endurance, overload, and general system testing is essential. 앫 Update availability prediction. Test results are analyzed and availability predictions are updated to give the project team and Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
177
c10.qxd
2/8/2009
178
5:51 PM
Page 178
CONNECTING THE DOTS
Product Released to Field
Figure 10.1. Overview of reliability activities in the product life cycle.
decision makers insight into what the system reliability and availability are likely to be. These updated predictions are used both to adjust project and test plans before testing completes, and to set appropriate availability expectations with customers. After the product is released to the field, the following activities are appropriate: 앫 Periodic validation and calibration with field data. Analyze field data to identify gaps relative to targets and requirements, and calibrate models and prediction techniques to use for future product releases. 앫 Reliability road mapping. If there is a gap between actual field performance and requirements, then a release-by-release road map of reliability-/availability-improving feature and testing investments can be constructed and executed. Best practice is to pull together material on a product or solution’s reliability into a written reliability report to clearly align reliability expectations across product management, development,
c10.qxd
2/8/2009
5:51 PM
Page 179
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
179
sales and marketing, and other areas. The report gives reliability targets, reliability architecture to support those targets, modeling and budgeting to assure those targets are met, and other relevant information. An outline for a comprehensive reliability report is provided in Appendix A—System Reliability Report Outline. Included within the outline are guidelines for completing each section. Each of these reliability-related activities is reviewed below. 10.1
SET AVAILABILITY REQUIREMENTS
The first step in architecting a highly reliable system is to set an availability goal for the system. The market’s availability expectation is often triangulated from customer requirements (e.g., written request for proposal, or RFP), the availability of competitor’s products, previous product releases, and recommendations from applicable industry standards. Typically, this target is quantitatively defined for product-attributable service downtime; often, the value for mission-critical systems is 99.999%, or “five 9’s”, which is 5.26 down-minutes per year. 10.2 INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES Highly available systems are created by combining an appropriate redundancy architecture with mechanisms to detect, isolate, and recover from faults, and a thorough quality-management system. Highly available systems never happen by accident; in fact, they are extremely difficult to create through the normal product-evolution process. The most successful highly available systems include specific availability enhancing techniques in the first release and then expand on them in every release. The remainder of this section enumerates many of the generally recognized best practices in the areas of physical design, system hardware, and system software, and the procedures designed to maintain the system. 10.2.1
Physical Design Techniques
The physical design availability enhancing techniques address the areas of system power, cooling, maintenance, and alarming. Implementing these techniques helps ensure that the system can sur-
c10.qxd
2/8/2009
180
5:51 PM
Page 180
CONNECTING THE DOTS
vive single failures in the power or cooling systems and also makes system maintenance easier and less error prone. The physical design techniques are: 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫
Redundant power feeds Field-replaceable power converters/supplies Field-replaceable fans Fan alarms Fail-safe fan controllers Independently replaceable fan controllers N+1 Fans Temperature monitoring Field-replaceable circuit breakers Individually fused cooling unit power feeds Connector reliability evaluation Individually fused fans Independent redundant power fuse blocks Power switch protection Put all electronics in field-replaceable units Provide visual status indicators
Each of the physical design techniques is discussed in further detail in the following subsections. 10.2.1.1 Redundant Power Feeds This technique provides dual independent power inputs to the system. In a DC system, these are typically called the “A-bus” and the “B-bus.” Frequently, these buses are backed up by batteries, and the batteries are kept charged via the commercial AC power source. AC power is more challenging to supply in a truly independent dual manner because the phases may need to be synchronized and, typically, the local electric utility runs only a single feed to a specific location. However, it is common to have backup generators that can supply AC power if the commercial power fails. This type of AC redundancy is common in hospitals and other locations where failure of the AC power can have catastrophic consequences. Using redundant power feeds allows the system to continue operating when there is a failure on one of the supplies, whether it is from battery failure, blown circuit breakers or fuses, lightning strikes, and so on, and also allows supply maintenance (such as battery replacement). Several of the other reliability techniques rely on this technique being implemented.
c10.qxd
2/8/2009
5:51 PM
Page 181
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
181
10.2.1.2 Field-Replaceable Power Supplies The main power supplies or power converters for a high-availability system should be designed into a FRU that can be swapped out quickly. They should not be an integral part of the system housing or backplane because power supply failure would then lead to the need for a system shutdown and rebuild. Power supplies and converters typically are a little less reliable than many other electronic components because they tend to generate a significant amount of heat (which can increase the failure rate of all the components in the supply), and they often contain fans (and, thus, moving parts, which wear out faster than other electronic components). Because power supplies and converters are somewhat less reliable than other electronic components, it is especially important to make them easy to replace without having to shut the system down. 10.2.1.3 Field-Replaceable Fans This technique builds the cooling fans into field-replaceable units (FRUs) so that they may be quickly and easily replaced. Fans are among the system components with the shortest expected lifetime, and failure of a fan can cause other components to fail from overheating. Thus, it is especially important to make fans so they can be replaced quickly, before the lack of cooling causes a failure in any other components. Not implementing this technique may mean that the entire system must be shut down while the failed fan is replaced. 10.2.1.4 Fan Alarms This technique generates an alarm whenever a fan or fan controller fails so that the technicians know it needs replacing. This results in the fans being replaced more quickly and ultimately reduces the exposure of other parts of the system to higher than normal temperatures; exposure to elevated temperatures stresses the hardware components and can lead to premature hardware failure. For this technique to be most effective, the previous technique, field-replaceable fans, should also be implemented. 10.2.1.5 Fail-safe Fan Controllers Many systems use variable-speed fans for their cooling needs and in normal operation run the fans only as fast as required to keep the temperature within design limits (often, this is less than 100% of full speed). The lower speed reduces noise levels and also con-
c10.qxd
2/8/2009
182
5:51 PM
Page 182
CONNECTING THE DOTS
sumes less power. It is the job of the fan controller to determine the optimum speed at which to run the fans. This fail-safe fan controller technique provides fan controller circuitry designed so that a controller failure will result in the fans automatically going to high speed. This keeps the system from overheating before a technician can travel to the site and replace the failed controller. Elevated temperatures accelerate hardware failures, and if components exceed their maximum operating temperatures they will often fail quickly. Some failures could lead to all fans stopping; if all fans stop and the equipment remains powered on, then maximum operating temperatures of components can be exceeded and hardware failures may occur, or an automatic overtemperature system may shut the system down. Such failures represent unprotected series elements in the system and increase system downtime significantly; they should be eliminated if possible. 10.2.1.6 Independently Replaceable Fan Controllers This technique keeps the fan controller circuitry in a FRU separate from other circuitry. This ensures that the fans do not stop while circuitry unrelated to the fans is being serviced. A reasonable alternative in some cases is redundant fan controllers. As an example, consider the Widget System of Figure 5.21. It might have been possible to include the fan controller circuitry on the control boards rather than within each fan assembly. Had that been done, then the replacement of a failed control board would have caused the fan controller to be removed at the same time as the faulty control board, and since the control boards are highly complex circuits, densely populated with electronics, they have a relatively high failure rate compared to the fan controller circuitry. Perhaps this could have been mitigated by providing redundant fan controllers, one on each control board; but in the end that would have added a significant amount of additional complexity compared to the chosen implementation. 10.2.1.7 N+1 Fans This technique allows the system to run indefinitely with one fan out of service. This is necessary to accommodate single fan failures and allow the system to continue to run until the failed fan can be replaced. It is especially important to implement this technique in systems in which technicians are not present or service is not possible until a later point in time. Implementing this tech-
c10.qxd
2/8/2009
5:51 PM
Page 183
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
183
nique allows the system to continue operation until technicians can replace the faulty fan. 10.2.1.8 Temperature Monitoring This technique monitors the temperature at key points throughout the system, and generates alarms and/or automatically increases the fan speed when temperatures exceed the limits. (Note: there may be cases when it is inappropriate to turn up the fans; for example, when the temperature is so high that combustion is likely. In this case, the fans should be turned off so as not to feed the fire.) Some components, such as CPUs, now come with built-in temperature sensors. These sensors should be monitored and included in the overall temperature monitoring hierarchy. Even in cases in which the internal chip sensors shut down an individual chip (so it does not destroy itself), monitoring of the temperature provides an indication of why the system shut down, and, in some cases, may be the only indicator of what the problem was. 10.2.1.9 Field-Replaceable Circuit Breakers Providing field-replaceable circuit breakers ensures that circuit breakers may be quickly and easily replaced, without removing any other system element. Circuit breakers typically have mechanical components and, thus, they can have a higher probability of failure than nonmechanical components, so it makes sense to plan for the occasional failure. If the “redundant power feed” technique has been implemented, then replacement of an individual circuit breaker should be possible while the system is operating. 10.2.1.10 Individually Fused Cooling Unit Power Feeds This technique depends upon implementation of the “redundant power feed” technique. If redundant power feeds have been implemented, then this technique employs a separate fuse or circuit breaker in each of the connections between the main power feed and the cooling unit. This ensures that the cooling units have power in the event of a single power feed failure, or if there is a blown fuse or circuit breaker in one of the power feeds. 10.2.1.11 Connector Reliability Evaluation The idea behind this technique is to ensure that all the connectors used in the system provide the required reliability. Connectors can be a major source of failures, both hard and intermittent, so
c10.qxd
2/8/2009
184
5:51 PM
Page 184
CONNECTING THE DOTS
evaluating each connector carefully is warranted. Typically, this evaluation is done by a component control organization with expertise in reliability. The evaluation should consider various aspects of the connectors chosen, including such things as keying (can it be connected improperly?), likelihood of pin bending, type (for example, tin versus gold), and thickness of the plating on the contacts. 10.2.1.12 Individually Fused Fans If fans are not fused individually, a shorted fan can cause the main power fuses to fail and lead to failure of the entire cooling unit. This risk can be avoided by providing individual fuses or circuit breakers for each fan. Then, a shorted fan will blow only its own fuse, allowing the remaining fans to continue operation. 10.2.1.13 Independent Redundant Power Fuse Blocks This technique depends upon implementing the “redundant power feed” technique. When redundant power feeds are implemented, this technique provides an independent fuse/breaker block for each feed. Providing separate blocks for each feed allows the individual blocks to be repaired/replaced independent of each other and, thus, enables the system to maintain operation while a repair/replacement is being made. 10.2.1.14 Power Switch Protection The concept behind the power switch protection technique is simple—make it mechanically difficult or impossible to turn the power off by hitting the switch accidentally. There are many interesting stories about technicians who powered down critical equipment by bumping a power switch with an elbow or knee. This technique puts physical guards, covers, or interlocks on power control switches so that technicians do not accidentally bump them and inadvertently power down equipment. 10.2.1.15 Put all Electronics in Field-Replaceable Units This technique addresses repair and replacement of electronic equipment. All electronic and electromechanical devices should be contained within field-replaceable units (FRUs) so they may be easily replaced. This includes visual alarm LEDs (light emitting diode indicators). LEDs should not be an integral part of a system housing, frame, or backplane. As such, LED failure would lead to
c10.qxd
2/8/2009
5:51 PM
Page 185
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
185
the need for system shutdown and rebuild. Electronic components should not be an integral part of a backplane, as their failure would require replacement of the entire backplane. 10.2.1.16 Provide Visual Status Indicators This technique adds status indicators (typically LEDs) to FRUs so that service technicians can easily determine when it is safe and appropriate to perform maintenance actions on each FRU. Typically, these indicators will show when power is on or off, and when it is safe to remove equipment for repair and/or replacement. 10.2.1.17 Physical Design Technique Example The schematic diagram shown in Figure 10.2 includes many of the techniques described in the preceding sections. It shows the cooling system of an arbitrary system. This system uses push/pull fans, with four fans pushing cooling air through the system and four fans pulling air through. Each set of four fans is controlled by an independent controller. Power is provided through a pair of redundant power feeds, each individually fused. The power is converted to 12 volt DC through a pair of field-replaceable converters, and then fed to the fan controllers through a pair of power OR-ing
Redundant Power Feeds
Field Replaceable
Figure 10.2. Example system with physical design techniques.
c10.qxd
2/8/2009
186
5:51 PM
Page 186
CONNECTING THE DOTS
diodes which allow redundant power supplies to be used in parallel. Each controller is fail-safe, in that if a fault occurs within the controller the fans it controls will all go to full speed. The controllers also provide status indicators for each of the four independently replaceable fans and an alarm output that can be sent to a system monitor. Additionally, the fan capacity is such that the cooling will be adequate with the loss of any individual fan. There are many possible variations on such a cooling design, but this example should give the reader a good idea of some of the things that should be considered. 10.2.2
Hardware Techniques
The hardware availability enhancing techniques primarily provide increased fault detection coverage (for both hardware and software faults), although there are some techniques that help reduce the impact of a fault and some that aid in diagnosing faults. The hardware availability enhancing techniques are: 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫
Hot swap Hardware redundancy CPU watchdog timers Hardware fault-injection testing Power supply monitoring Bus parity on parallel buses CRC on serial buses Bus watchdog timers Error checking and correcting on memory Soft error detection Clock failure detectors System activity recorder
The following subsections provide additional details on each of the above hardware availability enhancing techniques. 10.2.2.1 Hot Swap This technique provides the ability to remove and restore a “live” circuit from service so it may be replaced. Hot swap allows FRUs to be replaced without having to shut the power down to other FRUs, providing improved resiliency to faults. Depending on the type and design of the individual FRU, power to the FRU itself may or may not need to be removed to allow replacement. The im-
c10.qxd
2/8/2009
5:51 PM
Page 187
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
187
portant point is that power to other FRUs is allowed to stay on while the FRU of interest is removed. This technique is most commonly associated with backplane-based systems, but applies equally well to other configurations. 10.2.2.2 Hardware Redundancy This technique is a redundancy strategy in which multiple units are pooled. It can take the form of N+K, with N units providing service and K units as backups or spares, or N-out-of-M, where all M units share the load, but only N are required to provide full service. Although it can be used with any component, one of the more common scenarios is to use it with power supplies, since the power supplies tend to have higher failure rates (and a greater sensitivity to heat) than some of the other electronics. It is also used frequently with CPUs. Another example is the jet engines on an airplane. In these cases, some capacity may be lost, but the entire service does not go down. This method tends to be less expensive than a 1+1 or an active/standby type arrangement because there are fewer total spares (one spare for each active element in an active/standby arrangement versus K spares in an N+K system, or M – N spares in an N-out-of-M system, with K (or M – N) typically much less than the total number of units needed to provide service). The reliability may also be slightly less but, typically, the difference is small, assuming that faults are independent. 10.2.2.3 CPU Watchdog Timers This technique provides timers that require the software to routinely “check in” to prove software sanity. If the CPU fails to “check in” with the watchdog timer in the predetermined time, then the watchdog hardware typically triggers a hardware reset and/or raises an alarm. The technique is good for detecting both hardware and software faults. It detects insanity and leads to faster recovery by limiting the amount of time it takes to recover from an insane CPU. 10.2.2.4 Hardware Fault-Injection Testing This testing technique simulates hardware failures to make sure the system properly recognizes and responds to the faults. The cause of downtime is often due to incorrect handling of a fault after it was detected properly. Fault injection provides a mechanism to test that the faults are handled properly. Some of the larger integrated circuits provide support for fault injection through
c10.qxd
2/8/2009
188
5:51 PM
Page 188
CONNECTING THE DOTS
their JTAG pins. These pins may be used to tri-state the chip outputs, making it look like the chip has been removed from the circuit. Some of the more sophisticated chips support bit-level control of individual input/output pins, making it possible to simulate very specific faults. 10.2.2.5 Power Supply Monitoring This technique monitors the voltage, current, and temperature of each power supply (or power converter), and generates an alarm when the outputs go out of range. This allows system software to react to the out-of-range signal and turn components off in a controlled manner, thus preserving as much service as possible and avoiding damage to the electronic components. 10.2.2.6 Bus Parity on Parallel Buses This technique adds parity to parallel buses (typically at least the address and data buses), usually one parity bit per byte of the bus. This technique detects errors induced by electrical noise, marginal timing, marginal components, and so on on the circuit board. Similar to “error checking and correcting on memory,” without this technique the system may get corrupted data or instructions and, not knowing it is corrupted, put the system into an erroneous state. This type of fault may take a long time to manifest itself and, thus, may propagate throughout the system. Using the bus parity technique allows the corrupt data to be blocked at the source before it propagates. 10.2.2.7 CRC on Serial Buses This technique adds cyclic redundancy checking (CRC) to serial buses. This technique detects errors induced by electrical noise, marginal timing, marginal components, and so on on the communication link. Although this technique is listed as a hardware technique, it is acceptable if CRC is added by software. The objective is to detect corrupt messages on the link, with the expectation that CRC will detect messages that get corrupted while in transit. It is also advisable to apply a higher level of message scrutiny in software—a level that makes sure all messages actually make sense. It should be noted that some protocols, such as Ethernet, mandate the use of CRC on serial links, and using those protocols provides the level of fault detection this technique aims to provide.
c10.qxd
2/8/2009
5:51 PM
Page 189
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
189
10.2.2.8 Bus Watchdog Timers This technique detects bus accesses to nonexistent and nonresponsive peripherals, which would otherwise result in bus timeouts waiting for the peripheral to acknowledge the bus cycle. The most probable cause of this type of access is a software bug, but there are also some hardware faults that may be detected with this mechanism. When a bus cycle is started, the timer also starts. The timer period is set such that it is longer than the longest legitimate bus cycle. Thus, if the timer fires it must be due to some type of error in the bus cycle. Typically, the timer output is connected to the CPU on a “bus error” input, but in some cases it may be used to terminate the bus cycle and also generate some form of very high-priority (or even a nonmaskable) interrupt. 10.2.2.9 Error Checking and Correcting on Memory This technique provides checksum bits for every word of memory. Single-bit errors are then corrected as they occur, and double(and some multiple-) bit errors are detected, but not corrected. This technique provides protection against soft errors, which are typically induced by cosmic radiation, as well as errors induced by marginal timing or electrical noise on the circuit board. Without this technique, the system may retrieve corrupt data or instructions from memory and, not knowing it is corrupt, put the system into an erroneous state. This type of fault may take a long time to manifest itself and, thus, may propagate throughout the system. Use of this technique allows the system to prevent propagation of the corrupt information. 10.2.2.10 Soft Error Detection “Soft errors” are errors in which one or more bits of data change state, but not due to a hard failure of a component. They typically occur in some form of memory. When the memory is hit by cosmic radiation (typically neutrons), the charge stored in the memory device may get dispersed, changing the state of the memory bit. A subsequent rewrite of the bit can restore the change, but a read prior to the rewrite will obtain an incorrect value. Soft errors may also be induced by alpha particles in the packaging and a number of other sources, although newer manufacturing and screening techniques have reduced these sources of soft errors. This technique detects soft errors and may also provide correction on the fly. This technique is distinct from the ECC on memory in that it
c10.qxd
2/8/2009
190
5:51 PM
Page 190
CONNECTING THE DOTS
provides detection for other critical components such as CPUs, FPGAs, and CPLDs. These devices may also be susceptible to soft errors, and as integrated circuit design rules have become increasingly smaller, there is a concern that soft error susceptibility may increase. This is because smaller amounts of charge are stored in each memory cell and are thus easier to disperse. Custom designs for FPGAs and CPLDs can have detection designed in to some extent. Detection within a CPU depends on what the CPU vendor implemented, although there are some vendors that are addressing this concern. 10.2.2.11 Clock Failure Detectors This technique checks for transitions on the clock signal(s) and generates an error signal if the clock fails to transition. Depending on the FRU design, the error signal may need to go off-FRU, since loss of clock may eliminate the ability for the FRU to report the error in the same manner it reports other errors. For example, a clock detector on a CPU clock cannot tell the CPU to report the error, since loss of the CPU clock will prevent the CPU from reporting anything. 10.2.2.12 System Activity Recorder This technique records key usage parameters (such as power-on hours, maximum temperature, maximum voltage, etc.) in nonvolatile storage. The data may then be extracted at the repair center to help analyze failures of returned hardware units, identify use out of specification, and so on. In some systems, it may even be possible to reliably record key data from critical processor exceptions like the value of the program counter, stack pointer, and other registers in nonvolatile memory to aid debugging of critical software failures. Disk drive vendors have adopted this technique in the form of the SMART (self-monitoring, analysis and reporting technology) feature, which records things like head flying height, retries, temperature, spin-up time, and errors. 10.2.2.13 Hardware Technique Example The CPU circuit board shown in Figure 10.3 includes many of the techniques described in the preceding sections. The circuit is a general purpose CPU board that could be used for anything from embedded control to serving Web pages. It includes on-board power converters to generate the different supply voltages needed
c10.qxd
2/8/2009
5:51 PM
Page 191
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
191
Figure 10.3. Example circuit board with hardware techniques.
by the various components on the board. The converter is monitored for voltage and current anomalies and an alarm is generated if any go out of range. There is an on-board clock generator that runs all the digital circuitry on the board. This clock is monitored by a clock failure detector that generates an alarm if the clock fails. The CPU talks to the remaining components over the system bus, which contains both address and data, and includes parity generation and checking to monitor for transient bus errors. A memory controller sits between the dynamic memory and the bus. The memory controller generates and checks the error checking and correcting codes for the memory. This enables soft errors within the dynamic memory to be detected. All the control logic to connect the various components together resides in a field-programmable gate array (FPGA). In addition to the interconnection logic, the FPGA contains a CPU watchdog timer that the CPU must strobe periodically to indicate sanity, a bus watchdog timer that makes sure bus transactions complete, and internal circuitry to detect soft errors within the FPGA itself. Finally, there is a FLASH memory device that the system uses to store the system activity information. Because FLASH memory retains information after the power has been removed, this information is available for use in postmortem debugging and root-cause analysis. Although the
c10.qxd
2/8/2009
192
5:51 PM
Page 192
CONNECTING THE DOTS
techniques presented for this example CPU board will necessarily be different from the techniques appropriate for other applications, they provide a good example of the types of things that should be considered to create a reliable hardware design. 10.2.3
Software Techniques
The software availability enhancement techniques listed here enhance availability by increasing software fault detection coverage, reducing recovery latency, and increasing fault isolation. The software availability enhancing techniques are: 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫 앫
Memory protection Null pointer access detection Overload detection and control Heartbeating Process/task monitoring Memory leak detection Asserts Return code checking Parameter validation Timeouts Runtime consistency checking of data structures Parity/CRC on messages Message validation Tight loop detection Rolling updates Robust software updates Collect postmortem data N+K protection Checksums over critical data Critical failure-mode monitoring Minimize reboot time Parallel reboots Power-on self-tests Camp-on diagnostics Routine diagnostics Run-time diagnostics
The following sections elaborate on each of the software availability enhancing techniques listed above.
c10.qxd
2/8/2009
5:51 PM
Page 193
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
193
10.2.3.1 Memory Protection Memory protection partitions memory into regions and enforces specific access rights to each region. For example, program text should not be writable once it has been downloaded, and device registers should not be executable. This mechanism catches a variety of program errors and some hardware errors. Many of the software errors may be caught during system test and, thus, introduction of this technique can reduce the SW faults reaching the field. Use of this technique is most prevalent in embedded systems using either a proprietary or a real-time operating system. Systems that use a full scale operating system get some of this protection as part of the operating system; for example, one process is not allowed to write to another processes’ memory. 10.2.3.2 Null Pointer Access Detection Null pointer refers to a reference to memory location 0, and this is almost always caused by software defects that cause dereferencing of an uninitialized pointer value. This technique monitors accesses to address 0 (or a block of addresses starting at 0) to detect accesses to null pointers and null objects. This technique detects many errors early in the development cycle and ultimately results in fewer faults making it to the field. It also provides a detection mechanism for those faults that do make it to the field. This mechanism catches a variety of program errors and some hardware errors. 10.2.3.3 Overload Detection and Control This technique monitors critical system resources and raises an indication when resources become too low. When possible, action is also taken to reduce, in a controlled manner, the load on the specific resource (such as throttling incoming traffic). In typical information systems, the things that are monitored frequently include disk space, memory, CPU utilization, and various types of software control blocks (sometimes referred to as “handles”). Different resources could be monitored in other applications. For example, a Mars explorer may need to preserve enough battery power to make it through the night until the sun starts to recharge its batteries. If it was deemed that the battery was approaching the minimum level needed to remain online overnight, then other systems, such as locomotion, could be shut down.
c10.qxd
2/8/2009
194
5:51 PM
Page 194
CONNECTING THE DOTS
10.2.3.4 Heartbeating This technique provides for heartbeat messages to be exchanged between two (or more) independent entities. Loss of heartbeat may be used to indicate that a fault has occurred, and then trigger a recovery action. This detection can be significantly faster than waiting for some external entity (such as a customer) to provide notification that service is out. The faster the heartbeats, the quicker they detect failures, but the more system resources needed to generate and monitor the heartbeats. Heartbeats are typically tuned to meet the needs of each individual type of system, and the heartbeat messages themselves may include status information. For example, a circuit board that has determined that it has a fault could inform other entities within the system by changing the status indicators in the heartbeats it sends out. 10.2.3.5 Process/Task Monitoring This technique monitors processes or tasks. Processes or tasks that die may then be restarted as necessary. More advanced forms of this technique classify each task/process into different categories, with varying degrees of recovery (such as ignore, restart the task/process, reboot the CPU) assigned to each category. Highly sophisticated HA systems even take interdependencies among the processes or tasks into account when recovering failed processes or tasks. The simpler schemes simply watch to make sure the process or task is alive. More sophisticated systems require the process or task to periodically check in, usually via some type of heartbeat mechanism, to prove it is running correctly. 10.2.3.6 Memory Leak Detection A memory leak occurs when a software process or task requests memory from the system and then, typically through a logic error, fails to return the memory when it is no longer needed. In most operating systems terminating the process will reclaim the memory. However, in many embedded applications, processes are never terminated and, thus, a process with a memory leak can continue to use memory until the system does not have enough available memory to function. Additionally, leaks may occur in the operating system itself, or in drivers written to control a specific component or function. This technique monitors memory usage and detects when memory leaks have occurred. There are two basic flavors of memory leak detectors: those that are run strictly as part
c10.qxd
2/8/2009
5:51 PM
Page 195
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
195
of testing the system, and those that are built into the system and run the entire time the system is operational. There are many commercially available memory leak detectors that run during the testing interval. Built-in detectors tend to be proprietary, but have the advantage of providing better fault coverage during operation. Both types of detectors will reduce the number of software faults that reach the field by catching the offending leakers. 10.2.3.7 Asserts Asserts are an inline defensive check mechanism. They are included within the program at key points when it is important to make sure that a particular condition is true, or some specific data is valid. As an example, consider the software in an automated teller machine (ATM). In most ATMs, the smallest bills that are stocked are $20 bills. Immediately prior to dispensing cash, the ATM might assert that the amount is positive and that it is a multiple of $20. In theory, the software that collects the request from the customer should guarantee that the amount is a multiple of $20, so performing the assertion immediately prior to dispensing cash would help detect any errors in the input functions and also could protect against other potential glitches such as power spikes that might create a transient error of some type. To create an assert, the programmer places an assert call at a key location, and if the assertion is true, the program continues. Failure of the assert results in a predefined recovery action and logging of the error. Asserts can detect and avert field faults, and, additionally, may discover a number of problems during system test, reducing the number of software faults that make it to the field. Many programming languages provide some form of support for asserts, although the support varies somewhat depending on the specific language. In more sophisticated HA systems, it is common for developers to write several different assert types specific to the system. For example, one assert might simply log the error and continue, whereas a second type would log the error and kill the process in which it occurred. Other possibilities exist, and with multiple assert types the programmer can choose the most appropriate action for a given error condition. 10.2.3.8 Return Code Checking This technique checks the return codes from function/method calls. Many functions return indications of whether they were suc-
c10.qxd
2/8/2009
196
5:51 PM
Page 196
CONNECTING THE DOTS
cessful in performing the requested operation and, if not, it seldom makes sense for the calling routine to continue. For example, consider a system call to read data from a file into a buffer in memory. If the read fails, then it makes no sense to process the data in the buffer. Thus, it is critical that the return code from the system call be checked prior to trying to process the buffer data. Proper return code checking also helps to isolate faults so they do not propagate. In our read example, if the data in the buffer were part of a service request to another unit within the system, an invalid request could be sent if the return code were not checked, thus propagating the error to the second unit. This technique can also detect software faults prior to system test completion and, thus, can reduce the failure rate of the software. 10.2.3.9 Parameter Validation This technique checks the input parameters to a function or method. It is often used in conjunction with asserts to handle parameters that are invalid or out of range. This technique can also detect software faults prior to system test completion and, thus, can reduce the failure rate of the software. Whenever inputs come from an external source, such as a database, a disk file, or a message from another unit, they should be checked for validity. Doing so helps prevent fault propagation. As an example of parameter validation, consider the oxygen sensor on an automobile. The oxygen sensor is used to determine whether the engine is running too lean (too little fuel relative to the amount of air) or too rich (too much fuel relative to the amount of air). A typical sensor will output between 0.2 volts and 0.8 volts, with 0.2 volts indicating a lean condition, 0.8 volts indicating a rich condition, and 0.45 volts an optimum air/fuel ratio. Assume there is a software module in the automobile’s electronic control unit (ECU) that samples the vehicle sensors and another module that uses sensor data to adjust the air/fuel mixture. Suppose the sensor breaks and outputs zero volts. The sensor sampling module would then pass a value of 0 volts to the air/fuel mixture module. If the air/fuel module does not make sure the sensor value is within range, it will interpret the 0 volts as a lean condition. It will then continue to try to richen the mixture, resulting in a significant degradation in fuel mileage and additional pollutants in the exhaust. A better solution is to validate the parameter, realize it is out of range, use a default setting for air/fuel ratio
c10.qxd
2/8/2009
5:51 PM
Page 197
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
197
to keep the engine running, and light the “check engine” light so the driver knows there is an issue. 10.2.3.10 Timeouts This technique places a timer on asynchronous operations that could potentially never return a result, or return a result past the point in time at which that result is useful. This keeps the software from waiting forever. Sometimes, it is appropriate to retry the operation; other times, it makes more sense to cancel the operation. Most readers will have seen this in their Web browser. At some point they have tried to access a website only to get the “Error—Cannot Display the Web Page” message. A refresh of the browser may then display the originally requested page. The original error message may have been due to a timeout in the Internet protocol used to access the Web page, particularly if the requested website was overloaded with traffic. 10.2.3.11 Runtime Consistency Checking of Data Structures Also known as audits, these checks make sure that data structures are intact, data values are in range, lists are properly linked, and multiple copies of data are all in sync. This technique detects inconsistencies while the software is running in the field, but may also uncover data corruption during the system test interval, thus reducing the number of software faults that make it into the production software. 10.2.3.12 Parity/CRC on Messages This technique puts parity/CRC on proprietary messages so that errors may be detected. This method detects transmission, message construction, and delivery errors. This prevents the recipient of a corrupted or erroneous message from taking an improper action based upon the contents of the corrupted message. Some protocols, such as Ethernet, require this and may actually have the CRC generation and checking built into the chips that drive the communication link. However, whenever a proprietary communication mechanism is employed, adding parity or CRC can significantly improve the detection of message corruption. 10.2.3.13 Message Validation This technique checks messages for validity before acting on them. Preferably, multiple fields are checked, such as a message
c10.qxd
2/8/2009
198
5:51 PM
Page 198
CONNECTING THE DOTS
type and source ID, with each being more than a single bit. As an example, consider the Widget System from earlier. An interface card may issue a service request to a control board, and within that request may be the link number on the interface card. The control board software should validate that the link number is between 1 and the maximum number of links supported by the interface card. This would guarantee that if the control board uses the link number as an index into an array of link data, it will not index past the end of the array. It would also prevent an interface card software bug from propagating to the control board. 10.2.3.14 Tight Loop Detection One type of software bug is the tight loop or infinite loop, in which the software keeps iterating over the same set of program instructions without ever completing it. This can result in a system lockup or create a severe slowdown, since most of the CPU capacity is being used to execute the tight loop instructions. This technique monitors the program counter at periodic intervals and makes sure the software is not stuck in a tight loop. Typically, the checking is done during the handling of a high-priority timer interrupt to guarantee that the checking software gets to run. This technique also provides a way to break out of the loop if a tight loop is detected, typically by killing the offending process. 10.2.3.15 Rolling Updates This technique allows software updates to be applied without incurring any loss of service. This is accomplished by taking a spare unit offline, applying the update to it, and then restoring it to service. The service from an active unit is then transferred to the newly updated unit, and the active unit is then taken offline, updated, and restored to service. This continues automatically until all the units are upgraded. 10.2.3.16 Robust Software Updates This technique provides a software update framework that supports both a soak and backout mechanism so that updates may be tested before committing to the point of no return. In this scenario, the new software is loaded onto the machine and then put into a soak interval. During the soak, the new software is running but the old version is still available. This lets the user test the new soft-
c10.qxd
2/8/2009
5:51 PM
Page 199
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
199
ware for a while before completely committing to it. If there is a problem with the new software the user can execute a backout and revert to the previous version. Alternately, if the user is happy with the new version they can accept it and make it the new official version. 10.2.3.17 Collect Postmortem Data This technique collects data that can be used after the fact to determine what the root cause of a failure was. It includes things like operating system core dumps, crash dumps, and log entries. This technique captures failure data when application software fails but the operating system is still functioning. The “system activity recorder” hardware technique captures more profound cases, including information the operating system is likely to be incapable of capturing. An enhancement to this technique is analysis tools that are able to read the postmortem data and help with the rootcause analysis. 10.2.3.18 N+K Protection This technique is a redundancy strategy whereby multiple copies of a process/object are supported, along with a failover strategy to move work between them. It is similar to the N+K protection used for hardware except that the redundant entities are software instead of hardware. 10.2.3.19 Checksums over Critical Data This technique stores a checksum over blocks of critical data (such as program code, configuration information, etc). It provides a method for ensuring that the code was downloaded correctly, and may also be routinely checked to verify nothing has corrupted the program text. It may also be used to validate the contents of EPROM or FLASH. 10.2.3.20 Critical Failure-Mode Monitoring In this technique, system software monitors, alarms, and initiates a failover on all critical failure modes that could lead to system outage. For any given system, there are many possible failures that could lead to an outage, although they are definitely system specific. Some of the common types of failures that should be monitored include disk failures, communications link failures, clocking failures, etc.
c10.qxd
2/8/2009
200
5:51 PM
Page 200
CONNECTING THE DOTS
10.2.3.21 Minimize Reboot Time The shorter the boot/reboot time, the more quickly a standby may be brought into service, even if it is a cold standby. Also, in the event of a total system outage (such as occurs when the power fails), the shorter the boot time, the shorter the outage. Thus, reducing boot and reboot time increases the availability of the system. System initialization should only include those things that are necessary to get the system to the point where it can provide service. Deferring nonessential initializations (such as initializing the audit system, for example) is one good way to help reduce boot/reboot times. 10.2.3.22 Parallel Reboots This technique boots multiple units in parallel. The parallel boots thus reduce the time necessary to restore the entire system following a catastrophic event such as a power failure. For example, consider a server farm for an Internet website. If something happens to the power to the entire building where the server farm is located, then the entire website will be down. When power is restored the servers could be brought up one at a time, but this would severely limit the capacity of the website as it was being brought up. Bringing all the servers up in parallel will minimize the duration during which the site is available at less than full capacity. 10.2.3.23 Power-on Self-Tests These are tests that are run each time the unit powers up. They provide a level of basic sanity checking and increase confidence that the unit will function properly once it is placed in service. If a unit is faulty, the fault may be detected more quickly, resulting in a quicker replacement. 10.2.3.24 Camp-on Diagnostics This technique provides a method to “camp on” (wait for) a component until it is not in use, and then run diagnostics on that component. This only works with components whose use is dependent upon the current load in the system. This technique provides a good alternative to taking components completely out of service in order to diagnose them. 10.2.3.25 Routine Diagnostics This technique runs diagnostics on standby units at a predefined regular interval. It is very good at detecting latent faults in standby
c10.qxd
2/8/2009
5:51 PM
Page 201
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
201
units before the standby unit is needed to recover from a fault in the active unit. This technique is becoming slightly less popular for units that include software, due to the trend toward having standby units actively running software to keep themselves up to date, as opposed to the previous paradigm of having the standby completely offline. In cases in which the standby is actively running software, this technique is of significantly less value. 10.2.3.26 Run-time Diagnostics These are diagnostics that are run on an active unit. They help provide coverage for faults that occur during normal operation. Since the automatic detection and recovery of faults is a key factor in maintaining high availability, this technique can add significantly to system availability. It can sometimes be challenging to perform comprehensive diagnostics without disrupting service; in many cases it is not possible to test a very large set of possible inputs. In these cases, it often makes sense to attempt to initiate a “dummy” service request and see if the unit being diagnosed is capable of honoring the dummy request. 10.2.3.27 Software Technique Example The code snippet in Figure 10.4 shows several of the above techniques in actual use. The example calculates an employee’s withholding taxes based on employee information, such as pay scale and hours worked, that is maintained in a central employee database. The function returns a FAIL or SUCCESS indication to let the calling routine know if it was successful in calculating the withhold taxes. The first thing that is done is to validate the input parameter, which in this case is the employee’s ID. Validation is done by using asserts—one that asserts that the ID is greater than the minimum allowed ID, and another that asserts it is below the maximum. If the asserts fail, execution will not continue here, but will resume in the assert handler, which typically will point out where the error occurred and provide debugging information to help isolate the root cause. After validating the employee ID, the database is read to obtain the employee information. Because it is possible for the database read to fail (for example, if there is no entry for this employee ID), the return code is checked. If the database read failed, then an audit of the database is requested in the hopes the audit can detect and correct the problem. Finally, if everything worked so far we get to the point where the actual withholding tax is calculated.
c10.qxd
2/8/2009
202
5:51 PM
Page 202
CONNECTING THE DOTS
Figure 10.4. Code snippet showing software techniques.
This example has shown how to include several of the software techniques in an actual part of a program. An example that combines all, or even most, of the listed techniques would be necessarily quite large, and will not be attempted here. However, the reader should be able to see how the other techniques could be employed in other situations. 10.2.4
Procedural Techniques
The procedural availability enhancement techniques improve availability by making the system easier and less error prone to repair and maintain. Any time a technician has to interact with the system, there is the potential for an outage-inducing procedural
c10.qxd
2/8/2009
5:51 PM
Page 203
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
203
error. The procedural availability enhancement techniques reduce this potential by addressing three primary areas: 1. Minimizing human interactions 2. Helping the humans 3. Minimizing the impact of human errors The following sections enumerate the individual techniques that reduce the probability of procedure-related outages by addressing the above three areas. 10.2.4.1 Minimize Human Interactions The primary ways of minimizing human interactions are to automate the procedures and to minimize the number of tasks and decisions a person has to make. The following paragraphs elaborate on each of these techniques. Automate Procedures. This technique combines the tasks for a procedure into a single task that may be run automatically. This reduces the number of opportunities for mistakes, such as bad input parameters, and also completes the procedure more quickly. This is important, since often procedures are run on offline units, and speeding up the procedure reduces the system’s simplex exposure time. Minimize the Number of Tasks and Decision Points Per Procedure. Each additional task and decision point in a given procedure is another opportunity for a mistake. Making each procedure have a minimal set of tasks and decision points reduces this risk. It also speeds up the procedure, potentially resulting in less downtime or reduced simplex exposure (depending on the procedure). 10.2.4.2 “Help the Humans” Techniques There are many ways to “help the humans” maintain and repair a system. Below are listed many of the best practices for helping the humans. Create, Document, and Test All Procedures. This technique documents all the system procedures, and has each of them tested as part of system test. This results in procedures that are executable,
c10.qxd
2/8/2009
204
5:51 PM
Page 204
CONNECTING THE DOTS
correct, and proven, resulting in fewer procedural errors. This includes documentation of all procedures needed for system operation, maintenance, and recovery. Make Procedures Simple. Every procedure should be as simple as possible, but no simpler. Things that may help simplify procedures are automation, minimizing the number of tasks per procedure, and making procedures similar to other procedures. Make Procedures Clear. This technique helps ensure that the technician understands the procedure. This includes use of simple, clear language, list formats, and white space. Things like step-bystep instructions and pointers to key things to watch for, such as the state of a particular LED, also help contribute to this technique. Make Procedures Intuitive. This may be accomplished through software, user interface, and physical design, labeling, and marking. It includes polarizing connectors; keying circuit-pack connectors; labeling; marking; using color codes, size, or shape to differentiate system elements such as fuses; and designing uniform circuit-pack craft interfaces (LEDs, switches, marking). All these are aimed at guiding the technician toward the proper action. Make Procedures Similar to Each Other. This technique builds on a catalog of standardized steps and reduces the procedural failure rate by letting technicians become more familiar with the procedures more quickly. Examples include common power-up and power-down procedures for FRUs (preferably hot plug-in), and common location of controls and indicators on FRUs. Provide a Progress Indicator. This technique provides a way to ensure that progress is being made. It is often implemented as a counter, or a group of counters, with each increment indicating a specific progress point in a progression. This technique is especially useful for long, processing-intensive items, such as initialization, for which it may not be obvious that progress is being made. Without a progress indicator, long latency items may make it appear as if nothing is happening, and an impatient technician may decide the system is hung up and attempt a restart. This technique reassures the technician and helps to avoid the impatient restart.
c10.qxd
2/8/2009
5:51 PM
Page 205
10.2
INCORPORATE ARCHITECTURAL AND DESIGN TECHNIQUES
205
Automate Fault Detection and Automate Response Procedure Display. This technique detects faults automatically, and in those cases in which the system cannot completely recover from the fault (such as a hardware fault), displays the proper procedure to the technician. This results in faster repairs, thus reducing unavailability. Prominently Display Reminders. This includes display of outage escalation guidelines and contact information at customer sites, and reminders about routine maintenance, such as replacement of fan filters. Provide Training and Certification. This technique helps ensure that the technicians know how to perform procedures and what to do/who to call when things go wrong. Document Procedures Electronically. This technique provides all procedural documentation in electronic form. This makes it easier to ensure that the documentation is up to date and is available with the system (it is less likely to get lost than a manual), and also allows enhancements such as interactive help. Identify “Safe Points.” This technique clearly identifies points within the procedure at which it is safe to interrupt or stop the procedure temporarily. For example, each point in a multistep procedure at which the system is duplex might be a safe point. This allows technicians to leave the system in a known safe state when they have to respond to higher priority issues. 10.2.4.3 Minimize the Impact of Human Error Techniques The objective of the previous sections was to minimize the probability of human errors. Nonetheless, errors will eventually occur. Here are some best-practice techniques that will help reduce the impact of these errors. Provide Input-Checking Capability. This technique provides input checking to minimize entry of inappropriate data and commands by performing validity checks on the parameters and commands that the technicians enter during a procedure. It includes things like asking the “are you sure” question when the requested action may affect availability. Denying invalid requests when they
c10.qxd
2/8/2009
206
5:51 PM
Page 206
CONNECTING THE DOTS
are asked, rather than part way through a procedure, can reduce the probability of the system ending up in an unknown state. Provide “Backout” or “Undo.” This technique allows technicians to undo a mistake without having to find/improvise a different procedure to accomplish the same effect. A complete backout plan should be identified at each major step in a multistep procedure. Concise Reporting of System State. All alarms should be collected and displayed together. Alarm messages must be clear, simple, and consistent. Alarm-reporting algorithms must make it easy to understand the system state under various conditions (e.g., report only on change of state). Clear communication of the system state is very important because this information becomes the basis for further action.
10.3
MODELING TO VERIFY FEASIBILITY
To verify the feasibility of meeting the availability target with the proposed system architecture, an availability model is created with reasonable values for all parameters (detailed in Chapter 5 and Chapter 8). If there is a gap between the downtime predicted by this model and the availability target, then one can test the sensitivity of parameter and model changes to identify the most cost-effective way to reach the availability target. A sample sensitivity analysis for the software failure rate is shown in Figure 10.5. When refining the availability architecture, one typically considers, among other things: 앫 Distributing/arranging functionality to minimize the size of a failure group (e.g., minimize single points of failure) 앫 Supporting finer-grained software recovery options. Rather than requiring an entire processor, FRU, or network element to be rebooted to clear a software problem, enable restartable processes, tasks, or transactions. 앫 Adding additional redundancy 앫 Reducing failure rate targets 앫 Shortening recovery time targets 앫 Improving coverage factors
c10.qxd
2/8/2009
5:51 PM
Page 207
10.3
MODELING TO VERIFY FEASIBILITY
207
Control Board Software Failure Rate (failures per year)
Figure 10.5. Sample software failure rate sensitivity chart.
Once an architecture is reached that meets the availability target when feasible parameter values are assumed, those assumed parameters can then be baselined into the product’s requirements. Comprehensive reliability requirements would explicitly specify: 앫 Product-attributable unplanned downtime for stand-alone element and multielement configurations (e.g., N+K, geo-redundant), if applicable 앫 Maximum hardware failure rate (typically expressed in FITs) 앫 Maximum software failure (outage) rate 앫 Hardware and software coverage factors 앫 Software recovery times, such as system, board, and other appropriate recovery times 앫 If applicable, maximum switchover time and switchover success probabilities Through the development process, engineers periodically verify that hardware and software failure rates and reliability parameters are consistent with budgets (detailed in Chapter 7). If gaps against those budgets appear likely, then appropriate mitigations are considered, such as: 앫 Additional testing for software modules with elevated failure rates
c10.qxd
2/8/2009
208
5:51 PM
Page 208
CONNECTING THE DOTS
앫 Appropriate hardware/thermal changes to address elevated hardware failure rates 앫 Further optimizations to shorten latencies 앫 Additional fault insertion/adversarial testing to improve coverage factors
10.4
TESTING
It is essential to thoroughly test failure detection, isolation, and recovery mechanisms to assure that they work reliably as designed, to assure that no single failure in a highly available system causes an outage. Robustness testing (sometimes called negative, adversarial, breakage, rainy-day, fault insertion, or similar) confronts the system with plausible failures to assure that high-availability mechanisms work as designed. Appropriate endurance testing (sometimes called stability testing), overload testing, and stress testing are also essential to assure that the system operates correctly under extended mixed and varying loads (including overload). Planning and analyzing robustness, endurance, overload, and stress testing is a broad and important topic, but is beyond the scope of this book.
10.5
UPDATE AVAILABILITY PREDICTION
Best practice is to revise availability predictions during the second half of system testing based on actual lab testing results. Software reliability growth modeling offers insights into how effectively system testing is finding residual defects and, thus, can be used to plan how much system testing is likely to be required to achieve the system’s reliability/availability goals. As system testing completes, a “final” system availability prediction can be made from final laboratory test results.
10.6
PERIODIC FIELD VALIDATION AND MODEL UPDATE
Ideally, field data should be analyzed at the start of each major release to characterize how most recent release(s) are performing in
c10.qxd
2/8/2009
5:51 PM
Page 209
10.7 BUILDING AN AVAILABILITY ROAD MAP
209
the field (detailed in Chapter 6). This analysis should show actual field availability, including the mix of hardware-attributed and software-attributed failure rates and downtime. Parameter estimates from field data should be compared to parameter estimates from laboratory data, as well as estimated system availability, and adjustments to modeling input values should be considered. Chapter 9, Section 9.2 discusses how close predictions are expected to be to actual field data; if predictions are too far off, revising the availability model should be considered.
10.7
BUILDING AN AVAILABILITY ROAD MAP
If a gap relative to the system’s availability requirement is expected in the current release, then increase investment in reliability-/ availability-improving features (including testing) and, if necessary, build a road map to close the gap in a future release. Key elements of a reliability road map are: 1. Explicitly specifies the “ultimate” quantitative system availability goal(s) (often 99.999%) and definition (e.g., product-attributable total plus prorated partial service availability for one or more specific system configurations) 2. Availability estimate of current release and previous releases, when applicable 3. Per-release availability targets. As a practical matter, one often starts by setting decreasing linear downtime targets from the availability estimated for the current release to the target downtime in the target release. Actual reliability growth often follows a more exponential curve rather than a linear trajectory, but striving to have less downtime per release than the linear targets is generally an acceptable starting point. 4. Specific availability-improving features enumerated for at least some releases 5. Availability predictions for future releases considering the availability impact of both planned availability-improving features, other product features (recognizing that complex, new features often initially increase failure rates and may slow failure detection and recovery), and the expected reliability growth that occurs as defects are found, debugged, and removed from the product
2/8/2009
210
5:51 PM
Page 210
CONNECTING THE DOTS
6. Per-release availability budgets to plausibly close the gap between current release performance and specific availability goals in the target release As field data for additional releases become available, failure rates and other reliability parameters are recalibrated, predictions for future releases are recalculated, and business leaders decide if any changes in the reliability road map are appropriate. A reliability road map can be visualized in a chart like that in Figure 10.6, and a list of the planned reliability/availability-improving features for each release. Note that downtime may increase from release to release if major new features are introduced (thus adding lots of new code with residual defects), and actual field performance will undoubtedly be somewhat different from predicted values. Nevertheless, by taking a multirelease view of reliability growth, it is easier to both plan appropriately and to set appropriate expectations with customers.
10.8
RELIABILITY REPORT
A written reliability report is a useful mechanism for aligning expectations across a cross-functional project team with sales and
Sample Reliability Road Map
60 Annualized Unplanned Downtime
c10.qxd
50
Actual Downtime
40
Prediction
30
99.999%
linear targets 99.99%
20 10 0 Product Release
Figure 10.6. Sample reliability road map.
c10.qxd
2/8/2009
5:51 PM
Page 211
10.8
RELIABILITY REPORT
211
marketing, and even customers. The report collects appropriate high-level information from reliability activities to answer the following questions: 1. What are the high-level reliability goals and targets of this product or solution? 2. What architecture, design, and features are implemented to assure that those goals are met? 3. What modeling assures that those goals are met? 4. What laboratory and field data suggest or demonstrate that reliability goals are met? Typically, a written reliability report will include the following sections: 앫 Architecture overview—establishes basic context and background information for the product or solution 앫 Reliability and availability features—reviews the redundancy architecture (e.g., N+K, active–standby) and the set of availability enhancing techniques that are included in the product or solution 앫 Reliability requirements 앫 Unplanned downtime modeling and results 앫 Appendices for definitions, references, and perhaps even Markov transition diagrams Additional information can be added, such as failure mode and effect analysis, or any other reliability-, availability-, or stability-related information that is relevant to the particular product or solution. Appendix A gives an outline of a sample reliability report that can be used as a starting point.
c11.qxd
2/8/2009
5:53 PM
CHAPTER
Page 213
11
SUMMARY
We have described in detail the system reliability activities that take place in the various phases of product design and development. We have discussed requirements, reliability improving techniques, modeling, testing and the measurements that should be taken during testing, along with reliability road-mapping and field validation. We have reviewed how customers measure service availability, and why measured and perceived availability often varies between customers. Each of these topics has been covered in a practical way that enables the reader to quickly understand the topic and begin applying the concepts to their own products. Many examples and anecdotes were given to help the reader grasp the concepts, along with detailed tutorials on some of the more math-oriented subjects. The reader should now be able to model their existing products to understand the availability their customers are likely to experience. They can then analyze various ways of improving the system, and by using the methods presented here they can choose the improvements that offer the most for their development expense. Finally, they can proactively manage a product’s availability over its life cycle, rather than the more painful alternative of addressing system reliability one critical customer problem at a time.
Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
213
bappa.qxd
2/8/2009
APPENDIX
5:54 PM
Page 215
A
SYSTEM RELIABILITY REPORT OUTLINE
This appendix contains the outline for a reliability report. It explains what should be included in each section, and gives some examples. Following the outline provided here ensures that all aspects of a system’s reliability have been appropriately considered, and that the results can be clearly presented to the desired audience. Note: Text within angle brackets (< >) summarizes the intention of each section and would be deleted from an actual reliability report. <Title identifies product(s) covered by this report> Reliability Report for the Widget System Version 1.0, January, 2009 Contact: John Smith, Reliability Engineer, Widgets’r’Us (John.
[email protected]; +1-212-555-1234).
1 1.1
EXECUTIVE SUMMARY Architectural Overview
Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
215
bappa.qxd
2/8/2009
216
5:54 PM
Page 216
SYSTEM RELIABILITY REPORT OUTLINE
1.1.1 Hardware Platform 1.1.2 Software Architecture 1.2
Reliability and Availability Features
1.2.1 Reliability and Availability Features Provided in the System Hardware 1.2.2 Reliability and Availability Features Provided in the System Software
bappa.qxd
2/8/2009
5:54 PM
Page 217
SYSTEM RELIABILITY REPORT OUTLINE
2 2.1
217
RELIABILITY REQUIREMENTS Availability Objective
2.2
Downtime Requirements
2.3
Hardware Failure Rate Requirements
2.4
Service Life Requirements
3 3.1
UNPLANNED DOWNTIME MODEL AND RESULTS Unplanned Downtime Model Methodology
bappa.qxd
2/8/2009
5:54 PM
Page 218
218
SYSTEM RELIABILITY REPORT OUTLINE
3.2
Reliability Block Diagrams
3.3
Standard Modeling Assumptions
3.4 Product-Specific Assumptions, Predictions, and Field or Test Data 3.4.1 Hardware Failure Rates 3.4.2 Software Failure Rates 3.4.3 Failover and Software Recovery Times
bappa.qxd
2/8/2009
5:54 PM
Page 219
SYSTEM RELIABILITY REPORT OUTLINE
219
in the laboratory. The estimation method and values used should be explained here.> 3.4.4 Coverage Factors 3.5
System Stability Testing
3.6
Unplanned Downtime Model Results
ANNEX A—RELIABILITY DEFINITIONS ANNEX B—REFERENCES .
bappa.qxd
2/8/2009
220
5:54 PM
Page 220
SYSTEM RELIABILITY REPORT OUTLINE
ANNEX C—MARKOV MODEL STATE-TRANSITION DIAGRAMS
bappb.qxd
2/8/2009
APPENDIX
5:55 PM
Page 221
B
RELIABILITY AND AVAILABILITY THEORY
Reliability and availability evaluation of a system (hardware and software) can help answer questions like “How reliable will the system be during its operating life?” and/or “What is the probability that the system will be operating as compared to out of service?” System failures occur in a random manner and failure phenomena can be described in probabilistic terms. Fundamental reliability and availability evaluations depend on probability theory. This chapter describes the fundamental concepts and definitions of reliability and availability of a system. B.1
RELIABILITY AND AVAILABILITY DEFINITIONS
Reliability is defined as “the probability of a device performing its purpose adequately for the period of time intended under the operating conditions encountered” [Bagowsky61]. The probability is the most significant index of reliability but there are many parameters used and calculated. The term reliability is frequently used as a generic term describing the other indices. These indices are related to each other and there is no single all-purpose reliability formula or technique to cover the evaluation. The following are examples of these other indices: 앫 The expected number of failures that will occur in a specific period of time 앫 The average time between failures 앫 The expected loss of service capacity due to failure 앫 The average outage duration or downtime of a system 앫 The steady-state availability of a system Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
221
bappb.qxd
2/8/2009
222
5:55 PM
Page 222
RELIABILITY AND AVAILABILITY THEORY
The approaches taken and the resulting formula should always be connected with an understanding of the assumptions made in the area of reliability evaluation. Attention must be paid to the validation of the reliability analysis and prediction to avoid significant errors or omissions. B.1.1
Reliability
Mathematically, reliability, often denoted as R(t), is the probability that a system will be successfully operating during the mission time t: tⱖ0
R(t) = P(T > t),
(1)
where T is a random variable denoting the time to failure. In other words, reliability is the probability that the value of the random variable T is greater than the mission time t. Probability of failure, F(t), is defined as the probability that the system will fail by time t: F(t) = P(T ⱕ t),
tⱖ0
(2)
In other words, F(t) is the failure distribution function, which is often called the cumulative failure distribution function. The reliability function is also known as the survival function. Hence, R(t) = 1 – F(t)
(3)
The derivative of F(t), therefore, gives a function that is equivalent to the probability density function, and this is called the failure density function, f(t), where dF(t) dR(t) f(t) = ᎏ = – ᎏ dt dt
(4)
Or, if we integrate both sides of Equation (4),
冕 f(t)dt t
F(t) =
(5)
0
and
冕 f(t)dt = 冕
⬁
t
R(t) = 1 –
0
t
f(t)dt
(6)
bappb.qxd
2/8/2009
5:55 PM
Page 223
B.1
RELIABILITY AND AVAILABILITY DEFINITIONS
223
In the case of discrete random variables, the integrals in Equations (5) and (6) can be replaced by summations. A hypothetical failure density function is shown in Figure B.1, where the values of F(t) and R(t) are illustrated by the two appropriately shaded areas. F(t) and R(t) are the areas under their respective portions of the curve. Some readers may find it more intuitive to start with the failure density function shown in Figure B.1 and go from there. The “bell curve” of the normal distribution is a probability density function that most readers are probably familiar with, although it typically does not have time as the horizontal axis. The failure density function shows the probability of the system failing at any given point in time. Because the sum of all the probabilities must be 1 (or 100%), we know the area under the curve must be 1. The probability of failing by time t is thus the sum of the probabilities of failing from t = 0 until time t, which is the integral of f(t) evaluated between 0 and t. Reliability is the probability that the system did not fail by time t and is, thus, the remainder of the area under the curve, or the area from time t to infinity. This is the same as the integral of f(t) evaluated from t to infinity. B.1.2
System Mean Time to Failure (MTTF)
Mean time to failure (MTTF) is the expected (average) time that the system is likely to operate successfully before a failure occurs.
Figure B.1. Hypothetical failure density function. F(t) = probability of failure by time t, R(t) = probability of survival by time t.
bappb.qxd
2/8/2009
224
5:55 PM
Page 224
RELIABILITY AND AVAILABILITY THEORY
By definition, the mean or expected value of a random variable is the integral from negative infinity to infinity of the product of the random variable and its probability density function. Thus, to calculate mean time to failure we can use Equation (7), where f(t) is the probability density function and t is time. We can limit the integral to values of t that are zero or greater, since no failures can occur prior to starting the system at time t = 0.
冕
⬁
MTTF =
tf(t)dt
(7)
0
Substituting for f(t) using Equation (4), dR(t) f(t) = – ᎏ dt Equation (7) then becomes
冕
⬁
MTTF = –
tdR(t)dt
0
冨
⬁
冕
(8)
⬁
= –tR(t) 0 +
R(t)dt
0
The first term in Equation (8) equals zero at both limits. It is zero when t is zero precisely because t is zero, and it is zero when t is infinite because the probability of the component continuing to work (i.e., surviving) forever is zero. This leaves the MTTF function as
冕
⬁
MTTF =
R(t)dt
(9)
0
B.1.3
Failure Rate Function (or Hazard Rate Function)
In terms of failure, the hazard rate is a measure of the rate at which failures occur. It is defined as the probability that a failure occurs in a time interval [t1, t2], given that no failure has occurred prior to t1, the beginning of the interval. The probability that a system fails in a given time interval [t1, t2] can be expressed in terms of the reliability function as
bappb.qxd
2/8/2009
5:55 PM
Page 225
B.1
P(t1 < T ⱕ t2) =
冕
t2
冕
⬁
f(t)dt =
冕
⬁
f(t)dt –
t1
t1
225
RELIABILITY AND AVAILABILITY DEFINITIONS
t2
f(t)dt = R(t1) – R(t2)
where f(t) is again the failure density function. Thus, the failure rate can be derived as R(t1) – R(t2) ᎏᎏ (t2 – t1)R(t1)
(10)
If we redefine the interval as [t, t + ⌬t], Equation (10) becomes R(t) – R(t + ⌬t) ᎏᎏ ⌬tR(t) The hazard function is defined as the limit of the failure rate as the interval approaches zero. Thus, the hazard function h(t) is the instantaneous failure rate, and is defined by R(t) – R(t + ⌬t) h(t) = lim ᎏᎏ ⌬t씮0 ⌬tR(t)
冤
1 –dR(t) =ᎏ ᎏ R(t) dt
冥
(11)
–dR(t) =ᎏ R(t)dt Integrating both sides and noticing the right side is the definition of the natural logarithm, ln, of R(t) yields
冕 h(t) = –冕 t
t
0
0
dR(t) ᎏ R(t)dt
冕 h(t) = –ln[R(t)] t
(12)
0
冤 冕 h(t)dt冥 t
R(t) = exp –
0
For the special case where h(t) is a constant and independent of time, Equation (12) simplifies to R(t) = e–ht
(13)
bappb.qxd
2/8/2009
226
5:55 PM
Page 226
RELIABILITY AND AVAILABILITY THEORY
This special case is known as the exponential failure distribution. It is customary in this case to use to represent the constant failure rate, yielding the equation R(t) = e–t
(14)
Figure B.2 shows the hazard rate curve, also known as a bathtub curve (discussed in more detail in Chapter 5, Section 5.2.2.1.1), which characterizes many physical components. B.1.4
Availability
Reliability is a measure of successful system operation over a period of time or during a mission. During the mission time, no failure is allowed. Availability is a measure that allows for a system to be repaired when failures occur. Availability is defined as the probability that the system is in normal operation. Availability (A) is a measure of successful operation for repairable systems. Mathematically, System uptime A = ᎏᎏᎏᎏᎏ System uptime + System downtime Or, because the system is “up” between failures, MTTF A = ᎏᎏ MTTF + MTTR where MTTR stands for mean time to repair.
Figure B.2. Bathtub curve.
(15)
bappb.qxd
2/8/2009
5:55 PM
Page 227
B.1
RELIABILITY AND AVAILABILITY DEFINITIONS
227
Another frequently used term is mean time between failures (MTBF). Like MTTF and MTTR, MTBF is an expected value of the random variable time between failures. Mathematically, MTBF = MTTF + MTTR If MTTR can be reduced, availability will increase. A system in which failures are rapidly diagnosed and recovered is more desirable than a system that has a lower failure rate but the failures take a longer time to be detected, isolated, and recovered. Figure B.3 shows pictorially the relationship between MTBF, MTTR, and MTTF. From the figure it is easy to see that MTBF is the sum of MTTF and MTTR. B.1.5
Downtime
Downtime is an index associated with the service unavailability. Downtime is typically measured in minutes per service year: Downtime = 525,960 × (1 – Availability) min/yr
(16)
where 525,960 is number of minutes in a year. Other indices are also taken into consideration during reliability evaluations; the above are the major ones.
Figure B.3. MTTR, MTBF, and MTTF.
bappb.qxd
2/8/2009
5:55 PM
228
Page 228
RELIABILITY AND AVAILABILITY THEORY
B.2 PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION This section presents some of the common distribution functions and their related hazard functions that have applications in reliability evaluation. A number of standard distribution functions are widely used: binomial, Poisson, normal, lognormal, exponential, gamma, Weibull, and Rayleigh. Textbooks such as [Bagowsky61], [Shooman68], [Pukite98], and [Crowder 94] provide detailed documentation on probability models and how statistical analyses and inferences about those models are developed. B.2.1
Binomial Distribution
Consider an example of tossing a coin. There are two independent outcomes: heads or tails. The probability of getting one of the two outcomes at each time the coin is tossed is identical (we assume the coin is “fair”). Let us assume that the probability of getting a head is p, and the probability of getting a tail is q. Since there are only two outcomes, we have p + q = 1. For a given number of trials, say n, the probability of getting all of them as heads is pn, the probability of getting (n – 1) heads and one tail is npn–1q, the probability of getting (n – 2) heads and two tails is [n(n – 1)/2!]pn–2q2. This goes on, and the probability of getting all tails is qn. Hence, the general probability of getting all the possible outcomes can be summarized as n(n – 1) n(n – 1) . . . n(n – r + 1) pn + npn–1q + ᎏ pn–2q2 + ᎏᎏᎏ pn–rqr 2! r! + . . . + qn = (p + q)n In reliability evaluation, the outcomes can be modified to success or failure. Consider n trials with the outcome of r successes and (n – r) failures. The probability of this outcome can be evaluated as follows: Pr = C nr prq(n–r) n! = ᎏᎏ prqn–r r!(n – r)!
(17)
where C nr denotes the combination of r successes from total n trials.
bappb.qxd
2/8/2009
5:55 PM
Page 229
B.2
PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION
229
For all possible outcomes, we have n
(p + q)n = 冱 C nr prqn–r = 1 r=0
Example: For a given manufacturing process, it is known that the product defect rate is 1%. If an average customer purchases 100 of these products selected at random, what is the probability that he/she receives two or less defective products? In this example, n = 100, p = 0.01, q = 0.99, and r = 0,1,2, therefore: Pr(2 or less defects) = Pr(2 defects) + Pr(1 defect) + Pr(0 defects) 2
r (0.01)r(0.99)100–r = 冱 C100 r=0
= 0.1849 + 0.3697 + 0.3660 = 0.9206 It can be proven that the mean and the variance for binomial distribution are E(X) = np and V(X) = npq
B.2.2
Poisson Distribution
Like binomial distributions, Poisson distributions are used to describe discrete random events. The major difference between the two is that in a Poisson distribution, only the occurrence of an event is counted, and its nonoccurrence is not counted, whereas a binomial distribution counts both the occurrence and the nonoccurrence of events. Examples of a Poisson distribution are: 앫 The number of people coming to a bus stop 앫 The number of failures of a system 앫 The number of calls in a given period Assume that the average failure rate of a system is and the number of failures by time t is x. Then the probability of having x failures by time t is given by
bappb.qxd
2/8/2009
230
5:55 PM
Page 230
RELIABILITY AND AVAILABILITY THEORY
(t)xe–t Pr(X = x) = ᎏ x!
for x = 0, 1, 2 . . .
(18)
The mean and the variance of a Poisson distribution are given by E(X) = t and V(X) = t B.2.2.1 Relationship with the Binomial Distribution It can be shown that for large sample size n (n Ⰷ r) and small p (p ⱕ 0.05), the Poisson and binomial distributions are identical. That is, np = t and r=x B.2.3
Exponential Distribution
The exponential distribution is one of the most widely used probability distributions in reliability engineering. The most important characteristic of the exponential distribution is that the hazard rate is constant, in which case it is defined as the failure rate. The failure density function is given by f(t) = e–t
t > 0, f(t) 0 otherwise
(19)
and the reliability function is R(t) = e–t Figure B.4 shows the exponential reliability functions. It can be proven that the mean and variance of the exponential distribution are: E(T) = 1/
bappb.qxd
2/8/2009
5:55 PM
Page 231
B.2
PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION
231
Figure B.4. Exponential reliability functions.
and V(t) = 1/2 We can see that the mean time to failure (MTTF) for the exponential distribution is the reciprocal of the failure rate . Another property of the exponential distribution is known as the memoryless property, that is, the conditional reliability function for a component that has survived to time s is identical to that of a new component. Mathematically, we have Pr(T ⱖ t) = Pr(T ⱖ t + s/T ⱖ s),
for t > 0, s > 0
The exponential distribution is used extensively in the analysis of repairable systems in which components cycle between upstates and downstates. For example, in Markov models, the memoryless property is the fundamental assumption that characterizes failure and recovery distributions. B.2.4
Weibull Distribution
The exponential distribution is limited in its application due to the memoryless property. The Weibull distribution, on the other hand, is a generalization of the exponential distribution. It has a very important property—the distribution has no specific characteristic shape. In fact, depending on what the values of the parameters are in its reliability function, it can be shaped to represent many different distributions and it can be shaped to fit to experimental data that cannot be characterized as a particular distribution. This makes the Weibull (and a few other distribution func-
bappb.qxd
2/8/2009
232
5:55 PM
Page 232
RELIABILITY AND AVAILABILITY THEORY
tions such as gamma and lognormal, which will be discussed later) a very important function in experimental data analysis. The three-parameter probability density function of the Weibull distribution is given by
(t)–1 –(t/) f(t) = ᎏ e 
for t ⱖ 0
(20)
where is known as the scale parameter and  is the shape parameter. The reliability function is 
for t > 0,  > 0, > 0
R(t) = e–(t/)
The hazard function is
(t)–1 (t) = ᎏ 
for t > 0,  > 0, > 0
The mean and variance of the Weibull distribution are
冢
冣
1 E(X) = ⌫ ᎏ + 1  and
冤冢
冣
冢
2 1 V(X) = 2 ⌫ 1 + ᎏ – ⌫2 1 + ᎏ  
冣冥
where ⌫ represents the gamma function. There are two special cases of the Weibull distribution that deserve mention. When  = 1, the failure density function and the hazard function reduce to 1 f(t) = ᎏ e–(t/) and 1 (t) = ᎏ The failure density is identical to the exponential distribution.
bappb.qxd
2/8/2009
5:55 PM
Page 233
B.2
PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION
233
When  = 2, the failure density function and the hazard function reduce to 2t –(t2/2 f(t) = ᎏ e ) 2 and 2t (t) = ᎏ 2 The failure density function is identical to the Rayleigh distribution which is discussed next. It can be shown that
 < 1 represents a decreasing hazard rate of the burn-in period  = 1 represents a constant hazard rate of the normal life period  > 1 represents an increasing hazard rate of the wear-out period So, in this sense, the hazard rate function of the Weibull distribution can be connected to the bathtub curve we discussed in the main text. B.2.5
Rayleigh Distribution
The Rayleigh distribution is a special case of the Weibull distribution and has only one parameter. Besides its use in reliability engineering, the Rayleigh distribution is used to analyze noise problems associated with communications systems. It has also been used in some software reliability growth models. The failure density function is 2t –(t2/2) f(t) = ᎏ e 2
(21)
A more general form of the Raleigh density function is f(t) = kte–(kt
2/2)
where k is the only parameter. When k = 2/2, the Rayleigh distribution is equivalent to the Weibull distribution for  = 2. The Rayleigh distribution is a singleparameter function and k is both the scale and shape parameter.
bappb.qxd
2/8/2009
234
5:55 PM
Page 234
RELIABILITY AND AVAILABILITY THEORY
The Rayleigh reliability function is 2/2)
R(t) = e–(kt and the hazard function is
(t) = kt which is a linearly increasing hazard rate with time. This characteristic gives the Rayleigh distribution its importance in reliability evaluation. B.2.6
The Gamma Distribution
Similar to the Weibull distribution, the gamma distribution has a shape parameter ( is conventionally used for the gamma distribution shape parameter) and a scale parameter (␣ for the gamma distribution). By varying these parameters, the gamma distribution can be used to fit a wide range of experimental data. The failure density function is given by t–1 –(t/␣) f(t) = ᎏ e ␣⌫()
for t ⱖ 0, ␣ > 0,  > 0
(22)
and ⌫() is defined as ⌫() =
冕
⬁
t␥–1e–t dt
0
Note that for integer values of ␥, ⌫() reduces to ⌫() = (␥ – 1)! The reliability function is given by
冕
⬁
R(t) =
t
t–1 –(t/␣) ᎏ e dt ␣⌫()
There are two special case of gamma distribution. They are when  = 1 and when  is an integer. When  = 1 the failure density reduces to 1 f(t) = ᎏ e–(t/␣) ␣ Again, this is identical to the exponential distribution.
bappb.qxd
2/8/2009
5:55 PM
Page 235
B.2
PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION
235
When  is an integer, the failure density reduces to t–1 f(t) = ᎏᎏ e–(t/␣) ␣( – 1)! This density function is known as the special Erlangian distribution, which can be shown as –1 t R(t) = e–(t/␣) 冱 ᎏ i=0 
冢 冣
i
1 ᎏ j!
It can be shown that the mean and variance for the gamma distribution function is: E(t) = ␣ and V(t) = ␣2 B.2.7
The Normal Distribution
The normal distribution, also known as the Gaussian distribution, is probably the most important and widely used distribution in statistics and probability. It is of less significance in the reliability field. In the reliability field, it has applications in measurements of product susceptibility and external stress. Taking the wellknown bell shape, the normal distribution is perfectly symmetrical about its mean and the spread is measured by its variance. The failure density function is given by 1 2 f(t) = ᎏ e–1/2[(t–)/] 兹2 苶 苶
(23)
where is the mean value and is the standard deviation. The larger is, the flatter the distribution. The reliability function is
冕
t
R(t) = 1 –
–⬁
1 2 ᎏ e–1/2[(s–)/] ds 兹2 苶 苶
The mean value and the variance of the normal distribution are given by
bappb.qxd
2/8/2009
236
5:55 PM
Page 236
RELIABILITY AND AVAILABILITY THEORY
E(t) = and V(t) = 2 B.2.8
The Lognormal Distribution
Like the normal distribution, the lognormal distribution also has two parameters. It has not been considered as an important distribution to model component lifetimes, but it can be a good fit to the distribution of component repair times in modeling repairable systems. The density function is given by 1 2 2 f(t) = ᎏ e–(ln t–) /2 t兹2 苶 苶
(24)
where and are the parameters in the distribution function, but they are not the mean and variance of the distribution. Instead, they are the mean and variance of the natural logarithm of the random variable. is a shape parameter and is a scale parameter. The mean and variance of the distribution are 2/2)
E(t) = e+( and 2
2
V(t) = e2+ [e – 1] The cumulative distribution function for the lognormal distribution is 1 冕ᎏ e s兹2 苶 苶 t
F(t) =
–(ln s–)2/22
ds
(25)
0
and this can be related to the standard normal derivate Z by ln t – F(t) = P[T ⱕ t] = P[ln T ⱕln t] = P Z ⱕ ᎏ
冤
冥
bappb.qxd
2/8/2009
5:55 PM
Page 237
B.3
ESTIMATION OF CONFIDENCE INTERVALS
237
Therefore, the reliability function is given by ln t – R(t) = p Z > ᎏ
冤
冥
(26)
and the hazard function is ln t – ᎏᎏ f(t) (t) = ᎏ = ᎏᎏ R(t) tR(t)
冢
B.2.9
冣
(27)
Summary and Conclusions
This appendix has presented the most important probability distributions that are likely to be encountered in reliability evaluations. Some readers might be familiar with the concepts and distributions described here. In this case, this material can be used as a reference. For those who have not been previously exposed to this area, it is intended to provide some basic understanding of the fundamental distributions.
B.3
ESTIMATION OF CONFIDENCE INTERVALS
In estimation theory, it is assumed that the desired information is embedded in a noisy signal. Noise adds uncertainty and if there were no uncertainty then there would be no need for estimation. An estimator attempts to approximate the unknown parameters using the measurements. This is known as parametric estimation. On the other hand, methods that analyze the observed data and arrive at an estimate without assuming any underlying parametric function are called nonparametric estimation. In reliability engineering, the parametric approach is widely used since the physical failure processes can be well captured by the parametric distributions. Throughout this book, we discussed obtaining the point estimators for the reliability parameters. However, getting a point estimator is typically not sufficient for understanding the uncertainties or confidence levels of the estimation. Confidence interval estimation addresses this problem by associating the probability of covering the true value with upper and lower bounds. In this
bappb.qxd
2/8/2009
238
5:55 PM
Page 238
RELIABILITY AND AVAILABILITY THEORY
section, we will focus on the development of confidence intervals for two important reliability metrics: failure rate and unavailability. B.3.1
Confidence Intervals for Failure Rates
We have discussed using probability distributions to model the random variable of time to failure. Once some failure data are recorded after a certain time of operation, they can be used to obtain the estimates of the parameters in the distributions. In this section, we use exponential distribution as an example to explain how to obtain the upper and lower bounds of the failure rate parameter (failure rate is in the exponential case) for a given confidence level. Assume that we arrived at an estimate for , say ˆ , which is used as an estimated value for the true failure rate. Then we need to calculate confidence bounds or a confidence interval, say [L, U] for the failure rate, where L is the lower bound and U is the upper bound. The confidence intervals associate the point estimator with the error or confidence level. For example, having interval estimators [L, U] with a given probability 1 – ␣ means that with a 100(1 – ␣)% probability, the true failure rate lies in between L and U. Here, L and U will be called 100(1 – ␣)% confidence limits. Let us derive the confidence intervals for the failure rate using the exponential distribution as an example. We begin by obtaining a point estimate—the maximum likelihood estimator (MLE) of the failure rate . Assume that we observed n failures and xi denotes the time when the ith failure occurred. Let X1, X2, . . . , Xn be a random sample from the exponential distribution with pdf f(x; ) = e–x
x > 0, > 0
The joint pdf of X1, X2, . . . , Xn is given by n
L(X, ) = ne–⌺i=1xi
(28)
Function L(X, ) is called the likelihood function, which is the function of the unknown parameter and the real data, xi and n in this case. The parameter value that maximizes the likelihood func-
bappb.qxd
2/8/2009
5:55 PM
Page 239
B.3
ESTIMATION OF CONFIDENCE INTERVALS
239
tion is called the maximum likelihood estimator. The MLE can be interpreted as the parameter value that is most likely to explain the dataset. The logarithm of the likelihood function is called the log-likelihood function. The parameter value that maximizes the log-likelihood function will maximize the likelihood function. The log-likelihood function is n
ln L(X, ) = n ln – 冱 xi
(29)
i=1
The function ln(L) can be maximized by setting the first derivative of ln L with respect to , equal to zero, and solving the resulting equation for . Therefore, n ⭸ ln L n ᎏ = ᎏ – 冱 xi = 0 ⭸ i=1
This implies that n ˆ = ᎏ n 冱 xi
(30)
i=1
The observed value of ˆ is the maximum likelihood estimator of , that is, the total number of failures divided by the total operating time. It can be proven that 2n(/ˆ ) = 2T follows a chisquared (2) distribution. T is the total accrued time on all units. Knowing the distribution of 2T allows us to obtain the confidence limits on the parameters as follows: 2 2 P[ 1–( ␣/2),2n < 2T < (␣/2),2n] = 1 – ␣
(31)
or, equivalently, that 2 1–( 2(␣/2),2n ␣/2),2n P ᎏᎏ < 2T < ᎏ =1–␣ 2T 2T
冤
冥
This means that in (1 – ␣)% of samples with a given size n, the 2 2 ˆ random interval between ˆ L = ( 1–( ␣/2),2n/2T) and U = ( (␣/2),2n/2T) will contain the true failure rate.
bappb.qxd
2/8/2009
240
5:55 PM
Page 240
RELIABILITY AND AVAILABILITY THEORY
For the example shown in Chapter 6, Section 6.2.7, if after testing for T = 50,000 hours, n = 60 failures are observed, the point estimate of the failure rate is 60 ˆ = ᎏ = 0.0012 failures/hour 50,000 For a confidence level of 90%, that is, ␣ = 1 – 0.9 = 0.1, we calculate the confidence intervals for the failure rate as 2 2 95.703 1–( 0.95,120 ␣/2),2n ˆ L = ᎏᎏ = ᎏᎏ = ᎏ = 0.000957 failures/hour 2 × 50,000 100,000 2T
and 2 2(␣/2),2n 0.05,120 146.568 ˆ U = ᎏ = ᎏᎏ = ᎏ = 0.001465 failures/hour 2 × 50,000 100,000 2T
Let us discuss an example of associating confidence level with the failure rate bounds based on the recorded data—estimating the failure rate bounds for a given confidence level (say 95%) if zero failures have occurred in time t. Assume that a Poisson distribution is used to model the failure process. The probability of x failures or less in a total time t is x (t/m)ke–t/m Px = 冱 ᎏᎏ k! k=0
where m is the mean time to failure or the reciprocal of failure rate, that is, m = 1/; k is the index of the number of observed failures. Now let us investigate the probability of zero failures, that is, k = 0: 0 (t/m)0e–t/m Px=0 = 冱 ᎏᎏ = e–t/m 0! k=0
Next we estimate the one-sided confidence limit for , given that zero failures occurred by time t. Assume a value of , say ⬘, that satisfies ⬘ > , and the probability of actually getting zero failures is 1 – ␣ = 5%, where ␣ = 95% is the confidence level. Then,
bappb.qxd
2/8/2009
5:55 PM
Page 241
B.3
ESTIMATION OF CONFIDENCE INTERVALS
241
1 – 0.95 = e–t
⬘t = 3.0 3.0 ⬘ = ᎏ t or m⬘ = 0.33t. This implies that if zero failures have occurred in time t, then there is a 95% confidence that the failure rate is less than 3/t and that the MTTF is greater than 0.33t. B.3.2
Confidence Intervals for Unavailability
The unavailability can be calculated from the availability (A) equation in Equation (2.1): MTTF Unavailability = 1 – A = 1 – ᎏᎏ MTTF + MTTR
or
U = ᎏ (32) +
where and are failure and repair rates, respectively, and U is unavailability. Note that MTTF = 1/ and MTTR = 1/. The average uptime duration m and the average downtime duration r estimated can be evaluated from the recorded data. Using these two values, a single-point estimate of the unavailability can be evaluated from Equation (22): r Uˆ = ᎏ r+m
(33)
The confidence level can also be made from the same set of recorded data. It was shown [Baldwin54] that r r Pr[⬘⬘ a,b ⱕ F2a,2b ⱕ ⬘ a,b] = Pr ᎏ ⱕ ᎏ ⱕ ᎏᎏ r + ⬘m + r + ⬘⬘m
冤
冥
(34)
where
⬘a,b and ⬘⬘ a,b are constants depending upon the chosen confidence level F2a,2b = F-statistic with 2a degrees of freedom in the numerator and 2b in the denominator
bappb.qxd
2/8/2009
242
5:55 PM
Page 242
RELIABILITY AND AVAILABILITY THEORY
a = number of consecutive or randomly chosen downtime durations b = number of consecutive or randomly chosen uptime durations The values of ⬘a,b and ⬘⬘ a,b are determined for a specific probability ␣. ⬘a,b is obtained from 1–␣ Pr[F2a,2b ⱖ ⬘] = ᎏ 2
(35)
1–␣ Pr[F2a,2b ⱕ ⬘⬘] = ᎏ 2
(36)
and ⬘⬘ a,b from
Since the upper tails of the F-distribution are usually tabulated [Odeh77], it is more convenient to express the equation above as 1 1–␣ 1 Pr ᎏ ⱖ ᎏ = ᎏ ⬘ 2 F2a,2b
冤
冥
or 1 1–␣ Pr F2b,2a ⱖ ᎏ = ᎏ ⬘⬘ 2
冤
冥
(37)
Once the values of ⬘ and ⬘⬘ are evaluated from the F-distribution with the chosen confidence level, ␣, they can be used to derive the following limits enclosing the true values of U: r Upper limit, UU = ᎏᎏ r + ⬘⬘m (38) r Lower limit, UL = ᎏ r + ⬘m B.3.3
Confidence Intervals for Large Samples
We have discussed taking multiple samples of the same size and by the same method to verify if a random sample is representative in Chapter 9, Section 9.2.1. Suppose we collect n samples. As the
bappb.qxd
2/8/2009
5:55 PM
Page 243
B.3
ESTIMATION OF CONFIDENCE INTERVALS
243
sample size (n) becomes larger, the sampling distribution of means becomes approximately normal, regardless of the shape of the variable in the population according to the central limit theorem (CLT). Assume that we estimated the sample means from all of them, say X 苶i. Then the mean of all these sample means, say X 苶i, is 苶 = ⌺ni=1X the best estimate of the population true mean, say . According to the CLT, the sampling distribution will be centered around the population mean , that is, X 苶 ⬵ . The standard deviation of the sampling distribution (X), which is called its standard error, will approach the standard deviation of the population () divided by 苶. (n1/2), that is, X = /兹n The table for the normal distribution indicates 95% of the area under the curve lies between a Z score of ±1.96. Therefore, we are 95% confident that the population mean lies between X 苶 ± 1.96X. Similarly, the table for the normal distribution indicates 99% of the area under the curve lies between a Z score of ±2.58. Therefore, we are 99% confident that the population mean lies between X 苶 ± 2.58X. When the sample size n is less than 30, the t-distribution is used to calculate the sample mean and standard error. The mathematics of the t-distribution were developed by W. C. Gossett and were published in 1908 [Gossett1908]. Reference [Chaudhuri2005] documents more discussion on estimating sampling errors.
bappc.qxd
2/10/2009
APPENDIX
2:40 PM
Page 245
C
SOFTWARE RELIABILITY GROWTH MODELS
C.1
SOFTWARE CHARACTERISTIC MODELS
Research activities in software reliability engineering have been conducted over the past 35 years and many models have been proposed for the estimation of software reliability. There exist some classification systems of software reliability models; for example, the classification theme according to the nature of the debugging strategy presented by Bastani and Ramamoorthy [Bastani86]. In addition, Goel [Goel85], Musa [Musa84], and Mellor [Mellor87] presented their classification systems. In general, model classifications are helpful for identifying similarity between different models and to provide ideas when selecting an appropriate model. One of the most widely used classification methods classified software reliability models into two types: the deterministic and the probabilistic [Pham1999]. The deterministic models are used to study: (1) the elements of a program by counting the number of operators, operands, and instructions; (2) the control flow of a program by counting the branches and tracing the execution paths; and (3) the data flow of a program by studying the data sharing and passing. In general, these models estimate and predict software performance using regression of performance measures on program complexity or other metrics. Halstead’s software metric and McCabe’s cyclomatic complexity metric are two known models of this type [Halstead77 and McCabe76]. In general, these models can be used to analyze the program attributes and produce the software performance measures without involving any random event. Software complexity models have also been studied in [Ottenstein81] and [Schneiderwind81], Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
245
bappc.qxd
2/10/2009
246
2:40 PM
Page 246
SOFTWARE RELIABILITY GROWTH MODELS
who presented some empirical models using some complexity metrics. Lipow [Lipow82] presented some models to estimate the number of faults per line of code, which is an important complexity metric. A common feature of these models is that they estimate software reliability or the number of remaining faults by regression analysis. In other words, they determine the performance of the software according to its complexity or other metrics. The probabilistic models represent the failure occurrences and fault removals as probabilistic events. This type of software reliability model [Xie91, Lyu96, Musa87, and Pham2000] can be further classified into the following categories: fault seeding models, failure rate models, curve fitting models, program structure models, input domain models, Bayesian models, Markov models, reliability growth models, and nonhomogeneous Poisson process (NHPP) models. Among these models, NHPP models are straightforward to implement in real-world applications. This family of models has received the most attention from both research and industry communities. In the next section, NHPP theory and widely used models are discussed. C.2 C.2.1
NONHOMOGENEOUS POISSON PROCESS MODELS Summary of SRGMs
C.2.1.1 Basic NHPP Models Research activities in software reliability engineering have been conducted and a number of NHPP software reliability growth models (SRGMs) have been proposed to assess the reliability of software [Goel79a, 79b; Hossain93; Lyu96; Miller86; Musa83; Musa87; Ohba84a, 84b, 84c; Ohtera90a; Pham93, 96; Yamada83; Yamada92; Wood96]. One of the first NHPP models is suggested by Schneidewind (1975). The first and well-known NHPP model is given by Goel and Okumoto [Goel79a]) which has been further generalized in [Goel85] and modified by many researchers. This model essentially agrees with the model that John Musa [Musa84, 98] proposed. Moranda [Moranda81] described a variant of the JelinskiMoranda model. In this paper, the variable (growing) size of a developing program is accommodated so that the quality of a program can be estimated by analyzing an initial segment of the written code. Two parameters, mean time to failure (MTTF) and fault content of a program, are estimated.
bappc.qxd
2/10/2009
2:40 PM
Page 247
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
247
C.2.1.2 S-Shaped Models Ohba and coworkers [Ohba82] presented a NHPP model with an S-shaped mean value function. Some interesting results using an NHPP model are also presented by Yamada and coworkers [Yamada83]. Ohba [Ohba84] discussed several methods to improve some traditional software reliability analysis models. Selection of appropriate models was addressed. Ohba and coworkers [Ohba82] suggested the so-called S-shaped models. Based on experience, it is observed that the curve of the cumulative number of faults is often S-shaped with regard to the mean value function, which reflects the fact that faults are neither independent nor of the same size. At the beginning, some faults are hidden, so removing a fault has a small effect on the reduction of the failure intensity rate. Yamada [Yamada84] proposed another reason that the mean value function shows an S-shaped curve. The software testing usually involves a learning process by which people become familiar with the software and the testing tools and, therefore, can do a better job after a certain period of time. Yamada and Osaki [Yamada85] presented a general description of a discrete software reliability growth model that adopted the number of test runs or the number of executed test cases as the unit of fault detection period. Yamada and coworkers [Yamada86] proposed a testing-effort-dependent reliability growth model for which the software fault detection process is modeled by an NHPP model. They used exponential and Rayleigh distributions to model the testing expenditure functions. Since 1990, research activities have increased in the area of software reliability modeling. Yamada and Ohtera [Yamada90] incorporated the testing-effort expenditures into software reliability growth models. They conducted research on NHPP models and provided many modifications reflecting such issues as testing effort, delayed S-shaped, and learning process consideration. C.2.1.3 Imperfect Debugging Singpurwalla [Singpurwalla91] described an approach addressing optimal time interval for testing and debugging under uncertainty. He suggested two plausible forms for the utility function, one based on cost alone and the other involving the realized reliability of the software. Yamada [Yamada91a] described a software fault detection process during the testing phase of software development. Yamada et al. [Yamada91b] proposed two software reliabili-
bappc.qxd
2/10/2009
248
2:40 PM
Page 248
SOFTWARE RELIABILITY GROWTH MODELS
ty assessment models with imperfect debugging by assuming that new faults are sometimes introduced when faults originally latent in a software system are corrected and removed during the testing phase. Pham [Pham91] proposed software reliability models for critical applications. C.2.1.4 Fault Detection and Correction Processes Xie and Zhao [Xie92] investigated the Schneidewind NHPP model [Schneidewind75] and suggested that several NHPP models can be derived from it to model the fault detection process and fault correction process. Pham [Pham93] studied the imperfect debugging and multiple failure types in software development. Hossain and coworkers [Hossain93] suggested a modification of the Goel–Okumoto model. They also presented a necessary and sufficient condition for the likelihood estimates to be finite, positive, and unique. Pham and Zhang [Pham97] summarized the existing NHPP models and presented a new model incorporating the timedependent behavior of the fault detection function and the fault content function. Littlewood [Littlewood2000] studied different fault-finding procedures and showed that the effects these procedures have on reliability are not statistically independent. Wu [Wu2007] investigated fault detection and fault correction processes and proposed an approach to incorporate time delays due to fault detection and correction into software reliability models. Huang [Huang2004] proposed methods to incorporate fault dependency and time-dependent delay function into software reliability growth modeling. Gokhale [Gokhale98] proposes a method to incorporate debugging activities using rate-based simulation techniques. Various debugging policies are presented and the effects of these policies on the number of residual defects are analyzed. C.2.1.5 Testing Coverage Research has been conducted in testing coverage and its relationship with software reliability. Levendel [Levendel89] introduced the concepts of time-varying test coverage and time-varying defect repair density and related them to software reliability evaluation. Malaiya [Malaiya94, Malaiya2002] proposed a logarithmic model that relates the testing effort to test coverage and defect coverage. Testing coverage is measured in terms of blocks, branches, computation uses, predicate uses, and so on that are covered. Lyu
bappc.qxd
2/10/2009
2:40 PM
Page 249
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
249
[Lyu2003] documented an empirical study on testing and fault tolerance for software reliability engineering. Faults were inserted into the software and the nature, manifestation, detection, and correlation of these faults was carefully studied. This study shows that coverage testing is an effective means of detecting software faults, but the effectiveness of testing coverage is not equivalent to that of mutation coverage, which is a more truthful indicator of testing quality. Gokhale [Gokhale2004] proposed a relationship between test costs and the benefits, specifically between the quantity of testing and test coverage, based on the lognormal failure rate model. Cai [Cai2007] incorporates both testing time and testing coverage in software reliability prediction. Experiments were carried out using the model in multiversion, fault-tolerant software. C.2.1.6 Other Considerations Other software reliability growth models have been proposed in the literature. Yoshihiro Tohma and coworkers [Tohma91] worked on a hypergeometric model and its application to software reliability growth. Zhao and Xie [Zhao92] presented a log-power NHPP model that possesses properties such as simplicity and good graphical interpretation. Some papers also addressed the application of the software reliability models. Schneidewind [Schneidewind92] reported some software reliability studies of application to the U.S. space shuttle. He used experimental approaches to evaluate many existing reliability models and validated them using real software failure data. Schneidewind [Schneidewind93] claimed that it is not necessary that all the failure data be used to estimate software reliability since some of the failure data collected in the earlier testing phase is unstable. His research showed that improved reliability prediction can be achieved by using a subset of the failure data. Wood [Wood96] reported his experiments on software reliability models at Tandem Computer. He compared some existing software reliability models by applying them to the data collected from four releases of the software products. He observed that the number of defects predicted by the Goel–Okumoto model is close to the number reported in the field data. Research on software size estimation has also been conducted. Hakuta and coworkers [Hakuta96] proposed a model for estimating software size based on the program design and other docu-
bappc.qxd
2/10/2009
250
2:40 PM
Page 250
SOFTWARE RELIABILITY GROWTH MODELS
ments, then evaluated the model by looking at some application examples. Their model assumed a stepwise evaluation of software size at different levels of program design documents. Software reliability models based on NHPP have indeed been successfully used to evaluate software reliability [Musa83; Miller86; Musa87, Pham99]. Musa [Musa83, Musa87] promoted the use of NHPP models in software reliability growth modeling. Miller [Miller86] also provided a strong theoretical justification for using NHPP. An important advantage of NHPP models is that they are closed under superposition and time transformation. This characteristic is useful to describe different types of failures, even systems failures, including both hardware and software failures. Pham, Nordmann and Zhang [Pham99] developed a general NHPP model from which new models can be derived and existing models can be unified. C.2.1.7 Failure Rate Prediction SRGMs are used to evaluate and predict the software failure rate. References [Jeske05a, Zhang02, Zhang06] document details on how to use SRGM to evaluate testing data and predict software field failure rates. Typically, the software failure rate in the testing environment needs to be calibrated when predicting the software field failure rate to adjust the mismatch between the testing and the field environment. To adjust the mismatch, this rate should be correlated with the software failure rate observed in the field by some calibration factor. This calibration factor is best estimated by comparing the lab failure rate of a previous release against the field failure rate for that same release; assuming that testing strategies, operational profiles, and other general development processes remain relatively consistent, the calibration factor should be relatively consistent. References [Zhang02] and [Jeske05a] discuss more details on calibrating software failure rates estimated in the testing environment to predict software failure rates in the field. They also discuss two other practical issues: noninstantaneous defect removal time and deferral of defect fixes. Most SRGMs focus on defect detection and they assume that fault removal is instantaneous and all of the detected defects will be fixed before software is released. So software reliability growth can be achieved after software defects are detected. In practice, it takes a significant amount of time to remove defects, and fixes of some defects might be deferred to the next release for various rea-
bappc.qxd
2/10/2009
2:40 PM
Page 251
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
251
sons; for example, it is part of a new feature. References [Jeske05b] and [Zhang06] explained how to address these issues in realworld applications. C.2.2
Theory of SRGM
To use SRGM to describe the fault detection process, let N(t) denote the cumulative number of software failures by time t. The counting process {N(t),t ⱖ 0} is said to be a nonhomogeneous Poisson process with intensity function (t), t ⱖ 0, if N(t) follows a Poisson distribution with mean value function m(t): [m(t)]k Pr{N(t) = k} = ᎏ e–m(t), k!
k = 0, 1, 2 . . . ,
(1)
where m(t) = E[N(t)] is the expected number of cumulative failures, which is also known as the mean value function. The failure intensity function (or hazard function) is given by R(t) – R(⌬t + t) f(t) (t) = lim ᎏᎏ = ᎏ ⌬t씮0 ⌬tR(t) R(t) Given (t), the mean value function m(t) satisfies
冕 (s)ds t
m(t) =
(2)
0
Inversely, knowing m(t), the fault detection rate function at time t, can be obtained as dm(t) (t) = ᎏ dt
(3)
Software reliability R(x/t) is defined as the probability that a software failure does not occur in (t, t + x), given that the last failure occurred at testing time t(t ⱖ 0, x > 0). That is, R(x/t) = e–[m(t+x)–m(t)]
(4)
For special cases, when t = 0, R(x/0) = e–m(x); and when t = ⬁, R(x/⬁) = 1.
bappc.qxd
2/10/2009
252
2:40 PM
Page 252
SOFTWARE RELIABILITY GROWTH MODELS
Most of the NHPP software reliability growth models in the literature are based on the same underlying theory, with different formats of the mean value functions. NHPP models assume that failure intensity is proportional to the residual fault content. A general class of NHPP SRGMs can be obtained by solving the following differential equation: dm(t) ᎏ = b(t)[a(t) – m(t)] dt
(5)
Where a(t) stands for the total number of defects in the software and b(t) is known as the defect detection function. The second term a(t) – m(t) represents the number of (undetected) residual defects. The model in Equation (5) is a general model that can summarize most of the NHPP models. Depending on how elaborate a model one wishes to obtain, one can use a(t) and b(t) to yield more or less complex analytical solutions for the function m(t). Different a(t) and b(t) functions also reflect different assumptions of the software testing processes. The GO model cited in Chapter 7 is the simplest NHPP with a(t) = a and b(t) = b. A constant a(t) indicates that no new faults are introduced during the debugging process and, therefore, is considered a perfect debugging assumption. A constant b(t) implies that the failure intensity function (t) is proportional to the number of remaining faults. The general solution of the differential Equation (5) is given by [Pham97, Pham99]:
冤
m(t) = e–B(t) m0 +
冕 a()b()e t
t0
B()
冥
d
(6)
where B(t) = 兰tt0b() d, and m(t0) = m0 is the marginal condition of Equation (5), with t0 representing the starting time of the debugging process. Many existing NHPP models can be considered as special cases of the general model in Equation (6). An increasing a(t) function implies an increasing total number of faults (note that this includes those already detected and removed and those inserted during the debugging process) and reflects imperfect debugging. A time-dependent b(t) implies an increasing fault detection rate, which could be either attributed to a learning curve phenomenon [Ohba84,
bappc.qxd
2/10/2009
2:40 PM
Page 253
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
253
Yamada92], or to software process fluctuations [Rivers98], or a combination of both. This group of models with time-dependent fault detection functions are also referred to as “S-shaped” models since the fault detection function captures the delay at the beginning due to learning. Table C.1 summarizes the widely used NHPP software reliability growth models and their mean value functions (MVFs). C.2.2.1 Parameter Estimation Once the analytical expression for the mean value function m(t) is developed, the parameters of the mean value function need to be estimated, which is usually carried out by using the maximum likelihood estimates (MLE) [Schneidewind75]. There are two widely used formats for recording software failure data in practice. The first type records the cumulative number of failures for every separate time interval. The second records the exact failure occurrence time for each fault. Since there are two types of data, two methodologies of the parameter estimation are derived accordingly [Pham96]. Case 1 Let t1, t2, . . . , tn be the time interval of n software failures, and y1, y2,. . . , yn be the cumulative number of failures for each time interval. If data are given on the cumulative number of failures at discrete times [yi = y(ti) for i = 1, 2, . . . , n], then the log of the likelihood function (LLF) can be expressed as n
LLF = 冱(yi – yi–1) × log[m(ti) – m(ti–1)] – m(tn)
(7)
i=1
Thus, the maximum of the LLF is determined by the following: n
冱
i=1
⭸ ⭸ ᎏᎏ m(ti) – ᎏᎏ m(ti–1) ⭸ ⭸x ⭸x ᎏᎏᎏ (yi – yi–1) – ᎏ m(tn) = 0 m(ti) – m(ti–1) ⭸x
(8)
where x represents the unknown parameters in the mean value function m(t) that need to be substituted. Case 2 The second method records the failure occurrence time for each failure. Let Sj be the occurrence time of the failure j (j = 1, 2, . . . ,
bappc.qxd
2/10/2009
254
2:40 PM
Page 254
SOFTWARE RELIABILITY GROWTH MODELS
Table C.1. Summary of the NHPP software reliability models Model name
Model type MVF, m(t)
Delayed S-shaped model [Yamada83]
S-shaped
m(t) = a[1 – (1 + bt)e–bt]
Modification of G-O model to make it S-shaped
Goel–Okumoto (G-O) model [Goel79]
Concave
m(t) = a(1 – e–bt) a(t) = a b(t) = b
Also called exponential model
Inflection S-shaped model [Hossain93]
S-shaped
a(1 – e–bt) m(t) = ᎏᎏ 1 + e–bt
Solves a technical condition with the G-O model. Becomes the same as G-O if =0
a(t) = a b b(t) = ᎏᎏ 1 + e–bt
Pham– Nordmann– Zhang (PNZ) model
S-shaped and concave
␣ a(1 – e–bt)(1 – ᎏᎏ) + ␣at b m(t) = ᎏᎏᎏ –bt 1 + e a(t) = a(1 + ␣t) b b(t) = ᎏᎏ 1 + e–bt
Pham–Zhang (PZ) model [Pham99]
S-shaped and concave
1 + a)(1 – e–bt) m(t) = ᎏᎏ[(c (1 + e–bt) ab – ᎏᎏ(e–␣t – e–bt)] b–␣ a(t) = c + a(1 – e–␣t) b b(t) = ᎏᎏ 1 + e–bt
Yamada exponential model [Yamada86]
Concave
m(t) = a(1 – e–r␣(1–e(–t))) a(t) = a b(t) = r␣e–t
Comments
Assumes the introduction rate is a linear function of the testing time, and the fault detection rate is nondecreasing with an inflexion S-shaped model Assumes introduction rate is an exponential function of the testing time, and the fault detection rate is nondecreasing with an inflexion S-shaped model Incorporates an exponential testing-effort function
bappc.qxd
2/10/2009
2:40 PM
Page 255
C.2
255
NONHOMOGENEOUS POISSON PROCESS MODELS
Table C.1. Continued Model name
Model type MVF, m(t)
Comments 2 –r␣(1–e(–t /2))
Yamada Rayleigh model [Yamada86]
S-shaped
m(t) = a(1 – e a(t) = a b(t) = r␣te–t2/2
Yamada imperfect debugging model (1) [Yamada92]
S-shaped
ab m(t) = ᎏᎏ (e–␣t – ebt) ␣+b
Yamada imperfect debugging model (2) [Yamada92]
S-shaped
)
Assumes exponential fault content function and a constant fault detection rate
a(t) = ae␣t b(t) = b
␣ m(t) = a(1 – e–bt) 1 – ᎏᎏ + ␣at b
冢
Incorporates a Rayleigh testing-effort function
冣
a(t) = a(1 + ␣t) b(t) = b
Assumes constant introduction rate ␣ and fault rate detection b
n), then the log of the likelihood function takes the following form: n
LLF = 冱 log[(Si)] – m(Sn)
(9)
i=1
where (Si) is the failure intensity function at time Si. Thus, the maximum of the LLF is determined by the following: n
冱
i=1
⭸ ᎏᎏ (Si) ⭸x ⭸ ᎏ – ᎏ m(Sn) = 0 ⭸x (Si)
(10)
where x represents the unknown parameters in the mean value function m(t) that need to be substituted. Typically, software tools can help users to fit a specific model to a given data set. The tool will run algorithms to maximize the likelihood function, which yields the values of the parameters in the model for a given dataset. C.2.2.2
SRGM Model Selection Criteria
Descriptive Power. An NHPP SRGM model proposes a mean value function m(t) that can be used to estimate the number of expected
bappc.qxd
2/10/2009
256
2:40 PM
Page 256
SOFTWARE RELIABILITY GROWTH MODELS
failures by time t. Once a mean value function is fit to the actual debugging data (typically in the format of number of cumulative failures by cumulative test time), the goodness of fit can be measured and compared using the following three criteria: mean squared error (MSE), Akaike’s information criterion (AIC) [Akaike74], and predictive-ratio risk (PRR) proposed by Pham and Deng [Pham2003]. The closeness of fit between a given model and the actual dataset provides the descriptive power of the model. These three metrics compare the SRGMs on how well each of them fits the underlying data, in other words, they compare the SRGMs based on their descriptive power. Criterion 1 The MSE measures the distance of a model estimate from the actual data with the consideration of the number of observations and the number of parameters in the model. It is calculated as follows:
冱i (m(ti) – yi)2
MSE = ᎏᎏ n–N
(11)
where n is the number of observations, N stands for the number of parameters in the model, yi is the cumulative number of failures observed by time ti, m(ti) is the mean value function of the NHPP model, and i is the index for the reported defects. Criterion 2 AIC measures the ability of a model to maximize the likelihood function that is directly related to the degrees of freedom during fitting. The AIC criterion assigns a larger penalty to a model with more parameters: AIC = –2 × log(max value of the likelihood function) + 2 × N (12) where N stands for the number of parameters in the model. Criterion 3 The third criterion, predictive-ratio risk (PRR), is defined as
冢
m(ti) – yi PRR = 冱 ᎏᎏ m(ti) i
冣
2
(13)
bappc.qxd
2/10/2009
2:40 PM
Page 257
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
257
where yi is the number of failures observed by time ti and m(ti) is the mean value function of an NHPP model. PRR assigns a larger penalty to a model that has underestimated the cumulative number of failures at any given time. For all three, a lower metric value indicates a better fit to the data. Predictive Power. The predictive power is defined as the ratio of the difference between the predicted number of residual faults and number of faults observed in the postsystem test to the number of observed faults in the postsystem test, that is, — ˆ
Ns(T) – Npost P = ᎏᎏ Npost
(14)
— ˆ
where Ns is the estimated number of remaining faults by the end of the system test, and Npost is the number of faults detected during the post system test phase. A negative value indicates that the model has underestimated the number of remaining faults. A lower absolute metric value indicates better predictive power. For projects that have defect data from the test and field/trial intervals, the model that provides the best descriptive power and predictive power should be selected. If only test data is available and the test data is relatively large, the data can be divided into two subsets. The first subset can be used for descriptive power comparison, whereas the second subset can be used for predictive power comparison. If the test data is small, then only descriptive power can be used for model selection. C.2.3 SRGM Example–Evaluation of the Predictive Power: Data from a Real-Time Control System In this section, the predictive power of the proposed model is evaluated by using a dataset collected from testing a program for monitor and real-time control systems. The data is published in [Tohma91]. The software consists of about 200 modules, and each module has, on average, 1000 lines of a high-level language like FORTRAN. Table C.2 records the software failures detected during a 111-day testing period. This actual data set is concave overall, with two clusters of significant increasing detected faults.
bappc.qxd
2/10/2009
258
2:40 PM
Page 258
SOFTWARE RELIABILITY GROWTH MODELS
Table C.2. Failure per day and cumulative failure Days
Faults
Cumulative faults
Days
2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
5* 5* 5* 5* 6* 8 2 7 4 2 31 4 24 49 14 12 8 9 4 7 6 9 4 4 2 4 3 9 2 5 4 1 4 3 6 13 19 15 7 15 21 8 6 20 10
5* 10* 15* 20* 26* 34 36 43 47 49 80 84 108 157 171 183 191 200 204 211 217 226 230 234 236 240 243 252 254 259 263 264 268 271 277 293 309 324 331 346 367 375 381 401 411
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
Faults 3 3 8 5 1 2 2 2 7 1 0 2 3 2 7 3 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 2 0 1 0 0 0 0 0 0 2 0 0 0
Cumulative faults 414 417 420 430 431 433 435 437 444 446 446 448 451 453 460 463 463 464 464 465 465 465 466 467 467 467 468 469 469 469 469 470 472 472 473 473 473 473 473 473 473 475 475 475 475
bappc.qxd
2/10/2009
2:40 PM
Page 259
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
259
Table C.2. Continued Days 91 92 93 94 95 96 97 98 99 100 101
Faults
Cumulative faults
0 0 0 0 0 1 0 0 0 1 0
475 475 475 475 475 476 476 476 476 477 477
Days 102 103 104 105 106 107 108 109 110 111
Faults
Cumulative faults
0 1 0 9 1 0 0 1 0 1
477 478 478 478 479 479 479 480 480 481
*Interpolated data.
Table C.3. MLEs of model parameters—control system data MLEs (61 data points)
MLEs (111 data points)
aˆ = 522.49 bˆ = 0.06108
aˆ = 483.039 bˆ = 0.06866
Goel–Okumoto m(t) = a(1 – e–bt) (G-O) model a(t) = a b(t) = b
aˆ = 852.97 bˆ = 0.01283
aˆ = 497.282 bˆ = 0.0308
Inflexion S-shaped model
aˆ = 852.45 bˆ = 0.01285 ˆ = 0.001
aˆ = 482.017 bˆ = 0.07025 ˆ = 4.15218
aˆ = 470.759 bˆ = 0.07497
aˆ = 470.759 bˆ = 0.07497
Model name Delayed S-shaped model
MVF, m(t) m(t) = a[1 – (1 + bt)e–bt] a(t) = a b2 t b(t) = ᎏᎏ 1 + bt
a(1 – e–bt) m(t) = ᎏᎏ 1 + e–bt a(t) = a b b(t) = ᎏᎏ 1 + e–bt
Pham– Nordmann– Zhang (PNZ) model
␣ a(1 – e–bt) 1 – ᎏᎏ + ␣at b m(t) = ᎏᎏ ᎏᎏ –bt 1 + e
冢
a(t) = a(1 + ␣t) b b(t) = ᎏᎏ 1 + e–bt
冣
␣ˆ = 0.00024 ˆ = 4.69321
␣ˆ = 0.00024 ˆ = 4.69321
(continued)
bappc.qxd
2/10/2009
260
2:40 PM
Page 260
SOFTWARE RELIABILITY GROWTH MODELS
Table C.3. Continued Model name Pham–Zhang (PZ) model
MLEs (61 data points)
MVF, m(t)
aˆ = 0.920318 1 –bt + a)(1 – e ) bˆ = 0.0579 m(t) = ᎏᎏ[(c (1 + e–bt) ␣ˆ = 2.76 × 10–5 ˆ = 3.152 ab cˆ = 520.784 – ᎏᎏ(e–␣t – e–bt)] b–␣
MLEs (111 data points) aˆ = 0.46685 bˆ = 0.07025 ␣ˆ = 1.4 × 10–5 ˆ = 4.15213 cˆ = 482.016
a(t) = c + a(1 – e–␣t) b b(t) = ᎏᎏ 1 + e–bt (–t))
Yamada exponential model
m(t) = a(1 – e–r␣(1–e a(t) = a b(t) = r␣e–t
Yamada Rayleigh model
m(t) = a(1 – e–r␣(1–e(–t a(t) = a 2 b(t) = r␣te–t /2
Yamada imperfect debugging model (1)
ab m(t) = ᎏᎏ (e–␣t – ebt) ␣+b
Yamada imperfect debugging model (2)
)
2/2)
)
)
a(t) = ae␣t b(t) = b
␣ m(t) = a(1 – e–bt) 1 – ᎏᎏ + ␣at b
冢
a(t) = a(1 + ␣t) b(t) = b
冣
aˆ = 9219.7 ␣ˆ = 0.09995 ˆ = 0.01187
aˆ = 67958.8 ␣ˆ = 0.00732 ˆ = 0.03072
aˆ = 611.70 ␣ˆ = 1.637 ˆ = 0.00107
aˆ = 500.146 ␣ˆ = 3.31944 ˆ = 0.00066
aˆ = 1795.7 bˆ = 0.00614
aˆ = 654.963 bˆ = 0.02059
␣ˆ = 0.002
␣ˆ = 0.0027
aˆ = 16307 bˆ = 0.0068 ␣ˆ = 0.009817
aˆ = 591.804 bˆ = 0.02423 ␣ˆ = 0.0019
It seems that the software in this example turns stable after 61 days of testing. It is desired to compare the descriptive power of the models using the first 61 data points and to compare the predictive power of the models assuming the last 50 days data as "actual" data after the prediction was made. The results of parameter estimation using the first 61 days of data are summarized in Table C.3. The result of the predictive power comparison is shown in Table C.4. From Table C.4, we can see that the PZ model shows the best predictive power (with the lowest prediction MSE) followed by the Yamada Rayleigh, inflexion S-shaped and Delayed S-shaped models. From Table C.3, the number of total defects is estimated.
bappc.qxd
2/10/2009
2:40 PM
Page 261
C.2
NONHOMOGENEOUS POISSON PROCESS MODELS
261
Table C.4. Model comparison—control system data Model Name Delayed S-shaped model Goel–Okumoto(G-O) model Inflection S-shaped model Pham–Nordmann–Zhang (PNZ) model Pham–Zhang (PZ) model Yamada exponential model Yamada Rayleigh model Yamada imperfect debugging model (1) Yamada imperfect debugging model (2)
MSE (Prediction) 935.88 11611.42 590.38 2480.7 102.66 12228.25 187.57 8950.54 2752.83
For all the models except the PZ model, the number of total defects is a. ˆ The number of the total defects for the PZ model can be estimated as (ˆ c + a). ˆ For example, using all 111 days of data, the total number of defects estimated from the PZ model is 482.47 and that estimated from the Delayed S-shaped model is 483.03. Hence for this data set, we can conclude that the PZ model is the best model to estimate the software reliability parameters, that is, the number of residual defects and the per fault failure rate. Some of the other models provide close estimates or predictive power; they can be used to confirm the prediction. Let's use the PZ model to further estimate the software reliability parameters. Based on all 111 data points, the number of total defects is (ˆ c + a) ˆ = 482/47, hence the number of residual defects is – c + a) ˆ – n = 482.47 – 481 = 1.47. The per fault failure rate is Nˆ (t) = (ˆ then bˆ = 0.07025 failures/day/fault. So the initial failure rate based – on the testing data is then ˆ (T = 111 days) = Nˆ (t) × bˆ = 1.47 × 0.07025 = 0.1053 failures/day, or 38.5 failures/year. As discussed in Section 7.2, assume we estimated a calibration factor of K = 10 from other releases or similar products, the adjusted initial field failure rate can be estimated as: 0.07025 bˆ – ˆ (tfield = 0) = Nˆ (t) × ᎏ 1.47 × ᎏ = 0.01053 K 10 failures/day, or 3.85 failures/year. This software failure rate can be used in the architecture-based reliability model to update the downtime and availability prediction. Here let us discuss how to use the software failure rate estimated from the test data to predict the downtime from a high
bappc.qxd
2/10/2009
262
2:40 PM
Page 262
SOFTWARE RELIABILITY GROWTH MODELS
level. Assume that the software coverage factor C = 95%, that is 95% of the time software failures can be automatically detected and recovered. Let's further assume that the auto detection and recovery takes 10 seconds on average. Then 5% of the failures that escape the system fault detection mechanisms are detected and recovered through some human intervention. Let's assume the manual detection and recovery takes 30 minutes on average. We can quickly estimate the annual software downtime as:
冢
冣
10 DT = 3.85 0.95 × ᎏ + 0.05 × 30 = 6.38 minutes/year 60 This corresponds to an availability of 99.9988%. With this example, we show how to use SRGM to estimate the software failure rate, which can be fed back to the architecture-based model to produce an updated downtime prediction (as compared to an early architectural phase prediction).
bappd.qxd
2/8/2009
APPENDIX
5:58 PM
Page 263
D
ACRONYMS AND ABBREVIATIONS
CB CLT COTS CPLD CPU CRC CWT DC DSL ECC EPROM EPA ERI FCC FIT FPGA FRU GA GO GSPN HA HLR HW IC IP IT KLOC LED
Control Board Central Limit Theorem Commercial Off-The-Shelf Complex Programmable Logic Device Central Processing Unit Cyclic Redundancy Check Cleared While Testing Direct Current Digital Subscriber Line Error Checking and Correction Electrically Programmable Read-Only Memory Environmental Protection Agency Early Return Indicator Federal Communications Commission Failures in 1 Billion Hours Field-Programmable Gate Array Field-Replaceable Unit General Availability Goel–Okumoto Generalized Stochastic Petri Net High Availability Home Location Register Hardware Interface Card Internet Protocol Information Technology Kilolines of Code Light-Emitting Diode
Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
263
bappd.qxd
2/8/2009
264
5:58 PM
Page 264
ACRONYMS AND ABBREVIATIONS
LTR MLE MOP MR MTBF MTTF MTTO MTTR NE NEBS NEO4 NESAC NHPP NOC NFF NTF OAM OEM OS PBX PC PCI PDF PE PMC QuEST RBD RF RFP RPP SLA SNMP SO4 SPN SRGM SRQAC SW YRR
Long-term Return Rate Maximum Likelihood Estimation Method of Procedures Modification Request; a defect recorded in a defect tracking system Mean Time Between Failures Mean Time to Failure Mean Time to Outage Mean Time To Repair Network Element Network Equipment Building Standards Network Element Outage, Category 4 National Electronics Systems Assistance Center Nonhomogenous Poisson Process Network Operations Center No Fault Found, same as No Trouble Found No Trouble Found, same as No Fault Found Operations, Administration, and Maintenance Original Equipment Manufacturer Operating System Private Branch Exchange Personal Computer Peripheral Component Interconnect Probability Density Function Prediction Error PCI Mezzanine Card Quality Excellence for Suppliers of Telecommunications Reliability Block Diagram Radio Frequency Request for Proposal Reliability Prediction Procedure Service-Level Agreement Simple Network Management Protocol Product-Attributable Service Outage Downtime Stochastic Petri Net Software Reliability Growth Model Software Reliability and Quality Acceptance Criteria Software Yearly Return Rate
bappe.qxd
2/8/2009
APPENDIX
6:15 PM
Page 265
E
BIBLIOGRAPHY
References are grouped first as a complete list, then by topic in the following sections to make it easier for the readers to find additional information on a specific subject. [Akaike74] Akaike, H., A New Look at Statistical Model Identification, IEEE Transactions on Automatic Control, 19, 716–723 (1974). [ANSI91] ANSI/IEEE, Standard Glossary of Software Engineering Terminology, STD-729-1991, ANSI/IEEE (1991). [AT&T90] Klinger, D. J., Nakada, Y., and Menendez, M. A., (Eds.), AT&T Reliability Manual, Springer (1990). [Bagowsky61] Bagowsky, I., Reliability Theory and Practice, PrenticeHall (1961). [Baldwin54] Baldwin, C., J. et al., Mathematical Models for Use in the Simulation of Power Generation Outage, II, Power System Forced Outage Distributions, AIEE Transactions, 78, TP 59–849 (1954). [Bastani82] Bastani, F. B., and Ramamoorthy, C. V., Software Reliability Status and Perspectives, IEEE Trans. Software Eng., SE-11, 1411–1423 (1985) [Cai 2007] Cai, X., and Lyu, M. R., Software Reliability Modeling with Test Coverage: Experimentation and Measurement With a Fault-Tolerant Software Project, in The 18th IEEE International Symposium on Software Reliability, 17–26 (2007). [Chaudhuri2005] Chaudhuri, A., and Stenger, H., Survey Sampling: Theory and Methods, Second Edition, Chapman & Hall/CRC Press (2005). [Crowder 94] Crowder, M. J., Kimber, A., Sweeting, T., and Smith, R., Statistical Analysis of Reliability Data, Chapman & Hall/CRC Press (1994). [Feller68] Feller, W., An Introduction to Probability Theory and Its Applications, Vol., 1, Wiley (1968). [FP08] Website of Function Points, http://www.ifpug.org/. Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
265
bappe.qxd
2/8/2009
266
6:15 PM
Page 266
BIBLIOGRAPHY
[Goel79a] Goel, A. L. and Okumoto, K., Time-Dependent Fault Detection Rate Model for Software and Other Performance Measures, IEEE Transactions on Reliability, 28, 206–211 (1979). [Goel79b] Goel, A. L. and Okumoto, K., A Markovian Model for Reliability and Other Performance Measures of Software Systems, in Proceedings of the National Computer Conference, pp. 769–774 (1979). [Goel85] Goel, A. L., Software Reliability Models: Assumptions, Limitations, and Applicability. IEEE Trans Software Eng., SE-11, 1411–1423 (1985). [Gokhale1998] Gokhale, S. S., Lyu, M. R., and Trivedi, K. S., Software Reliability Analysis Incorporating Fault Detection and Debugging Activities, in Proceedings of the Ninth International Symposium on Software Reliability Engineering, 4–7 November, 202–211 (1998). [Gokhale2004] Gokhale, S. S., and Mullen, R. E., From Test Count to Code Coverage Using the Lognormal Failure Rate, in 15th International Symposium on Software Reliability Engineering, 295–305 (2004). [Gossett08] Gossett W. S., The Probable Error of a Mean, Biometrika 6(1), 1–25 (1908). [Hakuta97] Hakuta, M., Tone, F., and Ohminami, M., A Software Estimation Model and Its Evaluation, J. Systems Software, 37, 253–263 (1997). [Halstead77] Halstead M. H., Elements of Software Science, Elsevier North-Holland (1977). [Hossain93] Hossain, S. A. and Dahiya, R. C., Estimating the Parameters of a Non-Homogeneous Poisson Process Model for Software Reliability, IEEE Trans. on Reliability, 42(4), 604–612 (1993). [Huang2004] Huang, C.-Y., Lin, C.-T., Lyu, M. R., and Sue, C.-C., Software Reliability Growth Models Incorporating Fault Dependency With Various Debugging Time Lags, in Proceedings of the 28th Annual International Computer Software and Applications Conference, COMPSAC 2004, 1, 186–191 (2004). [IEEE95] IEEE, Charter and Organization of the Software Reliability Engineering Committee (1995). [Jeske05a] Jeske, D. R., and Zhang, X., Some Successful Approaches to Software Reliability Modeling in Industry, Journal of Systems and Software, 74, 85–99 (2005). [Jeske05b] Jeske, D. R., Zhang, X. and Pham, L., Adjusting Software Failure Rates that are Estimated from Test Data, IEEE Transactions on Reliability, 54(1), 107–114 (2005). [Jones1991] Jones, C., Applied Software Measurement, McGraw-Hill (1991). [Keene94] Keene, S. J., Comparing Hardware and Software Reliability, Reliability Review, 14(4), 5–7, 21 (1994). [Keiller91] Keiller, P. A. and Miller, D. R., On the Use and the Perfor-
bappe.qxd
2/8/2009
6:15 PM
Page 267
BIBLIOGRAPHY
267
mance of Software Reliability Growth Models, Software Reliability and Safety, 32, 95–117 (1991). [Kemeny60] Kemeny, J. G., and Snell, J. L., Finite Markov Chains, Van Nostrand (1960). [Kremer83] Kremer, W., “Birth-Death and Bug Counting,” IEEE Transactions on Reliability,” R-32(1), pp. 37–47 (1983). [Levendel1989] Levendel, Y., Software Quality and Reliability Prediction: A Time-Dependent Model with Controllable Testing Coverage and Repair Intensity, in Proceedings of the Fourth Israel Conference on Computer Systems and Software Engineering, 175–181 (1989). [Lipow82] Lipow, M., Number of Faults per Line of Code, IEEE Trans. on Software Eng., 8(4), 437–439 (1982). [Littlewood2000] Littlewood, B., Popov, P. T., Strigini, L., and Shryane, N., Modeling the Effects of Combining Diverse Software Fault Detection Techniques, IEEE Transactions on Software Engineering, 26(12), 1157–1167 (2000). [Lyu95] Lyu, M. R., Software Fault Tolerance, Wiley (1995). [Lyu96] Lyu, M. R. (Ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press (1996). [Lyu2003] Lyu, M. R., Huang, Z., Sze, S. K. S., and Cai, X., An Empirical Study on Testing and Fault Tolerance for Software Reliability Engineering, in 14th International Symposium on Software Reliability Engineering, 119–130 (2003). [Malaiya1994] Malaiya, Y. K., Li, N., Bieman, J. Karcich, R., and Skibbe, B., The Relationship Between Test Coverage and Reliability, in Proceedings of 5th International Symposium on Software Reliability Engineering, 186–195 (1994). [Malaiya2002] Malaiya, Y. K. Li, M. N., Bieman, J. M., and Karcich, R., Software Reliability Growth With Test Coverage, in Transactions of Reliability Engineering, 420–426 (2002). [McCabe76] McCabe, T. J., A Complexity Measure, IEEE Transactions on Software Engineering, SE-2(4), 308–320 (1976). [Mellor87] Mellor, P., Software Reliability Modelling: the State of the Art, Information and Software Technology, 29(2), 81–88 (1987). [Miller86] Miller, D. R., Exponential Order Static Models of Software Reliability Growth, IEEE Transactions on Software Engineering, SE-12(1), 12–24 (1986). [Musa83] Musa J. and Okumoto K., Software Reliability Nodels: Concepts, Classification, Comparison, and Practice, Electronic Systems Effectiveness and Life Cycle Costing, J. K. Skwirzynski (Ed.), NATO ASI Series, F3, Spring-Verlag, 395–424 (1983). [Musa84a] Musa, J. D., Software Reliability. Handbook of Software Engineering, C. R. Vick and C. V. Ramamoorthy (Eds.), 392–412 (1984). [Musa84b] Musa, J. D. and Okumoto, K, A Logarithmic Poisson Execution
bappe.qxd
2/8/2009
268
6:15 PM
Page 268
BIBLIOGRAPHY
Time Model for Software Reliability Measurement, International Conference on Software Engineering, Orlando, Florida, 230–238 (1984). [Musa87] Musa, J., Iannino, A., and Okumoto, K., Software Reliability, McGraw-Hill (1987). [Musa98] Musa, J. D., Software Reliability Engineering, McGraw-Hill (1998). [NASA2002] National Aeronautics and Space Administration (NASA), Fault Tree Handbook with Aerospace Applications, Version 1.1 (2002). [NUREG-0492] U.S. Nuclear Regulatory Commission, Fault Tree Handbook, NUREG-0492 (1981). [Odeh77] Odeh, et al. Pocket Book of Statistical Tables, Marcel Dekker (1977). [Ohba82] Ohba, M. et al., S-shaped Software Reliability Growth Curve: How Good Is It? COMPSAC'82, 38–44 (1982). [Ohba84a] Ohba, M., Software Reliability Analysis Models, IBM Journal of Research Development, 28, 428–443 (1984). [Ohba84b] Ohba, M., Inflexion S-shaped Software Reliability Growth Models, in Stochastic Models in Reliability Theory, Osaki, S. and Hatoyama, Y. (Eds.), Springer, 144–162 (1984). [Ohba84c] Ohba, M. And Yamada, S., S-shaped Software Reliability Growth Models, Proc. 4th Int. Conf. Reliability and Maintainability, 430–436 (1984. [Ohtera90a] Ohtera H. and Yamada, S., Optimal Allocation and Control Problems for Software- Testing Resources, IEEE Trans. Reliab., R-39, 171–176 (1990). [Ohtera90b] Ohtera, H., and Yamada, S., Optimal Software-Release Time Considering an Error-Detection Phenomenon during Operation, IEEE Transactions on Reliability, 39(5), 596–599 (1990). [Ottenstein81] Ottenstein, L., Predicting Numbers of Errors Using Software Science, Proceedings of the 1981 ACM Workshop/Symposium on Measurement and Evaluation of Software Quality, 157–167 (1981). [Pham91] Pham, H., and Pham, M., Software Reliability Models for Critical Applications, Idaho National Engineering Laboratory, EG&G2663, 1991. [Pham93] Pham, H., Software Reliability Assessment: Imperfect Debugging and Multiple Failure Types in Software Development. Report EG&G-RAAM-10737; Idaho National Engineering Laboratory (1993). [Pham96] Pham, H., 1996, A Software Cost Model with Imperfect Debugging, Random Life Cycle and Penalty Cost, International Journal of Systems Science, 27(5), 455–463 (1996). [Pham97] Pham, H. and Zhang, X., An NHPP Software Reliability Model and Its Comparison, International Journal of Reliability, Quality and Safety Engineering, 4, 269–282 (1997). [Pham99] Pham, H. Nordmann, L. and Zhang X., A General Imperfect
bappe.qxd
2/8/2009
6:15 PM
Page 269
BIBLIOGRAPHY
269
Software Debugging Model with S-shaped Fault Detection Rate, IEEE Transactions on Reliability, 48(2), 169–175 (1999). [Pham2000] Pham, H., Software Reliability, Springer (2000). [Pham2002] Pham, L., Zhang, X., and Jeske, D. R., Scaling System Test Software Failure Rate Estimates For Use in Field Environments, in Proceedings of the Annual Conference of the American Statistical Association, pp. 2692–2696 (2002). [Pham2003] Pham, H., and Deng, C., Predictive-Ratio Risk Criterion for Selecting Software Reliability Models, in Proceeding of the Ninth ISSAT International Conference on Reliability and Quality in Design, Honolulu, Hawaii, pp. 17–21 (2003). [Pukite98] Pukite, J., and Pukite, P., Modeling for Reliability Analysis, IEEE Press (1998). [Rivers98] Rivers, A. T., and Vouk M. A., Resource-Constrained NonOperational Testing of Software, in Proceedings of the 9th International Symposium on Software Reliability Engineering, Paderborn, Germany, November, pp. 154–163, IEEE Computer Society Press (1998). [Sandler63] Sandler, G. H, System Reliability Engineering, Prentice-Hall (1963). [Schneidewind75] Schneidewind, N. F., Analysis of Error Processes in Computer Software. Sigplan Notices, 10, 337–346 (1975). [Schneidewind79] Schneidewind, N. and Hoffmann, H., An Experiment in Software Error Data Collection and Analysis, IEEE Trans. Software Eng., 5(3), 276–286 (1979). [Schneidewind92] Schneidewind, N. F., Applying Reliability Models to the Space Shuttle, IEEE Software, 28–33 (1992). [Schneidewind93] Schneidewind, N. F., Software Reliability Model with Optimal Selection of Failure Data, IEEE Trans. on Software Engineering, 19(11), 997–1007(1993). [Shooman68] Shooman, M. L., Probabilistic Reliability, An Engineering Approach. McGraw-Hill (1968). [Singpurwalla91] Singpurwalla, N. D., Determining an Optimal Time Interval for Testing and Debugging Software, IEEE Transactions on Software Engineering, 17(4), 313–319 (1991). [Stalhane92] Stalhane, T., Practical Experience with Safety Assessment of a System for Automatic Train Control, in Proceedings of SAFECOMP’92, Zurich, Switzerland, Pergamon Press (1992). [Telcordia08] Telcordia, Telcordia Roadmap to Reliability Documents, Issue 4, Aug 2008, Telcordia. [Tohma91] Tohma, Y., Yamano, H., Ohba, M., and Jacoby, R., The Estimation of Parameters of the Hypergeometric Distribution and its Application to the Software Reliability Growth Model, IEEE Transactions on Software Engineering, 17(5), 483–489 (1991). [Trivedi02] Trivedi, K., Probability and Statistics with Reliability,
bappe.qxd
2/8/2009
270
6:15 PM
Page 270
BIBLIOGRAPHY
Queueing, and Computer Science Applications, 2nd Edition, Wiley (2002). [Wood96] Wood, A., Predicting Software Reliability,” IEEE Computer Magazine, November, 69–77 (1996). [Wu2007] Wu, Y. P., Hu, Q. P., Xie, M., and Ng, S. H., Modeling and Analysis of Software Fault Detection and Correction Process by Considering Time Dependency, IEEE Transactions on Reliability, 56(4), 629–642 (2007). [Xie91] Xie, M., Software Reliability Engineering, World Scientific (1991). [Xie92] Xie, M., and Zhao, M., The Schneidewind Software Reliability Model Revisited, Proceedings of the Third International Symposium on Software Reliability Engineering, 184–193 (1992). [Yamada83] Yamada, S., Ohba, M., and Osaki, S., S-shaped Reliability Growth Modeling for Software Error Detection, IEEE Transactions on Reliability, 12, 475–484 (1983). [Yamada84] Yamada, S., et al., Software Reliability Analysis Based on an Nonhomogeneous Error Detection Rate Model, Microelectronics and Reliability, 24, 915–920 (1984). [Yamada85] Yamada, S., and Osaki, S., Discrete Software Reliability Growth Models, Applied Stochastic Models and Data Analysis, 1, 65–77 (1985). [Yamada86] Yamada, S., Ohtera, H., and Narihisa, H., Software Reliability Growth Models with Testing Effort, IEEE Trans. on Reliability, 35(1), 19–23 (1986). [Yamada90] Yamada, S. and Ohtera, H., Software Reliability Growth Models Testing Effort Control. European J. Operation Research, 46, 343–349 (1990). [Yamada91a] Yamada, S., Software Quality/Reliability Measurement and Assessment: Software Reliability Growth Models and Data Analysis, Journal of Information Processing, 14(3), 254–266 (1991). [Yamada91b] Yamada, S., Tokuno, K., and Osaki, S., Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment, International Journal of Systems Science, 23(12), 2253– 2264 (1991). [Yamada 1990b] Yamada, S. and Ohtera, H., Software Reliability Growth Model for Testing Effort Control, European J. Operation Research 46, 343–349 (1990). [Yamada92] Yamada, S., Tokuno, K., and Osaki, S., Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment, International Journal of Systems Science, 23(12), 2241–2252 (1992). [Zhang02] Zhang, X., Jeske, D. R., and Pham, H., Calibrating Software Reliability Models When the Test Environment Does Not Match the User Environment, Applied Stochastic Models in Business and Industry, 18, 87–99 (2002).
bappe.qxd
2/8/2009
6:15 PM
Page 271
BIBLIOGRAPHY
271
[Zhang06] Zhang, X., and Pham, H., Field Failure Rate Prediction Before Deployment, Journal of Systems and Software, 79(3) (2006). [Zhao92] Zhao, M. and Xie, M., On the Log-Power NHPP Software Reliability Model, Proceedings of the Third International Symposium on Software Reliability Engineering, 14–22 (1992).
The following textbooks document reliability modeling techniques and statistics background. Bagowsky, I., Reliability Theory and Practice, Prentice-Hall (1961). Crowder, M. J., Kimber, A., Sweeting, T., and Smith, R., Statistical Analysis of Reliability Data, Chapman & Hall/CRC Press (1994). Feller, W., An Introduction to Probability Theory and Its Applications, Vol., 1, Wiley (1968). Kemeny, J. G., and Snell, J. L., Finite Markov Chains, Van Nostrand (1960). Lyu, M. R., Software Fault Tolerance, Wiley (1995). Pukite, J., and Pukite, P., Modeling for Reliability Analysis, IEEE Press (1998). Sandler, G. H, System Reliability Engineering, Prentice-Hall (1963). Shooman, M. L., Probabilistic Reliability, An Engineering Approach. McGraw-Hill (1968). Trivedi, K., Probability and Statistics with Reliability, Queueing, and Computer Science Applications, 2nd Edition, Wiley (2002).
The following are references on reliability terminologies, specific reliability topics, statistic tables, and so on. Akaike, H., A New Look at Statistical Model Identification, IEEE Transactions on Automatic Control, 19, 716–723 (1974). ANSI/IEEE, Standard Glossary of Software Engineering Terminology, STD-729-1991, ANSI/IEEE (1991). Baldwin, C., J. et al., Mathematical Models for Use in the Simulation of Power Generation Outage, II, Power System Forced Outage Distributions, AIEE Transactions, 78, TP 59–849 (1954). Chaudhuri, A., and Stenger, H., Survey Sampling: Theory and Methods, Second Edition, Chapman & Hall/CRC Press (2005). Gossett W. S., The Probable Error of a Mean, Biometrika 6(1), 1–25 (1908). IEEE, Charter and Organization of the Software Reliability Engineering Committee (1995). Keene, S. J., Comparing Hardware and Software Reliability, Reliability Review, 14(4), 5–7, 21 (1994).
bappe.qxd
2/8/2009
272
6:15 PM
Page 272
BIBLIOGRAPHY
National Aeronautics and Space Administration (NASA), Fault Tree Handbook with Aerospace Applications, Version 1.1 (2002). Odeh, et al. Pocket Book of Statistical Tables, Marcel Dekker (1977). Stahlhane, T., Practical Experience with Safety Assessment of a System for Automatic Train Control, in Proceedings of SAFECOMP’92, Zurich, Switzerland, Pergamon Press (1992). U.S. Nuclear Regulatory Commission, Fault Tree Handbook, NUREG0492 (1981).
The following references document software reliability modeling techniques and applications. These are SRGM textbooks that readers who are new to software reliability modeling will find very useful. Jones, C., Applied Software Measurement, McGraw-Hill (1991). Lyu, M. R. (Ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press (1996). Musa, J. D., Software Reliability Engineering, McGraw-Hill (1998). Pham, H., Software Reliability, Springer (2000). Xie, M., Software Reliability Engineering, World Scientific (1991).
Useful papers on software system data analysis and field failure rate predictions. Jeske, D. R., and Zhang, X., Some Successful Approaches to Software Reliability Modeling in Industry, Journal of Systems and Software, 74, 85–99 (2005). Jeske, D. R., Zhang, X. and Pham, L., Adjusting Software Failure Rates that are Estimated from Test Data, IEEE Transactions on Reliability, 54(1), 107–114 (2005). Zhang, X., Jeske, D. R., and Pham, H., Calibrating Software Reliability Models When the Test Environment Does Not Match the User Environment, Applied Stochastic Models in Business and Industry, 18, 87–99 (2002). Zhang, X., and Pham, H., Field Failure Rate Prediction Before Deployment, Journal of Systems and Software, 79(3) (2006).
Useful papers on widely used software reliability growth models. Goel, A. L. and Okumoto, K., Time-Dependent Fault Detection Rate Model for Software and Other Performance Measures, IEEE Transactions on Reliability, 28, 206–211 (1979). Goel, A. L. and Okumoto, K., A Markovian Model for Reliability and Other Performance Measures of Software Systems, in Proceedings of the National Computer Conference, pp. 769–774 (1979).
bappe.qxd
2/8/2009
6:15 PM
Page 273
BIBLIOGRAPHY
273
Keiller, P. A. and Miller, D. R., On the Use and the Performance of Software Reliability Growth Models, Software Reliability and Safety, 32, 95–117 (1991). Kremer, W., “Birth-Death and Bug Counting,” IEEE Transactions on Reliability,” R-32(1), pp. 37–47 (1983). Ohba, M., Inflexion S-shaped Software Reliability Growth Models, in Stochastic Models in Reliability Theory, Osaki, S. and Hatoyama, Y. (Eds.), Springer, 144–162 (1984). Pham, H. and Zhang, X., An NHPP Software Reliability Model and Its Comparison, International Journal of Reliability, Quality and Safety Engineering, 4, 269–282 (1997). Pham, H. Nordmann, L. and Zhang X., A General Imperfect Software Debugging Model with S-shaped Fault Detection Rate, IEEE Transactions on Reliability, 48(2), 169–175 (1999). Pham, L., Zhang, X., and Jeske, D. R., Scaling System Test Software Failure Rate Estimates For Use in Field Environments, in Proceedings of the Annual Conference of the American Statistical Association, pp. 2692–2696 (2002). Pham, H., and Deng, C., Predictive-Ratio Risk Criterion for Selecting Software Reliability Models, in Proceeding of the Ninth ISSAT International Conference on Reliability and Quality in Design, Honolulu, Hawaii, pp. 17–21 (2003). Rivers, A. T., and Vouk M. A., Resource-Constrained NonOperational Testing of Software, in Proceedings of the 9th International Symposium on Software ReliabilityEngineering, Paderborn, Germany, November, pp. 154–163, IEEE Computer Society Press (1998). Yamano, H., Ohba, M., and Jacoby, R., The Estimation of Parameters of the Hypergeometric Distribution and its Application to the Software Reliability Growth Model, IEEE Transactions on Software Engineering, 17(5), 483–489 (1991). Wood, A., Predicting Software Reliability,” IEEE Computer Magazine, November, 69–77 (1996). Yamada, S., Ohba, M., and Osaki, S., S-shaped Reliability Growth Modeling for Software Error Detection, IEEE Transactions on Reliability, 12, 475–484 (1983). Yamada, S., Tokuno, K., and Osaki, S., Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment, International Journal of Systems Science, 23(12), 2241–2252 (1992).
Software reliability growth models incorporating fault detection and removal processes are discussed in the following. Gokhale, S. S., Lyu, M. R., and Trivedi, K. S., Software Reliability Analysis Incorporating Fault Detection and Debugging Activities, in
bappe.qxd
2/8/2009
274
6:15 PM
Page 274
BIBLIOGRAPHY
Proceedings of the Ninth International Symposium on Software Reliability Engineering, 4–7 November, 202–211 (1998). Huang, C.-Y., Lin, C.-T., Lyu, M. R., and Sue, C.-C., Software Reliability Growth Models Incorporating Fault Dependency With Various Debugging Time Lags, in Proceedings of the 28th Annual International Computer Software and Applications Conference, COMPSAC 2004, 1, 186–191 (2004). Littlewood, B., Popov, P. T., Strigini, L., and Shryane, N., Modeling the Effects of Combining Diverse Software Fault Detection Techniques, IEEE Transactions on Software Engineering, 26(12), 1157–1167 (2000). Wu, Y. P., Hu, Q. P., Xie, M., and Ng, S. H., Modeling and Analysis of Software Fault Detection and Correction Process by Considering Time Dependency, IEEE Transactions on Reliability, 56(4), 629–642 (2007).
Software reliability growth models incorporating testing coverage are covered in the following references. Cai, X., and Lyu, M. R., Software Reliability Modeling with Test Coverage: Experimentation and Measurement With a Fault-Tolerant Software Project, in The 18th IEEE International Symposium on Software Reliability, 17–26 (2007). Gokhale, S. S., and Mullen, R. E., From Test Count to Code Coverage Using the Lognormal Failure Rate, in 15th International Symposium on Software Reliability Engineering, 295–305 (2004). Levendel, Y., Software Quality and Reliability Prediction: A TimeDependent Model with Controllable Testing Coverage and Repair Intensity, in Proceedings of the Fourth Israel Conference on Computer Systems and Software Engineering, 175–181 (1989). Lyu, M. R., Huang, Z., Sze, S. K. S., and Cai, X., An Empirical Study on Testing and Fault Tolerance for Software Reliability Engineering, in 14th International Symposium on Software Reliability Engineering, 119–130 (2003). Malaiya, Y. K., Li, N., Bieman, J. Karcich, R., and Skibbe, B., The Relationship Between Test Coverage and Reliability, in Proceedings of 5th International Symposium on Software Reliability Engineering, 186–195 (1994). Malaiya, Y. K. Li, M. N., Bieman, J. M., and Karcich, R., Software Reliability Growth With Test Coverage, in Transactions of Reliability Engineering, 420–426 (2002).
Materials on software metrics (such as complexity, etc.). Website of Function Points, http://www.ifpug.org/.
bappe.qxd
2/8/2009
6:15 PM
Page 275
BIBLIOGRAPHY
275
Halstead M. H., Elements of Software Science, Elsevier North-Holland (1977). McCabe, T. J., A Complexity Measure, IEEE Transacations on Software Engineering, SE-2(4), 308–320 (1976).
The following standards documents describe the methodology used to determine the downtime, availability, and failure rate estimates in system reliability analysis. Telcordia standard documents: Special Report SR-TSY-001171, Methods and Procedures for System Reliability Analysis, Issue 2, November 2007, Telcordia. Special Report SR-332, Reliability Prediction Procedure for Electronic Equipment, Issue 2, January 2006, Telcordia Technologies. GR-63-CORE, NEBS Requirements: Physical Protection, Issue 2, April 2002, Telcordia. GR-357-CORE, Generic Requirements for Assuring the Reliablity of Components Used in Telecommunications Equipment, Issue 1, March 2001, Telcordia. SR-TSY-000385, Bell Communications Research Reliability Manual, Issue 1, June 1986, Bell Communications Research. GR-418-CORE, Generic Reliability Assurance Requirements for Fiber Optic Transport Systems, Issue 2, December 1999, Telcordia. GR-512-CORE, LSSGR: Reliability, Chapter 12, Issue 2, January 1998, Telcordia. GR-874-CORE, An Introduction to the Reliability and Quality Generic Requirements (RQGR), Issue 3, April 1997, Bellcore. GR-929, Reliability and Quality Measurements for Telecommunications Systems (RQMS-Wireline), Issue 8, December 2002, Telcordia. GR-1339-CORE, Generic Reliability Requirements for Digital Cross-Connect Systems, Issue 1, March 1997, Bellcore. GR-1929, Reliability and Quality Measurements for Telecommunications Systems (RQMS-Wireless), Issue 1, December 1999, Telcordia. GR-2813, Generic Requirements for Software Reliability Prediction, Issue 1, December 1993, Bellcore. GR-2841-CORE, Generic Requirements for Operations Systems Platform Reliability, Issue 1, June 1994, Telcordia.
Other documents related to reliability standards: IEC 60300-3-1, Dependability Management—Part 3-1: Application Guide—Analysis Techniques for Dependability—Guide on Methodology, International Electrotechnical Commission, second edition (2003).
bappe.qxd
2/8/2009
276
6:15 PM
Page 276
BIBLIOGRAPHY
IEC 61713, Software Dependability through the Software Life-Cycle Process Application Guide (2000). NESAC Recommendations, http://questforum.asq.org/public/nesac/index.shtml. NRIC Best Practices: http://www.nric.org. TL 9000 Quality Measurement System, Measurements Handbook, Release 4.0, QuEST Forum, December 31, 2006.
More Telcordia reliability and quality documents from “Telcordia Roadmap to Reliabiltiy Documents” (there may be some overlap with the documents listed previously): GR-282-CORE, Software Reliability and Quality Acceptance Criteria (SRQAC), Issue 4, July 2006, Telecordia. GR-326-CORE, Generic Requirements for Singlemode Optical Connectors and Jumper Assemblies, Issue 3, September 1999, Telecordia. GR-357-CORE, Generic Requirements for Assuring the Reliability of Components Used in Telecommunications Equipment, Issue 1, March 2001, Telcordia.. GR-418-CORE, Generic Reliability Assurance Requirements for Fiber Optic Transport Systems, Issue 2, December 1999, Telcordia. GR-449-CORE, Generic Requirements and Design Considerations for Fiber Distributing Frames, Issue 2, July 2003, Telcordia. GR-468-CORE, Generic Reliability Assurance Requirements for Optoelectronic Devices Used in Telecommunications Equipment, Issue 2, September 2004, Telcordia. GR-508-CORE, Automatic Message Accounting (AMA), Issue 4, September 2003, Telcordia. GR-910-CORE, Generic Requirements for Fiber Optic Attenuators, Issue 2, December 2000, Telcordia. GR-929-CORE, Reliability and Quality Measurements for Telecommunications Systems (RQMS-Wireline), Issue 8, December 2002, Telcordia. GR-1110-CORE, Broadband Switching System (BSS) Generic Requirements, Issue 4, December 2000, Telcordia. GR-1221-CORE, Generic Requirements Assurance Requirements for Passive Optical Components, Issue 2, January 1999, Telcordia. GR-1241-CORE, Supplemental Service Control Point (SCP) Generic Requirements, Issue 7, December 2006, Telcordia. GR-1274-CORE, Generic Requirements for Reliability Qualification Testing of Printed Wiring Assemblies Exposed to Airborne Hygroscopic Dust, Issue 1, July 1994, Telcordia. GR-1280-CORE, Advanced Intelligent Network (AIN) Service Control Point (SCP) Generic Requirements, Issue 1, November 1993, Telcordia.
bappe.qxd
2/8/2009
6:15 PM
Page 277
BIBLIOGRAPHY
277
GR-1312-CORE, Generic Requirements for Optical Fiber Amplifiers and Proprietary Dense Wavelength-Division Multiplexed Systems, Issue 3, April 1999, Telecordia. GR-2813-CORE, Generic Requirements for Software Reliability Prediction, Issue 1, December 1993, Bellcore. GR-2853-CORE, Generic Requirements for AM/Digital Video Laser Transmitters, Optical Fiber Amplifiers and Receivers, Issue 3, December 1996. GR-3020-CORE, Nickel Cadmium Batteries in the Outside Plant, Issue 1, April 2000, Telcordia. SR-TSY-000385, Bell Communications Research Reliability Manual, Issue 1, July 1986, Telcordia. SR-TSY-001171, Methods and Procedures for System Reliability Analysis, Issue 2, November 2007, Telcordia. SR-1547, The Analysis and Use of Software Reliability and Quality Data, Issue 2, December 2000, Telecordia.
bindex.qxd
2/10/2009
9:59 AM
Page 279
INDEX
Active-active model, 70–72 Active-standby model, 72–73 Activity Recorder, 190 Application restart time, 65 Application restart success probability, 66 Asserts, 195 Assumptions, modeling, 76–78 Attributability, 11, 18 customer-attributable, 11, 26 external causes, 36 product-attributable, 11, 27 hardware, 19 software, 19 Audits, 197 Automate procedures, 203 Automatic failover time, 68, 131, 148 Automatic failover success probability, 68, 132, 149 Automatic fault detection, 205 Automatic recoveries, 23, 25 Automatic response procedure display, 205 Availability: classical view, 8 conceptual model, 15 customer’s view, 9 definition, 6, 59, 221–222, 226–227 road map , 209–210 Backout, 206 Binomial distribution, 228–230
Budgets (downtime), 26–29 Bus parity, 188 Bus watchdog timers, 189 Calibration factor, 114 Camp-on diagnostics, 200 Capacity loss, 10–12 Checksums, 199 Circuit breakers, 183 Clock failure detectors, 190 Compensation policies, 33 Complexity, 142 Confidence intervals, 105–106, 237–243 Connectors, 183 Coverage, 107, 129–130 hardware, 67, 139 software, 67 Covered fault detection time, 64, 109, 139 Counting rules, 12, 61 CPU watchdog timers, 187 CRC: message, 197 serial bus, 188 Critical failure-mode monitoring, 199 Critical severity problem, 14 Customer attributable (outages/downtime), 11, 26, 60 Customer policies and procedures, 32 Cut set, 53
Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
279
bindex.qxd
2/10/2009
280
9:59 AM
Page 280
INDEX
Dataset characterization, 171 Diagnostics: camp-on, 200 routine, 200 runtime, 201 Defect: severity, 14 software, 114–119 Displaying reminders, 205 Distribution: probability, 228–237 Downtime: customer attributable, 60 definition, 59, 227 product attributable, 59 unplanned, 61 Duration: outage duration, 11, 19 parking duration, 35 uncovered failure detection, 65 Element availability, 12 Error checking and correcting on memory, 189 Exclusion rules, 12, 61 Exposure time, 123, 170 Exponential distribution, 230–231 Failed-restart detection probability, 68 Fail-safe fan controllers, 181 Failover time, 68, 131 Failure: hardware, 19, 21 software, 19, 21 covered, 66–67 category, 16, 19, 20 Failure intensity function, 251 Failure rate: calibration, 208–209 estimating, 108 function, 224–226 hardware, 62–63, 111–114, 138–139 prediction, 250 software, 63–64, 114–115 Fans, 181, 182 Fan alarms, 181 Fan controllers, 181, 182
Fan fusing, 184 Fault detection and correction process, 248 Fault tree, 52–53 Fault insertion: testing,129 Fault tree models, 38, 52–53 Feasibility, 206–208 FIT, 46 Field data analysis: alarm, 106–108 outage, 96–106 Field replaceable: circuit breakers,183 electronics, 184 fans, 181 power supplies,181 Full-application restart time, 65 Full-application restart success probability, 66 Function point, 114 Gamma distribution, 234–235 Goel-Okumoto model, 116 Growth (reliability growth), 115, 209 Hardware failures, 19, 21 Hardware failure rate, 62–63, 90–91 Hardware fault coverage, 67 Hardware fault injection testing, 187 Hardware MTTR, 140 Hardware redundancy, 187 Heartbeats, 194 Helping the humans, 203 Hot swap, 185 Imperfect debugging, 247, 252 Independently replaceable fan controllers,182 Input checking, 205 Insertion/injection of faults, 187 JTAG fault insertion,188 Lab: data, 114–135 test, 208 Lognormal distribution, 236–237
bindex.qxd
2/10/2009
9:59 AM
Page 281
INDEX
Major severity problem, 14 Manual failover time, 67, 131, 149 Manual failover success probability, 68, 135, 149 Manual recoveries, 24 Markov models, 38, 42–52 Maturity, 141–142 Maximum likelihood estimate (MLE), 116 Mean value function, 251 Memory: error checking and correcting,189 leak detection, 194 protection, 193 Message: CRC/parity,197 validation, 197 Minimal cut set models, 38, 53–55 Minimize tasks/decisions per procedure, 203 Minor severity problem, 14 Modeling: assumptions, 76–78 definitions, 58–69 fault tree, 38, 52–53 Markov, 38, 42–52 minimal cut set, 38, 53–55 Monte Carlo Simulation, 38, 57–58 parameters, 87, 152 petri net, 38, 55–57 reliability block diagrams, 38, 39–42 standards, 92–93 Models : active-active, 70–72 active-standby, 72–73 N+K redundancy, 73–75 N-out-of-M redundancy, 75–76 simplex, 69–70 MTTR, 8, 140 MTTF, 8, 223–224 MTTO, 102 N+1 fans, 181 N+K : protection, 199 redundancy model, 73–75 N-out-of-M redundancy model, 75–76
281
Network element impact outage, 12 Nonhomogenous Poisson process (NHPP), 115 Normal distribution, 235–236 Normalization factors, 13 Null pointer access detection, 193 Outage: classifications, 20, 100 definition, 11 downtime, 12 duration, 11, 19 exclusions, 12 partial outage, 60 Overload detection and control, 193 Parameter validation, 196 Parity: bus, 188 message,197 parking, 35 Partial outage, 60 Pass rate, 129–135 Petri net models, 38, 55–57 Planned downtime, 28 Poisson distribution, 229–230 Postmortem data collection, 199 Power: feeds, 180, 183 supplies, 181 supply monitoring, 188 switch protection, 184 Power-on self-tests, 200 Prediction: accuracy, 167–171 error, 172 software failure rate prediction, 114–129, 140–145 Primary functionality, 32, 58 Problem severities, 14 Procedures: automating, 203 clarity, 204 documenting, 203, 205 making intuitive, 204 similarity, 204 simplification, 204 testing, 203 Procedural outages, 26, 27
bindex.qxd
2/10/2009
282
9:59 AM
Page 282
INDEX
Process monitoring, 194 Process restart time, 65 Process restart success probability, 66 Product attributable outage/downtime, 11, 27, 59 Progress indicator, 204 Pro-rating outages/downtime, 60 QuEST forum, 5 Rayleigh distribution, 233–234 Reboot time, 65, 200 Reboot success probability, 65 Recovery: automatic recovery, 23,24 manual emergency recovery, 24 planned/scheduled recovery, 24–26 time, 64, 130, 145–146 Redundant power feeds, 180 Reliability, 7, 222–223 block diagrams, 38, 39–42 definition, 7 report, 110–111, 215–220 road map, 209–210 Requirements: availability requirements, 179 fault coverage requirement, 15 recovery time requirement, 145–146 Return code checking, 195 Repair time, 64, 140 Report system state, 206 Residual defects, 118 Reuse, 141 Road map, 209–210 Rolling updates, 198 Routine diagnostics, 200 Runtime consistency checking, 197 Runtime diagnostics, 201 Safe point identification, 205 Sampling error, 172 Scheduled events, 32 Sensitivity analysis, 149–166 Service impact outage, 12 Service life, 63 Service , 7
Service affecting, 120, 123 Severity, 120, 123 Simplex model, 69 Single process restart time, 65 Single process restart success probability, 66 Size (software), 141 Soft error detection, 189 Software failures, 21 Software failure rate, 63–64, 91–92, 140–145 Software fault coverage, 67, 145 Software metrics, 141 complexity, 142, 245 maturity, 141 reuse, 141 size, 141 Software reliability growth modeling, 115–129 application, 249 concave, 126 fault detection and correction process, 248 hypergeometric models, 249 failure intensity function, 251 mean value function, 251 imperfect debugging, 247, 252 parameter estimation, 253 residual defects, 118 SRGM model selection criteria, 255 S-shaped, 124–127 testing coverage, 248 Software updates, 198 S-shaped model, 124–127 Standards, 89–94 System activity recorder, 190 System state, 206 Task monitoring, 194 Techniques: hardware, 186–192 physical design, 179–186 procedural techniques , 202–206 software, 192–202 Temperature monitoring, 183 Tight loop detection, 198 Timeouts, 197 TL 9000, 5
bindex.qxd
2/10/2009
9:59 AM
Page 283
INDEX
Unavailability, definition, 59 Uncovered failure recovery time, 108, 131, 146–147 Uncovered fault detection time, 65 Undo, 206 Unplanned downtime, 61 Validation: message, 197 parameter, 196
Visual status indicators, 185 Watchdog timers: bus, 189 CPU, 187 Weibull distribution, 231–233 Widget example, 78–89 Yamada exponential model, 126
283
babout.qxd
2/10/2009
10:00 AM
Page 285
ABOUT THE AUTHORS
ERIC BAUER is technical manager of Reliability Engineering in the Wireline Division of Alcatel-Lucent. He originally joined Bell Labs to design digital telephones and went on to develop multitasking operating systems on personal computers. Mr. Bauer then worked on network operating systems for sharing resources across heterogeneous operating systems, and developed an enhanced, high performance UNIX file system to facilitate file sharing across Microsoft, Apple and UNIX platforms, which led to work on an advanced Internet service platform at AT&T Labs. Mr. Bauer then joined Lucent Technologies to develop a new Java-based private branch exchange (PBX) telephone system that was a forerunner of today's IP Multimedia Subsystem (IMS) solutions, and later worked on a long-haul/ultra long-haul optical transmission system. When Lucent centralized reliability engineering, Mr. Bauer joined the Lucent reliability team to lead a reliability group, and has since worked reliability engineering on a variety of wireless and wireline products and solutions. He has been awarded 11 U.S. Patents, holds a Bachelor of Science degree in Electrical Engineering from Cornell University, Ithaca, New York, and a Master of Science degree in Electrical Engineering from Purdue University, West Lafayette, Indiana. He lives in Freehold, New Jersey. XUEMEI ZHANG received her Ph.D. in Industrial Engineering and her Master of Science degree in Statistics from Rutgers University, New Brunswick, New Jersey. Currently she is a principle member of technical staff in the Network Design and Performance Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.
285
babout.qxd
2/10/2009
286
10:00 AM
Page 286
ABOUT THE AUTHORS
Analysis Department in AT&T Labs. Prior to joining AT&T Labs, she has worked in the Performance Analysis Department and the Reliability Department in Bell Labs in Lucent Technologies (and later Alcatel-Lucent), in Holmdel, New Jersey. She has been working on reliability and performance analysis of wire line and wireless communications systems and networks. Her major work and research areas are system and architectural reliability and performance, product and solution reliability and performance modeling, and software reliability. She has published more than 30 journal and conference papers. She has 6 awarded and pending U.S. patent applications in the areas of system redundancy design, software reliability, radio network redundancy, and end-to-end solution key performance and reliability evaluation. She has served as a program committee member and conference session chair for international conferences and workshops. She was an invited committee member for Ph.D. and Master theses at Rutgers University, Piscataway, New Jersey and New Jersey Institute of Technology, Newark, New Jersey. Dr. Zhang is the recipient of a number of awards and scholarships, including the Bell Labs President's Gold Awards in 2002 and 2004, Bell Labs President's Silver Award in 2005, Best Contribution Award 3G WCDMA in 2000 and 2001, fellowship and scholarships from Rutgers University. DOUGLAS A. KIMBER earned his Bachelor of Science in Electrical Engineering degree from Purdue University, West Lafayette, Indiana, and a Master of Science in Electrical Engineering degree from the University of Michigan, Ann Arbor, Michigan. He began his career designing telecommunications circuit boards for an Integrated Services Digital Network (ISDN) packet switch at AT&T Bell Labs. He followed the corporate transition from AT&T to Lucent Technologies, and finally to Alcatel-Lucent. During this time he did software development in the System Integrity department, which was responsible for monitoring and maintaining service on the 5ESS digital switch. This is where he got his start in system reliability. He then moved on to develop circuitry and firmware for the Reliable Clustered Computing (RCC) Department. RCC created hardware and software that enhanced the reliability of commercial products, and allowed Mr. Kimber to work on all aspects of system reliability. After RCC Mr. Kimber did systems engineering and architecture, and ultimately worked in the Reliability Department where he was able to apply his experience to analyze and
babout.qxd
2/10/2009
10:00 AM
Page 287
ABOUT THE AUTHORS
287
improve the reliability of a variety of products. He holds 4 U.S. patents with several more pending, and was awarded two Bell Laboratories President's Silver Awards in 2004. Mr. Kimber is currently retired and spends his time pursuing his hobbies, which include circuit design, software, woodworking, automobiles (especially racing), robotics, and gardening.