Alessandro Birolini Reliability Engineering
Alessandro Birolini
Reliability Engineering Theory and Practice Fifth edition
With 140 Figures, 60 Tables, 120 Examples, and 50 Problems
123
Prof. Dr. Alessandro Birolini∗ Ponte Vecchio – Torre degli Amidei I-50122 Firenze Tuscany, Italy email:
[email protected] ∗ Ing´ enieur et penseur, Ph.D., Professor Emeritus of Reliability Engineering
at the Swiss Federal Institute of Technology (ETH), Zürich biography on: www.ethz.ch/people/whoiswho
Library of Congress Control Number: 2007921004
First and second edition printed under the title “Quality and Reliability of Technical Systems” ISBN 978-3-540-49388-4 5th ed. Springer Berlin Heidelberg New York ISBN-10 3-540-40287-X 4th ed. Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 1994, 1997, 1999, 2004, and 2007 The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera ready by author Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig Cover-Design: medio Technologies AG Printed on acid-free paper
62/3100/YL - 5 4 3 2 1 0
"La chance vient d l'esprit qui estpr2t d la recevoir. " I ) Louis Pasteur
"Quand on apercoit combien la somme de nos ignorantes dipasse celle de nos connaissances, on se serztpeu porti d conclure trop vite. " ZJ Louis De Broglie
"One has to learn to consider causes rather than symptoms of undesirable events and avoid hypocritical attitudes. " A. B.
'1 "Opportunity Comes to the intellect which is ready to receive it." 2, "When one recognizes how much the sum of our ignorante
exceeds that of our knowledge, one is less ready to draw rapid conclusions."
Preface to the 5th Edition This 5th edition differs from the 4th one for some refinements and extensions mainly on investigation and test of complex repairable systems. For phased-mission systems a new approach is given for both reliability and availability (Section 6.8.6.2). Effects of common cause failures (CCF) are carefully investigated for a 1-out-of-2 redundancy (6.8.7). Petri nets and dynamic FTA are introduced as alternative investigation methods for repairable systems (6.9). Approximate expressions are further developed. An unified approach for availability estimation und demonstration is given for exponentially and Erlangian distributed failure-free and repair times (7.2.2, A8.2.2.4, A8.3.1.4). Con$dence limits at system level are given for the case of constant failure rates (7.2.3.1). Investigation of nonhomogeneous Poisson processes is refined and more general point processes (superimposed, cumulative) are discussed (A7.8), with application to data analysis (7.6.2) & cost optimization (4.7). Trend tests to detect early failures or wearozdi are introduced (7.6.3). A simple demonstration for mean & variance in a cumulative process is given (A7.8.4). Expansion of a redundancy 2-out-of-3 to a redundancy 1-out-of-3 is discussed (2.2.6.5). Some present production-related reliability problems in VLSI ICs are shown (3.3.4). Maintenance strategies are reviewed (4.6). As in the previous editions of this book, reliability figures at system level have indices SI (e.g. M W i ) , where S stands for system and i is the state entered at t=O (Table 6.2). Furthermore, considering that for a repairable system, operating times between system failures can be neither identically distributed nor independent, failure rate is confined to nonrepairable systems or to repairable systems which are as-good-as-new after repair. Failure intensio is used for general repairable systems. For the cases in which renewal is assumed to occur, the variable X starting by X = 0 at euch renewal is used instead of t, as for interarrival times. Also because of the estimate M ~ B F= T l k , often used in practical applications, MTBF is confined to repairable systems whose failure occurrence can be described by a homogeneous Poisson processes, for which (and only for which) interarrival times are independent exponentially distributed random variables with the same Parameter h s and mean MTBF, = 11 hs (p. 358). For Markov and semiMarkov models, MUTs is used (pp. 265,477). Repair is used as a synonym for restoration, with the assumption that repaired elements in a system are as-good-as-new after repair (the system is as-good-as-new, with respect to the state considered, only if all nonrepaired elements have constant failure rate). Reliability growth has been transferred in Chapter 7 and Table 3.2 on electronic components has been put in the new Appendix A.lO. A set of problems for homework assignment has been added in the new Appendix A. 11. This edition extends and replaces the previous editions. The comments of many friends and the agreeable cooperation with Springer-Verlag are gratefully acknowledged. Zurich and Florence, September 13,2006
Alessandro Birolini
Preface to the 4th Edition The large interest granted to this book made a 4th edition necessary. The structure of the book is unchanged, with its main part in Chapters 1 - 8 and self contained appendices A l -A5 on management aspects and A6 - A8 on basic probability theory, stochastic processes & statistics.
Such a structure allows rapid access to practical results and a comprehensive introduction to the mathematical foundation of reliability theory. The content has been extended and reviewed. New models & considerations have been added to Appendix A7 for stochastic processes (NHPP), Chapter 4 for spare parts provisioning, Chapter 6 for complex repairable systems (imperfect switching, incomplete coverage, items with more than two states, phased-mission systems, fault tolerant reconfigurable systems w i t h reward and frequency I duration aspects, Monte Carlo simulation), and Chapters 7 & 8 for reliabilig data analysis. Some results come from a stay in 2001 as Visiting Fellow at the Institute ofAdvanced Study of the University of Bologna. Performance, dependability, cost, and time to market are key factors for today's products and services. However, failure of complex systems can have major safety consequences. Also here, one has to learn to consider causes rather than syrnptoms o f undesirable events und avoid hypocritical attitudes. Reliability engineering can help. Its purpose is to develop methods und tools to evaluate und demonstrate reliability, maintainability, availability, and safety of components, equipment & systems, and to support development and production engineers in building in these characteristics. To build in reliability, maintainability, and safety into complex systems, failure rate and failure mode analyses must be performed early in the development phase and be supported (as far as possible) by failure mechanism analysis, design guidelines, and design reviews. Before production, qual$cation tests are necessary to venfy that targets have been achieved. In the production phase, processes have to be qualified and monitored to assure the required quality level. For many systems,availability requirements have to be met and stochastic processes are used to investigate and optimize reliability and availability, including logistic support as well. Software often plays a dominant role, requiring specific quality assurance activities. Finally, to be cost and time effective, reliability engineering has to be coordinated with quality management (TQM) efforts, including value engineering and concurrent engineering, as appropriate. This book presents the state-of-the-art of reliability engineering in theory and practice. It is a textbook based on the author's experience of 30 years in this field, half in industry and as founder of the Swiss Test Lab. for VLSI ICs in Neuchatel, and half as Professor (full since 1992) of Reliability Engineering at the Swiss Federal Institute of Technology (ETH), Zurich. It also reflects the experience gained in an effective cooperation between University and industry over 10 years with more than 30 medium and large industries [1.2 (1996)]+). Following Chapter 1, the book is structured in three p a r k 1. Chapters 2 - 8 deal with reliability, maintainability, and availability analysis und test, with emphasis on practical aspects in Chapters 3, 5, and 8. This part answers the question of how to build in, evaluate, und dernonstrate reliability, maintainability, und availability. 2. Appendices A l - A5 deal with definitions, standards, and program plans for quality and reliability assurancel management of complex systems. This minor part of the book has been added to comment on definitions and standards, and to support managers in answering the question of how to specify und achieve high reliability targets for complex Systems, when tailon'ng is not rnandatory. 3. Appendices A6 - A8 give a comprehensive introduction to probability theory, stochastic processes, and statistics, as needed in Chapters 2, 6, and 7, respectively. Markov, semiMarkov, and semi-regenerative processes are introduced with a view developed by the author in [A7.2 (1975 & 1985)l. Tkispart is addressed to systern oriented engineers. Methods and tools are presented in a way that they can be tailored to Cover different levels of reliability requirements (the reader has to select this level). Investigation of repairable systems is performed systematically for many of the structures occurring in practical applications,
starting with constant failure and repair rates and generalizing step by step up to the case in which the process involved is regenerative with a minimum number o f regeneration states. Considering for each element M7TR (mean time to repair) << M V F (mean time to failure), it is shown that the shape of the repair time distribution has a small influence on the results at system level and, for constant failure rate, the reliability function at the system level can often be approximated by an exponential function. For large series - parallel systems, approximate expressions for reliability and availability are developed in depth, in particular using macro structures as introduced by the author in [6.5 (1991)l. Procedures to investigate repairable Systems with complex structure (for which a reliability block diagram often does not exist) are given as further application of the tools introduced in Appendix A7, in particular for imperfect switching, incomplete fault coverage, elements with more than two states, phased-mission systems, and fault tolerant reconfigurable systems with reward & frequency I duration aspects. New design d e s have been added for imperfect switching and incomplete coverage. A Monte Carlo approach useful for rare events is given. Spare parts provisioning is discussed for decentralized and centralized logistic support. Estimation and demonstration of a constant failure rate and statistical evaluation of general reliability data are considered in depth. Qualification tests and screening for components and assemblies are discnssed in detail. Methods for causes-to-effects analysis, design guidelines for reliability, maintainability & software quality, and checklists for design reviews are considered carefully. Cost optimization is investigated for some practical applications. Standards and trends in quality management are discussed. A large number of tables, figures, and examples support practical aspects It is emphasized that care is necessary in the statistical analysis of reliability data (in particular for accelerated tests and reliability growth), causes-to-effects analysis should be performed systematically at least wkere redundancy appears (also to support remote maintenance), and further efforts should be done for developing approximate expressions for complex repairable systems as well as models for fault tolerant systems with hardware und software. Most of the methods & tools given in this book can be used to investigatelimprove safety as well, which no longer has to be considered separately from reliability (although modeling human aspects can lead to some difficulties). The Same is forprocess and sewices reliability. The book has been used for many years (Ist German Ed. 1985, Springer) as a textbook for three Semesters beginning graduate students at the ETH Zurich and for Courses aimed at engineers in industry. The basic Course (Chapters 1, 2, 5 & 7, with introduction to Chapters 3,4, 6 & 8) should belong to the curriculum of most engineering degrees. This edition extends and reviews the 3rd Edition (1999). It aims further to establish a link between theory und practice, to be a contribution to a continuous learning program und a sustainable development, und to support creativity (stimulated by an internal confidence and a deep observation of nature, but restrained by excessive bureaucracy or depersonalization). The comments of many friends and the agreeable cooperation with Springer-Verlag are gratefully acknowledged. Zurich and Florence, March 2003
+I For
L.. ],
see References at the end of the book.
Alessandro Birolini
Contents
1 Basic Concepts. Quality and Reliability Assurance of Complex Equipment & Systems . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Failure Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.4 Maintenance. Maintainability . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Logistic Support . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.6 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.7 Safety, Risk. and Risk Acceptance . . . . . . . . . . . . . . . . . . 9 1.2.8 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.9 Cost and System Effectiveness . . . . . . . . . . . . . . . . . . . . 11 1.2.10 Product Liability . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.11 Histoncal Development . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Basic Tasks & Rules for Quality & Reliability Assurance of Complex Equip. & Systems .17 1.3.1 Quality and Reliability Assurance Tasks . . . . . . . . . . . . . . . . 17 1.3.2 Basic Quality and Reliability Assurance Rules . . . . . . . . . . . . . . 19 1.3.3 Elements of a Quality Assurance System . . . . . . . . . . . . . . . . . .21 1.3.4 Motivation and Training . . . . . . . . . . . . . . . . . . . . . . 24 2 Reliability Analysis During the Design Phase (Nonrepairable Items up to System Failure). . . 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 2.2 Predicted Reliability of Equipment and Systems with Simple Structure . . . . . . . 28 2.2.1 Required Function . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.2 Reliability Block Diagram . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Operating Conditions at Component Level. Stress Factors . . . . . . . . . 33 2.2.4 Failure Rate of Electronic Components . . . . . . . . . . . . . . . . 35 2.2.5 Reliability of One-Item Structure . . . . . . . . . . . . . . . . . . . 39 2.2.6 Reliability of Senes-Parallel Structures . . . . . . . . . . . . . . . . 41 2.2.6.1 Systems without Redundancy . . . . . . . . . . . . . . . . . 41 2.2.6.2 Concept of Redundancy . . . . . . . . . . . . . . . . . . . 42 2.2.6.3 Parallel Models . . . . . . . . . . . . . . . . . . . . . . 43 2.2.6.4 Series - Parallel Structures . . . . . . . . . . . . . . . . . . 45 2.2.6.5 Majonty Redundancy . . . . . . . . . . . . . . . . . . . . 47 2.2.7 Part Count Method . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3 Reliability of Systems with Complex Stmcture . . . . . . . . . . . . . . . . 52 2.3.1 Key Item Method . . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . 53 2.3.1.1 Bndge Structure 2.3.1.2 Re1. Block Diagram in which Elements Appear More than Once . . . 54 2.3.2 Successful Path Method . . . . . . . . . . . . . . . . . . . . . . 55 2.3.3 State Space Method . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3.4 Boolean Function Method . . . . . . . . . . . . . . . . . . . . . 57 2.3.5 Parallel Models with Constant Failure Rates and Load Sharing . . . . . . . 61
XI1
2.4 2.5 2.6 2.7
Contents 2.3.6 Elements with more than one Failure Mechanism or one Failure Mode 2.3.7 Basic Considerations on Fault Tolerant Structures . . . . . . . . Reliability Allocation . . . . . . . . . . . . . . . . . . . . . . Mechanical Reliability, Drift Failures . . . . . . . . . . . . . . . . Failure Mode Analysis . . . . . . . . . . . . . . . . . . . . . . Reliability Aspects in Design Reviews . . . . . . . . . . . . . . . .
. . . . 64 . . . . 66 . . . . 67 . . . . 67 . . . . 72 . . . . 77
3 QualificationTests for Components and Assemblies . . . . . . . . . . . . . . . . 81 3.1 Basic Selection Cntena for Electronic Components . . . . . . . . . . . . . . . 81 3.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.1.2 Performance Parameters . . . . . . . . . . . . . . . . . . . . . . 84 3.1.3 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . .84 3.1.4 Manufacturing Quality . . . . . . . . . . . . . . . . . . . . . . . 86 3.1.5 Long-Term Behavior of Performance Parameters . . . . . . . . . . . . 86 3.1.6 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2 Qualification Tests for Complex Electronic Components . . . . . . . . . . . . 87 3.2.1 Electrical Test of Complex ICs . . . . . . . . . . . . . . . . . . . . 88 3.2.2 Characterization of Complex ICs . . . . . . . . . . . . . . . . . . . 90 3.2.3 Environmental and Special Tests of Complex ICs . . . . . . . . . . . . 92 3.2.4 Reliability Tests . . . . . . . . . . . . . . . . . . . . . . . . . .101 3.3 Failure Modes, Failure Mechanisms, and Failure Analysis of Electronic Components . 101 3.3.1 Failure Modes of Elecironic Components . . . . . . . . . . . . . . . 101 3.3.2 Failure Mechanisms of Electronic Components . . . . . . . . . . . . . 102 3.3.3 Failure Analysis of Electronic Components . . . . . . . . . . . . . . . 102 3.3.4 Examples of VLSI Production-RelatedReliability Problems . . . . . . . . 106 3.4 Qualification Tests for Electronic Assemblies . . . . . . . . . . . . . . . . . 107
4 Maintainability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .112 4.1 Maintenance. Maintainability . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Maintenance Concept . . . . . . . . . . . . . . . . . . . . . . . . . . .115 4.2.1 Fault Recognition and Isolation . . . . . . . . . . . . . . . . . . .116 4.2.2 Equipment and System Partiiiouing . . . . . . . . . . . . . . . . . . 118 4.2.3 User Documentation . . . . . . . . . . . . . . . . . . . . . . . .118 4.2.4 Training of Operating and Maintenance Personnel . . . . . . . . . . . . 119 4.2.5 User Logistic Support . . . . . . . . . . . . . . . . . . . . . . . 119 4.3 Maintainability Aspects in Design Reviews . . . . . . . . . . . . . . . . . . 121 4.4 Predicted Maintainability . . . . . . . . . . . . . . . . . . . . . . . . .121 4.4.1 Calculation of M7TRs . . . . . . . . . . . . . . . . . . . . . . .121 4.4.2 Calculütion of M7TPMS . . . . . . . . . . . . . . . . . . . . . . 125 4.5 Basic Models for Spare Parts Provisioning . . . . . . . . . . . . . . . . . . 125 4.5.1 Centralized Logistic Support, Nonrepairable Spare Parts . . . . . . . . . 125 4.5.2 Decentralized Logistic Support, No~epairableSpare Parts . . . . . . . . 129 4.5.3 Repairable Spare Parts . . . . . . . . . . . . . . . . . . . . . . . 130 4.6 Repair strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.7 Cost Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5 Design Guidelines for Reliability. Maintainability. and Software Quality 5.1 Design Guidelines for Reliability . . . . . . . . . . . . . . . 5.1.1 Derating . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 139 . . . . . . .139 . . . . . . .139
Contents
5.2
5.3
XI11
. 5.1.2 Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.1.3 Moisture . . . . . . . . . . . . . . . . . . . . . . . . . . . .142 5.1.4 Electromagnetic Compatibility. ESD Protection . . . . . . . . . . . . .143 5.1.5 Components and Assemblies . . . . . . . . . . . . . . . . . . . . .145 5.1.5.1 Component Selection . . . . . . . . . . . . . . . . . . . .145 5.1.5.2 Component Use . . . . . . . . . . . . . . . . . . . . . .145 5.1.5.3 PCB and Assembly Design . . . . . . . . . . . . . . . . . .146 5.1.5.4 PCB and Assembly Manufacturing . . . . . . . . . . . . . . .147 5.1.5.5 Storage and Transportation . . . . . . . . . . . . . . . . . . 148 5.1.6 Particular Guidelines for IC Design and Manufacturing . . . . . . . . . . 148 Design Guidelines for Maintainability . . . . . . . . . . . . . . . . . . . . 149 5.2.1 General Guidelines . . . . . . . . . . . . . . . . . . . . . . . . 149 5.2.2 Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . .149 5.2.3 Accessibility, Exchangeability . . . . . . . . . . . . . . . . . . . . 151 5.2.4 Operation, Adjustment . . . . . . . . . . . . . . . . . . . . . . . 152 Design Guidelines for Software Quality . . . . . . . . . . . . . . . . . . . 152 5.3.1 Guidelines for Software Defect Prevention . . . . . . . . . . . . . . .155 5.3.2 Configuration Management . . . . . . . . . . . . . . . . . . . . .158 5.3.3 Guidelines for Software Testing . . . . . . . . . . . . . . . . . . .158 5.3.4 Software Quality Growth Models . . . . . . . . . . . . . . . . . . .159
6 Reliability and Availability of Repairable Systems . . . . . . . . . . . . . . . . 162 6.1 Introduction and General Assumptions . . . . . . . . . . . . . . . . . . . . 162 6.2 Oue-Item Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.1 One-Item Structure New at Time t = 0 . . . . . . . . . . . . . . . . . 169 6.2.1.1 Reliability Function . . . . . . . . . . . . . . . . . . . . . 169 6.2.1.2 Point Availability . . . . . . . . . . . . . . . . . . . . . 170 6.2.1.3 Average Availability . . . . . . . . . . . . . . . . . . . .171 6.2.1.4 Interval Reliability . . . . . . . . . . . . . . . . . . . . .172 6.2.1.5 Special Kinds of Availability . . . . . . . . . . . . . . . . .173 6.2.2 One-Item Strncture New at Time t = 0 and with Constant Failnre Rate h . . . 176 6.2.3 One-Item Strncture with Arbitrary Initial Conditions at Time t = 0 . . . . . 176 6.2.4 Asymptotic Behavior . . . . . . . . . . . . . . . . . . . . . . . 178 6.2.5 Steady-State Behavior . . . . . . . . . . . . . . . . . . . . . . . 180 6.3 Systems without Redundancy . . . . . . . . . . . . . . . . . . . . . . .182 6.3.1 Senes Structure with Constant Failure and Repair Rates . . . . . . . . . . 182 6.3.2 Series Structure with Constant Failure and Arbitrary Repair Rates . . . . . . 185 6.3.3 Series Structure with Arbitrary Failure and Repair Rates . . . . . . . . . . 186 6.4 1-out-of-2 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . .189 6.4.1 1-out-of-2 Redundancy with Constant Failure and Repair Rates . . . . . . . 189 6.4.2 1-out-of-2 Redundancy with Constant Failure and Arbitrary Repair Rates . . . 197 6.4.3 1-out-of-2 Red . with Const. Failure Rate in Res . State and Arbitr. Repair Rates . 200 6.5 k-out-of-n Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . .206 6.5.1 k-out-of-n Warm Redundancy with Constant Failure and Repair Rates . . . . 207 6.5.2 k-out-of-n Active Redundancy with Const. Failure and Arbitrary Repair Rates . 210 6.6 Simple Senes - Parallel Stnictures . . . . . . . . . . . . . . . . . . . . . 213 6.7 Approximate Expressions for Large Series- Parallel Structures . . . . . . . . . 219 6.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .219 6.7.2 Application to a Practical Example . . . . . . . . . . . . . . . . . .223
XN 6.8
6.9
Contents Systems with Complex Structure . . . . . . . . . . . . . . . . . . . . . . 231 6.8.1 General Considerations . . . . . . . . . . . . . . . . . . . . . . .231 6.8.2 Preventive Maintenance . . . . . . . . . . . . . . . . . . . . . . . 233 6.8.3 Imperfect Switching. . . . . . . . . . . . . . . . . . . . . . . . 236 6.8.4 Incomplete Coverage . . . . . . . . . . . . . . . . . . . . . . . . 241 6.8.5 Elements with more than two States or one Failure Mode . . . . . . . . . 246 . . . . . . . . . . . . . . . 248 6.8.6 Fault Tolerant Reconfigurable Systems 6.8.6.1 Ideal Case . . . . . . . . . . . . . . . . . . . . . . . 248 6.8.6.2 Time Censored Reconfiguration (Phased-Mission Systems) . . . . . 248 . . . . . . . . . . . . . . 255 6.8.6.3 Failure Censored Reconfiguration 6.8.6.4 With Reward and Frequency / Duration Aspects . . . . . . . . . . 259 6.8.7 Systems with Common Cause Failures . . . . . . . . . . . . . . . . . 260 6.8.8 General Procedure for Modeling Complex Systems . . . . . . . . . . . 264 Alternative Investigation Methods . . . . . . . . . . . . . . . . . . . . . 267 6.9.1 Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.9.2 Dynamic Fault Trees . . . . . . . . . . . . . . . . . . . . . . . 270 6.9.3 Computer-Aided Reliability and Availability Computation . . . . . . . . 272 6.9.3.1 Numerical Solution of Equations for Reliability and Availability . . . 272 6.9.3.2 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . 273
7 Statistical Quality Control and Reliability Tests . . . . . . . . . . . . . . . . .277 7.1 Statistical Quality Control . . . . . . . . . . . . . . . . . . . . . . . .277 7.1.1 Estimation of a Defective Probability p . . . . . . . . . . . . . . . . 278 7.1.2 Simple Two-sided Sampling Plans for Demonstration of a Def . Probability p . . 280 7.1.2.1 Simple Two-sided Sampling Plans . . . . . . . . . . . . . . . 281 7.1.2.2 Sequential Tests . . . . . . . . . . . . . . . . . . . . . . 283 7.1.3 One-sided Sampling Plans for the Demonstration of a Def . Probability p . . . 284 7.2 Statistical Reliability Tests . . . . . . . . . . . . . . . . . . . . . . . . . 287 7.2.1 Reliability & Availability Estimation & Demon . for the case of a given Mission . 287 7.2.2 Availability Estimation &Demonstration for Continuous Operation (steady-state) . 289 7.2.2.1 Availability Estimation . . . . . . . . . . . . . . . . . . . 289 7.2.2.2 Availability Demonstration . . . . . . . . . . . . . . . . . . 291 7.2.2.3 Further Availability Evaluation Methods for Continnous Operation . . 292 7.2.3 Estimation and Demonstration of a Constant Failure Rate h (or of MTBF= 1Ih) . 294 7.2.3.1 Estimation of a Constant Failure Rate h . . . . . . . . . . . . 296 7.2.3.2 Simple Two-sided Test for the Demonstration of h . . . . . . . . 298 7.2.3.3 Simple One-sided Test for the Demonstration of h . . . . . . . . 302 7.3 Statistical Maintainability Tests . . . . . . . . . . . . . . . . . . . . . . . 303 7.3.1 Estimation of an M?TR . . . . . . . . . . . . . . . . . . . . . . .303 7.3.2 Demonstration of an M7TR . . . . . . . . . . . . . . . . . . . . . 305 7.4 Accelerated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 7.5 Goodness-of-fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . .312 7.5.1 Kolmogorov-Srnirnov Test . . . . . . . . . . . . . . . . . . . . .312 7.5.2 Chi-square Test . . . . . . . . . . . . . . . . . . . . . . . . . .316 7.6 Statistical Analysis of General Reliability Data . . . . . . . . . . . . . . . . . 319 7.6.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . 319 7.6.2 Tests for Nonhomogeneous Poisson Processes . . . . . . . . . . . . . . 321 7.6.3 Trend Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 7.6.3.1 Tests of a HPP versus a NHPP with increasing intensity . . . . . . 323 7.6.3.2 Tests of a HPP versus a NHPP with decreasing intensity . . . . . . 326
xv
Contents
7.7
7.6.3.3 Heuristic Tests to distinguish between HPP and Gen. Monotonic Trend . 327 Reliability Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . .329
8 Quality & Reliability Assurance During the Production Phase (Basic Considerations) . . 335 8.1 Basic Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.2 Testing and Screening of Electronic Components . . . . . . . . . . . . . . . 336 8.2.1 Testing of Electronic Components . . . . . . . . . . . . . . . . . .336 8.2.2 Screening of Electronic Components . . . . . . . . . . . . . . . . .337 8.3 Testing and Screening of Electronic Assemblies . . . . . . . . . . . . . . . .340 8.4 Test and Screening Strategies, Economic Aspects . . . . . . . . . . . . . . .342 8.4.1 Basic Considerations . . . . . . . . . . . . . . . . . . . . . . . . 342 8.4.2 Quality Cost Optimization at Incoming Inspection Level . . . . . . . . . .345 8.4.3 Procedure to handle first deliveries . . . . . . . . . . . . . . . . . .350
Annexes A l Terms and Definitions
. . . . . . . . . . . . . . . . . . . . . . . . . . . 351
A2 Quality and Reliability Standards . . . . . . . . . . . . . . . . . . . . . . A2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . A2.2 Requirements in the Industrial Field . . . . . . . . . . . . . . . . . . . A2.3 Requirements in the Aerospace, Defense. and Nuclear Fields . . . . . . . . .
.365 . 365 .366 .368
A3 Definition and Realization of Quality and Reliability Requirements . . . . . . . . 369 A3.1 Definition of Quality and Reliability Reqnirements . . . . . . . . . . . . . . 369 A3.2 Realization of Quality and Reliability Requirements for Complex Equip . & Systems . 371 A3.3 Elements of a Quality and Reliability Assurance Program . . . . . . . . . . . 376 A3.3.1 Project Organization. Planning. and Scheduling . . . . . . . . . . . 376 A3.3.2 Quality and Reliability Requirements . . . . . . . . . . . . . . . . 377 A3.3.3 Reliability and Safety Analysis . . . . . . . . . . . . . . . . . .377 A3.3.4 Selection and Qualification of Components. Materials & Manuf . Processes . 378 A3.3.5 Configuraiion Management . . . . . . . . . . . . . . . . . . .378 A3.3.6 Quality Tests . . . . . . . . . . . . . . . . . . . . . . . . .380 A3.3.7 Quality Data Reporting System . . . . . . . . . . . . . . . . . .380 A4 Checklists for Design Reviews . . . . . . . . . . . . . . . . . . . . . . . . 383 A4.1 System Design Review . . . . . . . . . . . . . . . . . . . . . . . . . 383 A4.2 Preliminary Design Reviews . . . . . . . . . . . . . . . . . . . . . . . 384 A4.3 Critical Design Review (System Level) . . . . . . . . . . . . . . . . . . 386
A5 Requirements for Quality Data Reporting Systems . . . . . . . . . . . . . . . .388 A6 Ba& A6.1 A6.2 A6.3 A6.4
Probability Theory . . . . . . . . . . . . . . . . . Field of Events . . . . . . . . . . . . . . . . . . . Concept of Probability . . . . . . . . . . . . . . . . Conditional Probability. Independence . . . . . . . . . . Fundamental Rules of Probability Theory . . . . . . . . . A6.4.1 Addition Theorem for Mutually Exclusive Events . . A6.4.2 Multiplication Theorem for Two Independent Events A6.4.3 Multiplication Theorem for Arbitrary Events . . . .
XVI
Contents
A6.4.4 Addition Theorem for Arbitrary Events . . . . . . . . . . . . . . .399 A6.4.5 Theorem of Total Probability . . . . . . . . . . . . . . . . . . .400 A6.5 Random Variables, Distribution Functions . . . . . . . . . . . . . . . . .401 A6.6 Numerical Parameters of Random Variables . . . . . . . . . . . . . . . .406 . A6.6.1 Expected Value (Mean) . . . . . . . . . . . . . . . . . . . . 406 A6.6.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . .410 A6.6.3 Modal Value, Quantile. Median . . . . . . . . . . . . . . . . . .412 A6.7 Multidimensional Random Variables, Conditional Distributions . . . . . . . . .412 A6.8 Numencal Parameters of Random Vectors . . . . . . . . . . . . . . . . . 414 A6.8.1 Covariance Matrix. Correlation Coefficient . . . . . . . . . . . . .415 A6.8.2 Further Properties of Expected Value and Variance . . . . . . . . . . 416 A6.9 Distribution of the Sum of Indep. Positive Random Variables and of Zmin. Zmax . 416 A6.10 Distribution Functions used in Reliability Analysis . . . . . . . . . . . . .419 A6.10.1 Exponential Distribution . . . . . . . . . . . . . . . . . . .419 A6.10.2 Weibull Distribution . . . . . . . . . . . . . . . . . . . .420 A6.10.3 Gamma Distribution, Erlangian Distribution. and X2 -Distribution . . 422 A6.10.4 Normal Distribution . . . . . . . . . . . . . . . . . . . .424 A6.10.5 Lognormal Distribution . . . . . . . . . . . . . . . . . . .425 A6.10.6 Uniform Distribution . . . . . . . . . . . . . . . . . . . .427 A6.10.7 Binomial Distribution . . . . . . . . . . . . . . . . . . . .427 A6.10.8 Poisson Distribution . . . . . . . . . . . . . . . . . . . .429 . A6.10.9 Geometrie Distribution . . . . . . . . . . . . . . . . . . 431 A6.10.10 Hypergeometric Distribution . . . . . . . . . . . . . . . . 432 . A6.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . .432 A6.11.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . .433 A6.11.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 434 A7 Basic Stochastic-Processes Theory . . . . . . . . . . . . . . . . . . . . . . 438 A7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .438 . A7.2 Renewal Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 441 A7.2.1 Renewal Function. Renewal Density . . . . . . . . . . . . . . . .443 . A7.2.2 Recurrence Times . . . . . . . . . . . . . . . . . . . . . . 446 A7.2.3 Asymptotic Behavior . . . . . . . . . . . . . . . . . . . . . .447 A7.2.4 Stationary Renewal Processes . . . . . . . . . . . . . . . . . . .449 A7.2.5 Homogeneous Poisson Processes . . . . . . . . . . . . . . . . .450 A7.3 Alternating Renewal Processes . . . . . . . . . . . . . . . . . . . . . .452 A7.4 Regenerative Processes . . . . . . . . . . . . . . . . . . . . . . . . .456 A7.5 Markov Processes with Finitely Many States . . . . . . . . . . . . . . . .458 A7.5.1 Markov Chains with Finitely Many States . . . . . . . . . . . . . .458 A7.5.2 Markov Processes with Finitely Many States . . . . . . . . . . . .460 A7.5.3 State Probabilities and Stay (Sojourn) Times in a Given Class of States . . 469 A7.5.3.1 Method of Differential Equations . . . . . . . . . . . . .469 A7.5.3.2 Method of Integral Equations . . . . . . . . . . . . . . .473 A7.5.3.3 Stationary State and Asymptotic Behavior . . . . . . . . .474 A7.5.4 Frequency / Duration and Reward Aspects . . . . . . . . . . . . .476 A7.5.4.1 Frequency / Duration . . . . . . . . . . . . . . . . . .476 . A7.5.4.2 Reward . . . . . . . . . . . . . . . . . . . . . . . 478 A7.5.5 Birth and Death Process . . . . . . . . . . . . . . . . . . . . .479 A7.6 Semi-Markov Processes with Finitely Many States . . . . . . . . . . . . . .483 A7.7 Semi-regenerative Processes . . . . . . . . . . . . . . . . . . . . . . .488 A7.8 Nonregenerative Stochastic Processes . . . . . . . . . . . . . . . . . . .492
XVII
Contents A7.8.1 A7.8.2 A7.8.3 A7.8.4 A7.8.5
General Considerations . . . . . . . . . Nonhomogeneous Poisson Processes (NHPP) Supenmposed Renewal Processes . . . . . Cumulative Processes . . . . . . . . . . General Point Processes . . . . . . . . .
. . . . . . . . . . . .492 . . . . . . . . . . . . 493
. . . . . . . . . . . 497 . . . . . . . . . . . . .498 . . . . . . . . . . . . 500
A8 Basic Mathematical Statistics . . . . . . . . . . . . . . . . . . . . . . . . 503 A8.1 Empirical Methods . . . . . . . . . . . . . . . . . . . . . . . . . .503 A8.1.1 Empirical Distribution Function . . . . . . . . . . . . . . . . . .504 A8.1.2 Empirical Moments and Quantiles . . . . . . . . . . . . . . . . .506 A8.1.3 Further Applications of the Einpincal Distribution Function . . . . . . . 507 . A8.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 511 A8.2.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . .511 A8.2.2 Intemal Estimation . . . . . . . . . . . . . . . . . . . . . . .516 A8.2.2.1 Estimation of an Unknown Probability p . . . . . . . . . . 516 A8.2.2.2 Estimation of the Param. h for an Exp . Distribution, Fixed T . . 520 A8.2.2.3 Estimation of the Param. h for an Exp . Distribution, Fixed n . . 521 A8.2.2.4 Availability Estimation (Erlangian Failure-Free & Repair Times) 523 A8.3 Testing Statistical Hypotheses . . . . . . . . . . . . . . . . . . . . . .525 A8.3.1 Testing an Unknown Probability p . . . . . . . . . . . . . . . . .526 A8.3.1.1 Simple Two-sided Sampling Plan . . . . . . . . . . . . .527 A8.3.1.2 Sequential Test . . . . . . . . . . . . . . . . . . . .528 A8.3.1.3 Simple One-sided Sampling Plan . . . . . . . . . . . . . 529 A8.3.1.4 Availability Demonstration (Erlangian Failure-Free & Rep . Times)531 A8.3.2 Goodness-of-fitTestsforCompletely Specified Fo(t) . . . . . . . . . 533 A8.3.3 Goodness-of-fit Tests for Fo(t) with Unknown Parameters . . . . . . .536 A9 Tables and Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . .539 A9.1 Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . . .539 . A9.2 x2-~istribution(Chi-Square Distribution) . . . . . . . . . . . . . . . . 540 A9.3 t-Distribution (Student distribution) . . . . . . . . . . . . . . . . . . . .541 A9.4 F Distribution (Fisher distribution) . . . . . . . . . . . . . . . . . . . .542 A9.5 Table for the Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . .543 A9.6 GammaFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 A9.7 Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . . .545 . . . . . . . . . . . . . . . 547 A9.8 Probability Charts (Probability Plot Papers) A9.8.1 Lognormal Probability Chart . . . . . . . . . . . . . . . . . . .547 A9.8.2 Weibull Probability Chart . . . . . . . . . . . . . . . . . . . .548 A9.8.3 Normal Probability Chart . . . . . . . . . . . . . . . . . . . .549 Al0 Basic Technological Component's Properties . . . . . . . . . . . . . . . . . .550
A l l Problems for Home-Work . . . . . . . . . . . . . . . . . . . . . . . . . .554 Acronyms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 .
References
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .561
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .581
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
The purpose of reliability erzgineerirzg is to develop methods and tools to evaluate und demonstrate reliability, maintainability, availability, and safety of components, equipment, and systems, as well as to support development and production engineers in building in these characteristics. In order to be cost and time effective, reliability engineering must be integrated in project activities, and support quality assurance and concurrent engineering efforts. This chapter introduces basic concepts, shows their relationships, and discusses the tasks necessary to assure quality and reliability of complex equipment and systems with high quality und reliability requirements. A comprehensive list of definitions is given in Appendix A l . Standards for quality assurance (management) systems are discussed in Appendix A2. Refinements of management aspects are given in Appendices A3 - A5 for the cases in which tailoring is not mandatory.
1.1 Introduction Until the nineteen-sixties, quality targets were deemed to have been reached when the item considered was found to be free of defects or systematic failures at the time it left the manufacturer. The growing complexity of equipment and systems, as well as the rapidly increasing cost incurred by loss of operation as a consequence of failures, have brought to the forefront the aspects of reliability, maintainability, availability, and safety. The expectation today is that complex equipment and systems are not only free from defects und systematic failures at time t = O (when they are put into operation), but also perform the required function failure free for a stated time interval and have a fail-safe behavior in the case of critical or catastrophic failures. However, the question of whether a given item will operate without failures during a stated period of time cannot be simply answered by yes or no, on the basis of a compliance test. Experience shows that only aprobability for this occurrence can be given. This probability is a measure of the item's
2
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
reliability and can be interpreted as follows:
If n statistically identical items are put into operation at time t = 0 to pegorm a given mission und 7 I n of them accomplish it successfully, then the ratio 7 / n is a random variable which converges for increasing n to the true value of the reliability (Appendix A6.11). Performance Parameters as well as reliability, maintainability, availability, and safety have to be built in during design & development and retained during production and operation of an item. After the introduction of some important concepts in Section 1.2, Section 1.3 gives basic tasks and rules for quality and reliability assurance of complex equipment und Systems with high quality und reliability requirements (see Appendix A l for a comprehensive list of definitions and Appendices A2 - A5 for a refinement of management aspects).
1.2 Basic Concepts This section introduces important concepts used in reliability engineering and shows their relationships (see Appendix A l for a more complete list).
1.2.1 Reliability Reliability is a characteristic of an item, expressed by the probability that the item will perform its required function under given conditions for a stated time interval. It is generally designated by R. From a qualitative point of view, reliability can be defined as the ability of the item to remain functional. Quantitatively, reliability specifies the probabili~that no operational interruptions will occur during a stated time interval. This does not mean that redundant parts may not fail, such parts can fail and be repaired (without operational interruption at item (system) level). The concept of reliability thus applies to nonrepairable as well as to repairable items (Chapters 2 and 6, respectively). To make sense, a numerical Statement of reliability (e.g., R = 0.9) must be accompanied by the definition of the required function, the operating conditions, and the mission duration. In general, it is also important to know whether or not the item can be considered new when the mission Starts. An item is a functional or structural unit of arbitrary complexity (e.g. component, assembly, equipment, subsystem, system) that can be considered as an entity for investigations. It may consist of hardware, software, or both and may also include human resources. Often, ideal human aspects and logistic Support are assumed, even if (for simplicity) the term System is used instead of technical system.
1.2 Basic Concepts
3
The required function specifies the item's task. For example, for given inputs, the item outputs have to be constrained within specified tolerance bands (performance Parameters should still be given with tolerances and not merely as fixed values). The definition of the required function is the starting point for any reliabili9 analysis, as it defines failures. Operating conditions have an important influence upon reliability, and must therefore be specified with care. Experience shows e.g., that the failure rate of semiconductor devices will double for operating temperature increase of 10 - 20°C . The required function andl or operating conditions can be time dependent. In these cases, a mission profile has to be defined and all reliability figures will be related to it. A representative mission profile and the corresponding reliability targets should be given in the item's specification Often the mission duration is considered as a Parameter t, the reliabilityfunction is then defined by R ( t ) . R ( t ) is the probability that no failure at item level will occur in the interval (0, t ] . The item's condition at t = 0 (new or not) influences final results. To consider this, reliability figures at system level will have indices Si (e.g. R s , ( t ) ) ,where S stands for system and 1 is the state entered at t = 0 (Table 6.2). A distinction between predicted and estimated or assessed reliability is important. The first one is calculated on the basis of the item's reliability structure and the failure rate of its components (Sections 2.2 & 2.3), the second is obtained from a statistical evaluation of reliability tests (Section 7.2) or from field data by known environmental and operating conditions. The concept of reliability can be extended to processes and services as well, although human aspects can lead to modeling difficulties (see e.g. Section 1.2.7).
1.2.2 Failure A failure occurs when the item stops performing its required function. As simple as this definition is, it can become difficult to apply it to complex items. The failurefree time (hereafter used as a synonym for failure-free operating time) is generally a random variable. It is often reasonably long, but it can be very short, for instance because of a failure caused by a transient event at turn-on. A general assumption in investigating failure-free times is that at t = 0 the item is free of defects and systematic failures. Besides their frequency, failures should be classified (as far as possible) according to the mode, cause, effect, and mechanism:
1. Mode: The mode of a failure is the Symptom (local effect) by which a failure is observed; e.g., Opens, shorts, or drift for electronic components (Table 3.4); brittle rupture, creep, cracking, seizure, fatigue for mechanical components. 2. Cause: The cause of a failure can be intrinsic, due to weaknesses in the item andlor wearout, or extrinsic, due to errors, misuse or mishandling during the design, production, or use. Extrinsic causes often lead to systematic failures,
4
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
which are deterministic and should be considered like defects (dynamic defects in software quality). Defects are present at t = 0, even if often they can not be discovered at t = 0. Failures appear always in time, even if the time to failure is short as it can be with systematic or early failures. 3. Effect: The effect (consequence) of a failure can be different if considered on the item itself or at higher level. A usual classification is: non relevant, partial, complete, and critical failure. Since a failure can also cause further failures, distinction between primaiy and secondaiy failure is important. 4. Mechanism: Failure mechanism is the physical, chemical, or other process resulting in a failure (see Table 3.5 for some examples). Failures can also be classified as sudden and gradual. In this case, sudden and complete failures are termed cataleptic failures, gradual and partial failures are termed degradation failures. As failure is not the only cause for an item being down, the general term used to define the down state of an item (not caused by a preventive maintenance, other planned actions, or lack of external resources) is fault. Fault is thus a state of an item and can be due to a defect or a failure.
1.2.3 Failure Rate The failure rate plays an important role in reliability analysis. This Section introduces it heuristically, see Appendix A6.5 for an analytical derivation. Let us assume that n statistically identical and independent items are put into operation at time t = 0, under the same conditions, and at the time t a subset V ( t ) of these items have not yet failed. Y ( t ) is a right continuous decreasing step function (Fig. 1.1). t l , ..., t„ measured from t = 0, are the observed failure-free times (times to failure) of the n items considered. They are independent realizations of a random variable T (hereafter identified as a failure-free time) and must not be confused with arbitrary points on the time axis ( tl, t; ,...). The quantity
is the empirical mean (empirical expected value) of T. Empirical quantities are statistical estimates, marked with " in this book. For n+ W, E[TI converges to the true value E [T] = MTTF (given by Eq. (13 ) ) of the mean failure-free time T (Eq. (A6. l47), see also Appendix A8.1.2). The function
is the empirical reliability function. As shown in Appendix A8.1.1, k ( t ) converges to the reliability function R ( t ) for n-t W . For an arbitrary time interval ( t , t + 6t1, the empirical failure rate is defined as
1.2 Basic Concepts
Figure 1.1 Number T(t) of (nonrepairable) items still operating at time t
i ( t ) 6t is the ratio of the items failed in the interval ( t , t + 6 t ] to the number of items still operating (or surviving) at time t. Applying Eq. (1.2) to Eq. (1.3) yields
For n+
CQ
& S t - 1 0 , and assuming R ( t ) derivable, h(t) converges to the failure rate
h(t>=
- d R ( t )l d t
R( t )
Considering R(0) = 1 (at t = 0 all items are new) it follows that
The failure rate h ( t ) given by Eqs. (1.3)- (1.5) applies in particular to nonrepairable items (Figs. 1.1 & 1.2). However, considering Eq. (A6.25) it can also be used for repairable items which are as-good-as-new after repair (renewal), taking instead of t the variable X starting by X = 0 ut euch renewal (as for interarrival times). If a repairable system cannot be restored to be as-good-as-new after repair (with respect to the state considered), i.e if at least one element with time dependent failure rate has not been renewed at every repair, failure intensity z ( t )has to be used (see pp. 355, 356,358 for cornments). The use of hazard rate for A ( t ) should also be avoided.
6
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
In many practical applications, N t ) = h can be assumed. Eq. (1.6) then yields for h(t)= h .
The failure-free time T > 0 is exponentially distributed (F(t ) = Pr{%It } = 1 - e-L ). For this case, and only in this case, the failure rate h can be estimated by B = k I T , where T is a given (fixed) cumulative operating time and k the total number of failures during T (Eqs. (7.28) and (A8.46)). The mean (expected value) of the failure-free time T > 0 is given by (Eq. (A6.38))
where MTTF stands for mean time to failure. For k ( t )= h it follows that E[T] = 11h. Constant (time independent) failure rate h is often assumed for repairable items too, considered as-good-as-new after repair (renewal). For this case, and only in this case, successive failure-free times are independent random variables, exponentially distributed with the same Parameter A,and have mean MTBF = l / h ,
for h(x)=h ,
(1.9)
where MTBF stands for mean operating time between failures. Also because of the statistical estimate M ~ B F= T l k (Section 7.2.3.1), often used in practical applications, MTBF should be confined to the case of repairable items with constant failure rate (p. 358). For Markov and semi-Markov models, MUTs is used (Eqs. (6.287) or (A7.142)). The failure rate of a large population of statistically identical und independent items exhibits often a typical bathtub curve (Fig. 1.2) with the following 3 phases:
1. Early failures: A ( t ) decreases (in general) rapidly with time; failures in this phase are attributable to randomly distributed weaknesses in materials, components, or production processes. 2. Failures with constant (or nearly so) failure rate: h(t) is approximately constant; failures in this period are Poisson distributed and often cataleptic. 3. Wearout failures: h(t) increases w i t h time; failures in t h i s period a r e attributable to aging, wearout, fatigue, etc. (e.g. corrosion, electrornigration). Early failures are not deterministic and appear in general randomly distributed in time and over the items. During the early failure period, h(t) must not necessarily decrease as in Fig. 1.2, in some cases it can oscillate. To eliminate early failures, bum-in or environmental stress screening is used (Chapter 8). Early failures must be distinguished from systematic failures, which are deterministic and caused by errors or mistakes, and whose elimination requires a change in design, production process, operational procedure, documentation or other. The length of the early failure period varies greatly in practice. However, in most applications it will be shorter than a few thousand hours. The presence of a period with constant (or nearly so)
1.2 Basic Concepts
Figure 1.2 Typical shape for the failure rate of a Zarge population of statistically identical und independent (nonrepairable) items (dashed is a possible shift for a higher Stress, e.g. ambient temperature)
failure rate h ( t )= h is realistic for many equipment & Systems, and useful for caIcuIations. The memoryless property, which characterizes this period, Ieads to a homogeneous Poisson process for the flow of failures (Appendix A7.2.5) and to a Markov process for the time behavior of a repairable item if also constant repair rates can be assumed (Chapter 6). An increasing failure rate after a given operating time (> 10 years for many electronic equipment) is typical for most items and appears because of degradation phenomena due to wearout. A possible explanation for the shape of h ( t ) given in Fig. 1.2 is that the population of n statistically identical and independent items contains n pf weak elements and n(1- p f ) good ones. The distribution of the failure-free time can then be expressed by a weighted sum of the form F(t) = pf F l ( t )+ ( 1 - p f ) F z ( t ) . For calculation or simulation purposes, F l ( t ) could be a gamma distribution with ß < 1 and F z ( t ) a shifted Weibull distribution with ß > 1 (Eqs. (A6.34), (A6.96), (A6.97)). The failure rate strongly depends upon the item's operating conditions. For semiconductor devices, experience shows for example that the value of ?L doubles for an operating temperature increase of 10 to 20°C and becomes more than an order of magnitude higher if the device is exposed to elevated mechanical Stresses (Table 2.3). Typicalfigures for ?L are 10-10 to 10-7 h-1 for electronic components. The concept of failure rate also applies to humans and a shape similar to that depicted in Fig. 1.2 can be obtained from a mortality table. As stated with Eqs. (1.3) -(1.5), the failure rate h ( t ) is a conditional density and must not be confused with the failure intensity z ( t )(Eq. (A7.228)) or the intensity h ( t ) of a renewal process (Eq. (A7.18)) or m ( t ) of a Poisson process (Eq. (A7.193)). z (t), h ( t ) , and m ( t ) are unconditional densities and differ basically from h ( t ) . This distinction is important also for the case of a homogeneous Poisson process, for which z ( t )= h ( t )= m(t)= h holds for the intensity and h ( x )= h holds for the interarrival times ( X starting by 0 at each interarrival time, See also p. 356). To reduce ambiguities,force of mortality has been suggested for h ( t ) in [6.3, A7.301.
8
1
Basic Concepts, Quality and Reliability Assurance of Complex Equiprnent and Systems
1.2.4 Maintenance, Maintainability Maintenance defines the set of activities performed on an item to retain it in or to restore it to a specified state. Maintenance is thus subdivided into preventive maintenance, carried out at predetermined intervals to reduce wearout failures, and corrective maintenance, carried out after failure recognition and intended to put the item into a state in which it can again perform the required function. Aim of a preventive maintenance is also to detect and repair hidden failures, i.e. failures in redundant elements not identified at their occurrence. Corrective maintenance is also known as repair, and can include any or all of the following steps: recognition, isolation (localization & diagnosis), elimination (disassembly, replace, reassembly), checkout. Repair is used hereafter as a synonym for restoration. To simplify calculations, it is generally assumed that the element in the reliability block diagram for which a maintenance action has been performed is as-good-as-new after maintenance. This assumption is valid for the whole equipment or system in the case of constant failure rate for all elements which have not been repaired or replaced. Maintainability is a characteristic of an item, expressed by the probability that a preventive maintenance or a repair of the item will be performed within a stated time intental for given procedures und resources (skill level of personnel, spare Parts, test facilities, etc.). From a qualitative point of view, maintainability can be defined as the ability of an item to be retained in or restored to a specified state. The expected value (mean) of the repair time is denoted by MTTR (mean time to repair), that of a preventive maintenance by MTTPM. Often used for unscheduled removals is also MTBUR. Maintainability has to be built into complex equipment or Systems during design und development by realizing a maintenance concept. Due to the increasing maintenance cost, maintainability aspects have grown in importance. However, maintainability achieved in the field largely depends on the resources available for maintenance (human and material), as well as on the correct installation of the equipment or system, i.e. on the logistic support and accessibility.
1.2.5 Logistic Support Logistic support designates all activities undertaken to provide effective and economical use of an item during its operating phase. To be effective, logistic support should be integrated into the rnaintenance concept of the item under consideration and include after-sales service. An emerging aspect related to maintenance and logistic support is that of obsolescence managernent, i.e. how to assure functionality over a long operating period, e.g. 20 years, when technology is rapidly evolving and components need for maintenance are no longer manufactured. Care has to be given here to design aspects, to assure interchangeability during the equipment's useful life without important redesign. Standardization in this direction is in Progress [1.9].
1.2 Basic Concepts
9
1.2.6 Availability Availability is a broad term, expressing the ratio of delivered to expected service. It is often designated by A and used for the stationary & steady-state value of the point and average availability (PA = AA). Point availability (PA(t))is a characteristic of an item expressed by the probability that the item will perform its required function under given conditions at a stated instant of time t. From a qualitative point of view, point availability can be defined as the ability of the item to perjorm its required function under given conditions at a stated instant of time (dependability). Availability evaluations are often difficult, as logistic support and human factors should be considered in addition to reliability and maintainability. Ideal human and logistic support conditions are thus often assumed, yielding to the intrinsic (inherent) availability. Hereafter, availability is used as a synonym for intrinsic availability. Further assumptions for calculations are continuous operation and complete renewal for the repaired element in the reliability block diagram (assumed as-good-as-new after repair). For a given item, the point availability PA(t) rapidly converges to a stationary & steady-state value, given by (Eq. (6.48))
PA is also the stationary & steady-state value of the average availability (AA) giving the expected value (mean) of the percentage of the time during which the item performs its required function. PAs and AAS is used for considerations at system level. Other availability measures can be defined, e.g. mission availability, work-mission availability, overall availability (Sections 6.2.1.5, 6.8.2). Application specific figures are also known, see e.g. [6.11]. In contrast to reliability analyses for which no failure at item (system) level is allowed (only redundant parts can fail and be repaired on line), availability analyses allow failures at item (system)level.
1.2.7 Safety, Risk, and Risk Acceptance Safety is the ability of the item not to cause injury to persons, nor significant material damage or other unacceptable consequences during its use. Safety evaluation must consider the following two aspects: Safety when the item functions and is operated correctly and safety when the item or a part of it has failed. The first aspect deals with accident prevention, for which a large number of national and international regulations exist. The second aspect is that of technical safety which is investigated using the same tools as for reliability. However, a distinction between technical safety and reliability is necessary. While safety assurance examines measures which allow an item to be brought into a safe state in the case of failure (fail-safe behavior), reliability assurance deals more generally with measures for minimizing the total number of failures. Moreover, for technical safety the effects of external
10
1
Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
influences like human errors, catastrophes, sabotage, etc. are of great importance and must be considered carefully. The safety level of an item influences the number of product liability claims. However, increasing in safety can reduce reliability. Closely related to the concept of (technical) safety are those of risk, risk management, and risk acceptance, including risk analysis and risk assessment [1.21, 1.261. Risk problems are generally interdisciplinary and have to be solved in close cooperation between engineers und sociologists to find common solutions to controversial questions. An appropriate weighting between probability of occurrence and effect (consequence) of a given accident is important. The multiplicative rule is one among different possibilities. Also it is necessary to consider the different causes (machine, machine & human, human) and effects (location, time, involved people, effect duration) of an accident. Statistical tools can Support risk assessment. However, although the behavior of a homogenous human population is often known, experience shows that the reaction of a single Person can become unpredictable. Similar difficulties also arise in the evaluation of rare events in complex systems. Considerations on risk and risk acceptance should take into account that the probability p, for a given accident which can be caused by one of n statistically identical and independent items, each of them with occurrence probability p, is for np small nearly equal to np as per
Equation (1.11) follows from the binomial distribution and the Poisson approximation (Eqs. (A6.120) & (A6.129)). It also applies with n p = Atot T to the case in which one assumes that the accident occurs randomly in the interval (0, T], caused by one of n independent items (systems) with failure rates Al, ..., L,,,where ?L„, = ?Ll + ... + An . This is because the sum of n independent Poisson processes is again a Poisson process (Eq. (7.27)) and the probability ?L„ ~ e for one failure in the interval (0, T] is nearly equal to Atot T . Thus, for n p << 1 or Atot T << 1 it holds that
Also by assuming a reduction of the individual occurrence probability p (or failure rate Li),one recognizes that in the future it will be necessary either to accept greater risks p, or to keep the spread of high-risk technologies under tighter control. Similar considerations could also be made for the problem of environmenfal Stresses caused by mankind. Aspects of ecologically acceptable production, use, disposal, and recycling or reuse of products will become subject for international regulations, in the general context of sustainable development. In the context of a product development, risks related to feasibility and time to market within the given cost constraints must be considered during all development phases (feasibility checks in Fig. 1.6 and Tables A3.3 & 5.3).
~
~
1.2 Basic Concepts
11
Mandatory for risk rnanagement are psychological aspects related to risk awareness and safety communication. As long as a danger for risk is not perceived, people often do not react. Knowing that a safety behavior presupposes a risk awareness, communication is an important tool to avoid that a risk related to the system considered will be underestimated, See e.g. [1.26].
1.2.8 Quality Quality is understood as the degree to which a set of inherent characteristics fulfiIls requirements. This definition, given now also in the ISO 9000: 2000 IA1.61, follows closely the traditional definition of quality, expressed by fitness for use, and applies to products and services as well.
1.2.9 Cost and System Effectiveness All previously introduced concepts are interrelated. Their relationship is best shown through the concept of cost effectiveness, as given in Fig. 1.3. Cost effectiveness is a measure of the ability of the item to meet a service demand of stated quantitative characteristics, with the best possible usefulness to life-cycle cost ratio. It is often referred also to as system effectiveness. Figure 1.3 deals essentially with technical and cost aspects. Some management aspects are considered in Appendices A2 - A 5. From Fig. 1.3, one recognizes the central role of quality assurance, bringing together all assurance activities (Section 1.3.3), and of dependability (collective term for availability performance and its influencing factors). As shown in Fig. 1.3, lije-cycle cost (LCC) is the sum of the cost for acquisition, operation, maintenance, and disposal of an item. For complex systems, higher reliability in general leads to a higher acquisition cost and lower operating cost, so that the optimum of life-cycle cost seldom lies at extremely low or high reliability figures. For such a System, per year operating and maintenance cost often lie between 3 and 6% of acquisition cost, and experience shows that up to 80% of the life-cycle cost is frequently generated by decisions early in the design phase. In the future, life-cycle cost will take more into account current and deferred damage to the environment caused by production, use, and disposal of an item. Life-cycle cost optimization is project specific, in general, and falls within the framework of cost effectiveness or systems engineering. It can be positively influenced by concurrent engineering [1.13, 1.15, 1.221. Figure 1.4 shows as an example the influence of the attainment level of quality and reliability targets on the sum of cost for quality assurance and for the assurance of reliability, maintainability, and logistic support for two complex systems [2.3 (1986)l. To introduce this model, let us first consider Example 1.1.
12
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
Example 1.1 An assembly contains n independent components each with a defective probability p . Let ck be the cost to replace k defective components. Determine (i) the expected value (mean) C(i) of the total replacement cost (no defective components are allowed in the assembly) and (ii) the mean of the total cost (test and replacement) C(ii) if the components are submitted to an incoming inspection which reduces defective percentage from p to po (test cost ct per component).
Solutiori (i) The solution makes use of the binomial distribution (Appendix A6.10.7) and question (i) is also solved in Example A6.18. The probability of having exactly k defective components in a lot of size n is given by (Eq. (A6.120))
The mean C(i) of the total cost (deferred cost) caused by the defective components follows then from
(ii) To the cost caused by the defective components, calculated from Eq. (1.14) with p, instead of p, one must add the incoming inspection cost n cl
The difference between C(i) and C(ii) gives the gain (or loss) obtained by introducing the incoming inspection, allowing thus a cost optimization (see also Section 8.4 for a deeper discussion).
With similar considerations to those in Example 1.1 one obtains for the expected value (mean) of the total repair cost C„ during the cumulative operating time T of an item with failure rate h and cost C, per repair
C„=hTc„=-
1
MTBF
ccm.
In Eq. (1.16), the term h T gives the mean value of the number of failures during T (Eq. (A7.42)), and MTBF is used as MTBF = 1 / h. From the above considerations, the following equation expressing the mean C of the sum of the cost for quality assuranceand for the assurance of reliability, maintainability, and logistic support of a system can be obtained
Thereby, q denotes quality, r reliability, crn corrective maintenance, prn preventive maintenance, 1 logistic support, off down time, and d defects.
1.2 Basic Concepts
I
' D
C11 Capability
i
. .
Design, development, evaiuation Production Cost analyses (Life-cycle costs, VE, VA)
I
Cost Effectiveness (System Effectiveness)
1
Ooerational Availability (Dependabili
Cost EffectivenessAssurance (System EffectivenessAssurance)
.
Configuration management Quality testing (incl. reliability, maintainability. and safety tests) Quality control during production (hardware) Quality data reporting System Software quality
.
Reliability targets Required function Environm. cond. Parts & materials Design guidelines Derating Screening Redundancy FMEA, FTA, etc. Rel. block diagr. Rel. prediction Design reviews
. .. . ..
Safety
.
. .
Maintainability targets Maintenance concept Design guidelines Paititioning in LRUs Operating control Diagnosis Maintainability analysis Design reviews
.
.
. ..
1
..
. .
Safety targets Maintenance Design concept Customer guidelines Safety analysis documentation ( F M E M C A , ' Spare Parts provisioning FTA, etc.) TOO~S and test ~~~i~~ reviews equipment for maintenance After d e s service
. .
.
Figure 1.3 Cost Effectiveness (System Effectiveness) for complex equipment & Systems with high quality und reliability requirements (see Appendices A l - A5 for definitions and management aspects; dependability can be used instead of operational availability, for a qualitative meaning)
14
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
MTBFs and OAs are the system mean operating time between failures (assumed here = 1/ hs) and the system steady-state overall availability (Eq. (6.196) with Tpm instead of T„). T is the total system operating time (useful life) and nd is the number of hidden defects discovered (and eliminated) in the field. C q , C r , Ccm, C„, and C l are the cost for quality assurance and for the assurance of reliability, , cd are repairability, serviceability, and logistic support, respectively. C„, c o f ~and the cost per repair, per down time hour, and per hidden defect, respectively (preventive maintenance cost are scheduled cost, considered here as a part of C p m ) . The first five terms in Eq. (1.17) represent a part of the acquisition cost, the last three terms are deferred cost occurring during field operation. A model for investigating the cost C according to Eq. (1.17) was developed in [2.3 (1986)l by assuming C q , C,., C c m , C p m , C I , MTBFS, OAS, T, cm, C,#, and cd as Parameters and investigating the variation of the total cost expressed by Eq. (1.17) as a function of the level of attainment of the specified targets, i.e. by introducing the variables gq = QA IQA, , gr = MTBFs I MTBFs,, g„ = MTIRSg~MTTRs, gpm= MTTPMS, / M T T P M S , and gl = MLDsgI MLDs, where the subscript g denotes the specified target for the corresponding quantity. A power relationship
was assumed between the actual cost C i , the cost Ci, to reach the specified target (goal) of the considered quantity, and the level of attainment of the specified target ( 0 iml < 1 and all other m i> I). The following relationship between the number of hidden defects discovered in the field and the ratio C 9 / C , was also included in the model
The final equation for the cost C as function of the variables g 9 , gr , g, , g„ , and gl follows then as (using Eq. (6.196) for OAs)
The relative cost C / C , given in Fig. 1.4 is obtained by dividing C by the value C , form Eq. (1.20) with all gi = 1. Extensive analyses with different values for mi, C i , MTBFs, OAs, T, C,, cs8, and cd have shown that the value C / C g is only moderately sensitive to the parameters mi.
1.2 Basic Concepts Rel. cost C/Cg
Figure 1.4 Sum of the relative cost C / Cg for quality assurance and for the assurance of reliability, maintainability, and logistic support of two complex Systems with different mission profiles, as a function of the level of attainment of the specified quality and reliability targets gq and g„ respectively (the specified targets are dashed, results based on Eq. (1.20))
1.2.10
Product Liability
Product liability is the onus on a manufacturer (producer) or others to compensate for losses related to injury to persons, material damage, or other unacceptable consequences caused by a product (item). The manufacturer has to speczfy a safe operational mode for the product (user documentation). In legal documents related to product liability, the term product often indicates hardware only and the term defective product is in general used instead of defective or failed product. Responsible in a product liability claim are all those people involved in the design, production, sale, and maintenance of the product (item), inclusive suppliers. Basically, strict liability is applied (the manufacturer has to demonstrate that the product was free from defects). This holds in the USA and increasingly in Europe [1.8]. However, in Europe the causality between damage and defect has still to be demonstrated by the wer. The rapid increase of product liability claims (alone in the USA, 50,000 in 1970 and over one million in 1990) cannot be ignored by manufacturers. Although such a situation has probably been influenced by the peculiarity of US legal procedures, configuration management and safety analysis (in particular causes-to-effects analyses) as well as considerations on risk management should be performed to increase safety and avoid product liability claims (see Sections 1.2.7 & 2.6, and Appendix A.3.3).
16
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
1.2.11 Historical Development Methods and procedures of quality assurance and reliability engineering have been developed extensively over the last 50 years. For indicative purpose, Table 1.1 summarizes the major steps of this development and Fig. 1.5 shows the approximate distribution of the relative effort between quality assurance and reliability engineering during the same period of time. Because of the rapid progress of microelectronics, considerations on redundancy, fault-tolerante, test strategy, and sojhvare quality have increased in importance. A skillful, allegorical presentation of the story of reliability (as an Odyssey) is given in [1.25].
Table 1.1 Histoncal development of quality assurance (management) and reliability engineering ~efore1940 Quality attributes and characteristics are defined. In-process and final tests are carried out, usually in a department within the production area. The concept of quality of manufacture is introduced. L940 - 50 Defects and failures are systematically collected and analyzed. Corrective actions are carried out. Statistical quality control is developed. It is recognized that quality must be built into an item. The concept quality of design becomes important. 1950 - 60 Quality assurance is recognized as a means for developing and manufacturing an item with a specified quality level. Preventive measures (actions) are added to tests and corrective actions. It is recognized that correct short-term functioning does not also signify reliability. Design reviews and systematic analysis of failures (failure data and failure mechanisms), performed often in the research & development area, lead to important reliability improvements. 1960 - 70 Difficulties with respect to reproducibility and change control, as well as interfacing problems during the integration phase, require a refinement of the concept of configuration management. Reliability engineering is recognized as a means of developing and manufacturing an item with specified reliability. Reliability estimation methods und demonstration tests are developed. It is recognized that reliability cannot easily be demonstrated by an acceptance test. Instead of a reliability figure ( h orMTBF=lIh), the contractual requirement is for a reliability assurance program. Maintainability, availability, and logistic Support become important. 1970 - 80 Due to the increasing complexity and cost for maintenance of equipment and systems, the aspects of man-machine interface and life-cycle cost become important. Terms like product assurance, cost effectiveness and systems engineering are introduced. Product liability becomes important. Quality and reliability assurance activities are made project specific and carried out in close cooperation with all engineers involved in a project. Customers require demonstration of reliability and maintainability during the warranty penod. The aspect of testability gains in significance. Test und screening strategies are 1980 - 90 developed to reduce testing cost and warranty services. Because of the rapid progress in microelectronics, greater possibilities are available for redundant and fault tolerant structures. The concept of sofhvare quality is introduced. after 1990 The necessity to further shorten the development time leads to the concept of concurrent engineering. Total Quality Management (TQM) appears as a refinement to the concept of quality assurance as used at the end of the seventies.
1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems Relative effort [%]
4 100 75
'
50 -
'. .
\
*'
Quality assurance 25
0
I
I
System engineenng (part) Fault causes I modes I effects I mechanisms analysis Reliabilily analysis
.---X
\.\- -
I
Software quality Configuration management Qualjty testing, Quality control, Quality data reporting system
FYear
Figure 1.5 Approximate distribution of the relative effort between quality assurance and reliability engineering for complex equipment und systems
1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems This section deals with some important considerations on the organization of quality and reliability assurance in the case of complex equipment nnd systems with high quality und reliability requirements. This minor Part of the book aims to support managers in answering the question of how to specify und realize high reliability targets for complex equipment und systems when tailoring is not mandatory. Refinements are in Appendices A l - A5, with considerations on quality management and total quality management (TQM) as well. As a general rule, quality assurance and reliability engineering must avoid bureaucracy, be integrated in project activities, and support quality management and concurrent engineering efforts, as per TQM.
1.3.1 Quality and Reliability Assurance Tasks Experience shows that the development and production of complex equipment and systems with high reliability, maintainability, availability, a n d l o r safety targets requires specific activities during all life-cycle phases of the item considered. For complex equipment and systems, Fig. 1.6 shows the life-cyclephases and Table 1.2 gives main tasks for quality and reliability assurance. Depicted in Table 1.2 is also the period of time over which the tasks have to be performed. Within a project, the tasks of Table 1.2 must be refined in a project-specific quality and reliability assurance program (Appendix A3).
18
1
Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
Table 1.2 Main tasks for quality and reliability assurance of complex equipment und systems with high quality und reliabilily r-equirements (the bar height is a measure of the relative effort)
Main tasks for quality and reliability assurance of complex equipment und systems, conforrning to TQM (see Table A3.2 for more details and for task assignment)
1. Customer and market requirements
2. Preliminary analyses
3. Quality and reliability aspects in specs, quotations, contracts, etc.
-
4. Quality and reliability assurance program
5. Reliability and maintaimbility analyses 6. Safety and human factor analyses 7. Selection and qualification of components and materials
8. Supplier selection and qualification
9. Project-dependent procedures and work instructions 10. Configuration management
11. Prototype qualification tests 12. Quality control during production
13. In-process tests 14. Final and acceptance tests
15. Quality data reporting system 16. Logistic support
17. Coordination and monitoring 18. Quality costs
9. Concepts, methods, and general procedures (quality and reliability)
50. Motivation and training I
19
1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems Conception, Definition, Design, Development, Evaluation FUU deve~opment,
I
Idea, market requirements Evaluation of delivered equipment and Systems Proposal for preliminary study
.
t
.
System specifications Feasibility chcck Interface definition Proposal for the design phase
.
1 t
I
Production (Manufactunng) Pilot production
Feasibility check Revised system specifications Qualified and released prototypes Technical documentation Proposal for pilot production
.
1
Series production
t
t
Installation, Ocration
1
t
.
,? -
.
g-
Feasibility check Production documentation Qualified produc-. tion processes Qualified and released first senes item Proposal for senes production
.
I Series item cuStomer documentation $ Logistic 2 concept Spare 2 . provisioning
Figure 1.6 Basic life-cycle phases of complex equipment und Systems (the output of a given phase is the input of the next phase; see Tab. 5.3 for software)
1.3.2 Basic Quality and Reliability Assurance Rules Performance, dependability, cost, and time to market are key factors for today's products and services. Taking care of the considerations in Section 1.3.1, the basic rules for a quality and reliability assurance optimized by considering cost and time schedule aspects (conforming to TQM) can be summarized as follows:
1. Quality and reliability targets should be just as high as necessary to satisfy real customer needs 4
Apply the rule "as-good-as-necessary':
2. Activities for quality & reliability assurance should be performed continuously throughout all project phases, from definition to operating phase (Table 1.2) 4
Do not change the project manager before ending the pilot production.
3.Activities must be performed in close cooperation between all engineers involved in the project (Table A3.2)
+ Use TQM und concurrent engineering approaches. 4. Quality and reliability assurance activities should be monitored by a central quality & reliability assurance department (Q & RA), which cooperates actively in all project phases (Fig. 1.7 and Table A3.2)
+ Establish
an efficient und independent quality & reliability assurance department ( Q & RA) active in the projects.
20
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
Figure 1.7 shows a basic organization which could embody the above rules and satisfy requirements of quality management standards (Appendix A2). As shown in Table A3.2, the assignment of quality and reliability assurance tasks should be such, that every engineer in a project bears his/her own responsibilities (as per T Q M ) . A design engineer should for instance be responsible for all aspects of hislher own product (e.g. an assembly) including reliability, maintainability & safety, and the production department should be able to manufacture and test such an item within its own competence. The quality & reliability assurancedepartment ( Q & RA in Fig. 1.7) can be for instance responsible for setting targets for reliability and quality levels, coordination of the activities belonging to quality and reliability assurance, preparation of guidelines and working documents (quality and reliability aspects), qualification, testing and screening of components and material (quality and reliability aspects), release of manufacturing processes (quality and reliability aspects), development and operation of the quality data reporting system, solution of quality and reliability problems at the equipment and system level, acceptance testing. This central quality and reliability department should not be too small (credibility) nor too large (sluggishness).
Figure 1.7 Basic organizational structure for quality & reliability assurance in a company producing complex equipment und systems with high quality (Q), reliability (R), und / or sufety requirements (connecting lines indicate close cooperation; A denotes assurance, I inspection)
1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems
21
1.3.3 Elements of a Quality Assurance System As stated in Sections 1.3.1, many of the tasks associated with quality assurance (here in the sense of quality management as per T Q M ) are interdisciplinary in nature. In order to have a minimum impact on cost and time schedules, their solution requires the concurrent efforts of all engineers involved in a project. To improve coordination, it can be useful to group the quality assurance activities into the following basic areas (Fig. 1.3): 1. Configuration Management: Procedure used to specify, describe, audit & release the configuration of an item, as well as to control it during modifications or changes. Configuration management is an important tool for quality assurance. It can be subdivided into configuration identification, auditing (design reviews), control, and accounting (Appendix A3.3.5). 2. Quality Tests: Tests to verify whether the item conforms to specified requirements. Quality tests include incoming inspections, as well as qualification tests, production tests, and acceptance tests. They also Cover reliability, maintainability, safety, and software aspects. To be cost effective, quality tests must be coordinated and integrated into a test strategy. 3. Quality Control During Production: Control (monitoring) of the production processes and procedures to reach a stated quality of manufacturing. 4. Qualip Data Reporting System (QDS, FRACAS): A system to collect, analyze, and correct all defects and failures (faults) occurring during the production and test of an item, as well as to evaluate and feedback the corresponding quality and reliability data. Such a system is generally Computer assisted. Analysis of failures and defects must be traced to the cause, to avoid repetition of the same problem. 5. Sofmare quality: Special procedures and tools to specify, develop, and test software (Section 5.3). Configuration management spans from the definition up to the operating phase (Appendices A3 & A4). Quality tests encompasses technical and statistical aspects (Chapters 3, 7, and 8). The concept of a quality data reporting system is depicted in Fig. 1.8 (see Appendix A5 for basic requirements). Table 1.3 shows an example of data reporting sheets for PCBs evaluation. The quality and reliability assurance system must be described in an appropriate quality handbook supported by the company management. A possible content of such a handbook for a company producing complex equipment und systems with high quality 61 reliability requirements can be: General, Project Organization, Quality Assurance (Management), Quality & Reliability Assurance Program, Reliability Engineering, Maintainability Engineering, Safety Engineering, Software Quality Assurance.
22
1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
23
1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems
Table 1.3 Example of information status for PCBs (populated printed circuit board's) from a quality data reporting system
a) Defects and failures at PCB level Period: . . . .
b) Defects and failures at component level Penod:
.. ..
PCB:. . . .
Compo- Manufac- No'
No. of PCBs: . .
compOnents Nuinber of faults type application
C)
No. of faults per place of occurrence
Of
%
incoming
in-process final test warranty period
Cause analysis for defects and failures due to components
Period: Cause
Percent deiective (%) Failure ratc ( 1 0 - ~h-I1
Measures
Component
PCB
sys- inherent not iden- observed predicled tified tematic fahre
d) Correlation between components and PCBs Penod: . . . .
observed term
term
24
1
Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems
1.3.4 Motivation and Training Cost effective quality and reliability assurance (management) can be achieved if every engineer involved in a project is made responsible for his/ her assigned activities (e.g. as per Table A3.2). Figure 1.9 shows a comprehensive, practice oriented, motivution und training program in a company producing complex equipment und Systems with high quality & reliability requirements.
Basic training
Title:
Quality Management and Reliability Engineering Introduction to tasks, methods, and Aim: organizationof the company's quality and reliability assurance Top and middle management, project Participants: managers, selected engineers 4 h (seminar with discussion) Duration: Documentation: ca. 30 pp.
Advanced training
Ti tle:
1
Methods of Reliability Engineering Leaming the methods used in Aim: reliabiliy assurance Participants: Project managers, engineers from marketing & production, selected engineers from development 8 h (seminar with discussion) Duration: Document.: ca. 40 pp.
Advanced training
I I
Title: Aim:
Reliability Engineering Leaming the techniques used in reliabilty engineering (applications oriented and company specific) Participants: Design engineers, Q&R specialists, selected engineers from marketing and production 24 h (course with exercices) Duration: Document.: ca. I50 pp.
0 I
Special training
Title: Aim:
Special ~ o ~ i c s ' Learning special tools and * Examples: Statistical Quality Control, techniques Test and Screening Strategies, Software Participants: Q&R specialists, selected Quality, Testability, Reliability and engineers from development Availability of Repairable Systems, Faultand production Tolerant Systems with Hardware and Software, Mechanical Reliability, Failure Duration: 4 to 16 h per topic Mechanisms and Failure Analysis, etc. Document.: ca. 20 pp. per topic
Figure 1.9 Example for a practical oriented training and motivation program in a company producing complex equipment and Systems with high quality ( Q ) C? reliability ( R ) requirements
2 Reliability Analysis During the Design Phase (Nonrepairable Items up to System Failure)
Reliability analysis during the design and development of complex equipment and systems is important to detect and eliminate reliability weaknesses as early as possible and to perform comparative studies. Such an investigation includes failure rate and failure mode analysis, verification of the adherence to design guidelines, and cooperation in design reviews. This chapter presents methods and tools for failure rate and failure mode analysis of complex equipment and systems considered as nonrepairable (up to system failure). After a short introduction (Section 2.1), Section 2.2 deals with series -parallel structures. Complex structures, elements with more than one failure mode, and parallel models with load sharing are investigated in Section 2.3. Reliability allocation is introduced in Section 2.4. Stress / strength and drift analysis is discussed in Section 2.5. Section 2.6 deals with failure mode and causes-to-effects analyses. Section 2.7 gives a checklist for reliability aspects in design reviews. Maintainability is considered in Chapter 4 and repairable systems are investigated in Chapter 6 (including complex systems for which a reliability block diagram does not exist, imperfect switching, incomplete coverage, common cause failures, reconfigurable systems, as well as an introduction to Petri nets, dynamic FT, and computer-aided analysis). Design guidelines are in Chapter 5, qualification tests in Chapter 3, and reliability tests in Chapters 7 & 8. Theoretical foundations for this chapter are in Appendix A6.
2.1 Introduction An important Part of the reliability analysis during the design and development of complex equipment and systems deals with failure rate and failure mode investigation as well as with the verification of the adherence to appropriate design guidelines for reliability. Failure mode and causes-to-effects analysis is considered in Section 2.6, design guidelines are given in Chapter 5. Sections 2.2- 2.5 are devoted to failure rate analysis. Investigating the failure rate of a complex
26
2 Reliability Analysis During the Design Phase
equipment or system leads to the calculation of the predicted reliability, i.e. that reliability which can be calculated from the structure of the item and the reliability of its elements. Such a prediction is necessary for an early detection of reliability weaknesses, for comparative studies, for availability investigation taking care of maintainability and logistic support, and for the definition of quantitative reliability targets for subcontractors. However, because of different kind of uncertainties, the predicted reliability can often be only given with ,a limited accuracy. To these uncertainties belong simplifications in the mathematical modeling (independent elements, complete and sudden failures, no flaws during design and manufacturing, no damages), insufficient consideration of faults caused by internal or external interference (switching, transients, EMC, etc.), inaccuracies in the data used for the calculation of the component failure rates. On the other hand, the true reliability of an item can only be determined by reliability tests, performed often at the prototype's qualification tests, i.e. late in the design and development phase. Practical applications also shown that with an experienced reliability engineer, the predicted failure rate at equipment or system level often agree reasonably well (within a factor of 2) with field data. Moreover, relative values obtained by comparative studies generally have a much greater accuracy than absolute values. All these reasons support the efforts for a reliability prediction during the design of equipment and systems with specified reliability targets. Besides theoretical considerations, discussed in the following sections, practical aspects have to be considered when designing reliable equipment or systems, for instance with respect to operating conditions and to the mutual influence between elements (input/output, load sharing, effects of failures, transients, etc.). Concrete possibilities for reliability improvement are reduction of thermal, electrical and mechanical Stresses, correct interfacing of components and materials, simplification of design and construction, use of qualitatively better components and matenals, protection against ESD and EMC, screening of critical components and assemblies, use of redundancy, in that order. Design guidelines (Chapter 5) and design reviews (Tables A3.3, 2.8, 4.3, and 5.5, Appendix A4) are mandatory to support such improvements. This chapter deals with nonrepairable (up to system failure) equipment and systems. Maintainability is discussed in Chapter 4. Reliability and availability of repairable equipment and systems is considered carefully in Chapter 6.
2.1 Introduction
Required function
Set up the reliability block diagram (RBD), by performing a FMEA where redundancy appears Determine the component Stresses Compute the failure rate hiof each component Compute R(t) at the assembly level Check the fulfillment of reliability design rules Perform a preliminary design review
Eliminate reliability weaknesses component/matenal selection derating screening redundancy P -
yes 1
I
Go to the next assembly or to the next integration level
Figure 2.1 Reliability analysis procedure at assembly level
Taking account of the above considerations, Fig. 2.1 shows the reliability analysis procedure for an assembly. The procedure of Fig. 2.1 is based on the part stress method given in Section 2.2.4 (see Section 2.2.7 for the part count method). Also included are a failure mode analysis (FMEAIFMECA to check the validity of the assumed failure modes) and a verification of the adherence to design guidelines for reliability in a preliminary design review (Section 5.1, Appendix A3.3.5). Verification of the assumed failure mode is mandatoiy where redundancy appears, in particular because of the series element in the reliability block diagram (see for instance Examples 2.1-2.3 and, for a comparative investigation, Figs. 2.8 - 2.9, and 6.17 - 6.18). To simplify the notation in the following sections, reliability will be used forpredicted reliability and system instead of technical system, i.e. for a system with ideal human factors and logistic Support.
28
2.2
2 Reliability Analysis During the Design Phase
Predicted Reliability of Equipment and Systems with Simple Structure
Simple structures are those for which a reliability block diagram exists and can be reduced to a series/parallel fomz with independent elements. For such an item, the predicted reliabilio is calculated according to the following procedure, See Fig. 2.1: 1. Definition of the required function and of its associated mission profile. 2. Derivation of the corresponding reliability block diagram (RBD). 3. Determination of the operating conditions for each element of the RBD. 4. Determination of the failure rate for each element of the RBD. 5. Calculation of the reliability for each element of the RBD. 6. Calculation of the item (system) reliability function Rs ( t). 7. Elimination of reliability weaknesses and return to step 1 or 2, as necessary.
This section discusses at some length steps 1 to 6, see Example 2.6 for the application to a simple situation. For the investigation of equipment or systems for which a reliability block diagram does not exist, one refers to Section 6.8.
2.2.1 Required Function The required function specifies the item's task. Its definition is the starting point for any analysis, as it defines failures. For practical purposes, Parameters should be defined with tolerances and not merely as fixed values. In addition to the required function, environmental conditions at system level must also be defined. Among these, ambient temperature (e.g. +40°C), Storage temperature (e.g. -20 to +60°C), humidity (e.g. 40 to 60%), dust, corrosive atmosphere, vibration (e.g. 0Sg„ at 2 to 60Hz), shock, noise (e.g. 40 to 70dB), and power supply voltage variations (e.g. +.20%). From these global environmental conditions, the constructive characteristics of the system, and the internal loads, operating conditions (actual stresses) for each element of the system can be determined. Required function and environmental conditions are often time dependent, leading to a mission profile (operational profile for software). A representative mission profile and the corresponding reliability targets should be defined in the system specifications (initially as a rough description and then refined step by step), See the remark on p. 38 and Section 6.8.6.2 for phased-mission systems.
2.2.2 Reliability Block Diagram The reliability block diagram (RBD) is an event diagram. It answers the following question: Which elements of the item under consideration are necessavy for the
2.2 Predicted Reliability of Equipment and Systems with Simple Structure
Equipment
Component
I
I I
I
1
Figure 2.2 Procedure for setting up the reliability block diagram (RBD) of a system with four levels
filfillment of the required function und which can fail without affecting it? Setting up a RBD involves, at first, partitioning the item into elements with clearly defined tasks. The elements which are necessary for the required function are connected in series, while elements which can fail with no effect on the required function (redundancy) are connected in parallel. Obviously, the ordering of the series elements in the reliability block diagram can be arbitrary. Elements which are not relevant for (or used in) the required function under consideration are removed (put into a reference list), after having verified (FMEA) that their failure does not affect elements involved in the required function. These considerations make it clear that for a given system, each requiredfinction has its own reliability block diagram. In setting up the reliability block diagram, care must be taken regarding the fact that only two states (good or failed) and one failure mode (e.g., Opens or shorts) can be considered for each element. Particular attention must also be paid to the correct identification of the parts which appear in series with a redundancy (see e.g. Section 6.8). For large equipment or systems the reliability block diagram is derived top down as indicated in Fig. 2.2 (for 4 levels as an example). At each level, the corresponding required function is derived from that at the next higher level. The technique of setting up reliability block diagrams is shown in the Examples 2.1 to 2.3 (see also Examples 2.6, 2.13, 2.14). One recognizes that a reliability block diagram basically differs from afinctional block diagram. Examples 2.2,2.3, 2.14 also show that one or more elements can appear more than once in a reliability
30
2 Reliability Analysis During the Design Phase
block diagram, while the corresponding element is physically present only once in the item considered. To point out the strong dependence created by this fact, it is mandatory to use a box form other than a Square for these elements (in Example 2.2, if E2 fails the required function is fulfilled only if E I , E3 and E5 work). To avoid ambiguities, each physically different element of the item should bear its own number. The typical structures of reliability block diagrams are summarized in Table 2.1 (see Section 6.8 for situations in which a reliability block diagram does not exist). Example 2.1 Set up the reliability block diagrams for the following circuits: E,
(i) Res. voltage divider
(ii) Electronic switch
(iu) Simplified radio receiver
Solution Cases (i) and (iii) exhibit rio redundaiicy, i.e. for the required function (tacitly assumed here) all elements must work. In case (ii), hdnsistors TRI and TR2 are redundant if their failure mode is a short between emitter and collector (the failure mode for resistors is generally an open). From these considerations, the reliability block diagrams follows as
4 7 (i) Resistive voltage divider
(ii) Electronic switch
(3) Simplified radio receiver
Example 2.2 An item is used for two different missions with the corresponding reliability block diagrams given in the figures below. Give the reliability block diagram for the case in which both functions are simultaneously required in a common mission.
Mission 1
Solution The simultaneous fulfillment of both required functions leads to the series connection of both reliability block diagrams. Simplification is possible for element E1 but not for element E2. A deeper discussion on phasedmission reliability analysis is in Section 6.8.6.2.
Mission 2
--@ Mission 1 and 2
31
2.2 Predicted Reliability of Equipment and Systems with Simple Structure
Table 2.1 Basic reliability block diagrams and associated reliabiiity functions (nonrepairable up to system failure, new at t = O (R, (0) = I), independent elements (except E2 in 9), active redundancy; 7-9 are complex structures and cannot be reduced to a series-parallel structure with indep. elements) Reliability Block Diagram
Reliabiiity Function
Remarks
(Rs=Rso ( t ) ; R, = R i ( t ) , R i ( 0 ) = l )
One- item structure, h(t)=h
~ ~ ( t e)- A=i t
Senes structure,
h, ( t ) =h , ( t ) +
...+ h,(t)
1- out - of - 2 - redundancy,
k - out - of - n redundancy for k = 1 Rs= 1 - ( 1 - R)"
=3
4
k-out-of-n
Serieslparallel structure
Majority redundancy, gene r d case (n + 1) -out - of ( 2 n + l ) , n = 1,2, ...
Bridge structure (bi-directional on E * )
Bndge structure (unidirectional on E 5 )
The element E2 appears twice in the reliability block diagram (not in the hardware)
2 Reliability Analysis During the Design Phase
Example 2.3 Set up the reliability block diagram for the electronic circuit shown on the right. The required function asks for operation of P2 (main assembly) and of P1 or Pr (control cards).
Solution This example is not as trivial as Examples 2.1 and 2.2. A good way to derive the reliability block diagram is to consider the mission " 4 or 4 must work" and "P2 must work" separately, and then to put both missions together as in Example 2.2 (see also Example 2.14).
Also given in Table 2.1 are the associated reliability functions for the case of nonrepairable elements (up to system failure) with active redundancy and independent elements except case 9 (Sections 2.2.6, 2.3.1-2.3.4); see Section 2.3.5 for load sharing, Section 2.5 for mechanical systems, and Chapter 6 for repairable systems.
Table 2.2 Most important Parameters influencing the failure rate of electronic components
Digital and linear ICs Hybrid circuits Bipolar transistors Diodes Thyristors
D
Optoelectroniccomponents
D
Resistors
D
Capdcitors
D
Coils, transformers
D
Relays, switches
D
Connectors D denotes dominant,
D X
denotes important
2.2 Predicted Reliability of Equipment and Systems with Simple Structures
id capability
Figure 2.3
Load capability and typical derating curve (dashed) for a bipolar Si-transistor as function of the ambient temperature BA ( P = dissipated power, PN = rated power)
2.2.3 Operating Conditions at Component Level, Stress Factors The operating conditions of each element in the reliability block diagram influence the item's reliability and have to be considered. These operating conditions are function of the environmental conditions (Section 3.1.1) and internal loads, in operating and dormant state. Table 2.2 gives an overview of the rnost important Parameters influencing electronic component failure rates. A basic assumption is that components are in no way over stressed. In this context it is important to consider that the load capability of many electronic components decreases with increasing ambient temperature. This in particular for power, but also for voltage and current. As an Example, Fig. 2.3 shows the variation of the power capability as function of the ambient temperature OA for a bipolar Si transistor (with constant thermal resistance RJA). The continuous line represents the load capability. To the right of the break point the junction temperature is nearly equal to 175°C (max. specified operating temperature). The dashed line gives a typical derating curve for such a device. Derating is the designed (intentional) non utilization of the full load capability of a component with the purpose to reduce its failure rate. The stress factor (stress ratio, stress) S is defined as S=
applied load rated load at 40°C
(2.1)
To give a touch, Figs. 2.4 - 2.6 show the influence of the temperature (ambient BA, case Oc or junction OJ) and of the stress factor S on the failure rate of some electronic components (from IEC 61709 [2.23]). Experience shows that for a good design and BA 5 40°C one should have O . l < S < 0.6 for power, voltage, and current, S 2 0.8 for fan-out, and S 5 0.7 for Uin of lin. ICs (Table 5.1). S < 0.1 should also be avoided.
2 Reliability Analysis During the Design Phase
Paper, metallired paper 8 plastic' - - - - Ceramic
n
4
Aluminum, non-solid eleclrolyte'
P a p e r , metaliizedpaper 8 plastic Ceramic Aluminum. non-solid eiecfrolyie solid eledralfie
----
z:lTantal,
Figure 2.4 Factor n~ as function of the case temperature OC for capacitors and resistors, and factor nu as function of the voltage stress for capacitors (examples from IEC 61 709 [2.23])
Transistors. Refetence- and Microwavediodes - - - - ICs, EPROM, OTPROM, EEPROM, EAROM
ICS, Trans~~tors. Reference and Microwavediodes EPROM OTPROM EEPROM EAROM Diodes andlor Power Devices
nT
.............. Diodes andior Power Devices
400 100
300 200
100
0
0
R
40
80
120
160
CMOS (UmI= 15 V)
P
- - - - BipoiaiAlialog iCr
...........,., Transistors
Figure 2.5 Factor nT as function of the junction temperature B J (left, half log for semiconductors and right, linear for semiconductors, resistors, and coils) and factor n" as function of the power supply voltage for semiconductors (examples from IEC 61709 [2.23])
2.2 Predicted Reliability of Equipment and Systems with Simple Structures
Figure 2.6 Factor TCTas function of the junction temperature BJ and factors TCU and n1 as function of voltage and current stress for optoeleclronic devices (examples from ZEC 61709 [2.23])
2.2.4 Failure Rate of Electronic Components The failure rate h ( t ) of an item is the probability (referred to 6 t ) of a failure in the interval ( t , t + 6 t ] given that the item was new at t = 0 und did not fail in the interval (0, t ] , see Eqs. (1.3), (1.5), (A6.25). For a large population of statistically identical und independent items, h ( t ) exhibits often three successive phases: One of early failures, one with constant (or nearly so) failure rate and one involving failures due to wearout. Early failures should be eliminated through screening (Chapter 8). Wearout failures can be expected for some electronic components (electrolytic capacitors, power and optoelectronic devices, ULSI-ICs) as well as for mechanical and electromechanical components. They must be considered on a caseby-case basis in setting up a preventive maintenance strategy (Sections 4.6,6.8.2). To simplify calculations, reliability prediction is often performed by assuming a constant (time independent) failure rate during the useful life
This approximation greatly simplify calculations, since a constant failure rate leads to a flow of failures described by a homogeneous Poisson process, i.e. to a process with memoryless propers (Eqs. (A6.30), (A6.87), Appendix A7.2.5). The failure rate of components can be assessed experimentally by accelerated reliability tests or from field data (if operating conditions are sufficiently well known) with appropriate statistical data analysis (Sections 7.2, 7.4 - 7.6). For established electronicl electro-mechanical components, models and figures are often given in failure rate handbooks [2.21-2.291. Among these, FIDES Guide 2004 [2.21], HDBK217Plus (2006) [2.22],IEC 61709 (1996) [2.23],IEC TR 62380 (2004) [2.24], IRPH 2003 [2.25],RDF-96 [2.28], Telcordia SR-332 (2001) [2.29]. IEC 61709 gives
36
2 Reliability Analysis Dunng the Design Phase
Table 2.3 Indicative figures for environmental conditions and corresponding factors nE
1
1 1;:
GF (-40 to+45OC) 2 - 200 Hz (Ground fixed] GM (-40 to+45"C) 2- 500 HZ (Ground mobile)
Ns (-40 to+45"C) 2 - 200 Hz (Nav. sheltered) 2 gn NU (-40to+7O0C) 2 - 200 Hz (Nav. unsheltered) 5 g, C=capacitors,DS=discrete semicond.,R=resistors,RH=rel. humidity, h=high, m=medium, l=low, g , = 10m/sz (GB is Ground stationary weatherprotected in [2.25,2.29] and is taken as reference value in [2.22,2.23,2.24])
laws of dependency of the failure rate on different Stresses (temperature, voltage, etc.) and must be supported by a Set of reference failure rates h r f (e.g. for a standard industrial environment, i.e. 40°C ambient temperature O A , GB as per Table 2.3, and steady-state conditions in the field). IRPH 2003 is based on IEC 61709 and gives reference failure rates. Effects of thermal cycling, dormant state, and ESD are considered in IEC TR 62380 and HDBK-217Plus. Refined models are in FZDES Guide 2004. HDBK-217PEus is the next generation of the PRISM software tool and aims to replace MIL-HDBK-217. An international agreement on failure rate models for reliability predictions at equipment und System level in practical applications should be found to simplify comparative investigations ([1.2 (1996)l and remark on p. 38). Failure rates are taken from one of the above handbooks or from one's own field experience for the calculation of the predicted reliability. Models in these handbooks have often a simple structure, of the form
often further simplified to
by taking n E = nQ = 1 because of the assumed standard (industrial) environment (GB in Table 2.3) and standard quality level. Indicative figures for the factors nE and n~ are in Tables 2.3 and 2.4. h lies between 1 0 - ' ~ h - ' for passive components and 1 0 - ~h-' for VLSI ICs (Table A1O. 1, Example 2.4). The unit I O - ~h-' is designated by FIT (failure in time).
2.2 Predicted Reliability of Equipment and Systems with Simple Structures Table 2.4
Reference values for the qudity factors nQ
1
Qualification Reiiiforced
1
CECC*
I no special /
Monolithic ICs Hybrid ICs Discrete Semiconductors Resistors Capacitors
0.1
2.0
*~eferencevalue in [2.22-2.251 and class I1 in [2.29] (coi~espondsto MIL-HDBK-217F classes B,JANTX,M)
In general, ho and h r f increase exponentially with temperature (see Figs. 2.4 2.6 for some examples). The influence of the stress factor is illustrated by factors and nI. For the factor n T as a function of the junction temperature 8 j , an Arrhenius Model is often used. In the case of only one dominant failure mechanism, Eq. (7.56) gives the ratio of the n T factors at two temperatures T2 and Tl
where A is the acceleration factor, k the Boltzmann's constant (8.6.10-~eV/ K), T the junction temperature (in Kelvin degrees), and Eu the activation energy in eV. As in Figs. 2.4 - 2.6, experience shows that a global value for E, often lie between 0.3eV and 0.6eV for Si devices. The design guideline BJ 1 100°C, if possible OJ 1 80°C, given in Section 5.1 for semiconductor devices is based on this consideration. Models in IEC 61709 assumes for n T two dominant failure mechanisms with activation energies E,, and E„ (about 0.3eV for Eal arid 0.6eV for Ea2 ). The corresponding equation for n T takes in this case the form
where 0 5 A 5 1 is a constant, z = (1/T,f-1/T2)lk, and zrf =(1/TZf - l I T , ) l k with TEf = 3 13 K (40°C). For components of good cornmercial quality, and using nE = n Q= 1, failure rate calculations lead to figures which for practical applications in standard industrial environments (BA= 40°C, G B ) often agree reasonably well withfield data (up to a factor of 2). This holds at the equipment und System level, although deviations can occur at component level, depending on the failure rate catalog used (Example 2.4).
38
2 Reliability Analysis During the Design Phase
Discussion over comparison with obsolete data should be dropped and it would seem to be opportune to unify models und data, taking from each model the "good part" and putting them together for new better models (strategy applicable to other situations as well). Models for prediction in practical applications should remain reasonably simple, laws for dominant failure mechanisms should be given in international standards, and the list of reference failure rates hTefshould be yearly updated. Models based on failure mechanisms have to be used as a basis for simplified models. The assumption of h < loT9h-' should be confined to cornponents with a stable production process and a reserve to technological limits. Calculation of the failure rate at equipment and system level often requires considerations on the mission profile. If the mission can be partitioned in time spans with almost homogeneous Stresses, switching effects are negligible, and the failure rate is time independent (between successive state changes of the system), the contribution of each time span can be added linearly, as often assumed for d u 9 cycles. With these assumptions, investigation of phased-mission systems (systems whose elements are used at different rates) becomes possible (Section 6.8.6.2). Estimation and demonstration of component's and system's failure rates is considered in Sections 7.2.3.1 and 7.2.3.2- 7.2.3.3, respectively.
Example 2.4 For indicative purpose, the following table gives failure rates calculated according to some different data bases 12.29, 2.25, 2.241 for continuous operation in non interface application; 8,=40°C, BJ=5S0C, S = 0.5, C B , and X g = l as for CECC certified, and class I1 Telcordia; P1 is used for plastic package; in 10-~ h-' (NT), quantified at 1.10-~h-' . Telcordia IRPH 2001 2003
zo IEC
**
'mf*
DRAM, CMOS, 1 M, P1 SRAM, CMOS, 1 M, P1 EPROM CMOS, 1 M, P1 i 6 ~ i t p ~ (TRI, i 0 ~CMOS, PI Gate array, CMOS, 30,000 gates , 4 0 Pins, PI Lin, Bip, 70 Tr, P1 GP diode, Si, 100 mA, lin, P1 Bip. transistor, 300 mW, switching, P1 JFET, 300 mW, switching, P1 Ceramic capacitor, 100 nF, 12S°C, class 1 Foil capacitor, 1pF Ta solid capacitor, herrn., 100 ,uF, 0.3Q / V MF resistor, 114 W, 100 kQ Cermet pot, 50 k 8 , < 10 annual shaft rot. *Assurned value for computations as per IEC 61709 [2.23],€IA= 40°C; * *~roduciionyear 2001 for ICs
2.2 Predicted Reliability of Equipment and Systems with Simple Structures
39
2.2.5 Reliability of One-Item Structure A one-item nonrepairable structure is characterized by the distribution function F(t) = Pr{z I t ] of its failure-free time T , assumed > 0 with F ( 0 )= 0 , and hereafter used as a synonym for failure-free operating time. The reliability function R ( t ) ,i.e. the probability of no failure in the interval (0, t ] ,follows as (Eq. (A6.24))
The expected value (mean) of the failure-free time T , designated as MTTF (mean time to failure), can be calculated from Eq. (A6.38) M77F = E[T]= j ~ ( t ) d t .
(2.8)
0
Should the one-item structure exhibit a useful life limited to T„ Eq. (2.8) yields MTTF, =
J" ~ ( t ) d, t
R ( ~ ) =for o t > T~
-
0
In the following, T, = will be assumed (except in Example 6.21). Equation (2.8) is an important relationship. It is valid not only for a one-item structure (often considered as an indivisible entity), but it also holds for a one-item structure of arbitrary complexity. Rs ( t) & M7TFs is used to emphasize this
Thereby, S stands for system and i for the state entered at t =O (Table 6.2). i= 0 holds for system new at t = 0, i.e. for Rso(0) = 1 (this notation is used in the following sections, in particular in Chapter 6 dealing with repairable systems). Back to the one-item structure, considered in this section as an indivisible entity, and assuming R ( t ) derivable, the failure rate h ( t )of a nonrepairable one-item structure new at t = 0 is given by (Eq. (A6.25)) 1
A ( t ) = lim - P r { t < z < t + 6 t s t ~ o6t
Considering R(0) = 1, it follows that
from which, for h ( t ) = ?L,
I z > t }=
-
d R(t)ldt Nt)
40
2 Reliability Analysis During the Design Phase
The mean time to failure in this case is equal to 1I h . In practical applications
is often used, where MTBF stands for mean operating time between failures, expressing thus a figure applicable to repairable one-item structures. To avoid misuses, and also because of the often used estimate MTBF = T l k , MTBF should be confined to repairable items with constant (time independent) failure rate (constant failure rates for all elements in the case of a system, See remark on p. 358). As shown by Eq. (2.11), the reliability function of a nonrepairable one-item structure new at t = 0 is completely defined by its failure rate h ( t ) . In the case of electronic components, h ( t ) = h can often be assumed. The failure-free time z then ht For a time exhibits an exponential distribution ( F ( t ) = Pr{z Ib } = 1 - e- ) dependent failure rate, the distribution function of the failure-free time can often be approximated by the weighted sum of a Gamma distribution (Eq. (A6.97)) with ß < 1 and a shifted Weibull distribution (Eq. (A6.96)) with ß > 1 (Eq. (A6.34)). Equations (2.7) - (2.12) implies that the nonrepairable one-item structure is new at time t = 0. Also of interest in some applications is the probability of failure-free operation during an interval (0, t ] under the condition that the item has already operated without failure for xo time units before t = 0. This quantity is a conditional probability , designated by R( t ,xo ) and given by (Eq. (A6.27))
For h ( x )= h , Eq. (2.14) reduces to Eq. (2.12). This memoryless property occurs only with constant (time independent)failure rate. Its use greatly simplify calculations in the next sections, in particular in Chapter 6 dealing with repairable Systems. Equations (2.8) and (2.9) can also be used for repairable items. In fact, assuming that at failure the item is replaced by a statistically equivalent one, a new, independent, failure-free time z with the same distribution function as the former one is started after repair (replacement), yielding the Same expected value. However, for these cases the variable x starting by x = 0 after each repair has to be used instead of t (as for interarrival times). With this, M m i(Eq. 2.9)) can be used for the mean time to failure of a given system, independently of whether it is repairable or not. The only assumption is that the system is as-good-as-new after repair, with respect to the state i considered (Table 6.2). At system level, this occurs only if all nonrepaired (renewed) elements in the system have constant failure rates. If the failure rate of one nonrenewed element is not constant, difficulties can arise. This, even if the assumption of an as-bad-as-old situation (pp. 405 & 497), applies. In some applications, it can appear that elements of a population of similar items exhibits different failure rate. Considering as an example the case of components
2.2 Predicted Reliability of Equipment and Systems with Simple Stmctures
41
delivered from two manufacturer with proportion p & (1- p) and failure rates hl & h2,the reliability function of an arbitrarily selected component is (Eq. (A6.34))
According to Eq. (2.10), it follows for the failure rate that
From Eq. (2.15) one recognizes that the failure rate of mixture distributions is time dependent and decrease monotonically from the average of the failure rates at t = 0 to the minimum of the failure rates as t -+ W.
2.2.6 Reliability of Series - Parallel Structures For nonrepairable items (up to item failure), reliability calculation at equipment and system level can often be performed using the basic models given in Table 2.1. The one-item stmcture has been introduced in Section 2.2.5. Series, parallel, and series - parallel structures are considered in this Section. The last three models of Table 2.1 are investigated in Section 2.3. To unify the notation, system will be used for item and it is assumed that at t = 0 the system in new (i.e. Rs O ( t )is given).
2.2.6.1 Systems without Redundancy From a reliability point of view, a system has no redundancy (senes system) if all elements must work in order to fulfill the required function. The reliability block diagram consists in this case of the series connection of all elements (El to E,) of the system). For calculation purposes it is often assumed that each element operates and fails independently from every other element (independent elements as defined on p. 52). For series Systems, this assumption must not (in general) be verified, because the first failure is a system failure for reliability purposes. Let ei be the event {elementEiworks without failure in the interval(0, t]). The probability of this event is the reliability function Ri ( t ) of the element Ei,i.e.
The system does not fail in the interval (0, t] if and only if all elements, E I , ..., E, do not fail in that interval, thus
42
2 Reliability Analysis Dunng the Design Phase
Here and in the following, S stands for system and 0 specifies that the system is new at t = 0 . Due to the assumed independence among the elements EI, ..., E, and thus among the events e l , ... , e„ it follows (Eq. (A6.9)) that for the reliability jhnction Rs ,$t)
The failure rate of the system can be calculated from Eq. (2.10)
Equation (2.18) leads to the following important conclusion: The failure rate of a series system (system without redundancy), that consists of independent elements (p. 52), is equal to the sum of the failure rates of its elements.
The system's mean time to failure follows from Eq. (2.9). The special case in which all elements have a constant failure rate h i ( t )= h, leads to
2.2.6.2 Concept of Redundancy High reliability, availability, and / or safety at equipment or system level can often only be reached with the help of redundancy. Redundancy is the existence of more than one means (in an item) for performing the required function. Redundancy does not just imply a duplication of hardware, since it can be implemented at the software level or as a time redundancy. However, to avoid common mode and single-point failures, redundant elements should be realized (designed and manufactured) independently from each other. Irrespective of the failure mode (e.g. shorts or opens), redundancy still appears in parallel on the reliability block diagram, not necessarily in the hardware (see e.g. Example 2.6). In setting up the reliability block diagram, particular attention must be paid to the series element to a redundancy. A FMEA is generally mandatory for such a decision. Should the redundant elements fulfill only a part of the required function a pseudo redundancy exist. From the operating point of view, one distinguishes between active, warm, and standby redundancy:
2.2 Predicted Reliability of Equipment and Systems with Simple Structures
43
1. Active Redundancy (parallel, hot): Redundant elements are subjected from the beginning to the same load as operating elements; load sharing is possible, but is not considered in the case of independent elements (see Section 2.2.6.3). 2. Warm Redundancy (lightly loaded): Redundant elements are subjected to a lower load until one of the operating elements fails; load sharing is present; however, the failure rate in the reserve state is lower than in the operating state. 3. Standby Redundancy (cold, unloaded): Redundant elements are subjected to no load until one of the operating elements fails; no load sharing is possible, and the failure rate in the reserve state is assumed to be Zero ( h = 0). Important redundant structures with independent elements in active redundancy are considered in Sections 2.2.6.3 to 2.3.4. Warm and standby redundancies are investigated in Section 2.3.5 and Chapter 6 (repair rate p = 0).
2.2.6.3 Parallel Models A parallel model consists of n (often statistically identical) elements in active redundancy, of which k are necessary to perform the required function and the remaining n - k are in reserve. Such a structure is designated as a k-out-of-n redundancy (also known as k-out-of-n: G). Investigations assume in general no load sharing, i.e. independent elements (see Section 2.3.5 & 6.5 for load sharing). Let us consider at first the case of an active 1-out-of-2 redundancy as given in Table 2.1 (3rd row). The required function is fulfilled if at least one of the elements El or E2 works without failure in the interval (0, t ] . With the Same notation as for Eq. (2.16) it follows that
Assuming that elements El and E2 are independent (p. 52), Eq. (2.20) yields for the reliability function R s o ( t ) (Eqs. (A6.13), (A6.8), (2.16)),
The mean time to failure M T P s can be calculated from Eq. (2.9). The special case of two identical elements with constant failure rate ( R l ( t )= R 2 ( t )= e-L ) leads to
Equation (2.22) shows that in the presence of redundancy, the failure rate h s ( t ) at system level is a function of time, even if the element's failure rates are time independent. However, the stochastic behavior of the system is still described by a Markov process (see e.g. Section 2.3.5). This time dependence becomes negligible in the case of repairable systems with constant failure and repair rates (Eq. (6.93)).
44
2 Reliability Analysis During the Design Phase
Generalization to an active k-out-of-n redundancy with identical and independent eleinents (R,(t) = ... = R,(t) = R(t)) follows from the binonlial distribution (Eq. (A6.120)) by setting p = R(t)
RsO(t) can be interpreted as the probability of observing at least k successes in n Bernoulli trials with p = R(t). The mean time to failure MTTFSo can be calculated from Eq. (2.9). For k = 1 and R(t) = e-ht it follows that ~ ~ ~= 1(-(It -)e-")'
and
1 1 MTTFso=-(I+-+ h 2
1
...+-).
n
(2.24)
The improvement in M q o shown by Eq. (2.24) becomes much greater when repair without intenuption of operation at system level is allowed, factor ~ / 2 h instead of 3 12 for an active l-out-of-2 redundancy, where = 1 I MTTR is the constant repair rate (Tables 6.6 and 6.8). However, as shown in Fig. 2.7, the increase of the reliability function Rso(t) caused by redundancy is very important for short missions ( t << 1 l I ) , even in the nonrepairable case. Other comparisons between series - parallel structures are given in Figs. 2.8 and 2.9 (Figs. 6.17 and 6.18 for the repairable case). In addition to the k-out-of-n redundancy described by Eq. (2.23), attention has been paid in the literature to cases in which the fulfillment of the required function asks that not more than n-k consec~rtiveelements fail. Such a structure, known as consecutive k-out-of-n system, is theoretically rnore reliable than the corresponding
Figure 2.7 Reliability function for the one-item structure (as reference) and for some active redundaiicies (nonrepairable np to system failure, constant failure rntes, identical and independent elements, no load shaing; see Section 2.3.5 for load sharing)
2.2
Predicted Reliability of Equipment and Systems with Simple Structures
45
k-out-of-n redundancy. For a k-out-of-n consecutive system with n identical and independent elements in active redundancy (each with reliability R) it holds that [2.40]
Rso
= Pr{no block
with more than n - k consecutive failed elements}
with g(n,i)=(:) for i I n - k , g ( a , a ) = O for a 2 n - k + l and g(a,b)=g(a-1,b) + g ( a - 2 , b - 1 ) + ... + g ( a - n + k - i , b - n + k) otherwise. n = 5 and k = 3 yields Rs = R ~ + ~ R ~ ( ~ - R ) + ~ o R ~ ( ~ - R ) ~ from Eq. (2.23) and
from Eq. (2.25), with Rs =RsO ( t ) , R=R(t), R(0) = 1. Examples for consecutive k-out-of-n systems are conveying systems and relay stations. However, for these kinds of application it is important to verify that all elements are independent (with respect to external influences, load sharing, etc.).
-
2.2.6.4 Series Parallel Structures Series - parallel structures can be investigated through successive use of the results for series and parallel models. This holds in particular for nonrepairable systems with active redundancy and independent elements (p. 52). To demonstrate the procedure, let us consider the 5th row in Table 2.1: 1st step: The series elements El - E3 are replaced by E s , E4 & Es by E9, and E6 & E7 by Elo, yielding
*
2nd step: The 1-out-of-2 redundancy E8 and E9 is replaced by Eil, giving with
R„(t) = R,(t)
+ R,(t)
- R,(t)
Rg(t)
3rd step: From steps 1 and 2, the reliability function of the system follows as (with Rs=R„(t), Ri=Ri(t), Ri(0)=l, i =1, ..., 7 )
46
2 Reliability Analysis Dunng the Design Phase
The mean time to failure can be calculated from Eq. (2.9). Should all elements have a constant failure rate (Al to h7), then
and
Under the assumptions of active redundancy, nonrepairable (up to system failure), independent elements (p. 52), and constant failure rates, the reliability function R s o ( t ) of a system with series-parallel structure is given by a sum of exponential functions. The mean time to failure MTTFSO follows then directly from the exponent terms of R s o ( t ) ,see Eq. (2.27) for an example. The use of redundancy implies the introduction of a series eleinent in the reliability block diagram which takes into account the parts which are common to the redundant elements, creates the redundancy (Example 2.5), or assumes a control andlor switching function. For a design engineer it is important to evaluate the influence of the series element in a redundant structure. Figures 2.8 and 2.9 allow such an evaluation to be made for the case in which constant failure rates, independent elements, and active redundancy can be assuined. In Fig. 2.8, a oneitem structure (element El with failure rate h,) is compared with a 1-out-of-2 redundancy with a series element (element E2 with failure rate h2). In Fig. 2.9, the 1-out-of-2 redundancy with a series element E2 is compared with the structure which would be obtained if a 1-out-of-2 redundancy for element E2 with a series element E3 would become necessary. Obviously h 3 < h 2 < h l (the limiting cases h, = h 2 for Fig. 2.8 and hl = h 2 = h 3 for Fig. 2.9 have an indicative purpose only). The three cases are labeled a), b), and C). The upper part of Figs. 2.8 and 2.9 depict the reliability functions and the lower part the ratios MTTFsob I M w O a and M W O c 1 MTTFSOb,respectively. The comparison between case a) of Fig. 2.8 and case C) of Fig. 2.9, given as MTTFSO, I MTTQO, on Fig. 2.8, shows a much lower dependency on h2 1 Al. From Figs. 2.8 and 2.9 following design guideline can be formulated: The failure rate h2 of the series element in a nonrepairable (up to system failure) 1-out-of-2 active redundancy should not be larger than 10% of the failure rate of the redundant elements Al; tize 10% rule applies also for the case of h3 in Fig. 2.9, i.e.
The investigation of the structures given in Figs. 2.8 and 2.9 for the repairable case (with = 1IMTTR as constant repair rate) leads in Section 6.6 to more
47
2.2 Predicted Reliability of Equipment and Systems with Simple Structures
severe conditions (3L2 C: 0.013L1 in general, and h 2 < 0.002 Al for p /Al > 500), See Figs. 6.17 and 6.18.
2.2.6.5 Majority Redundancy Majority redundancy is a special case of a k-out-of-n redundancy, frequently used in, but not limited to, redundant digital circuits. 2 n + 1 outputs are fed to a voter whose output represents the majority of its 2 n + 1 input signals. The investigation is based on the previously described procedure for series - parallel structures, See for example the case of n = 1 (active redundancy 2-out-of-3 in series with the voter E,) given in the 6th line of Table 2.1. The majority redundancy realizes in a simple way a fault-tolerant structure without the need for control or switching elements. The required function is performed with no operational interruption up to the time point of the second failure, since the first failure is automatically masked by the majority redundancy. In digital circuits, the voter for a majority redundancy with n = 1 consists of three two-input NAND and one three-input NAND gate, for a bit by bit solution. An alarm circuit is also simple to realize, and can be implemented with three two-input EXOR and one three-input OR gates (Example 2.5). A similar stmcture as for the alarm circuit can be used to realize a second alarm circuit giving a pulse at the second failure, thus expanding the 2-out-of-3 active redundancy to a 1-out-of-3 active redundancy (Problem 2.7 in Appendix A l l ) . A majority redundancy can also be realized with software (N-version programming).
Example 2.5 Realize a majority redundancy for n = 1 inclusive voter and alarm signal at the first failure of a redundant element (bit hy bit solution with "1" for operating and "0" for failure). Solution Using the Same notation as for Eq. (2.16), the 2-out-of-3 active redundancy can be implemented by (el n e2) U (el n e j ) U (e2 neg ). With this, the functional block diagram per bit of the voter for a majority redundancy with n = 1 is obtained as the realization of the logic equation related to the above expression. The alarm circuit giving a logic 1 at the occurrence of the first failure is also easy to implement. Also it is possible to realize a second alarm circuit to detect the second failure (Problem 2.7 in Appendix A l 1).
In
i__i-
Voter
Alarm
2 Reliability Analysis Dunng the Design Phase
Figure 2.8 Comparison between the one-item structure and a 1-out-of-2 active redundancy with senes element (nonrepairable up to system failure, independent elements, constant failure rates hl & h z , h1 remains the same in both structures;equations according to Table 2.1; given on the nght1MO, with M n F s from Fig. 2.9; see Fig. 6.17 for the repairable case) hand side is M-
2.2 Predicted Reliability of Equipment and Systems with Simple Structures
C)
W
F-
EI.
1-out-ol-2 active (E,. = E,)
E~
-
1-out-of-2 active (E2 = E2)
iI.soc(t) = (2.
-hlt-e-2hlt)
(2e-h2t- e-2h2')e-h3',
M7TFnC= 4 / (L,+ h, +L,) - 2 / ( A l + 2h2 + X,)
-2/(2h, +h,+h,)
+ 1 / (2h1+ 2h, + L,)
Figure 2.9 Comparison between basic series -parallel structures (nonrepairable up to system failure, active redundancy, independent elements, constant failure rates h, to h3, hl and h2 remain the same in both structures; equations according to Table 2.1; see Fig. 6.18 for the repairable case)
50
2 Reliability Analysis During the Design Phase
Example 2.6 Compute the predicted reliability for the following circuit, for which the required function asks that the LED must light when the control voltage ul is high. The environmental conditions correspond to GB in Table 2.3, with ambient temperature €IA = 50°C inside the equipment and 30°C at the location of the LED; quality factor Z Q = 1 as per Table 2.4.
:0.1Vand4V : 5V LED : 1V at 20 mA, I„, = 100 mA Re : 150 Q, 112 W, MF U,
V„
m,
TRI : Si,0.3 W, 30 V, ß > 100, plastic R„ : 10 kQ, 1R W, ME
Solution The solution is based on the procedure given in Fig 2.1
I. The required function can be fulfilled since the transistor works as an electronic switch with IC = 20mA and IB = 0.33mA in the on state (satnrated) and the off state is assured by ul =O.lV. 2. Since all elements are involved in the required function, the reliability block diagram consists of the series connection of the five items EI to E5, where E5 represents the printed circuit with soldering joints.
3. The stress factor of each element can be easily determined from the circuit and the given rated values. A stress factor 0.1 is assumed for all elements when the transistor is off. When the transistor is On, the stress factor is 0.2 for the diode and about 0.1 for all other elements. The ambient temperature is 30°C for the LED and 50°C for the remaining elements.
4. The f a h r e rates of the individual elements is determined (approximately) with data from Section 2.2.4 (Example 2.4, Figs. 2.4 -2.6, Tables 2.3 and 2.4 with Z E = %Q = 1). Thus, LED :h,=1.3.10-~h-' Transistor : h4 = 3 .10-~hhl Resistor : h2 = hj = 0.3 .10-~hd,
when the transistor is on. For the printed circuit board and soldering joints, h5 = 2.10-' h-1 is assumed. The above values for h remain practically unchanged when the transistor is off due to the low stress factors (the stress factor in the off state was Set at 0.1).
5. Based on the results of Step 4, the reliability function of each element can be determined as ~ ~ (= e-'i t )
6. The reliability function RS (t ) for the whole circuit can now be calculated. Equation (2.19)
2.2 Predicted Reliability of Equipment and Systems with Simple Shuctures
51
yields Rs (t ) = ea'9'10-9' . For 10 years of continuous operation, for exarnple, the predicted reliability of the circuit is > 0.999.
7 . Supplernentary result: To discuss this example further, let us assume that the failure rate of the transistor is too high (e.g. for safety reasons) and that no transistor of better quality can be obtained. Redundancy should be implemented for this element. Assuming as failure modes short between emitter and collector for transistors and Open for resistors, the resulting circuit and the corresponding reliability block diagram are
R ~ iB
C E TR~
+vJc
t
E~ P R~~ , E~2fi T R P~ T R ~ EI fitoR~~ E5 as in point
i
Due to the very small stress factor, calculation of the individual element failure rates yields the same values as without redundancy. Thus, for the reliability function of the circuit one obtains (assuming independent elements)
from which it follows that
Circuit reliability is then practically no longer influenced by the transistor. This agrees with the discussion made with Fig. 2.7 for h t << 1 . If the failure mode of the transistors were an Open between collector and emitter, both elements E4 and E7 would appear in series in the reliability block diagram; redundancy would be a disadvantage in this case. The intention to put RB, and RB2 in parallel (redundancy) or to use just one basis resistor is wrong, the functionality of the circuit would be compromised because of the saturation voltage of TR2.
2.2.7 Part Count Method In an early development phase, for logistic purposes, or in some particular applications, a rough estirnate of the predicted reliability can be required. For such an analysis, it is generally assumed that the system under consideration is without redundancy (series structure as in Section 2.2.6.1) and the calculation of the failure rate at component level is made either using field data or by considering technology, environmental, and quality factors only. This procedure is known as part count rnethod and differs basically from the part stress rnethod introduced in Section 2.2.4. Advantage of a part count prediction is the great simplicity, but its usefulness is often limited to specific applications.
52
2.3
2 Reliability Analysis Dunng the Design Phase
Reliability of Systems with Complex Structure
Complex structures arise in many applications, e.g. in power, telecommunications, defense, and aerospace systems. In the context of this book, a structure is complex when the reliability block diagram either cannot be reduced to a series-parallel structure with independent elements or does not exist. For instance, a reliability block diagram does not exist if more than two states (goodlfailed) or one failure mode (e.g. short or open) must be considered for an element. Moreover, the reduction of a reliability block diagram to a series -parallel structure with independent elements is in general not possible with distributed structures or when elements appear in the diagram more than once (cases 7, 8 , 9 in Table 2.1). The term independent elements refers to independence up to the system failure, in particular without load sharing between redundant elements (load sharing is considered in Section 2.3.5 and Chapter 6). For comparative investigations in Chapter 6, the term totally independent elements will be used to indicate for repairable systems, independence with respect to operation und repair (each element in the reliability block diagram operates and fails independently from every other element and has its own repair crew). Analysis of complex structures can become difficult and time-consuming. However, methods are well developed, should the reliability block diagram exist and the system satisfy the following requirements: 1. Only active (parallel) redundancy is considered.
2. Elements can appear more than once in the reliability block diagram, but different elements are independent (totally independent for Eq. (2.48)). 3. On 1 off operations are either 100% reliable, or their effect has been considered in the reliability block diagram according to the above restrictions. Under these assumptions, analysis can be performed using Boolean models. However, for practical applications, simple heuristically oriented methods apply well. Heuristic methods are given in Sections 2.3.1-2.3.3, Boolean models in Section 2.3.4. Section 2.3.5 deals then with warm redundancy, allowing for load sharing. Section 2.3.6 considers elements with two failure modes. Stress I strength analysis are discussed in Section 2.5. Further aspects, as well as situations in which the reliability block diagram does not exist, are considered in Section 6.8 8 (see also Section 6.9 for an introduction to Petri nets, dynarnic FT, and computer-aided analysis). As in the previous sections, reliability figures have the indices SO, where S stands for System and 0 specifies System new at t = 0.
2.3.1 Key Item Method The ke.y item method is based on the theorem of total probability (Eq. (A6.17)). The event {theitem operates failure free in the interval(0, t ]), or in a short form
2.3 Reliability of Systems with Complex Structure
{system up in ( 0 , t ]}, can be split into the following two complementary events {ElementEiup in (0, t ] n system up in(0, t l } and {ElementEifails in (0, t ] n system up in(0, t ] ] From this it follows that, for the reliability function Rso(t),
I
Rso(t) = R i ( t ) Pr(system up in (O,t] Ei up in (0, t ] }
+ ( 1 - R i ( t ) )Pr(system up in (O,t]I Ei failed in ( 0 , t ] } .
(2.29)
Where R i ( t ) = Pr{Ei up in (0, t ] ] as in Eq. (2.16). The element Ei must be chosen in such a way that a series -parallel structure is obtained for the reliability block diagrams conditioned by the events (Ei up in ( 0 , t ] ) and (Ei failed in (0, t ] } . Successive application of Eq. (2.29) is also possible (Examples 2.9 and 2.14). Sections 2.3.1.1 and 2.3.1.2 present two typical situations.
2.3.1.1 Bridge Structure The reliability block diagram of a bridge structure with a bi-directional connection is shown in Fig. 2.10 (case 7 in Table 2.1). Element E5 can work with respect to the required function in both directions, from EI via Es to E4 and from E2 via E5 to E3. It is therefore in a key position (key element). This property is used to calculate the reliability function by means of Eq. (2.29) with Ei = Es. For the conditional probabilities in Eq. (2.29), the corresponding reliability block diagrams are
E5 did not fail in (0, tl
E5 failed in (0, t ]
From Eq. (2.29), it follows that (with Rs = R,,, ( t ) , Ri= F$ (t), Ri (0) = 1,
i = 1,..., 5 )
Rs=R5(Rl+R2-RiR2)(R3+R4-R3R4)+(1-R5)(R1R3+R2R4-R1R2R3R4). (2.30)
Figure 2.10 Reliability block diagram of a bndge circuit with a bi-directional connection on E5
54
2 Reliability Analysis During the Design Phase
The same considerations apply to the bridge structure with a directed connection (case 8 in Table 2.1). Here, E i must be different from E 5 . Choosing E i = E 4 yields
The Same result is obtained by choosing e.g. Ei = E l
Example 2.7 shows a further application of the key item method.
Example 2.7 Give the reliability of the item according to case a) below. How mucli would the reliability be improved if the structure were be modified according to case b)? (Assumptions: nonrepairable up to system failure, active redundancy, independent elements, REl (t ) = REl (t ) = RE1 (t ) = R, (t ) and RE2 (t ) = RE2,(t ) = R2 (t )).
Case a) Case b) Solution Element Er is in a key position in case a). Thus, similarly to Eq. (2.30), one obtains R , = R , ( 2 R 2 - ~ 2 2 ) + ( 1 - ~ 1 ) ( 2 ~ 1 with ~ 2 - ~Ra=Roa(t), ~~~) Ri=Ri(t), R i ( 0 ) = l , i = 1 , 2 . Case b) represents a series connection of a 1-out-of-3 redundancy with a 1-out-of-2 redundancy. From Sections 2.2.6.3 and 2.2.6.4 it follows that Rb = R1 R1 (3 - 3 R1 + ~ ~ ' ) ( 2R2 ), with Rb=ROb(t), Ri=Ri(t), Ri(0)=l, i = 1 , 2 . Fromthis, R b - R a = 2 R l R2(1-R2)(1-Rl)
2
.
(2.32)
The difference Rb - R, reaches as maximum the value 2 / 27 for R1 = 11 3 and R2 = 1/ 2 , i.e. Rb= 57 / 108 and Ra= 49 / 108 ( Rb- Ra= 0 for R1 = 0, R, = 1 , R2 = 0, R2 = 1); the advantage of case b) is small, as far as reliability is concemed.
2.3.1.2 Reliability Block Diagram in Which at Least One Element Appears More than Once In practice, situations often occur in which an element appears more than once in the reliability block diagram, although, physically, there is only one such element in the system considered. These situations can be investigated with the key item method introduced in Section 2.3.1.1, see Examples 2.8,2.9, and 2.14.
2.3 Reliability of Systems with Complex Stmcture
Example 2.8 Give the reliability for the equipment introduced in Example 2.2.
Solution In the reliability block diagram of Example 2.2, element E2 is in a key position. Similarly to Eq. (2.30) it follows that Rs = R2 R, (R4 + R5 - R 4 R 5 ) + ( 1 - R 2 ) R 1R3 R5. with Rs = R s o ( t ) and Ri=Ri(t), Ri(0)=l, i = 1 ,
(2.33)
...,5
Example 2.9 Give the reliability for the redundant circuit of Example 2.3. Solution In tlie reliability block diagram of Example 2.3, U1 and U2 are in a key position. Using the method introduced in Section 2.3.1 successively on U1 and U2, i.e. on E5 and G , yields.
With R I = R 2 = R j = R 4 = R D ,R 5 = R 6 = R U ,R 7 = R 8 = R I , Rg=RI,itfollowsthat
Rs = Ru R„[Ru(2RDR,-R; R ? ) ( ~ R ,- ~ : ) + 2 ( 1 -R , ) R ~ R,], with Rs = R s o ( t ) , Ru=Ru(t), R D = R D ( t ) ,R,=R,(t), RII=R,(t),
(2.34) Ri(0)=l ( i = l ,..., 9 ) .
2.3.2 Successful Path Method In this and in the next section, two general (closely related) methods are introduced. For simplicity, considerations will be based on the reliability block diagram given in Fig. 2.1 1. As in Section 2.2.6.1, ei stands for the event {element Eiu p in the interval(0, t ] }, hence Pr{ei} = R i ( t ) ,as in Eq. (2.16), and Pr{Zi}= 1 - R i ( t ) . The successful p a t h method is based on the following concept: The system fulfills its required function if there i s at least one p a t h b e t w e e n the input u n d the output u p o n which all e l e m e n t s p e r f o r m t h e i r required function. Paths must lead from left to right and may not contain any loops. Only the given direction is possible along a directed connection. The following successful paths exist in the reliability block diagram of Fig. 2.11
2 Reliability Analysis Dunng the Design Phase
Figure 2.11 Reiiability block diagram of a complex structure (elements E3 and E4 appear each twice in the RBD, the directed connection has reliability 1)
Consequently it follows that
Using the addition theorem of probability theory (Eq. (A6.15)), Eq. (2.35) leads to
) Ri=Ri(t), Ri(0)=i, i = i , ...,5. with Rs = R S O ( t and
2.3.3 State Space Method This method is based on the following concept: Every element E i is assigned an indicator c i ( t ) with the following property: c i ( t )= 1 as long as Ei does not fail, und Ci(t)= 0 if Ei has failed. The vector with components Ci(t) determines the system state at time t. Since each element in the interval (0, t] functions or fails independently of the others, 2 n states are possible for an item with n elernents. After listing the 2 n possible states at time t, all those states are determined in which the system performs the required function. The probability that the system is in one of these states is the reliability function R s o ( t ) of the system considered. The Z n possible conditions at time t for the reliability block diagram of Fig. 2.1 1 are
2.3 Reliability of Systems with Complex Structure
A "1" in this table means that the element or item considered has not failed in (0, t ] (see footnote on p. 58 for fault tree analysis). For Fig. 2.1 1, the event {system up in the interval(0, t ] ] is equivalent to the event
After appropriate simplification, this reduces to
from which
Evaluation of Eq. (2.37) leads to Eq. (2.36). In contrast to the successful path method, all events in the state space method (columns in the state space table and terms in Eq. (2.37)) are mutually exclusive.
2.3.4 Boolean Function Method The Booleanfunction method generalizes and formalizes the methods introduced in Sections 2.3.2 & 2.3.3. For this analysis, besides the 3 assumptions given on p. 52, it is assumed that the item (system) considered is coherent, that is (basically) 1. The state of the system depends on the states of all of its elements; in particular, the system is up for all elements up and down for all elements down. 2. If the system is down, no additional failure of any element can bring it in an up state (monotony). In the case of repairable systems, the second property must be extended to: If the system is up, it remains up if any element is repaired. Almost all systems in
58
2 Reliability Analysis During the Design Phase
practical applications are coherent. In the following, u p is used for system in operating stute and down for systern in a failed stute (in repair if repairable). The states of a coherent system can be described by a system function (structure function). A system function Q, is a Boolean function')
1
for item up for item down
defined in Section 2.3.3 (Ci= 1 if element E iis up and of the indicators Ci = Ci(t), Ci = 0 if element Eiis down), for which the following applies (coherent systern):
1. Q, depends on all the variables Ci ( i = 1, ..., n ) . 2. $ is non decreasing in all variables, @ = 0 for all Since the indicators
Ci = 0 and 4 = 1 for all Ci = 1.
Ciand the system function Q, take on only the values 0 and 1,
applies for the reliability function of element Ei,and
applies for the reliability function of the system (calculation of E [@]is in general easier as calculation of Pr{$ = 1)). The Boolean function method thus transfers the problem of calculating R s o ( t ) ..., in).Two methods with to that of the deterinination of the system function @ (C1, a great intuitive appeal are available for this purpose:
1. Minimal Path Set approach: A set 1;: of elements is a minimal path set if the system is up when C j = 1 for all Ej E and Ck= 0 for all Ek E q, but this does not apply for any subset of 1;: (for the bridge in Fig. 2.10, {1,3}, {2,4}, {1,5,4}, and {2,5,3) are the minimal path sets). The elements Ei within 1: form a series nlodel with system function
If for a given system there are r minimal path sets, these form an active I-out-of-r redundancy; i.e.,
+)
In fault tree analysis (FTA), "0" for operation (up) and "1" for failure (down) is often used [A2.6 (IEC 61025)l.
2.3 Reliability of Systems wjth Complex Structure
59
2. Minimal Cut Set approach: A set Ciis a minimal cut set if the system is down when = 0 for all E j E Ci and = 1 for all Ek E Ci,but this does not apply for any subset of Ci (for the bridge in Fig. 2.10, {1,2}, {3,4), {1,5,4}, and {2,5,3} are the minimal cut sets). The elements E j within Ciform aparallel model (active redundancy with k = 1) with system function
cj
ck
If for a given system there are m minimal cut sets, these form a series model; i.e.,
A series model with elements E l , ..., E , has one path set and n cut sets, a parallel model (1-out-of-n) has one cut set and n path sets. Algorithms for finding all minimal path sets and all minimal cut sets are known, see e.g. [2.33 (1975)l. From Eqs. (2.40), (2.42), and (2.44) it holds that
with R s o ( 0 ) = 1. For practical applications, the following bounds for the reliability function R s o ( t ) can be used [2.36]
If the minimal path sets are mutually exclusive, the right-hand inequality of Eq. (2.46) becomes an equality, similar is for the minimal cut sets (left-hand inequality). The paths given with Eq. (2.35) are the minimal path sets for the reliability block diagram of Fig. 2.11. Using Eq. (2.42) this lead to the system function $ ( 5 1 ~ . . . ~ 5 , > = 1 - < 1 - 5 1<354)(1-51 5355)(1-51 c 4 5 5 ) ( 1 - 5 2 5 3 5 5 ) ( 1 - < 2 5 4 5 5 ) arid then to Eq. (2.36). Investigation of ihe block diagram of Fig. 2.1 1 by the method of minimal cut sets is more laborious. Obviously, minimal path sets and minimal cut sets deliver the same system function $(C1, ..., C,), with different effort depending on the structure of the reliability block diagram considered (stmctures with many series elements can be treated easily with minimal path sets, see Example 2.10).
Example 2.10 Give the system function according to the minimal path set and the minimal cut Set approach for the following reliability block diagram, and calculate the reliability function assuming independent elements.
2 Reliability Analysis During the Design Phase
Solution For the above reliability block diagram, there exist 2 minimal path sets PI, P2 and 4 minimal cut sets Cl , ..., C4, as given below.
The system function follows then from Eq. (2.42) for the minimal path sets
or from Eq. (2.44) for the minimal cut sets (in both cases by considering Ci
Ci 5I.=C.J 5.) L
Assuming independence for the (different) elements, it follows for the reliability function (forbothcasesandwith RS = R S O ( t ) &Ri=Ri(t), Ri(0)=l, i=1, ...,5) Rs=RlR2R5+R2R3R4R5-RlR2R3R4R5.
Supplementary results: Calculation with the key item method leads directly to Rs = R 2 ( R l + R 3 R 4 - R 1 R 3 R 4 ) R 5 + ( 1 - R 2 ) . 0 = R2(R1+R3R4-RlR3R4)R5.
For items (systems) with i n d e p e n d e n t , n o n r e p a i r a b l e e l e m e n t s (up to system failure), the reliability function R s o ( t ) = E[@(<„ ..., L,)] can generally be obtained directly from the system function by considering in Eqs. (2.42) and (2.44) the i d e m p o t e n c y property (Ci Ci = Ci) and substituting R i ( t ) for Ci (Eq. (A6.69)). A further possibility is to use the d i s j u n c t i v e n o r m a l f o r m $D(
2.3 Reliability of Systems with Complex Structure
61
For coherent repairable Systems with totally independent elenients (every element works and is repaired independently from every other element and disposes of its own repair crew), Eq. (2.40) or Eq. (2.47) can be used to calculate the point availability P A s o ( t ) ,yielding for the case of Eq. (2.47)
-
with PAi = P A i o ( t )for the general case (Eq. (6.17)) or PA = M?T< I (Mn"? + M V R , ) for steady-state or t+ (Eq. (6.48)). However, in practical applications, a repair Crew for each element in the reliability block diagram of a system is rarely available. Nevertheless, Eq. (2.48) can be used as an approximation (upper bound) for PAso(t). For repairable elements, Ci(t) given by Eq. (2.38) is defined as Ci(t)= 1 for element Ei operating (up) and Ci(t)= 0 for Ei in repair (down), yielding I 5 [ c i ( t ) ]= PAio(t). In practical applications, computation is often easily performed for the unavailability I-PAso(t).
2.3.5 Parallel Models with Constant Failure Rates and Load Sharing In the redundancy structures investigated in the previous sections, all elements were operating under the same conditions. For this type of redundancy, called active (parallel) redundancy, the assumed statistical independence of the elements implies in particular that there is no load sharing. This assumption does not arise in many practical applications, for example, at component level or in the presence of power elements. The investigation of the reliability function in the case of load sharing or of other kinds of dependency involves the use of stochastic processes. The situation is simple if one can assume that the failure rate of each element changes onZy when a failure occurs. In this case, the general model for a kout-of-n redundancy is a death process as given in Fig. 2.12 (birth and death process as in Fig. 6.13 for the repairable case with constant failure & repair rates). Zo, ..., Zn-k+l are the states of the process. In state Zi, i elements are down. At state Zn-k+l the system is down.
Figure 2.12 Diagram OE the transition probabilities in ( 2 , t + 6 t ] for a k-out-of-n redundancy (nonrepairable, constant failure rates during the sojoum time in each srnte (not nrcessarily at a state change, e.g. because of load sharing), t arbitraiy, 6 t + 0, Markov process)
62
2 Reliability Analysis During the Design Phase
Assurning h = failure rate of an element in the operating state
(2.49)
Ar
(2.50)
and = failure rate of an element in the resewe state (Ar I h),
the model of Fig. 2.12 considers in particular the following cases:
1. Active redundancy without load sharing (independent elements) vi = ( n - i ) h ,
i = 0,
..., n - k ,
h is the same for all states.
2. Active redundancy with load sharing (?L vi=(n-i)h(i),
= ?L(.i))
i = o,..., n - k ,
(2.52)
h(i) increases at each state change. 3. Warm (lightly loaded) redundancy ( ?L, < h ) vi = k h + ( n - k - i ) h r ,
i = O,..., n - k ,
-
(2.53)
h and h, are the Same for all states. 4. Standby (cold) redundancy (?L, ~i=kh,
0)
i = O ,..., n - k ,
h is the Same for all states. For a standby redundancy, it is assumed that the failure rate in the reserve state is = 0 (the reserve elements are switched on when needed). Warm redundancy is somewhere between active and standby ( 0 < ?Lr < ?L). It should be noted that the k-out-of-n active, warm, or standby redundancy is only the simplest representatives of the general concept of redundancy. Series -parallel structures, voting techniques, bridges, and more complex structures are frequently used (see Sections 2.2.6, 2.3.1 - 2.3.4, and 6.6 - 6.8 with repair rate p = 0, for some examples). Furthermore, redundancy can also appear in other forms, e.g. at software level, and the benefit of redundancy can be limited by the involved failure modes as well as by control und switching elements (see Section 6.8 for some examples). For the analysis of the model shown in Fig. 2.12, let Pi(t) = Pr{ the process is in state Zi at time t }
(2.55)
be the state probabilities ( i = 0, ..., n - k + 1). Pi(t) is obtained by considering the process at two adjacent time points t and t + 6 t and by making use of the memoryless property resulting from the constant failure rate assumed between consecutive state changes (Appendix A7.5). The function Pi(t) thus satisfies the following differente equation
63
2.3 Reliability of Systems with Complex Structure
Pi(t + 6t) = Pi(t)(l - v i 6t) t Pi-l(t) vi-1 6 t ,
i
= l , ..., n - k .
(2.56)
For 6 t + 0, there follows then a system of differential equations describing the death process
Assuming the initial conditions Pi(0) = 1 and Pj (0) = 0 for j + i at t = 0, the solution (generally obtained using the Laplace transform) leads to Pi(t), i = 0, ..., n - k + 1. Knowing Pi(t), one can evaluate the reliabili@function Rs(t)
and the mearz time to failure from Eq. (2.9). Assuming for instance Po(0) = 1 as initial condition, one obtains for the Laplace transform of Rso(t), m
Km(s) =
J~
~ ~eTSt ( d tt , )
0 the expression
The mean time to failure follows then from M m O = RSO(0) and leads to n-k
MTTFso =
-
.
(2.62)
i= 0 Vi
Thereby, S stands for system and 0 specify the initial condition Po(0) = 1 (Table 6.2). For a k-out-of-n standby redundancy (Eq. (2.54)), it follows that
and n-k+l (2.64) kh Equation (2.63) shows the relation existing between the Poisson distribution and the occurrence of exponentially distributed events. MTTFSo =
64
2 Reliability Analysis During the Design Phase
For the case of a k-out-of-n active redundancy without load sharing, it follows from Eqs. (2.62) and (2.51) that
see also Table 6.8 with p= 0, and A, = h . Some examples for R S O ( t )with different values for n and k are given in Fig. 2.7.
2.3.6 Elements with more than one Failure Mechanism or one Failure Mode In the previous sections, it was assumed that each element exhibits only one dominant failure mechanism, causing one dominant failure mode; for example intermetallic compound causing a short or corrosion causing an Open for integrated circuits. However, in practical applications, components can have some failure mechanisms and fail in different manner (See e.g. Table 3.4). A simple way to consider more than one failure mechanism is to assume that each failure mechanism is independent of each other and causes a failure at item level. In this case, a series model can be used by assigning a failure rate to each failure mechanism, and Eq. 2.18 or Eq. 7.57 delivers the total failure rate of the item considered. More sophisticated models are possible. A mixture of failure rates and 1 or mechanisms has been discussed in Section 2.2.5 (Eq. (2.15)). This section will consider as an example the case of a diode exhibiting two failure modes. Let R(t) = Pr{no failure in (0, t ]
I diode new at
t =0}
R ( t ) = 1- R ( t ) = Prtfailure in (0, t ] I diode new at t =0)
-
R U ( t )= Pr{open in (0, t ] I diode new at t =0}
-
R K ( t )= Pr{short in (0, t] I diode new at t =O).
Obviously (Example 2.1 1)
The series connection of two diodes exhibits a circuit failure if either one Open or two shorts occur. From this, -
-2 Rs = 1 - ( 1 - ~ , ) ~+ RK = 2- ~ -R: ,-
with Rs = R s o ( t ) ,
+ R,?,
= R K ( t ) ,Eu = & ( t ) .
*
(2.67)
65
2.3 Reliability of Systems with Complex Structure
Example 2.11 In an accelerated test of 1000 diodes, 100 failures occur, of which 30 are opens and 70 shorts. Give an estimate for E , and E,.
zu,
Solution The maximum likeiihood estinlate of an unknnwn probability p is, according to Eq. (A8.29), p = k l ~ z .Hence, E = o . I , and R K = 0 . 0 7 .
Similarly, for two diodes in parallel (Example 2.12),
To be sitnultaneously protected against at least one failure of arbitrary mode (short or open), a quad redundancy is necessary. Depending upon whether opens or shorts are more frequent, a quad redundancy with or without a bridge connection is used. For both these cases it follows that
and
Equations (2.67) to (2.70) can be obtained using the state space method introduced in Section 2.3.3, however with three states for every element (good, Open (U), and short (K)leading to a state space with 3n elements, See Example 2.12).
Example 2.12 Using the state space method, give the reliability of two parallel connected diodes, assuming that opens and shorts are possible.
Solution Considering the three possible parallel connected diodes is 4 1 1 1 U U U 4 I U K I U K S 1 1 0 1 0 0
states (good (I), Open (U), and short (K)), the state space for two K K K 1 U K 0 0 0
R o m the above table, it follows that
& =Pr(S=O)
=
&* D2
2
~ & + ~ ~ + 2 ~ ~ ~ + E ~ - - -2 - -2 -2 -2 = 2 ( 1 - R U - RK)RK + R u + 2 R u RK + RK = 2 R K - R K f Ru.
The linear superposition of the two failure modes, appearing in the final result for apply necessarily to arbitraq structures.
E,,
do not
66
2 Reliability Analysis During the Design Phase
2.3.7 Basic Considerations on Fault Tolerant Structures In applications with high reliability, availability or safety requirements, equipment and systems must be designed to be fault tolerant. This means that without external help (autonomously) the item considered should be able to recognize a fault (failure or defect) and quickly reconfigure itself in such a way as to remain safe and possibly continue to operate with minimal performance loss Cfail-sale,graceful degradation). Methods to investigate fault tolerant items have been introduced in Sections 2.2.6.2 through 2.3.6, in particular Sections 2.2.6.5 (majority redundancy) and 2.3.6 (quad redundancy). The latter is one of the few structures which can Support at least one failure of any mode, the price paid is four devices instead of one. Other possibilities are known to implement fault tolerance at component level, e.g. [2.39]. Repairable fault tolerant sy stems are considered carefully in Chapter 6, in particular in Section 6.8 for non ideal reconfiguration (incomplete coverage, imperfect switching), phased-mission systems, common cause failures, and reward & frequency/duration aspects. It is shown, that the stochastic processes introduced in Appendix A7 can be used to investigate reliability and availability of fault tolerant systems for cases in which a reliability block diagram does not exist as well. To avoid common cause or single-point failures, redundant elements should be designed and produced independently from each other, in critical cases with different technology, tools, and personnel. Investigation of all possible failure (fault)modes during the design of fault tolerant equipment or systems is mandatory. This is generally done using failure modes und effects analysis (FMEAIFMECA), fault tree analysis (FTA),causes-to-effects diagrams or similar tools (Section 2.6), supported by appropriate investigation models (see e.g. Examples 6.14 and 6.16). Failure mode analysis is essential where redundancy appears, among other to identify the parts which are in series to the ideal redundancy (in the reliability block diagram), to discover interactions between elements of the given item, and to find appropriate measures to avoid failure propagation (secondary failures). Protection against seconda~~failures can be realized, at component level, with decoupling elements such as diodes, resistors, capacitors (diodes EI - E4 in Example 2.3). Other possibilities are the introduction of standby elements which are activated at failure of active elements, the use of basically different technologies for redundant elements, etc. Quite generally, all parts which are essential for basic functions (e.g. interfaces and monitoring circuits) have to be designed with care. Adherence to appropriate design guidelines is important (Chapter 5). Recognition and localization of hidden failures as well as avoidance of false alarms (caused e.g. by synchronization problems) is mandatory. These and similar considerations applies in particular for equipment and systems with high reliability andl or safety requirements, as used e.g. in aerospace, automotive, and nuclear applications. Many of the above aspects also apply to defects, both in hardware and software (see Section 5.3.1 for software defects).
67
2.4 Reliability Allocation
2.4
Reliability Allocation
With complex equipment and systems, it is important to allocate reliability goals at subsystem and assembly levels early in the design phase. Such an allocation motivates the design engineer to consider reliability aspects at all system levels. Allocation is simple if the item (system) has no redundancy and its components have constant failure rates. The system's failure rate h, is then constant and equal to the sum of the failure rates of its elements (Eq. (2.19)). In such a case, the allocation of h, can be done as follows:
1. Break down the system into elements El,..., E,. 2. Define a complexity factor ki for each element ( 0 I kiI 1, kl + ... + k, = 1). 3. Determine the duty cycle d for each element ( d = operating time of element E iI operating time of the system). 4. Allocate the system's failure rate h , among elements El,..., E, according to
?Li =?Lski1di.
(2.71)
Should all elements have the same complexity ( k1= .. . = k, = 11 n ) and the same duty cycle ( d , = ... = d , = l ) , then
?Li =?LSI n .
(2.72)
In addition to the above, cost, technology risks, and failure effects should also be considered. A case-by-case optimization is often possible. Should the individual element failure rates not be constant andl or the system contain redundancy, allocation of reliability goals is more difficult. The results of Sections 2.2 and 2.3 can be used. If repairable series -parallel structures appear, one can often assume that the failure rate at equipment or system level is fixed by the series elements (Section 6.6), for which Eqs. (2.71) and (2.72) can be used.
2.5
Mechanical Reliability, Drift Failures
As long as the reliability is considered to be the probability R for a mission success (without relation to the distribution of the failure-free time), the reliability analysis procedure for mechanical equipment or systems is similar to that used for electronic equipment or systems and is based on the following steps: 1. Definition of the system and of its associated mission profile. 2. Derivation of the corresponding reliability block diagram. 3. Determination of the reliability for each element of the reliability block diagram.
68
2 Reliability Analysis During the Design Phase
4. Calculation of the system reliability R s o . 5. Elimination of reliability weaknesses and return to step 1 or 2, as necessary. Such a procedure is currently used in practical applications and is illustrated by Examples 2.13 and 2.14.
Example 2.13 The fastening of two mechanical parts should be easy and reliable. It is done by means of two flanges which are pressed together with 4 clamps E1 to E4 placed 90' to each other. Expenence has shown that the fastening holds when at least 2 opposing clamps work. Set up the reliability block diagram for this fixation and compute its reliability (each clamp is news at t = 0 and has reliability R, = R, = R3 = R4 = R). Solution Since at least two opposing clamps ( E l and E3 or E2 and E4) have to function without failure, the reliability block diagram is obtained as the series connection of El and E3 in parallel with the series connection of E2 and E4, See graph on the right. Under the assumption that clamp is independent from every other one, the item reliability follows from Rs = 2 R' - R ~ . Supplernentary result: If two arbitrary clamps were sufficient for the required function, a 2-outof-4 active redundancy would apply yielding (Tab. 2.1) Rs = 6 R' - 8 R~ + 3 R ~ .
,,
Example 2.14 To separate a satellite's protective shielding, a special electrical-pyrotechnic system described in the block diagram on the right is used. An electrical signal Comes through the cables E1 and E2 (redundancy) to the electncal-pyrotechnic converter E3 which lights the fuses. These carry the pyrotechnic signal to explosive charges for E3 guillotining bolts EI2 and E13 of the tensioning belt. The charges can be ignited from two sides, although one ignition will suffice (redundancy). For fulfillment of the required function, both bolts must be exploded simultaneously. Give the reliability of this separation system as a function of the reliability R1,..., R13of its elements (news at t = 0). Solution The reliability block diagram is easily obtained by considering first the ignition of bolts E12 & E13separately and then connecting these two parts of the reliability block diagram in series.
69
2.5 Mechanical Reliability, Drift Failures
Elements E4, E5, ElO,and Elleach appear twice in the reliability block diagram. Repeated application of the key item method (successively on E5,E1l,E4,and ElO,see Section 2.3.1 and Example 2.9), by assuming that the elements EI,... , Eljare independent, leads to
Rso=Rg R ~ z R ~ ~ ( R ~ + R ~R )~{- RR ~~( R ~ ~ [ R ~ I R ~ R8)(R7+R9-R7 ~ ( R ~ + R ~ -4R )~
+ ~ ~ - R ~ ~ > R ~ R ~ I + ( ~ - R ~ ) R ~ R ~ I + ( ~ - R ~ ~ ) R ~ R ~ R ~ R
More complicated is the situation when the reliability function R ( t ) is required. For electronic components it is possible to operate with the failure rate, since models and data are often available. This is generally not the case for mechanical parts, although failure rate models for some parts and units (bearings, springs, couplings, valves, etc.) have been developed [2.26]. If no information about failure rates is available, a general approach based on the stress-strength method, often supported by finite element analysis, can be used. Let c L ( t )be the stress (load) and c s ( t ) the strength, a failure occurs at the time t for which I EL(t)I > I c s ( t )I holds for the first time. Often, CL(t) and c s ( t ) can be considered as deterministic values and the ratio k s ( t )1 e L ( t )is the safety factor. In many practical applications, c L ( t ) and c s ( t ) are random variables, often stochastic processes. A practical oriented procedure for the reliability analysis of mechanical Systems in these cases is:
1. Definition of the system and of its associated rnission profile. 2. Formulation of failure hypotheses (buckling, bending, etc.) and validation of them using a FMEAIFMECA (Section 2.6); failure hypotheses are often correlated, this dependence must be identified and considered.
3. Evaluation of the stresses applied with respect to the critical failure hypotheses. 4,Evaluation of the strength limits by considering also dynamic stresses, notches, surface condition, etc.
5. Calculation of the system reliability (Eqs. (2.74) - (2.80)). 6. Elimination of reliability weaknesses and return to step 1 or 2, as necessary. Reliability calculation often leads to one of the following situations:
1. One failure hypothesis, stress and strength are > 0: The reliability function is given by Rso(t)= PrIes(x)> gL(x), 0 < X 5 t ) ,
R~~ (0)' 1.
(2.74)
2. More than one (n >1) failure hypothesis that can be correlated, stresses and strength are > 0: The reliabilityfunction is given by
70
2 Reliability Analysis During the Design Phase
Equation (2.75) can take a complicated form, according to the degree of dependence encountered. The situation is easier when stress and strength can be assumed to be independent and positive random variables. In this case, Pr{cs > kL 5, = X ] = Pr{5, > X} = 1 - F, ( X ) and the theorem of total probability leads to
1
Examples 2.15 and 2.16 illustrate the use of Eq. (2.76).
Example 2.15 Let the stress C L of a mechanical joint be normally distributed with mean m ~ = 100N/mm2 and standard deviation o~ = 4 0 ~ 1 m m ~The . strength CS is also normally distnbuted with mean 2 ms = 150N/mm2 and standard deviation os = 10NImm . Compute the reliability of the joint. Solution Since C L and Gs are normally distributed, their difference is also normally distributed A.6.16). Their mean and standard deviation are m s - m L =50Nlmm2 and = 41N/mm2, respectively. The reliability of the joint is then given by (Table A9.1)
Example 2.16 Let the stren th C s of a rod be normally distributed with mean m -450N/mm2 -1 4 -0.01 t N 1 mm h and standard deviation os = 25N / mm2 + 0.001 t N / mm h I . The stress 2 C L is constant and equal 350 Nlmm . Calculate the reliability of the rod at t = 0 and t = 104 h .
F
Solution At t = 0, ms = 450N/mm2 and os = 25 N/mm2. Thn,
. reliability is then After 10,000 operating hours, ms = 350N/mm2 and oS = 3 5 ~ / m m ~The
2.5 Mechanical Reliability, Drift Failures
71
Equation (2.76) holds for a one-item structure. For a series model, i.e. in particular for the series connection of two independent elements one obtains:
1. Same stsess
kL
2. Independent stresses
kL1 and SL2
For a parallel model, i.e. in particular for the parallel connection of two non repairable independent elements it follows that: 1. Same stress
kL
2. Independent stresses
CL,
and
cL2
As with Eqs. (2.78) and (2.80), the results of Table 2.1 can be applied in the case of independent stresses and elements. However, this ideal situation is seldom true for mechanical systems, for which Eqs. (2.77) and (2.79) are often more realistic. Moreover, the uiicertainty about the exact form of the distributions for stress and strength far from the mean value, severely reduce the accuracy of the results obtained from the above equations in practical applications. For mechanical items, tests are thus often the only way to evaluate their reliability. Investigations into new methods are in Progress, paying particular attention to the dependence between stresses and to a realistic truncation of the stress and strength densities (Eq. (A6.33)). Other approaches are possible for mechanical systems, see e.g. [2.61-2.751. For electronic items, Eqs. (2.76) and (2.77) - (2.80) can often be used to investigate drifi failures. Quite generally, all considerations of Section 2.5 could be applied to electronic items. However, the method based on the failure rate, introduced in Section 2.2, is easier to be used and works reasonably well in many practical applications dealing with electronic and electromechanical equipment and systems.
72
2.6
2 Reliability Analysis During the Design Phase
Failure Mode Analysis
Failure rate analysis (Sections 2.1-2.5) basically do not account for the mode and effect (consequence) of a failure. To understand the mechanism of System failures und in order to identify potential weaknesses of a fail-safe concept it is necessary tu pegorm a failure mode analysis, at least where redundancy appears und for critical parts of the item considered. Such an analysis is termed FMEA (Failure Modes and Effects Analysis) or alternatively FMECA (Failure Modes, Effects, and Criticality Analysis) if also the failure severity is of interest. If failures and defects have to be considered, Fault is used instead of Failure. A FMEAFMECA consists of the systematic analysis of failure (fault) modes, their causes, effects, and criticality C2.81 - 2.84, 2.87 - 2.93, 2.95, 2.971, including common-mode & common-cause failures as well. All possible failure (fault) modes (for the item considered), their causes and consequences are systematically investigated, in one run or in several steps (design FMEAFMECA, process FMEAiFMECA). For critical cases, the possibilities to avoid the failure (fault) or to minimize its consequence are analyzed and corresponding corrective (or preventive) actions are initiated. The criticality describes the severity of the consequence of the failure (fault) and is designated by categories or levels which are function of the risk for damage or loss of performance. Considerations on failure modes for electronic components are in Tables 3.4 & A1O.l and Section 3.3. The FMEAiFMECA is performed bottom-up by the designer in cooperation with the reliability engineer. The procedure is well established in international standards [2.89]. It is easy to understand but can become time-consuming for complex equipment and Systems. For this reason it is recommended to concentrate efforts to critical parts of the item considered, in particular where redundancy appears. Table 2.5 shows the procedure for a FMEAIFMECA (conforming to ZEC 60812 [2.89]). Basic are steps 3 to 8. Table 2.6 gives an example of a detailed FMECA for the electronic switch given in Example 2.6, Point 7. Each row of Tab. 2.5 is a column in Tab. 2.6. Other worksheet forms for FMEAFMECA are possible [2.82 2.84, 2.91, 2.951. The FMEAiFMECA is mandatory for items with fail-safe behavior and for all parts of an item in which redundancy appears (to venfy the effectiveness of the redundancy when failure occurs and to define the element in series on the reliability block diagram), as well as for failures which can cause a safety problem (liability claim). A FMEMFMECA is also useful to Support maintainability analyses. For a visualization of the item's criticality, the FMECA is often completed by a criticality grid (criticality matrix), see e.g. [2.89]. In such a matrix, each failure mode give an entry (dot or other) with criticality category as ordinate and corresponding probability (frequency) of occurrence as abscissa (Fig. 2.13). Generally accepted classifications are minor (I), major (11), critical (111), and catastrophic (IV) for the criticality level and very low, low, medium and high for the probability of occurrence. In a criticality grid, the further an entry is far from the origin, the greater is the necessity for a corrective fpreventive action.
2.6 Failure Mode Analysis
Table 2.5 Basic procedure for performing a FMECA (according to IEC 60812 [2.89])*) 1. Sequential numbering of the step. 2. Designation of the element or part under consideration, short description of its functioii, and reference to the reliability block diagram, part list, etc. (3 steps in IEC 60812) 3. Assumption of a possible fault**) mode (all possible fault modes have to be considered).
4. Identification of possible causes for the fault mode assumed in step 3 (a cause for a fault can also be a fiaw in the design phase, production phase, transportation, installation or use).
5. Description of the symptoms which will charactenze the fault mode assumed in step 3 and of its local effect (outputlinput relationships, possibilities for secondaiy failures or faults, etc.). 6. Identification of the consequences of the fault mode assumed in step 3 on the next higher integration levels (up to the System level) and on tlie rnission to be perfomed.
7. Identification of fault detection provisions and of corrective actions which can mitigate the severity of the fault mode assumed in step 3, reduce the probability of occurrence, or initiate an alternate operational mode which allows continued operation when the fault occurs.
8. Identification of possibilities to avoid the fault mode assumed in step 3. 9. Evaluation of the severity of the fault mode assumed in step 3 (FMECA only); e.g. I for minor, I1 for major, I11 for critical, IV for catastrophic (or alternatively, 1 for failure to complete a task, 2 for large economic loss, 3 for large material damage, 4 for loss of human life).
10. Estimation of the probability of occurrence (or failure rate) of the fault mode assumed in step 3 (FMECA only), with consideration of the cause of fault identified in step 4).
1I. Formulation of pertinent remarks which complete the information in the previous columns and also of recommendations for corrective actions, which will reduce the consequences of the fault mode assumed in step 3. *) FMEA by
omitting steps 9 & 10; steps are columns in Tab. 2.6; **)fault includesfailure & defect
The procedure for the FMEAIFMECA has been developed for hardware, but can also be used for software as well [2.87, 2.88, 5.64, 5.681. For mechanical items, the FMEAI FMECA is an essential tool in reliability analysis (Section 2.5).
Very low
Low
Medium
High
Probability of failure / fault
Figure 2.13 Example of criticality grid for a FMECA (according to IEC 60812 [2.89])
131
74)
i transistor I plastic ackage
R1.NPN
BC
L n Isad
Element, Assumed Function, fault Position mode
(2)
causes
(41
1
1
(7) Fault detection possibilities
(61 Effect on mission
FMEAIFMECA
sequence to other elements
LED lights dimly; disappears by bridging CE; no consePanial failure uBC=O, 'RC"
(5) Symptoms, local effects
:quipment: control cabinet XYZ sm: LED displuy circuit ired by: J. ntdhammer Date: rw. Sept. 13, 2000
1
mode in (3)
thefault
Possibilrtzes to
(8) (10)
Se- Probability of v e r i t ~ occurrence
(9)
Mission / required function: fault signaling State: operatinn phase
and GB be careful when forming the lead Obsewe the max soldering time; distance betweer package and board >2 mm pay attention to the cleaning medium hermet. package
h for OA= 30°C
it is possible to notify the failure of TRI (Level detector)
and Gg
h for BA= 50°C
Remarks und suggestions
(11)
Page: 1&2
FAILURE (FAULT) MODES AND EFFECTS ANALYSIS I FAILURE (FAULT) MODES, EFFECTS, AND CRITICALITY ANALYSIS
2.6 Failure Mode Analysis
Tabk 2.6
(cont.)
2 Reliability Analysis During the Design Phase
Figure 2.14 Example of Fault Tree (FT) for the electronic switch given in Example 2.6, Point 7, p.51 (0 = Open, S = short, Ext. are possible extenial causes, such as power out, manufacturing error, etc.); as in use for FTA, "0" holds for operating and "1" for failure (Section 6.9.2)
A further possibility to investigate failure-causes-to-effects relationships is the Fault Tree Analysis @TA) [2.6, 2.85, 2.86, 2.95, 2.961. The FTA is a top-down procedure in which the undesired event, for example a critical failure at system level, is represented by AND, OR, and NOT combinations of causes at lower levels. It is a current rule in FTA [A2.6 (IEC 61025)l to use "0" for operating and " 1" for failure (the top event "1"being in general a failure). An example of Fault Tree (FT) for the electronic switch of Example 2.6 (Point 7) is shown in Fig. 2.14. In a fault tree, a cut set is a set of basic events whose occurrence causes the top event to occur. If the top event is system failure, minimal cut sets defined by Eq. (2.43) can be identified. Algorithms have been developed to obtain from a fault tree the minimal cut sets (and minimal path sets) belonging to the system considered, see e.g. [2.35, 2.961. From a complete and correct fault tree it is thus possible to calculate the reliability function and the point availability of the corresponding system in the case of parallel (active) redundancy and totally independent elements (p. 52). TOconsider some dependencies, dynamic gutes have been introduced (Section 6.9.2). Compared to FMEAIFMECA, FTA can take external influences (human andor environmental) better into account, and handle situations where more rhan one primary fault (multiple faults) has to occur in order to cause the undesired event at system level. However, it does not necessarily go through all possible fault modes. Combination of FMEAIFMECA with FTA leads to causes-to-effects chart, showing logical relationship between causes and their single or multiple consequences. Further methods which can Support causes-to-effects analyses are sneak analysis (circuit, path, timing), worst case analysis, drift analysis, stress-strength analysis, Ishikawa (fishbone) diagrams, Kepner-Tregoe method, Pareto diagrams, and Shewhart cycles (Plan-Analyze-Check-Do), see e.g. [1.19, 1.22, 2.131. Table 2.7 gives a comparison of the most important tools used for causes-to-effects analyses. Figure 2.15 shows the basic structure of an Ishikawa (fishbone) diagram. The Ishikawa diagram is a graphical visualization of the relationships between causes und effect, grouping the causes into machine, material, method, and human (man), into failure mechanisms, or into a combination of all them, as appropriate.
\
2.6 Failure Mode Analysis
Major causes
Machine
Mut;
W
A i n o r causes b
Effect
Method
Figure 2.15 Typical structure of a cause and effect (Ishikawa or fishbone) diagram (causes can often be grouped into Machine, Material, Method, and Human (Man), into failure mechanisms, or into a combination of all them, as appropriate)
Performing a FMEAIFMECA, FTA, or any other similar investigation presupposes a detailed technical knowledge and thorough understanding of the item and the technology considered. This is necessary to identify all relevant potential flaws (during design, development, manufacture, operation), their causes, and the more appropriate corrective or preventive actions.
2.7
Reliability Aspects in Design Reviews
Design reviews are important to point out, discuss, and eliminate design weaknesses. Their objective is also to decide about continuation or stopping of the project on the basis of objective considerations (feasibility checks in Fig. 1.6 and in Tables A3.3 and 5.3). The most important design reviews are described in Table A3.3 for hardware an in Table 5.5 for software. To be effective, design reviews must be supported by project specific checklists. Table 2.8 gives an example of catalog of questions which can be used to generate project specific checklists for reliability aspects in design reviews (see Table 4.3 for maintainability and Appendix A4 for other aspects). As shown in Table 2.8, checking the reliability aspects during a design review is more than just verifying the value of the predicted reliability or the source used for failure rate calculation. The purpose of a design review is in particular to discuss the selection and use of components and materials, the adherence to given design guidelines, the presence of potential reliability weaknesses, and the results of analysis and tests. Table 2.8 and Table 2.9 can be used to Support this aim.
78
2 Reliability Analysis During the Design Phase
Table 2.7 Important tools for causes-to-effects-analysis
1
Tool
1
Descnption
I
Application
/
Effort
Systematic bottom-up investigation of the effects (consequences) at system (item) level of the fault modes of all parts of the system considered, as well as of manufactunng flaws and (as far as possible) of user's errors I mistakes*)
Development phase (design FMEARMECA) and production phase (process FMENFMECA); mandatory for all interfaces, in particular where redundancy appears and for sufety relevant parts
Quasi-systematic top-down investigation of the effects (consequences) of faults (failures and defects) as well as of extemal influences on the reliability and lor safety of the system (item) considered; the top event (e.g. a specific catastrophic failure) is the result of AND & OR combinations of elementary events
Similar to FMEAEMECA; however, combination of more than one fault (or elementary event) can be better considered as by a FMEAEMECA; also is the influence of external ei>ents(natural catastrophe, sabotage etc.) easier to be considered
Large to very large if many tc events are considerei
Ishikawa Diagram (Fishbone Diagram)
Graphical representation of the causes-to-effects relationships; the causes are often grouped in four classes: machine, material, method 1 process, and human dependent
Ideal for tearn-work discussions, in particular for the investigation of design, development, or production weaknesses
Small to large
KepnerTregoe Method
Stmctured problem detection, analysis, and solution by complex situations; the main steps of the method deal with a careful problem analysis, decision making, and
Generally applicable, in particular by complex situations and in interdisciplinary work-groups
Largely dependeni on the specific situation
FMEAFMECA (Fault Mode Effects Analysis I Fault Mode, Effects and Criticality halysis)*)
FrA
(Fault Tree Analysis)
I
Pareto Diagram
Correlation Diagram
Graphical presentation of the frequency (histogram) and (cumulative) distribution of the problem causes, grouped in application specific classes
Graphical representation of (two) quantities with possible functional (deterministic or stochastic) relation on an appropnate dy-Cartesian coordinate system
ion making in selecting the causes of a fault and thus in defining the appropnate corrective action (Pareto rule: 80% of the problems are generüted by 20% of 1 the possible causes) Assessment of a relationship between two quantities
* Faults includefailures and dcfccts, allowing errors as possible causes as well:
Very largi if pei-formed for a' elements (0.1 MM for a PCB
Small
Small
MM stays for man month
2.7 Reliability Aspects in Design Reviews
79
Table 2.8 Example of a catalog of questions for the preparation of project speczfic checklists for the evaluation of reliability aspects in preliminary design reviews (Appendices A3 and A4) of complex equipment and Systems with high reliability requirements 1. 1s it a new development, redesign, or change Imodification? 2. 1s there test or field data available from similar items? What were the problems? 3. Has a list of preferred components been prepared and consequently used? 4. 1s the selectioIiqualification of nonstandard components and material specified? How?
5. Have the interactions among elements been minimized? Can interface problems be expected? 6. Have all the specification requirements of the item been fulfilled? Can individual requirements be reduced? 7. Has the mission profile been defined? How has it been considered in the analysis? 8. Has a reliability block diagram been prepared? 9. Have the environmental conditions for the item been clearly defined? How are the operating conditions for each element? 10. Have derating rules been appropnately applied? 11. Has the junction temperature of all semiconductor devices been kept lower than 10O0C? 12. Have drift, worst-case, and sneak path analyses been performed? What are the results? 13. Has the influence of on-off switching and of extemal interference (EMC) been considered? 14. 1s it necessary to improve the reliability by introducing redundancy? 15. Has a F'hfEAlFMECA been performed, at least for the parts where redundancy appears? How? Are single-point failures present? Can nothing be done against them? Are there safety problems? Can liability problems be expected? 16. Does the predicted reiiability of each element correspond to its allocated value? With which n-factors it has been calculated? 17. Has the predicted reliability of the whole item been calculated? Does this value correspond to the target given in the item's specifications? 18. Are there elements with a limited useful life? 19. Are there components which require screening? Assemblies which require environmental stress screening (ESS)?
20. Can design or construction be further simplified? 21. 1s failure detection, localization, and removal easy? Are hidden failures possible?
22. Have reliability tests been planned? What does this test program include? 23. Have the aspects of manufacturability, testability, and reproducibility been considered? 24. Have the supply problems (second source, long-term deliveries, obsolescence) been solved?
80
2 Reliability Analysis During the Design Phase
Table 2.9 Example of form sheets for detecting and investigating potential reliability weaknesses at assembly and equipment level a) Assembly design Com- Failuri )onent Parameters
design, develop., guidelines
b) Assembly manufacturing Solder- Clean- El. tests ing ing
Item Layout
C)
Screen- Fault (defect, Corrective Transportation and storage ing failure) analysis actions
Prototype qualification tests Item
Electrical tests
Fault (defect, Environmental Reliability tests failure) analysis tests
Corrective actions
Fault (defect, Corrective Transportation and storage actions failure) analysis
Operation (field data)
d) Equipment or system level Assembling
Test
Screening (ESS)
3 Qualification Tests for Components and Assemblies
Components, materials, and assemblies have a great impact on the quality and reliability of the equipment and systems in which they are used. Their selection und qualification has to be considered with care by new technologies or important redesigns, on a case-by-case basis. Besides cost and availability on the market, important selection criteria are intended application, technology, quality, long-term behavior of relevant parameters, and reliability. A qualification test includes characterization at different Stresses (for instance electrical and thermal for electronic components), environmental tests, reliability tests, andfailure analysis. After some considerations about selection criteria for electronic components (Section 3. I), this chapter deals with qualification tests for complex integrated circuits (Section 3.2) and electronic assemblies (Section 3.4), and discusses basic aspects of failure modes, mechanisms, and analysis of electronic components (Section 3.3). Procedures given in this chapter can be extended to nonelectronic components and materials as well. Reliability related basic technological properties of electronic components are summarized in Appendix A10. Statistical tests are in Chapter 7, test and screening strategies in Chapter 8, design guidelines in Chapter 5.
3.1 Basic Selection Criteria for Electronic Components As given in Section 2.2 (Eq. (2.18)), the failure rate of equipment and systems without redundancy is the sum of the failure rates of their elements. Thus, for large equipment or systems without redundancy, high reliability can only be achieved by selecting components and materials with sufficiently low failure rates. Useful information for such a selection are: 1. Intended application, in particular required function, environmental conditions, as well as reliability and safety targets. 2. Specific properties of the component or material considered (technological lirnits, useful life, long term behavior of relevant Parameters, etc.).
3 Qualification Tests for Components and Assemblies
Possibility for accelerated tests. Results of qualification tests on sirnilar components or materials. Influence of screening, experience from field operation. Influence of derating. Potential design problems (sensitivity of performance Parameters, interface problems, EMC, etc.). Lirnitations due to standardization or logistic aspects. Potential production problems (assembling, testing, handling, storing, etc.). Purchasing considerations (cost, delivery time, second sources, Song-term availability, quality level). As many of the above requirements are conflicting, component selection often results in a compromise. The following is a brief discussion of the most important aspects in selecting electronic components.
3.1.1 Environment Environmental conditions have a major impact on the functionality and reliability of electronic components, equipment, and systems. They are defined in international standards [3.8]. Such standards specify stress limits and test conditions, among others for
heat (steady-state, rate of temperature change), cold, humidity, precipitation (rain, Snow, hail), radiation (solar, heat, ionizing), salt, sand, dust, noise, vibration (sinusoidal, random), shock, fall, acceleration. Several combinations of stresses have also been defined, for instance temperature and humidity, temperature and vibration, humidity and vibration. Not all stress combinations are relevant and by combining stresses, or in defining sequences of stresses, care must be taken to avoid the activation of failure mechanisms which would not appear in thefield. Environmental conditions at equipment and System level are given by the application. They can range from severe, as in aerospace and defense fields (with extreme low and high ambient temperatures, 100% relative humidity, rapid thermal changes, vibration, shock, and high electromagnetic interference), to favorable, as in Computer rooms (with forced cooling at constant temperature and no mechanical stress). International standards can be used to fix representative environmental conditions for many applications, e.g. ZEC 60721 [3.8]. Table 3.1 gives examples for environmental test conditions for electronicl electromechanical equipment and systems. The stress conditions given in Table 3.1 have indicative purpose and have to be refined according to the specific application, to be cost and time effective.
83
3.1 Selection Criteria for Electronic Components
Table 3.1 Examples for environmental test conditions for electronic I electromechanical equipment and Systems (according to IEC6!l%8 [3.8])
r
Stress profile: Procedure
Induced failures
Dry heat
48 or 72 h at 55,70 or 85°C: El. test, warm up (2OC/ min), hold (80% of test time), power-on (20% of test time), el. test, cool down ( l 0 C /min), el. test between 2 and 16 h
'hysical: Oxidation, structural hanges, softening, drying out, ,iscosity reduction, expansion Aectrical: Drift parameters, noise, nsulating resistance, opens, shorts
Damp heat (cycles)
2,6, 12 or 24 X 24 h cycles 25 + 55°C with rel. humidity over 90% at 55OC and 95% at 25°C: EI. test, warm up ( 3 h), hold ( 9 h), cool down (3h), hold ( 9 h ) , at the end dry with air and el. test between 6 and 16 h
'hysical: Corrosion, electrolysis, bsorption, diffusion Slectrical: Dnft parameters, nsulating resistance, leakage urrents, shorts
48 or 72 h at -25,4O or -55'C: El. test, cool down ( 2°C / min), hold (80% test time), power-on (20% test time), el. test, warm up ( 1°Clmin), el. test between 6 and 16h
'hysical: Ice formation, structural hanges, hardening, brittleness, ncrease in viscosity, contraction ?lectrical: Drift parameters, opens
Environmental condition
temperature
Vibrations (random)
Vibrations (sinusoidal)
Mechanical shock (impact)
30 min random acceleration with rectangular spectrum 20 to 2000 Hz and an acceleration spectral density of 0.03,0.1, or 0.3$1 HZ: EI. test, stress, visual inspection, el. test 30min at 2g, (0.15mm), 5g, (0.35mm), or log, ( 0.75 mm) at the resonant freq. and the same test duration for swept freq. (3 axes): El. test, resonance determination, stress at the resonant frequencies, Stresses at swept freq. (10 to 500 Hz), visual inspection, el. test 1000,2000 or 4000 impacts (half sine curve 30 or 50 g, peak value and 6 ms duration in the main loading diection or distributed in the various impact directions: EI. test, stress (1 to 3 impactsls), inspection (shock absorber), visual inspection, el. test 26 free falls from 50 or lOOcm drop height distributed over all surfaces, Corners and edges, with or without transport packaging: EI. test, fall onto a 5 cm thick wooden block (fir) on a lOcm thick concrete base, visual insp., el. test
L Free fall
g = 10m/s2;
el. = electncal
'hysical: Structural changes, racture of fixings and housings, oosening of connections, fatigue Bectrical: Opens, shorts, contact ~roblems,noise
'hysical: Structural changes, racture of fixings and housings, oosening of connections, fatigue
3 Qualification Tests for Components and Assemblies
At componenf level, to the stresses caused by the equipment or system environmental conditions add those stresses produced by the component itself, due to its internal electrical or mechanical load. The sum of these stresses gives the operating conditions, necessary to determine the stress at component level and the corresponding failure rate. For instance, the ambient temperature inside an electronic assembly can be just some few "C higher than the temperature of the cooling medium, if forced cooling is used, but can become more than 30°C higher than the ambient temperature if cooling is poor.
3.1.2 Performance Parameters The required performance pararneters at component level are defined by the intended application. Once these requirements are established, the necessary derating is determined taking into account the quantitative relationship between failure rate and stress factors (Sections 2.2.3, 2.2.4, 5.1.1). It must be noted that the use of better components does not necessarily imply better performance andl or reliability. For instance, a faster IC farnily can cause EMC problems, besides higher power consumption and chip temperature. In critical cases, component selection should not be based only on short data sheet information. Knowledge of Parameter sensitivity can be mandatory for the application considered.
3.1.3 Technology Technology is rapidly evolving for many electronic components, see Fig. 3.1 and Table A1O.l for some basic information. As each technology has its advantages and weaknesses with respect to performance parameters and 1or reliability, it is necessary to have a set of rules which can help to select a technology. Such rules (design guidelines in Section 5.1) are evolving and have to be periodically refined. Of particular importance for integrated circuits (ICs) is the selection of the packaging form and type. For the packaging form, distinction is made between inserted and surfacemounted devices. Znserted devices offer the advantage of easy handling during the manufacture of PCBs and also of lower sensitivity to manufacturing defects or deviations. However, the number of pins is limited. Sugace mount devices (SMD) allow a large number of pins (more than 196 for PQFP and BGA), are cost and space saving, and have better electrical performance because of the shortened and symmetrical bond wires. However, compared to inserted devices, they have greater junction to ambient thermal resistance, are more stressed during soldering, and solder joints have a much lower mechanical strength (Section 3.4). Difficulties
3.1 Selection Critena for Electronic Components Approximate sales volume [%]
Figure 3.1 Basic IC technology evolution
can be expected with pitch lower than 0.3 mm, in particular if thermal and / or mechanical Stresses occur in field (Sections 3.4 and 8.3). Packaging types are subdivided into hermetic (ceramic, cerdip, meta1 can) and nonhermetic (plastic) packages. Hermetic packages should be preferred in applications with high humidity or in corrosive ambiance, in any case if moisture condensation occurs on the package surface. Compared to plastic packages they offer lower thermal resistance between chip and case (Table 5.2), but are more expensive and sensitive to damage (microcracks) caused by inappropriate handling (mechanical shocks during testing or PCB production). Plastic packages are inexpensive, less sensitive to thermal or mechanical damage, but are permeable to moisture (other problems related to epoxy, such as ionic contamination and low glass-transition temperature, have been solved). However, better epoxy quality as well as new passivation (glassivation) based on silicon nitride leads to a much better protection against corrosion than formerly (Section 3.2.3, point 8). If the results of qualification tests are good, the use of ZCs in plastic packages can be allowed if one of the following conditions is satisfied:
1. Continuous operation, relative humidity < 70%, noncorrosive or marginally corrosive environment, junction temperature 5 100 "C, and equipment useful life less than 10 years. 2. Intermittent operation, relative humidity < 60%, noncorrosive environment, no moisture condensation on the package, junction temperature < 100 "C , and equipment useful life less than 10 years. For ICs with silicon nitride passivation (glassivation), the conditions stated in Point 1 above should also apply for the case of interrnittent operation.
86
3 Qualification Tests for Components and Assemblies
3.1.4 Manufacturing Quality The quality of manufacture has a great influence on electronic component reliability. However, information about global defective probabilities (fraction of defective items) or agreed AQL values (even Zero defects) are often not suflcient to monitor the r e l i a b i l i ~level (AQL is nothing more than an agreed upper limit of the defective probability, generally at a producer risk a = 10%, see Section 7.1.3). Information about changes in the defective probability and the results of the corresponding fault analysis are important. For this, a direct feedback to the component manufacturer is generally more useful than an agreement on an AQL value.
3.1.5 Long-Term Behavior of Performance Parameters The long-term stability of performance parameters is an important selection criterion for electronic components, allowing differentiation between good and poor manufacturers (Fig. 3.2). Verification of this behavior is generally undertaken with accelerated reliability tests (trends are often enough for many practical applications).
3.1.6 Reliability The reliability of an electronic component can often be specified by its failure rate h. Failure rate figures obtained from field data are valid if intrinsic failures can be separated from extrinsic ones and reliable data / information are available. Those figures given by component manufacturers are useful if calculated with appropriate values for the (global) activation energy (for instance, 0.4 to 0.6eV for ICs) and confidence level (> 60% two sided or z 80% one sided, see Section 7.1.1). Moreaver, besides the numerical value of A, the influence of the stress factor (derating) S is important as a selection criteria (Eq. (2.1), Table 5.1).
Performance parameter [%] fair
unstable
100 %
\- bad
Figure 3.2 Long-term behavior of performance Parameters
3.2 Qualification Tests for Complex Electronic Components
3.2
Qualification Tests for Complex Electronic Components
The purpose of a qualification test is to verify the suitability of a given item (material, component, assembly, equipment, system) for a stated application. Qualification tests are often a Part of a release procedure. For instance, Prototype release for a manufacturer and release for acceptance in a preferred list (qualified part list) for a user. Such a test is generally necessary for new technologies or after important redesigns or production processes changes. Additionally, periodic requalification of critical parameters is often necessary to monitor quality and reliability. Electronic component qualification tests cover characterization, environmental und special tests, as well as reliability tests. They must be supported by intensive failure (fault) analysis to investigate relevant failure mechanisms (and fault causes). For a user, such a qualification test must consider: 1. Range of validity, narrow enough to be representative, but sufficiently large to cover company's needs and to repay test cost. 2. Characterization, to investigate the electrical performance parameters. 3. Environmental and special tests, to check technology limits. 4. Reliability tests, to gain information on the failure rate. 5. Failure analysis, to detect failure causes and investigate failure mechanisms. 6. Supply conditions, to define cost, delivery schedules, second sources, etc. 7. Final report and feedback to Sie manufacturer. The extent of the above steps depends on the importance of the component being considered, the effect (consequence) of its failure in an equipment or system, and the experience previously gained with similar components and with the same manufacturer. National and international activities are moving toward agreements which should make a qualification test by the User unnecessary for many components [3.8, 3.181. Procedures for environmental tests are often defined in standards [3.8, 3.111. A comprehensive qualification test procedure for ICs in plastic packages is given in Fig. 3.3. One recognizes the major steps (characterization, environmental and special tests, reliability tests, and failure analysis) of the above list. Environmental tests cover the thermal, climatic, and mechanical Stresses expected in the application under consideration. The number of devices required for the reliability tests should be determined in order to expect 3 to 6 failures during burn-in. The procedure of Fig. 3.3 has been applied extensively (with device-specific aspects like data retention and programming cycles for nonvolatile memories, or modifications because of ceramic packages) to 12 memories each with 2 to 4 manufacturers for comparative investigations [3.2 (1993), 3.6, 3.161. The cost for a qualification test based on Fig. 3.3 for 2 manufacturers (comparative studies) can exceed US$50,000.
88
3 Qualification Tests for Components and Assemblies
3.2.1 Electrical Test of Complex ICs Electrical test of VLSI ICs is performed according to the following three steps: 1. Continuity test. 2. Test of DC parameters. 3. Functional and dynamic test (AC).
The continuitv test checks whether every pin is connected to the chip. It consists in forcing a prescribed current (100pA) into one pin after another (with all other pins grounded) and measuring the resulting voltage. For inputs with protection diodes and for normal outputs this voltage should lie between - 0.1 and - 1.5V. Verification of D C parameters is simple. It is performed according to the manufacturer's specifications without restrictions (disregarding very low input currents). For this purpose a precision measurement unit ( P M U ) is used to force a current and measure a voltage ( VOH,VOL,etc.) or to force a voltage and measure a current ( I l H , ZIL, etc.). Before each step, the IC inputs and outputs are brought to the logical state necessary for the measurement. The functional test is performed together with the verification of the dynarnic parameters, as shown in Figure 3.4. The generator in Fig. 3.4 delivers one row after another of the truth table which has to be verified, with a frequency fo. For a 40pin IC, these are 40-bit words. Of these binary words, called test vectors, the inputs are applied to the device under test @UT) and the expected outputs to a logical comparator. The actual outputs from the DUT and the expected outputs are compared at a time point selected with high accuracy by a strobe. Modern VLSI automatic test equipment (ATE) for digital ICs have test frequencies fo > 600MHz and an overall precision better than 200ps (resolution < 30ps). In a VLSI ATE not only the strobe but other pulses can be varied over a wide range. The dynamic parameters can be verified in this way. However, the direct measurement of a time delay or of a rise time is in general time-consuming. The main problem with a functional test is that it is not possible to verify all the states and state sequences of a VLSI IC. To see this, consider for instance that for an n X 1 cell memory there are 2" states and n ! possible address sequences, the corresponding truth table would contain 2". n ! rows, giving more than 10loOforn = 64. The procedure used in
Expected output
1 I
Result
Test vector I
I
strobe, delayed by the specified propagation time
Figure 3.4 Principle of functional and AC testing for LSI and VLSI ICs
3.2 Qualification Tests for Complex Electronic Components
DC characterization (histograms at -55, 0,
Reliability Tests (154 ICs)
Environmental and Special Tests (66 ICs) (36 ICs) 1 (30 ICs)
Characterization (20 ICs)
(18 ICs) High temperature storagc (168 h at 150°C)*, electr. test at 0, 16,24 and 168 h at 70°C
(reference ICs)
I
(2 ICs) Passivation
AC characterization (histograms and shmoo-plots at -55,0, 25,70 and 125°C)
(2000 X 45/+15o0C)*, electx test at 0, 1000,2000 cycles at 70°C
investigations Latch-up (for CMOS) Hot camers Dielectnc breakdown Electromigration Soft errors
120°C. 85% RH, Vcc = 5.5 V), electr. test at 0, 96, 192,408 h** at 70°C, failure analysis at 192, 408 h** (recovery 1-2 h, electr. test within 8 h
..
4
150 1Cs 2000 h hum-in at 125'C, electr. test at 0, 16,64, 250,1000 and 2000 h at 70°C, failure analysis at 16, 64, 250, 1000 and
8 5 T , 85% RH, Vcc = 7 V), electr. test at 0,500,1000,2000 h at 70°C, failure analysis at 500,1000,2000 h (recovery 1-2 h, electr. test within 8 h after recovery)
(ESD) at 500,1000, 2000 V, until VEsDand at V„, -250 V (HBM), el. test before and after strcs!
I----------
Screening (e.g. MIL-STD 883 class B without internal visual inspection)
Failure Analysis
L,-
t Final Report
Figure 3.3 Example for a comprehensive qualification test procedure for complex ICs in plastic (PI) packages (industrial application, normal environmental conditions, 3 to 6 expected h-I in this example), RH = relative humidity) failures during the reliability test ( Ah = 2 . 1 0 - ~ * 150°C by Epoxy resin, 175°C by Silicon resin; ** 1000 h by Si3N4 passivation
90
3 Qualification Tests for Cornponents and Assernblies
practical applications takes into account one or more of the following partitioning the device into modules and testing each of them separately, finding out regularities in the truth table or given by technological properties, limiting the test to the part of the truth table which is important for the application under consideration.
The above limitations rises the question of test coverage, i.e. the percentage of faults which are detected by the test. A precise answer to this question can only be given in some particular cases, because information about the faults which actually appear in a given IC is often lacking. Fault models, such as stuck-at-zero, stuck-at-one, or bridging are useful for PCB's testing, but generally of limited utility for a test engineer at the component level. For packaged VLSI ICs, the electrical test should be performed at 70°C or at the highest specified operating temperature.
3.2.2 Characterization of Complex ICs Characterization is a parametric, experimental analysis of the electrical properties of a given IC. Its purpose is to investigate the influence of different operating conditions such as supply voltage, temperature, frequency, and logic levels on the IC's behavior and to deliver a cost-effective test program for incoming inspection. For this reason a characterization is performed at 3 to 5 different temperatures and with a large number of different Patterns.
1
1
1
1
1
1 b O
0
0
0
0
0
0
0
0
Diagonal
March
Checkerboard
I
Surround
1
I
1
Butterfly
Galloping one
Figure 3.5 Exarnple of test Patterns for rnernories (see Table 3.2 for Pattern sensitivity)
91
3.2 Qualification Tests for Complex Electronic Components
Table 3.2 Kindness of various test patterns for detecting faults in SRAMs, and approximate test times for a 100 ns 128K X 8 SRAM (tests on a Sentry S50, scrambling table with IDS5000 EBT) Functional C* D,H,S,O
Testpattern Checkerboard March
poor
fair
/
eood
/
ooor
Number of test steps
Dyn. parameters A, RA C""
-
-
I
ooor
I
-
1
Approx. test time[s] bit addr. I word addr
4n
1
5n
Diagonal
good
fair
poor
poor
Surround
good
good
fair
fair
26n - 16&
10n
1
0.13
27
0.34
Butterfly
good
good
good
fair
8n3I2 + 2 n
8.10~
Galloping one
good
good
good
good
4n2 + 6 n
4.10~
38 7.10~
A=addressing, C=cap. coupling, D=decoder, H=stuckat Oor at 1, 0 = open, S =short, RA = read amplifier recovery time, * pattern dependent, ** pattern and level dependent
Referring to the functional and AC measurements, Figure 3.5 shows some basic patterns for memories. These patterns are generally performed twice, direct and inverse. For the patterns of Fig. 3.5, Table 3.2 gives a qualitative indication of the corresponding pattern sensitivity for static random access memories (SRAMs), and the approximate test time for a 128K X 8 SRAM. Quantitative evaluation of pattern sensitiviQ or of test coverage is seldom possible; in general, because of the limited validity of fault models available (Sections 4.2.1 and 5.2.2). As shown in Table 3.2, test time strongly depends on the pattern selected. As test times greater than 10s per pattem are long also in the context of a characterization (the Same Pattern will be repeated several thousands times, see e.g. Fig. 3.6), development of efficient test patterns is mandatory [3.2 (1989), 3.6, 3.16, 3.191. For such investigations, knowledge of the relationship between address and physical location (scrambling) of the corresponding cell on the chip is important. If design information is not available, an electrun beam tester (EBT) can be used to establish the scrambling table. An important evaluation tool during a characterization of complex ICs is the shmoo plot. A shmoo plot is the representation in an dy-diagram of the operating region of an IC as a function of two parameters. As an example, Fig. 3.6 gives the shmoo plots for t A versus vcc of a 128Kx8 SRAM for two patterns and two ambient temperatures [3.6]. For Fig. 3.6, test pattem has been pei-formed about 4000 times (2 X 29 X 61), each with a different combination of vcc and t A . If no fault is detected, an X or a is plotted (defective cells are generally retested once, to confirm the fault). As shown in Fig. 3.6, a small (probably capacitive) coupling between nearby cells exists for this device, as a butterfly pattern is more sensitive than the diagonal pattem to this kind of fault. Statistical evaluation of shmoo plots is often done with composite shmoo-plots in which each record is labeled in 10% steps.
92
3 Qualification Tests for Components and Assemblies
Table 3.3 DC parameters for a 40 pin CMOS ASIC specially developed for high noise immunity and with Schmitt-trigger inputs (20 ICs)
max (V)
min mean max
0.52 2.65 2.76 2.85
0.44 3.19 3.33 3.44
0.44 3.89 3.97 4.09
0.60 2.70 2.75 2.85
0.52 3.19 3.32 3.44
0.48 3.79 3.93 4.04
From the above considerations one recognizes that in general only a small part of the possible states and state sequences can be tested. The definition of appropriate test Patterns must thus pay attention to the specific device, its technology and regularities in the truth table, as well as to information about its application and experience with similar devices 13.2 (1989), 3.61. A close cooperation between test engineer and User, and also if possible with the device designer and manufacturer, can help to reduce the amount of testing. As stated in Section 3.2.1, measurement of DC parameters presents no difficulties. As an example, Table 3.3 gives some results for an application specific CMOS-IC (ASIC) specially developed for high noise immunity.
3.2.3 Environmental and Special Tests of Complex ICs The aim of environmental und special tests is to submit a given IC to Stresses which can be more severe than those encountered in field operation, in order to investigate technological limits and failure mechanisms. Such tests are often destructive. A failure analysis after each stress is important to evaluate failure mechanisms and to detect degradation (Section 3.3). Kind and extent of environmental and special tests depend on the intended application ( G F for Fig. 3.3) and specific characteristics of the component considered. The following is a description of the environmental and special tests given in Fig. 3.3 (considerations on production related potential reliability problems are in Sections 3.3 & 3.4, see also Figs. 3.7, 3.9, 3.10):
3.2 Qualification Tests for Complex Electronic Components
Vcc
Vcc
4
4
93
Figure 3.6 Shmoo plots of a lOOns 128K X 8 SRAM for test Patterns a) Diagonal and b) Butterfly and 70°C (X) at two ambient temperatures 0°C ( 0 )
1. Znternal Visual Znspection: Two ICs are inspected and then kept as a reference for comparative investigation (check for damage after stresses). Before opening (using wet chemical or plasma etching), the ICs are X-rayed to locate the chip and to detect irregularities (package, bonding, die attach, etc.) or impurities. After opening, inspection is made with optical microscopes (conventional X i,000 andlor stereo X 100). Improper placement of bonds, excessive height and looping of the bonding wires, contarnination, etching, or metallization defects can be seen. Many of these deficiencies often have-only a marginal effect on reliability. Figure 3.7a shows a limiting case (mask misalignment). Figure 3.7b shows voids in the metallization of a 1M DRAM.
2. Passivation Test: Passivation (glassivation) is the protective coating, usually silicon dioxide (PSG) and /or silicon nitride, placed on the entire (die) surface. For ICs in plastic packages it should ideally be free from cracks and pinholes. To check this, the chip is immersed for about 5 min in a 50°C warm mixture of nitric and phosphoric acid and then inspected with an optical microscope (e.g. as in MIL-STD-883 method 2021 [3.11]). Cracks occur in a silicon dioxide passivation if the content of phosphorus is < 2%. However, more than 4% phosphorus activates the formation of phosphoric acid. As solution, silicon nitride passivation (often together with silicon dioxide in separate layers) has been introduced. Such a passivation shows much more resistance to the penetration of moisture (see humidity tests in Point 8 below) and of ionic contamination.
94
3 Qualification Tests for Components and Assemblies
3.Solderability: Solderability of tinned pins should no longer constitute a problem today, except after a very long storage time in a non-protected ambient or after a long burn-in or high-temperature storage. However, problems can arise with gold or silver plated pins, see Section 5.1.5.4. The solderability test is performed according to established standards (e.g. IEC 60068-2 or MZL-STD-883 [3.8, 3.111) after conditioning, generally using the solder bath or the meniscograph method. 4.Electrostatic Discharge (ESD): Electrostatic discharges during handling, assernbling, and testing of electronic components and populated printed circuit boards (PCBs) can destroy or damage sensitive components, particularly semiconductor devices. All I Cs families and many discrete electronic components are sensitive to ESD. Integrated circuits have in general protection circuits, passive and more recently active (better protection by a factor 2 2). To determine ESD immunity, i.e. the voltage value at which damage occurs, different pulse shapes (models) and procedures to perform the test have been proposed. For semiconductor devices, the human body model (HBM) and the charged device model (CDM) are the most widely used. The CDM seems to apply better than the HBM in reproducing some of the damage observed in field applications (see Section 5.1.4 for further details). Based on the experiences gained in qualifying 12 memory types according to Fig. 3.3 [3.2 (1993), 3.61, the following procedure can be suggested for the HBM: 1. 9 ICs divided into 3 equal groups are tested at 500, 1000, and 2000V, respectively. Taking note of the results obtained during these preliminary tests, 3 new ICs are stressed with steps of 250V up to the voltage at which damage occurs (VESD). 3 further ICs are then tested at VEsD-2SOV to confirm that no damage occurs. 2. The test consists of 3 positive and 3 negative pulses applied to each pin within 30 s . Pulses are generated by discharging a lOOpF capacitor through a 1SkQ resistor placed in series to the capacitor (HBM), wiring inductance < 10pH. Pulses are between pin and ground, unused pins Open. 3. Before and after each test, leakage currents (when possible with the limits +lpA for Open and f200nA for short) and electrical characteristics are measured (electrical test as after any other environmental test). Experience shows that an electrostatic discharge often occurs between 1000 and 4000V. The model Parameters of lOOpF and 1.5k!2 for the HBM are average values measured with humans (80 to 500 pF , 50 to 5000 Q, 2 kV on synthetic floor and 0.8kV on a antistatic floor with a relative humidity of about 50%). A new model for latent damages caused by ESD has been developed in [3.61 (1995)l. Protection against ESD is discussed in Sections 5.1.4 and 5.1.5.4, see also Section 3.3.4.
l*
-
I
-
4
-
2
'
3.2 Qualification Tests for Complex Electronic Components
a) Alignment error at a contact window (SEM, X 10,000)
d) Silver dendrites near an Au bond ball (SEM, x800)
b) Opens in the metallization of a 1 M DRAM bit line, due to particles present during the photolithographic process (SEM, X 2,500)
e) Electromigration in a 16K Schottky TTL
C)
Cross section through two trench-capacitor cells of a 4 M DRAM (SEM, X 5,000)
PROM after 7 years field operation (SEM, x500)
f) Bond wire damage (delamination) in a plastic-packaged device after 500 X -50 1 +150°C thermal cycles (SEM, x50O)
Figure 3.7 Failure analyses on ICs (Rel. Labioratory at the ETH Zurich); see also Figs. 3.9 & 3.10
96
3 Qualification Tests for Components and Assemblies
5. Technological Characterization: Technological investigations are performed to check technological and process parameters with respect to adequacy and maturity. The extent of these investigations can range from a simple check(Fig. 3 . 7 ~ )to a comprehensive analysis, because of detected weaknesses. Refinement of techniques and evaluation methods for technological characterization is in Progress, see e.g. [3.31 - 3.65, 3.71 - 3.891. The following is a simplified, short description of some important technological characterization methods: Latch-up is a condition in which an IC latches into a nonoperative state drawing an excessive current (often a short between power supply and ground), and can only be returned to an operating condition through removal and reapplication of the power supply. It is typical for CMOS structures, but can also occur in other technologies where a PNPN structure appears. Latch-up is primarily induced by voltage overstresses (on signals or power supply lines) or by radiation. Modern devices often have a relatively high latch-up immunity (up to 200 rnA injection current). A verification of latch-up sensitivity can become necessary for some special devices (ASICs for instance). Latch-up tests stimulate voltage overstresses on signal and power supply lines as well as power-onlpower-off sequences. Hot Carriers arise in micron and submicron MOSFETs as a consequence of high electricfields (104 to 105Vlcm) in transistor channels. Carriers may gain sufficient kinetic energy (some eV, compared to 0.02 eV in thermal equilibrium) to surmount the potential barrier at the oxide interface. The injection of carriers into the gate oxide is generally followed by electronhole pairs creation and causes an increasing degradation of the transistor parameters, in particular an increase with time of the threshold voltage VTH which can be measured in NMOS transistors. Effects on VLSI and ULSI-ICs are an increase of switching times (access times in RAMs for instance), possible data retention problems (soft writing in EPROMs) and in general an increase of noise. Degradation through hot carriers is accelerated with increasing drain voltage and lowering temperature (negative activation energy of about - 0.03 eV). The test is generally performed under dynamic conditions, at high power supply voltages (7 to 9V) and at low temperatures (-70 to - 20 "C ). Time-Dependent Dielectric Breakdown (TDDB) occurs in very thin gate oxide layers (< 20nm) as a consequence of extremely high electric jields (10 7- 10~VIcm).The mechanism is described by the therrnochemical (E) model up to about 10 7 ~ / c r nand by the carrier injection (1IE) model up to about 2.10 7 ~ / c m An . approach to unify both models has been proposed in [3.48 (1999)l. As soon as the critical threshold is reached, breakdown takes place, often suddenly. The effects of gate oxide breakdowns are increased
3.2 Qualification Tests for Complex Electronic Components
leakage currents or shorts between gate and substrate. The development in time of this failure mechanism depends on process parameters and oxide defects. Particularly sensitive are memories > 4 M . An Arrhenius model can be used for the temperature. Time-dependent dielectric breakdown tests are generally performed on special test structures (often capacitors). Electromigration is the migration of metal atoms, and also of Si at the Al / S i interface, as a result of very high current densities, see Fig. 3.7e for an example of a 16K TTL PROM after 7 years of field operation. Earlier lirnited to ECL, electromigration also occurs today with other technologies (because of scaling). The median t-jOof the failure-free time as a function of the current density and temperature can be obtained from the empirical model given by Black [3.46], t50 = B j-neEa'kT , where E, = 0.55 eV for pure Al (0.75 eV for Al-Cu alloy), n = 2 , and B is a process-dependent constant. Electromigration tests are generally performed at wafer level on test structures. Measures to avoid electromigration are optimization of grain structure (bamboo structures), use of Al-Si-Cu alloys for the metallization and of compressive passivation, as well as introduction of multilayer metallizations. Soft errors can be caused by the process or chip design as well as by process deviations. Key parameters are M O S F E T threshold voltages, oxide thickness, doping concentrations, and line resistance. If for instance the post-implant of a silicon layer has been improperly designed, its conductivity rnight become too low. In this case, the word lines of a D R A M could suffer from signal reductions and at the end of the word line soft errors could be observed on some cells. As a further example, if logical circuits with different signal levels are unshielded and arranged close to the border of a cell array, stray coupling may destroy the information of cells located close to the circuit (chip design problem). Finally, process deviations can cause soft errors. For instance, signal levels can be degraded when metal lines are locally reduced to less than half of their width by the influence of dirt particles. The characterization of soft errors is difficult in general. At the chip level, an electron beam tester allows the measurement of signals within the chip circuitry. At the wafer level, single test structures located in the space between the chips (kerf) can be used to measure and characterize important parameters independently of the chip circuitry. These structures can usually be contacted by needles, so that a well equipped bench setup with high-resolution I-V and C-V measurement instrumentation would be a suitable characterization tool. Data Retention and Program/ Erase Cycles are important for nonvolatile memories (EPROM, EEPROM, FLASH). A test for data retention generally consists of Storage (bake) at high temperature (2000 h at 125°C for plastic
97
98
3 Qualification Tests for Components and Assemblies
packages and 500 h at 250°C for ceramic packages) with an electrical test at 70°C at 0 , 250, 500, 1000, and 2000 h (often using a checkerboard Pattern with measurement of ~ A Aand of the margin voltage). Experimental investigation of EPROM data retention at temperatures higher than 250°C shown a deviation from the charge loss predicted by the thermionic model r3.6, 3.361. Typical values for program/ erase cycles during a qualification test are 100 for EPROMs and 10,000 for EEPROMs and Flash memories.
6 . High-Temperature Storage: The purpose of high-temperature storage is the stabilization of the thermodynarnic equilibrium, and consequently of the IC's electrical Parameters. Failure mechanisms related to surface problems (contamination, oxidation, contacts, charge induced failures) are activated. To perform the test, the ICs are placed on a metal tray (pins on the tray to avoid thermal voltage stresses) in an oven at 150°C for 200 h . Should solderability be a problem, a protective atmosphere ( N 2 ) can be used. Experience shows that for a mature technology (design and production processes), high temperature storage produces only a very few failures (see also Section 8.2.2). 7. Thermal Cycles: The purpose of thermal cycles is to test the IC's ability to Support rapid temperature changes. This activates failure mechanisms related to mechanical stresses caused by mismatch in the expansion coefficients of the materials used, as well as wearout because of fatigue, See Fig. 3.7f for an example. Thermal cycles are generally performed from air to air in a twochamber oven (transfer from one chamber to the other with a lift). To perform the test, the ICs are placed on a metal tray (pin on the tray to avoid thermal voltage stresses) and subjected to 2,000 thermal cycles from -65°C ( 4 - 1 0 ) to +150°C (+15,-0), transfer time 5 Imin, time to reach the specified temperature 5 15min, dwell time at the temperature extremes r l0rnin. Should solderability be a problem, a protective atmosphere ( N 2 ) can be used. Experience shows that for a mature technology (design and production processes), failures should not appear before some thousand thermal cycles (lower figures for power devices). 8.Humidity or Damp Heut Test, 85/85 and pressure cooker: The aim of humidity tests is to investigate the influence of moisture on the chip surface, in particular corrosion. The following two procedures are often used:
(i) Atmospheric pressure, 85 f 2°C and 85 I 5% rel. humidity ( W 8 5 Test) for 168 to 5,000 h . (ii) Pressurized steam, 110 f 2°C or 120 I 2OC or 130 I 2°C and 85 f 5% rel. humidity @ressure-cooker test or highly accelerated stress test (HAST)) for 24 to 408 h (1,000 h for silicon nitride passivation). In both cases, a voltage bias is applied during exposure in such a way that power consumption is as low as possible, while the voltage is kept as high as possible (reverse bias with adjacent metallization lines altematively polarized
3.2 Qualification Tests for Complex Electronic Components
99
high and low, e.g. l h o n / 3 h off intermittently if power consumption is greater than 0.OiW). For a detailed procedure one may refer to IEC 60749 [3.8]. In the procedure of Fig. 3.3, both 85/85 and HAST tests are performed in order to correlate results and establish (empirically) a conversion factor. Of great importance for applications is the relation between the failure rates at elevated temperature and humidity (e.g. 85/85 or 120185) and at field operating conditions (e.g. 40160). A large number of models have been proposed in the literature to empirically fit the acceleration factor A associated with the 85/85 test
A=
mean time to failure at lower stress (01/ RHl) mean time to failure at 85/85 (B2 / R H 2 )
The most important of these models are
In Eqs. (3.2) to (3.6), E, is the activation energy, k the Boltzmann constant (8.6.10-~e V / K), 8 the temperature in "C, T the absolute temperature (K), RH the relative humidity, and Cl to C 4 are constants. Equations (3.2) to (3.6) are based on the Eyring model (Eq. (7.59)), the influence of the temperature and the humidity is multiplicative in Eqs. (3.2) to (3.5). Eq. (3.2) has the same structure as in the case of electromigration (Eq. (7.60)). In all models, the technological Parameters (type, thickness, and quality of the passivation, kind of epoxy, type of metallization, etc.) appear indirectly in the activation energy E, or in the constants C , to C 4 . Relationships for HAST are more empirical. From the above considerations, 85/85 and HAST tests can be used as accelerated tests to assess the effect of damp heat combined with bias on ICs by accepting a numerical uncertainty in calculating the acceleration factor. As a global value for the acceleration factor referred to operating field conditions of 40°C and 60% RH, one can assume for PSG a value between 100 and 150 for the 85/85 test and between 1,000 and 1,500 for the 120185 test. To assure 10 years field operation at 40°C and 60% RH, PSG-ICs should thus pass without
100
3 Qualification Tests for Components and Assemblies
evident corrosion damage about i,OOO h at 85/85 or 100 h at 120185. Practical results show that silicon-nitride glassivation offers a much greater resistance to moisture than PSG by a factor up to 10 [3.6]. Also related to the effects of humidity is meta1 migration in the presence of reactive chemicals and voltage bias, leading to the formation of conductive paths (dendrites) between electrodes, see an example in Fig. 3.7d. A further problem related to plastic packaged ZCs is that of bonding a gold wire to an aluminum contact surface. Because of the different interdiffusion constants of gold and aluminum, an inhomogeneous intermetallic layer (Kirkendall voids) appears at high temperature and / or in presence of contaminants, considerably reducing the electrical and mechanical properties of the bond. Voids grow into the gold surface like a plague, from which the name purple plague derives. Purple plague was an important reliability problem in the sixties. It propagates exponentially at temperatures greater than about 180°C. Although almost generally solved (bond temperature, Al-alloy, metallization thickness, wire diameter, etc.), verification after high temperature Storage and thermal cycles is a part of a qualification test, especially for ASICs and devices in small-scale production. Table 3.4 Indicative values for failure modes of electronic components (%) I
I
Resistors, vaiable (Cermet) Capacitors
1
foil ceramic Ta (solid)
Coils Relays Quartz cryslals
* .input and output half each;
1
1
15
80
5
70
10
20
80
15
20 20
70
5
5
-
-
80t
80
20
1
5
-
1
-
I
short to VCC or to GND half each; + no output; iinproper output; O fail to off; #localized wearout; fail to trip I spurious trip = 312
3.2 Qualification Tests for Complex Electronic Components
101
3.2.4 Reliability Tests The aim of a reliability test for electronic components is to obtain information about Sie failure rate, long-term behavior of critical Parameters, effectiveness of screening to be performed at the incorning inspection. The test consists in general of a dynamic burn-in with electrical measurements and failure analysis at appropriate time points (Fig. 3.3), also including some components which have not failed (to check for degradation). The number ( n ) of devices under test can be estimated from the predicted failure rate h (Section 2.2.4) and the acceleration factor A (Eq. (7.56)) in order to expect 3 to 6 failures (k) during bum-in ( n = kl(hAt)). Half of the devices can be submitted to a screening (Section 8.2.2) in order to better isolate early failures (Fig. 3.3). Statistical data analyses are given in Section 7.2 and Appendix A8.
3.3
Failure Modes, Failure Mechanisms, and Failure Analysis of Electronic Components
3.3.1 Failure Modes of Electronic Components A failure mode is the symptom (local effect) through which a failure is observed. Typical failure modes are Opens, shorts, drift, or functional faults for electronic components, and brittle rupture, creep, or cracking for mechanical components. Average values for the relative frequency of failure modes in electronic components are given in Table 3.4. The values given in Table 3.4 have indicative purpose and have to be supplemented by application specific results , as far as necessary. The different failure modes of hardware, often influenced also by the specific application, cause difficulties in investigating the effect (consequence) of failure, and thus in the concrete implementation of redundancy (series if short, parallel if open). For critical situations it can become necessary to use quad redundancy (Section 2.3.6). Quad redundancy is the simplest fault tolerant structure which can accept at least one failure (short or open) of any one of the 4 elements involved in the redundancy.
102
3 Qualification Tests for Components and Assemblies
3.3.2 Failure Mechanisms of Electronic Components A failure mechanism is the physical, chemical, or other process which results in failure. A large number of failure mechanisms have been investigated in the literature, e.g. [3.31 - 3.651 & [3.71 - 3.891. For some of them, appropriate physical explanations have been found. For others, the models are empirical and often of limited validity. Evaluation of models for failure mechanisms should be developed in two steps: (i) verify the physical v a l i d i ~of the model and (ii) give its analytical formulation with the appropriate set of Parameters to fit tke model to the data. In any case, experimental verification of the model should be performed with at least a second, independent experiment. The limits of tke model should be clearly indicated. The two most important models used to describe failure mechanisms, the Arrhenius model and the Eyring model are introduced in Section 7.4 with accelerated tests (Eqs. (7.56) - (7.60)). Models to describe the influence of temperature and humidity in damp heat tests have been given with Eqs. (3.2) - (3.6). A new model for latent damages caused by ESD is given in r3.61 (1995)l. Table 3.5 summarizes some important failure mechanisms for ICs, specifying influencing factors and the approximate distribution of the failure mechanisms for plasticpackaged ICs in industrial applications ( G F in Table 2.3). The percentage of misuse and mishandling failures can vary over a large range (20 to 80%) depending on the design engineer using the device, the equipment manufacturer and the end User. For ULSI-ICs one can expect that the percentage of failure mechanisms related to oxide breakdoi.vlz and hot carriers will grow in the future.
3.3.3 Failure Analysis of Electronic Components The aim of a failure analysis is to investigate the failure mechanisms and find out the possible failure causes. A procedure for failure analysis of ICs (from a user's point of view) is shown in Fig. 3.8. It is based on the following steps and can be terminated as soon as the necessary information has been obtained:
1. Failure detection und description: A careful description of the failure, as observed in situ, and of the surrounding circumstances (operating conditions at the failure occurrence time) is important. Also necessary are information on the IC itself (type, manufacturer, manufacturing data, etc.), on the electrical circuit in which it was used, on the operating time, and if possible on the tests to which the IC was submitted previous to the final use (evaluation of possible damage, e.g. ESD). In a few cases the failure analysis procedure can be terminated, if evident rnishandling or misuse failure can be confirmed. 2. Nonclestructive analysis: The nondestructive analysis begins with an external visual inspection (mechanical damage, cracks, corrosion, burns, overheating, etc.), followed by an X-ray inspection (evident internal fault or damage) and a careful electrical test (Section 3.2.1). For ICs in hermetic packages, it can also
.
L
Pumle d a m e
Causes
Contamination with NaC, KC, etc., too thin oxide layer (MOS), package material
Formation of intermet. layer between metal. (Al)& substr. Si) Injection of electrons because of high E Generation of electron-hole pairs by a-particles (DRAMs) Activation of PNPN paths
External radiation Voltage overstress
E, O , (E,=0.2 - 0.4eV for oxide with defects, 0.5 0.6eV for intrinsic oxide:
leV fo;large-grain Al)
RH, E, 8j ( E a = 0 . 5 - 0.7eV)
E, OJ (E,= 0.5 - 1.2 eV, up to 2eV for linear ICs)
Temperature > 180°C (Ea =0.7 - 1.1 e V )
Thermal cycles with Ag > 150°C (vibrations at res. freq. for hennetic dev.)
Acceleration factors
E = electric field, RH = relative humidity, j = current density, OJ= junction temperature, passivation (= glassivation), % = indicative distribution in percent
LMisuse / Mishandling I Electrical (ESD IEOS), thermal, mecb., or climatic overstress / Application, design, handling, test
Mask defects, overheating, pure Al Dimensions, diffusion profiles, E Package material, extemal radiation PNPN paths
High voltages, thin oxides, oxide defects Contarnination witb alkaline ions, pinholes, oxide or diffusion defects
Humidity, volta e, contamination Electrochemicai or galvanic reaction in the presence of humidity and ionic contamination (P, Na, Cl etc.), critical for ( N a + , Cl-, K ), cracks or pinholes PSG (SiO*) passivation with > 4% P (< 2% P gives cracks) in the passivation Huinidity, voltage, migrating metals Migration of metal atoms in the presence of reactive (Au, Ag, Pd, Cu, Pb, Sn), contaminant chemicals, water, and bias, leading to conductive paths (encapsulant) (dendrites) between electrodes Migration of metal atoms (also of Si at contacts) in the direction of the electron flow, creating voids or Opens in the temperature gradient, anomalies in siructure the metallization
Charge spread laterally from the metallization or along the isolation interface, resulting in an inversion layer outside the active region which can provide for instance a conduction path between two diffusion regions
Different interdiffusion constants of Formation of an intermetallic h e r at the interface between wire (Au) and metallization ( ~ f causing ) a brittle region Au and A l , bonding temperature, (voids in Au due to diffusion) which can provoke bond lifting contamination, too thick metallization
Others Intermetallic compound hot camers a-particles Latch-up, etc.
.
I
Mechanical fatigue of bonding wires or bonding pads because Different expansion coefficients of the of thermomechanical stress (also because of vibrations at the materials in contact (for hermetically resonance frequency for hermetically sealed devices) sealed devices also wire resonance)
Short description
Breakdown of thin oxide layers occurring suddenly when sufficient cbarge has been injected to tngger a runaway proc. Carrier injection in the gate oxide because of E and B J ; creation of charges in the Si02/Si-interface
1
Oxide Time-dependent dielec. breakdown (TDDB) Ion migration (parasitic transistors, inversion)
Electromigration
Meta1 migration
.
Metallization Corrosion
Surface Charge spreading (leakage currents, inversion)
B
u
Failure mechanism
Bonding Fatigue
104
3 Qualification Tests for Components and Assemblies
be necessary to perform a seal test and if possible a dew-point test. The result of the nondestructive analysis is a careful description of the external failure mode and a first information about possible failure causes and mechanisms. For evident failure causes, the failure analysis can be terminated. 3. Semidestructive analysis: The semidestructive analysis begins by opening the package, mechanically for hermetic packages and with wet chemical (or plasma etching) for plastic ICs. A careful internal visual check is then performed with optical microscopes, conventional 1000 X or stereo 100 X. This evaluation includes Opens, shorts, state of the passivation / glassivation, bonding, damage due to ESD, corrosion, cracks in the metallization, electromigration, particles, etc. If the IC is still operating (at least partially), other procedures can be used to localize more accurately the fault on the die. Among these are the electron beam tester (or other voltage contrast techniques), liquid crystals (LC), infrared thermography (IRT), emission microscopy (EMMI), or one of the methods to detect irregular recombination Centers, like electron beam induced current (EBIC) or optical beam induced current (OBIC). For further investigations it is then necessary to use a scanning electron microscope (SEM). The result of the semidestructive analysis is a careful description of the internal jailure mode and an improved information about possible failure causes and failure mechanisms. In the case of evident failure causes, the failure analysis procedure can be terminated. 4. Destructive Analysis: A destructive analysis is perfomed if the previous investigations yield unsatisfactory results and there is a realistic chance of success through further analyses. After removal of the passivation and other layers (as necessary) an inspection is carried out with a scanning electron microscope supported b y a material investigation (e.g. EDX spectrometry). Analyses are then continued using methods of microanalysis (electron microprobe, ion probe, diffraction, etc.) and performing microsections. The destructive analysis is the last possibility to recognize the original failure cause and the failure mechunisms involved. However, it cannot guarantee success, even with skilled personnel and suitable analysis equipment.
5. Failure mechanism analysis: This step implies a correct interpretation of the results from steps 1 through 4. Additional investigations have to be made in some cases, but questions related to failure mechanisms can still remain Open. In general, feedback to the manufacturer at this stage is mandatory.
6. Final report: All relevant results of the steps 1 to 5 above and the agreed corrective actions must be included in a (short and clear) final report. 7. Corrective actions: Depending on the identified failure causes, appropriate corrective actions should be started. These have to be discussed with the IC manufacturer as well as with the equipment designer, manufacturer, or User depending on the failure causes which have been identified.
3.3 Failure Modes, Failure Mechanisms, and Failure Analysis of Electronic Components
I 1. Failure delection and descnption component identification reasodmotivation for thc analysis operating conditions at failure environmental conditions at failure
2. Nondestructive analysis extemal visual inspection X-ray microscope examination ultrasonic microscope analysis electrical test high-temperature Storage seal test, (possibly also a dew point test) some other special tests, as necessary
.
1
3. Semidestructive analysis ' package opening ' optical microscope inspection failure (fault) localization on the chip (liquid crystals, microthermography, electron beam tester, emission microscope, OBIC, EBIC, etc.) preliminaiy analysis with the scanning electron microscope (SEM)
2a. Failure cause follows from the analysis of the external causes
3a. Failure cause follows from the
4. Destructive analysis material analysis at the surface (EDX) glassivatiodpassivation removal material analysis (EDX) metallization removal SEM examination analysis in greater depth (possibly with microsections, FIB, TEM, SEM, elc.)
I
5. Failure mechanism analysis
6. Failure analysis report
1
7. Corrective actions (with manufaclurer)
Figure 3.8 Basic procedure for failure analysis of electronic components (ICs a s an example)
The failure analysis procedure described in Section 3.3.3 for ICs can be applied to other electronic or mechanical components and extended to Cover populated printed circuit boards (PCBs) as well as subassemblies or assemblies.
106
3 Qualification Tests for Components and Assemblies
3.3.4 Examples of VLSI Production-RelatedReliabilityProblems Production-related potential reliability problems, i.e. flaws or damages which can lead to failures, can occur for VLSI devices at packaging or soldering level (Fig. 3.10), as well as on silicon dies. Those on dies are often more difficult to identify. Following examples show three cases for production-related potential reliability problems on silicon dies, in grown difficulty with respect to their identification [3.49] (see also Fig. 3.7 for further exarnples). Fig. 3.9a shows a contact step coverage flaw. The contact to a diffusion in bulk silicon is made by the first metal layer, which usually is protected by a barrier against Al penetration into bulk-silicon. However, the first metal layer often must adapt itself to some topography. Design rules make Sure that the contact is flat enough. However, if the contact slopes are too steep (e.g. etching process problem) the step coverage may be reduced. In this case, electric contact is often still given, but melting or electromigration rnay start, leading to a failure. OBIRCH (optical beam induced resistivity change) can help to detect such weak contacts. Fig. 3.9b shows a wafer processing flaw. Semiconductor devices include at least one poly-Si layer, which usually performs MOS-transistor gates. It is isolated versus bulk silicon by a thin (some nm) gate-oxide, or by a more thick field oxide in active regions. The isolation against further poly-Si layers is given by a self-grown re-oxidation of the poly-Si surface and (in part) by doped silicate-glass (PSG, BPSG). In the structuration process of poly-Si (usually photolithography and plasma etching), an improper etching process may result in poly-Si residues or particles, which during subsequent re-oxidation form an irregular and thin oxide around themselves. A short at t = 0 will be avoided; however, a latent short path is created and a small voltage peak may be enough to breakdown the oxide causing a leakage path. Figs. 3 . 9 ~and 3.9d show a ESD damage giving failures at t = O or latent failures, formerly considered as mechanical surface damage. Silicon dies are often delivered as wafers to customers which perform subsequent pre-assembly processes (wafer dicing, back grinding, and pick & place). These operations can include great risks for electrostatic discharge from robotics equipment to the device via device passivation (e.g. when the picker setup of the pneumatic handler moves rapidly on a Teflon bearing). The term ESDFOS (electrostatic discharge from outside-to-surface) has been introduced to describe this failure cause. Like a lightning-strike, the electrostatic spark Comes onto the passivation, cracks it, melts the alurninum of the top metal and cracks the interlevel dielectric (ILD), where the metal underneath locally melts and penetrates into the crack. Depending from the degree of Al penetration, the damage causes a failure at t = 0 or a latent failure. While great care has been taken to Person and workplaces ESD-protection, less attention has often been paid to robotics tools. Extended ESD concept and periodic audits with survey and location of air ionizer fans, grounding concepts, materials, etc. is an effective method against this damage.
3.4 Qualification Tests for Electronic Assemblies
a) A steep dope topography causing a bad contact coverage with Al ( X 5000)
C)
b) Slightly oxidized poly residue (small white line) buried between a poly-Sigate and a neighbored contact (X 5000)
d) Short of two top meta1 layer as consequence of an ESDFOS damage ( X5000)
Latent ESDFOS damage, see also Fig. 3.9d ( X 5000)
Figure 3.9 Examples of production-related (hidden) potential reliability problems in Si-dies [3.49] (see also Figs. 3.7 & 3.10)
3.4
Qualification Tests for Electronic Assemblies
As outlined in Section 3.2 for components, the purpose of a qualijkation test is to verify the suitability of a given item (electronic assemblies in this section) for a stated application. Such a qualification involves per3cormance, environmental, and reliability tests, and has to be supported by a careful failures (faults) analysis. To be efficient, it should be performed on Prototypes which are representative for the production line in order to check not only the design but also the production process. Results of qualification tests are an important input to the critical design review (Table A3.3). This section deals with qualification tests of electronic assemblies, in particular of populated printed circuit boards (PCBs).
108
3 Qualification Tests for Components and Assemblies
The aim of the performance test is similar to that of the characterization discussed in Section 3.2.2 for complex ICs. It is an experimental analysis of the electrical properties of the given assembly, with the purpose of investigating the influence of the most relevant electrical Parameters on the behavior of the assembly at different ambient temperatures and power supply conditions (see Section 8.3 for considerations on electrical tests of PCBs). Environmental tests have the purpose of submitting the given assembly to Stresses which can be more severe than those encountered in the field, in order to investigate technological limits and failure mechanisms (see Section 3.2.3 for complex ICs). The following procedure, based on the experience with a large number of equipment [3.76], can be recommended for assemblies of standard technology used in high reliability (or safety) applications (total 110 assemblies): 1. Electrical behavior at extreme temperatures with functional monitoring, 100 h at -40°C , 0°C , and +80°C (2 assemblies). 2. 4,000 thermal cycles (-40 / +120°C with functional monitoring, < 5°C / min or 2 20°C / min within the components according to the field application, 1lOmin dwell time at -40°C and t 5min at 120°C after the thermal equilibrium has been reached within ti 5°C ( 1 3 assemblies, metallographic analysis after 2,000 and 4,000 cycles). 3. Random vibrations at low temperature, l h with 2 - 6g„ , 20 - 500Hz at -20 "C (2 assemblies). 4. EMC and ESD tests (2 assemblies). 5. Humidity tests, 240h 85/85 test (1 assembly). Experience shows [3.76] that electronic equipment often behaves well even under extreme environmental conditions (operation at +I20 "C and -60 "C , thermal cycles -40 / +120°C with up to 60°C / min within the components, humidity test 85/85, cycles of 4h 95/95 followed by 4h at -20°C, random vibrations 20- 500Hz at 4g„ and -20°C, ESDIEMC with pulses up to 15kV). However, problems related to crack propagation in solder joints appear, and metallographic investigations on more than 1,000 microsections [3.76] confirm that cracks in solder joints are often initiated by production flaws, see Fig. 3.10d to 3.10f for some examples. Many of the production flaws with inserted components can be avoided and would cause (if existent) only minor reliability problems. For instance, voids can be eliminated by a better plating of the through-holes (reduced surface roughness of the walls and optimization of the plating parameters). Since even voids up to 50% of the solder volume do not severely reduce the reliability of solder joints for inserted components, it is preferable to avoid rework. Poor wetting of the leads or the excessive formation of brittle intermetallic layers are major potential reliability problems for solder joints. This last kind of defects must be avoided through a better production process.
3.4 Qualification Tests for Electronic Assemblies
109
More critical are surface mount devices (SMD), for which a detectable crack propagation in solder joints often begins after some few thousand thermal cycles. Extensive investigations [3.79, 3.80, 3.891 show that crack propagation is almost independent of pitch, at least down to a pitch of 0.3mm. Experimental results indicate an increase in the reliability of solder joints of IC's with shrinking pitches, due to the increasing flexibility of the leads. A new model to describe the viscoplastic behavior of SMT solder joints has been developed in [3.89]. This model outlines the strong impact of deformation energy on damage evolution and points out, on the basis of experimental observations, that cracks begin in a locally restricted recrystallized area within the joint and propagate in a stripe along the main Stress. The faster the deformation rate (the higher the thermal gradient) and the lower the temperature, the faster damage accumulates in the solder joint. Basically, two different deformation mechanisms are present, grain boundary sliding at rather low thermal gradient and dislocation climbing at higher thermal gradient. Hence attention must be paid in defining environmental and reliability tests or screening procedures for assemblies in SMT (Section 8.3). In such a test, or screening, it is mandatory to activate only the failure mechanism which would also be activated in the field. Because of the elastic behavior of the components and PCB, the dwell time during thermal cycles also plays an important role. The dwell time must be long enough to allow relaxation of the Stresses and depends on temperature, temperature swing, and materials stiffness. As for the thermal gradient, it is difficult to give general rules. Reliability tests at the assembly and higher integration level have as a primary purpose the detection of all systematicfailures (Section 7.7) and an estimation of the failure rate (Section 7.2.3). Precise information on the failure rate shape is seldom possible from qualification tests, because of cost and time limits. If reliability tests are necessary, the following procedure can be used (total 2 8 assemblies): 1.4,000 h dynamic burn-in at 80°C ambient temperature ( 2 2 assemblies, functional monitoring, intermediate electrical tests at 24, 96, 240, 1,000, and 4,000 h). 2.5,000 thermal cycles -20 / +lOO°C with < 5 'C / min for applications with slow heat up and 2 20°C / min for rapid heat up, dwell time 2 10 min at -20°C and 2 5 min at 100°C after the thermal equilibrium has been reached within rt 5°C ( 2 3 assemblies, metallographic analysis after 1,000, 2,000, and 5,000 cycles; crack propagation can be estimated using a CoffinManson relationship of the form N = A E" with E = (ag- ac )1A0 / d [3.89, 3.791, the Parameter A has to be determined with tests at different temperature
swings). 3. 5,000 thermal cycles 0 / +80°C, with temperature gradient as in point 2 above, 20- 500Hz (2 3 assemblies, combined with random vibrations lg„, metallographic analysis after 1,000,2,000, and 5,000 cycles).
3 Qualification Tests for Components and Assemblies
a) Void caused by an s-shaped pin gassing out in the area A ( X 20)
d) A row of voids along the pin of an SOP package ( X 30)
b) Flaw caused by the insertion of the insulation of a resistor network ( X 20)
e) Soldering defect in a surface mounted resistor, area A ( X 30 )
C)
Defect in the copper plating of a hole in a multilayer printed board ( ~ 5 0 )
f) Detail A of Fig. 3.9e ( X 500)
Figure 3.10 Examples of production flaws responsible for the initiation of cracks in solder joints a) C) inserted devices, d) f) SMD (Rel. I,aboratory at the ETH Zurich); see also Figs. 3.7 & 3.9
-
-
111
3.4 Qualification Tests for Electronic Assemblies
Thermal cycles with random vibrations highly activate failure mechanisms at the assembly level, i.e. crack propagation in solder joints. If such a Stress occurs in the field, insertion technology should be preferred to SMT. Figure 3.11 shows a comparative investigation of crack propagation [3.79 (1993)l.
&QW, SZpins, pitch 0.6Smm. tin pldted
QFP, 52 pinr, pitch 0.65mm. unplated
-e SOP, 28 pins, pitch 1.27mrn, iin plalrd 4- SOP. 28 pms, p m h 1.27mm. unplned
+Ceramic capacitor, tid pplted &Cerarnic capacitor, unplated & MELF resistor, tin plated & MELF resistor, unplvted
No. of thermal cycles
Figure 3.11 Crack propagation in different SMD solder joints as a function of the number of thermal cycles ( 6 1 11 = crack length in % of the solder joint length, mean over 20 values, thermal cycles -20/+10O0C with 60°C/min inside the solder joint; Reliability Laboratory at the ETH Zurich)
4 Maintainability Analysis
At equipment and system level, maintainability has a great influence on reliability and availability. This holds in particular if redundancy has been implemented and redundant parts have to be repaired on line, i.e. without interruption of operation at system level. Maintainability is thus an important Parameter in the optimization of availability and life-cycle cost. Achieving high maintainability in complex equipment and systems requires appropriate activities which must be started early in the design & development phase and be coordinated by a maintenance concept. To this concept belong failure recognition and isolation (built-in tests), partitioning of the equipment or system into (as far as possible) independent line replaceable units, and logistic Support. A maintenance concept has to be tailored to the equipment or system considered. After some basic concepts (Section 4.1), Section 4.2 deals with a maintenance concept for complex equipment und systems. Section 4.3 considers maintainability aspects in design reviews and Section 4.4 presents methods and tools for maintainability prediction. Spare parts provisioning is investigated in Section 4.5, repair strategies in Section 4.6, and cost optimization in Section 4.7. Design guidelines for maintainability are given in Section 5.2. The influence of preventive maintenance, imperfect switching, and incomplete coverage on system's reliability & availability is investigated in Section 6.8. For simplicity, repair is used as a synonym for restoration.
4.1 Maintenance, Maintainability Maintenance defines all those activities performed on an item to retain it in or to restore it to a specified state. Maintenance includes thus preventive maintenance, carried out at predetermined intervals, according to prescribed procedures to reduce the probability of failures or the degradation of the functionality of an item, and corrective maintenance, initiated after fault recognition and intended to bring the item into a state in which it can again perform the required function (Fig. 4.1). The aim of preventive maintenance must also be to detect and repair hidden failures, i.e. failures in redundant elements. Corrective maintenance is also known as repair (restoration) and can include any or all of the following steps:
4.1 Maintenance, Maintainability
I
MAINTENANCE
PREVENTIVE MAINTENXNCE (retainmeiit of thc iiem fiinctionüliiy) I
' I
COllRECTlVE MAINTENANCE (reestablishinent o f ~ h eitem functionülity)
I
Test of all relevant functions, also to recognize hidden failures Activities to compensate for dnft and to reduce wearout failures Overhaul to increase useful life
I
Failure recognition Failure localization & diagnosis (isolation) Failure correction (removal) Function checkout
Figure 4.1 Maintenance tasks (failure can be replaced by fault, including defects and failures)
recognition, localization & diagnosis (isolation), correction (disassemble, remove, replace, reassemble, adjust), and function checkout. For simplicity, repair is hereafter used as a synonym for restoration. The time elapsed from the recognition of a failure until the start-up after failure correction, including all logistic delays (waiting for spare parts or tools) is the repair time (restoration time). Often, ideal logistic support, with no logistic delay, is assumed for calculations. Maintainability is a characteristic of an item, expressed by the probability that preventive maintenance (serviceability) or repair (repairability) of the item will be performed within a stated time interval by given procedures und resources (number and skill level of the personnel, spare parts, test facilities, logistic support). If z' and T" are the (random) times required to carry out a repair or for a preventive maintenance, then Repairability = F'r(2' 2 X )
and
Serviceability = P ~ { T<" X}.
(4.1)
Considering T' and 7" as interarrival times, the variable X is used instead of t in Eq. (4.1). For a rough characterization, the expected values (means) of z' and z" E [T'] = M7TR = mean time to repair (restoration) E [T"1 = MTTPM = mean time to preventive maintenance are often used. Assuming X as a Parameter, Eq. (4.1) gives the distribution functions of $ and z", respectively. These distribution functions characterize the repairability and the sewiceability of the item considered. Experience shows that T' and T" often exhibit a lognormal distribution (Eq. (A6.110)). The typical shape of the corresponding density is shown in Fig. 4.2. A characteristic of the lognormal density is the sudden increase after a period of time in which its value is practically Zero, and the relatively fast decrease after reaching the maximum (modal value X,).
4 Maintainability Analysis
Figure 4.2 Density of the lognormal distribution function for h = 0.6 h-l and o = 0.3 (dashed is the approximation given by a shifted exponential distribution with Same mean)
This shape can be accepted, taking into consideration the main terms of a repair time (Fig. 4.1). However, calculations using a lognormal distribution can become time-consuming. In practical applications it is therefore useful to distinguish between one of the following two situations: 1. Investigation of maintenance times, often under assumption of ideal logistic Support: In this case, the actual distribution function must be considered, see Sections 7.3 and 7.5 for some examples with a lognormal distribution. 2. Investigation of the reliability und availability of repairable Systems: The exact shape of the repair time distribution has in general less influence on the reliability and availability values at system level, as long as the M T T R is unchanged and MTTR« MTTF holds (Examples 6.7, 6.8, 6.9); in this case, the actual repair time distribution function can often be approximated by an exponential function with same mean.
A further possibility to Point 2 above, is to use e.g. a shifted exponential distribution function (Examples 6.8 and 6.9). Figure 4.2 shows (dashed) an example with
The Parameter y' of the exponential d.f. follows from the equality of the mean values
115
4.1 Maintenance, Maintainability
For the numerical example given in Fig. 4.2 (3\. = 0.6h", cs = 0.3; M7TR = 1.75h, Var = 0.29h2) one obtains W = 0.99h and p1=1.32h-'. A shift which considers equal mean and variance leads to = 1.2h & P'=1.9h-'. For a deeper investigation, one can refer to Examples 6.7 - 6.9. In some cases, an Erlang distribution (Eq. (A6.102)) with ß 2 3 can be assumed for repair times, yielding simple results. As in the case of the failure rate h(x), for a statistical evaluation of repair times (T') it would be preferable to omit data attributable to systematic failures. For the remaining data, a repair rate p(x) can be obtained from the distribution function G(x) = Pr{z ' 5 X ) ,with density g(x) = d G(x) l dx, as per Eq. (A6.25) 1
p(x) = lim -Pr(x S X ~ 6~ O
< z ' I x + 6 x ~ T ' > x=-P, ]
g(x)
1 - G(x)
(4.3)
(considering that T' starts anew at each repair (restoration), X is used instead oft). In evaluating the maintainability achieved in the field, the influence of the logistic support must be considered. MTTR requirements are discussed in Appendix A3.1. MTTR estimation and demonstration is considered in Section 7.3.
4.2
Maintenance Concept
Like for reliability, maintainability must be built into equipment and systems during the design and development phase. This in particular because maintainability cannot be easily predicted and a maintainability improvement often requires important changes in layout or construction of the item (system) considered. For these reasons, attaining a prescribed maintainability in complex equipment and systems generally requires the planning and realization of a maintenance concept. Such a concept deals with the following aspects: 1. Fault recognition and isolation, including checkout after repair (isolation can be subdivided in localization and diagnosis, and fault is used to consider failures and defects). 2. Partitioning of the equipment or system into independent line replaceable units (LRUs), i.e. spare parts at equipment or system level (line repairable, last repairable, or last replaceable is often used for line replaceable). 3. Preparation of the User documentation (operating & maintenance manuals). 4. Training of operating and maintenance personnel. 5. Logistic support for the user, including after-sales service. This section introduces the above points for the case of complex equipment und Systems with high rnaintainability requirements.
116
4 Maintainability Analysis
4.2.1 Fault Recognition and Isolation For complex equipment and Systems, recognition of partial failures or of hidden failures (failure of redundant elements) can be difficult. For this reason, a status test, initiated by operating personnel, or an operation monitoring, running autonomously, must often be implemented. Properties, advantages, and disadvantages of both methods are summarized in Table 4.1. The choice between a status test or a (more complete) operating monitoring must consider cost, reliability, availability, and safety requirements at system level. The goal of fault isolation (localization and diagnosis) is to isolate faults (failures and defects) down to the line replaceable units (LRUs), i.e. to the part which is considered as a spare part at the equipment or system level. LRUs are generally assemblies, e.g. populated printed circuit board, or units which for repair purposes are considered as an entity and replaced on a plug-out/plug-in basis to reduce repair times. Repair of LRUs is generally performed by specialized personnel and repaired LRUs are stored for reuse. Fault isolation should be performed using built in test (BIT) facilities, if necessary supported by built-in test equipment (BITE). Use of external special tools should be avoided, however check lists and portable test equipment can be useful to limit the amount of built-in facilities. Fault recognition and fault isolation are closely related and should be considered together using common hardware andlor software. A high degree of automation
Table 4.1 Autornatic and semiautomatic fault recognition Status Test Rough (quick test)
I
Complete (functional test)
Operation Monitoring
Testing of all important functions, if necessary with help of external test equipment Initiated by the operating personnel, then runs automatically
Periodic testing of all important functions Initiated by the operating personnel, then runs automatically or semi-autom. (possibly without extemal stimulation or test equipment)
Monitoring of all important functions and automatic display of complete and partial faults Performed with built-in means (BIT/BITE)
Lower cost Allows fast checking of the functional conditions
Gives a clear status of the functional conditions of the item considered Allows fault isolation down to LRU level
Runs automatically on-line, i.e. in background
Limited fault isolation (localization and diagnosis) capability
Relatively expensive Runs generally off-line (i.e. not in background)
Expensive
LRU = line replaceable unit; BIT = built-in test; BITE = built-in test equipment
4.2 Maintenance Concept
117
should be striven for, and test results should be automatically recorded. A one-toone conespondence between test messages and content of the user documentation (operating and maintenance manuals) must be assured. Built-in tests (BIT) should be able to identify hidden faults, i.e. faults (defects or failures) of redundant elements and, as far as possible, also of sojiware defects. This ability is generally characterized by the following testability parameters: degree of fault recognition (coverage, e.g. 99% of all relevant failures), degree of fault isolation (e.g. down to LRUs), correctness of the fault isolation (e.g. 95%), test duration (e.g. 1s). The first two parameters can be expressed by a probability. Distinction between failures and defects is important. As a measure of the correctness of the fault isolation capability, one can use the ratio between the nuinber of correctly isolated faults and the number of isolation tests performed. This figure, similar to that of test coverage, must often remain at an empirical level, because of the lack of exact information about the defects and failures really present or assumed in the item considered. For the test duration, it is generally sufficient to work with mean values. Failure (fault) mode analysis methods (FMEAIFMECA, FTA, cause-to-effect charts, etc.) are useful to check the effectiveness of built-in facilities (Section 2.6). Built-in test facilities, in particular built-in test equipment (BITE), must be defined taking into consideration not only of pricelperformance aspects but also of their impact on the reliability and availability of the equipment or system in which they are used. Standard BITE can often be integrated into the equipment or system considered. However, project specific BITE is generally more efficient than standard solutions. For such a selection, the following aspects are important: 1. S i m p l i c i ~ :Test sequences, procedures, and documentation should be as easy as possible. 2. Standardization: The greatest possible standardization should be striven for, in hardware and software. 3. Reliability: Built-in facilities should have a failure rate of at least one order of magnitude lower than that of the equipment or system in which they are used; their failure should not influence the item's operation (FMEAIFMECA). 4. Maintenance: The maintenance of BITBITE must be simple and should not interfere with that of the equipment or system; the User should be connected to thefield data change service of the manufacturer.
For some applications, it is important that fault isolation (or at least part of the diagnosis) can be remotely controlled. Such a requirement can often be satisfied, if stated early in the design phase. Remote diagnosis must be investigated on a case-by-case basis, using results from a careful failure modes and effects analysis (FMEA, FTA).
118
4 Maintainability Analysis
A further step on above considerations leads to maintenance concepts which allow automatic or semiautomatic reconfiguration of an item after failure. Design guidelines for maintainability are given in Section 5.2. Effects of imperfect switching and incomplete coverage are investigated in Section 6.8.
4.2.2 Equipment and System Partitioning The consequent partitioning of complex equipment and systems into (as far as possible) independent line replaceable units (LRUs) is important for good maintainability. Partitioning must be performed early in the design phase, because of its impact on layout and construction of the equipment or system considered. LRUs should constitute functional units and have clearly defined interfaces with other LRUs. Ideally LRUs should allow a modular construction of the equipment or system, i.e. constitute autonomous units which can be tested each one independently from every other (for hardware as well as for software). Related to the above aspects are those of accessibility, adjustment, and exchangeability. Accessibility should be easy for LRUs with limited useful lije, high failure rate, or wearout. The use of digital techniques largely reduces the need for adjustment (alignment). As a general rule, hardware adjustment in the field should be avoided. Exchangeability can be a problem for equipment and systems with long useful lije. Spare parts provisioning and aspects of obsolescence can in such cases become mandatory (Section 4.5).
4.2.3 User Documentation User (or product) documentation for complex equipment and systems can include all of the following Manuals or Handbooks General Description Operating Manual Preventive Maintenance (Service) Manual Corrective Maintenance (Repair) Manual Illustrated Spare Parts Catalog Logistic Support. It is important for the content of the User documentation to be consistent with the hardware and software status of the item considered. Emphasis must be placed on a clear and concise presentation, with block diagrams, flow charts, and check lists. The language should be easily understandable to non-specialized personnel. Procedures should be self sufficient and contain checkpoints to prevent the skipping of important steps.
119
4.2 Maintenance Concept
4.2.4 Training of Operating and Maintenance Personnel Suitably equipped, well trained, and motivated maintenance personnel are an important prerequisite to achieve short maintenance times and to avoid human errors. Training must be comprehensive enough to cover present needs. However, for a complex system it should be periodically updated to cover technological changes introduced in the system and to further motivate the operating and maintenance personnel.
4.2.5 User Logistic Support For complex equipment and Systems, customers (users) generally expect from the manufacturer a logistic support during the useful life of the item under consideration. This can range from support on an on-call basis up to a maintenance contract with manufacturer's personnel located at the user site. One important point in such a logistic support is the definition of responsibilities. For this reason, maintenance is often subdivided into different levels (four for military applications (Table 4.2) and three for industry, in general). The first level concerns simple maintenance work such as the Status test, fault recognition and fault isolation down to the subsystem level. This task is generally performed by operating personnel. At the second level, fault isolation is refined, the defective LRU is replaced by a good one, and the functional test is performed. For this task first line maintenance personnel is often required. At the third level, faulty LRUs are repaired by maintenance personnel and stored for reuse. The fourth level generally relates to
Table 4.2
Maintenance levels in the defense area logistic level
Field
Operating personnel
Simple maintenance work Status test Fault recognition Fault isolation down to subsystem level
Level 2
Cover
First line maintenance personnel
Preventive maintenance Fault isolation down to LRU level First line repair (LRU replacement) Functional test
Level 3
Depot
Maintenance personnel
Difficult maintenance Repair of LRUs
Level 4
Arsen' from arsenal or Industry or industry
U
2
3
Tasks
Location Carried out by
m 2.5
O E
al
4.a
LI
4 E 2
3
99 8
Y .
C.-
2
mEX
LRU = line replaceable unit (spare pari at system level)
.
Reconditioning werk Important changes or modifications
120
4 Maintainability Analysis
overhaul or revision (essentially for large mechanical parts subjected to wear, erosion, scoring, etc.) and is often performed at the manufacturer's site by specialized personnel. For large mechanical systems, maintenance can account for over 30% of the operating cost. A careful optimization of these cost may be necessary in many cases. The part contributed by preventive maintenance is more or less deterministic. For the corrective maintenance, cost equations weighted by probabilities of occurrence can be established from considerations similar as those given in Sections 1.2.9 and 8.4, see also Sections 4.5,4.6, and 4.7.
Table 4.3 Example of catalog of questions for the preparation of project specific checklists for the evaluation of maintainability aspects in preliminary design reviews (Appendices A3 and A4) of complex equipment and systems with high maintainability requirements
I. Has the equipment or system been conceived with modularity in mind? Are the modules functionally independent and separately testable? 2. Has a concept for fault recognition and isolation been planned and realized? 1s fault detection automatic? Which kind of faults are recognized? How does fault isolation work? 1s isolation down to line replaceable (repairable) units (LRUs) possible? How large are the values for fault recognition and fault isolation (coverage)? 1s remote diagnostic possible?
3. C m redundant elements be repaired on-line? 4. Are enough test points provided? Do they have pull-uplpull-down resistors?
5. Have hardware adjustments (or alignments) been reduced to a n~iriirnurn? Are the adjustable elements clearly marked and easily accessible? 1s the adjustment uncritical? 6. Has the amount of external test equipment been kept to a minimum? 7. Has the standardization of components, materials, and maintenance tools been considered?
8. Are line replaceable units (LRUs) identical with spare parts? Can they be easily tested? 1s a spare parts provisioning concept available? 9. Are all elements with lirnited useful life clearly marked and easily accessible? 0. Are access flaps (and doors) easy to Open (without special tools) and self-latching? Have plug-in unit guide rails self-blocking devices? Can a standardized extender for PCBs be used?
1. Have indirect connectors been used? 1s the plugging-out/plugging-in of PCBs (LRUs) easy? Are power supplies and ground distributed across different contacts? 2. Have wires and cables been conveniently placed? Also with regard to maintenance?
3. Are sensitive elements sufficiently protected against mishandling during maintenance? 4. Can preventive maintenance be performed on-line? Does preventive maintenance also allow the detection of hidden failures?
5. Can the item (the system) be considered as-good-as-new after a maintenance action? 6. Have man-machine aspects been sufficiently considered? 7. Have all safety aspects also for operating and maintenance personnel been considered? Also in the case of failure (F'MEAIFMECA,FTA, etc.)?
4.3 Maintainability Aspects in Design Reviews
4.3 Maintainability Aspects in Design Reviews Design reviews are important to point out, discuss, and eliminate design weaknesses. Their objective is also to decide about continuation ur stopping of the project on the basis of objective considerations, feasibility checks in Tables A3.3 & 5.3 and Fig. 1.6. The most important design reviews (PDR & CDR) are described in Table A3.3. To be effective, design reviews must be supported by project specific checklists. Table 4.3 gives an example of catalog of questions which can be used to generate project specific checklists for maintainability aspects in design reviews (see Table 2.8 for reliability and Appendix A4 for other aspects).
4.4
Predicted Maintainability
Knowing the reliability structure of a system and the reliability and maintainability of its elements, it is possible to calculate the maintainability of the system considered as a one-item structure (e.g. calculating the reliability function and the point availability at system level and extracting g(t) as the density of the repair time at the system level using Eqs. (6.14) and (6.18)). However, such a calculation soon becomes laborious for arbitrary Systems (Chapter 6). For many practical applications it is often sufficient to know the mean time to repair at the system level Mi7Rs (expected value of the repair (renewal) time at system level) as a function of the system reliability structure, and of the mean time to failure MlTq and mean time to repair MTZRi of its elements. Such a calculation is discussed in Section 4.4.1. Section 4.4.2 deals then with the calculation of the mean time to The method used in preventive maintenance at system level MTTPMs. Sections 4.4.1 and 4.4.2 is easy to understand and delivers mathematically exact results for M n R s and MTLPMS. Use of statistical methods to estimate or demonstrate a maintainability or a M i T R are discussed in Sections 7.2.1, 7.3, 7.5, and 7.6.
4.4.1 Calculation of MTTRs Let us first consider a system without redundancy, with elements E I , ..., E, in series as given in Fig. 6.4. M7TF; and Mi7Ri are the mean time tu failure and the mean time to repair of element Ei,respectively ( i = 1, ... , n ) . Assume now that each
122
4 Maintainability Analysis
element works for the Same cumulative operating time T (the system is disconnected during repair or repair times are neglected because of M U R i << M T F ) and let T be arbitrarily large. In this case, the expected value (mean) of the number of failures of element Ei during Tis given by (Eq. (A7.27))
The mean of the total repair time necessary to restore the T l M T T e failures follows then from
For the whole system, there will be in mean
failures and a mean total repair time of
From Eqs. (4.4) and (4.5) it follows then for the mean time to repair (restoration) at the system level MTTRs the final value
Equation (4.6) gives the mathematically exact value for the mean system repair time M U R S under the assumption that at system down (during a repair) no further failures can occur and that switching is ideal (no influence on the reliability). From Eq. (4.6) one can easily verify that MTTRS = M U R ,
when
MTTRl = ... = MT&
= MTTR,
and 1 " MTTRs = - MTTRi , n 1.= 1
when
MTTFi = .. . = MTTF, .
4.4 Predicted Maintainability
Example 4.1 Give the mean time to repair at system level M7TR.y for the following system.
How large is the mean of the total system down time during the interval (0, t ] for t + .;..?
Solution From Eq. (4.6) it follows that
MlTR, =
2h -+500h
2.5h 400h
lh +-+250h
0.5h lOOh
-- 0.01925 V
-+-
500h
1 400h
+
1 P
250h
+
1 P
= 1.04h.
0.0185h-'
lOOh
The mean down time at the system level is also 1.04 h, then for a system without redundancy it holds that down time = repair time. The mean operating time at the system level in the interval (0, t ] can be obtained from the expression for the average availability AAS (Eqs. (6.23), (6.24), (6.48), and (6.49))
-
lim E[total operating time in (0, t I ] = t . AAS = t . MVFS / (MVFS + MmRs ).
t-f
From this, the mean of the total system down time dusing (0, t ] for t -+
follows then from
limE[total system down time in (0, t I] = t - t . AAS = t MVR, / (MTTFs t-f
+ MVR,
).
Numerical computation then leads to
If every element exhibits a constant failure rate
-MTTR, Chi
,
Ai,then
M
m = 1/ Ai and
hi .
with hs = i=l
Equations (4.6) and (4.7) can also be used for Systems with redundancy. However, in this case, a distinction at system level between repair time and down time is necessary. If the system contains only active redundancy, the mean time to repair at the system level M U R s is given by Eq. (4.6) or (4.7) by sumrning over all elements of the system, as if they were in series (a similar consideration holds for
124
4 Maintainability Analysis
spare parts provisioning). By assuming that failures of redundant elements are repaired without interruption of operation at the system level, Eq. (4.6) or (4.7) can be used to obtain an approximate value of the mean down time at the system level, by summing only over all elements without redundancy (series elements), see Example 4.2.
Example 4.2 How does the MTTRS of the system in Example 4.1 change, if an active redundancy is introduced to the element with M7TF = 100h ? MITF = 100 h M I T R = 0.5 h MTTF = 500 h MITR = 2 h
M n F = 400 h MTTR = 2.5 h
d
MTTF = 100 h MTTR = 0.5 h
Under the assumption that the redundancy is repaired without interruption of operation at the system level, is there a difference between the mean time to repair and the mean down time at the system level?
Solution Because of the assumed active redundancy, the operating elements and the reserve elements show the same mean number of failures. The mean system repair time follows then from Eq. (4.6) by summing over all system elements, yielding
MTTR, =
2h
2.5h
lh
500h
400 h
250h
1
1
1
1
500h
400h
250h
lOOh
-+-
0.5h
0.5h
+ lOOh + -lOOh
+-+-+-
0.02425
-
I lOOh
- 0.85 h
0.0285h-I
However, the system down time differs now from the system repair time. Assuming for the redundancy an availability equal to oue (for constant failure rate h = 11M T T F , constant repair rate y = 11MTTR, and one repair Crew, Table 6.6 (p. 195) gives for the 1-out-af-2 active redundancy PA = AA = y (2h +P) / (2h (h + y) + y2 ) yielding AA = 0.99995 for this example), the system down time is defined by the elements in series on the reliability block diagram (see Point 9 in Section 6.8.8 (Eq. (6.291)) for precise considerations), thus 2h
+ mean down time at system level =
500h
+
1
500h
2.5h
+
400h
+
1
400h
-
lh 250h
-
-
0.01425 0.0085h-'
- 1.68h.
250h
Similarly to Example 4.1, the mean of the system down time during the interval (0, t ] follows then from MTTR,
lirnE[total down time in (0, t ] ]= t (1 - AAS ) = t -= t . 1.68h. 0.0085h-' = 0.014t. f+
m
MTTF,
4.4 Predicted Maintainability
125
4.4.2 Calculation of MTTPMs Based on the results of Section 4.4.1, the calculation of the mean time to preventive maintenance at system level M m P M S can be performed for the following two cases: 1. Preventive maintenance is carried out at once for the entire system, one element after the other. If the system consists of elements E I , ..., E , (arbitrarily grouped on the reliability block diagram) and the mean time to preventive maintenance of element Ei is MTi'PMi, then
2. Every element Ei of the system is serviced for preventive maintenance independently of all other elements and has a mean time to preventive maintenance MTTPMi. In this case, Eq. (4.6) can be used with MTBPMi instead of MTTF;. and MTTPMi instead of M7TRi, where MTBPMi is the mean time between preventive maintenance for the element Ei. Case 2 has a practical significance when preventive maintenance can be performed without interruption of the operation at the system level.
4.5
Basic Models for Spare Parts Provisioning
Spare parts provisioning is important for Systems with long useful life or when short repair times andlor independence from the manufacturer is required (spare part is used here e.g. for line replaceable unit (LRU)). Basically, a distinction is made between centralized and decentralized logistic support. Also it is important to take into account whether spare parts are repairable or not. This section presents the basic models for the provision of nonrepairable and of repairable spare parts. For nonrepairable spare parts, the cases of centralized and decentralized logistic support are considered in order to quantify the advantage of a centralized logistic support with respect to a decentralized one. More general maintenance strategies are discussed in Section 4.6, cost specific aspects in Section 4.7.
4.5.1 Centralized Logistic Support, Nonrepairable Spare Parts In centralized logistic support, spare parts are stocked at one place. The basic problem can be formulated as follows:
126
4 Maintainability Analysis
At time t = 0, the first Part is put into operation, it fails at time t = z1 and is replaced (in a negligible time) by a second part which fails at time t = z1 + z 2 and so forth; asked is the number n of parts which must be stocked in order that the requirement for parts during the cumulative operating time T is met with a given (fixed) probability y . To answer this question, the smallest integer n must be found for which
holds. In general, zl ,..., T , are assumed to be independent positive random variables with the same distribution function F ( x ) ,density f ( x ) ,and finite mean E[zi]= E [ z ]= MTTF & Var[zi]= Var[c].If the number of parts is calculated from
the requirement can only be covered (for T large) with a probability of 0.5. Thus, more than T1 M7TF parts are necessary to meet the requirement with y > 0.5. According to Eq. (A7.12), the probability as per Eq. (4.9) can be expressed by the ( n - 1)th convolution of the distribution function F ( t ) with itself, i.e.
T
with
F I ( T )= F(T) and
F,(T) = JF,-](T - ~ ) f ( ~ ) d nx >, 1 .
(4.11)
0
Of the distribution functions F ( x ) used in reliability theory, a closed, simple form for the function F,(x) exists only for the exponential, gamma, and normal distribution functions, yielding a Poisson, gamma, and normal distribution, respectively. In particular, the exponential distribution F ( x ) = 1 -e-L X leads to (Eq. (A7.39))
The important case of the Weibull distribution F ( x ) = I - e d h X)' must be solved numerically. Figure 4.3 shows the results with Y and ß as Parameters [4.2 (1974)]. For large values of n , an approximate solution for a wide class of distribution functions F ( x ) can be obtained using the central limit theorem. From Eq. (A6.148) if follows that (for Var[z]< W )
and thus, using
X
d n ~ a r [ z+]nE[z]= T ,
4.5 Basic Models for Spare Parts Provisioning n
lim p r { E z i > T ) =n-fW
6T-n„„
i=l
Setting ( T - nE[z])
2
l e - Y '2dyZY
14-
= -d
it follows that
with
K
.Jvar[z I =
EP 1
.
1
MTTF
.-
T MTTF
Figure 4.3 Number of parts (n) which are necessary to cover a total cumulative Operating time T with a probability 2 y, i.e. smallest n for which Pr{T1 + ... + Z~ > T) 2 y holds, with Pr{T Ix )= 1 - e-" and MTTF = T(1 + 1 /P) 1 h (dashed are the results given by the central limit theorem as per Eq. (4.15), ß = 1 yields the exponential distribution function)
4 Maintainability Analysis
1.0
Figure 4.4
1.4
1.8
2.2
2.6
3.0
Coefficient of variation for the Weibull distribution for 1 I ß I 3
From Eqs. (4.13) and (4.14) one recognizes that d is the y quantile of the standard normal distribution ( 1- @(-d ) = @(d) = y ), yielding (Table A9.1)
Equation (4.15) gives for y <. 0.95 a good approximation of the number of parts n / E[z I is the coeflcient of down to low values of n (see e.g. Fig. 4.3). K ==I variation ( K = 1 for the exponential distribution and K = U1 + 2 /P)/ (U1 + 1 /ß))2 - 1 for the Weibull distribution (Fig. 4.4)). For the case of a Weibull distribution with ß t 1, approximate values for n obtained using the central limit theorem (Eq. (4.15)) are shown dashed in Fig. 4.3. For ß =1, deviation from the exact value is < 1.3 for y 10.95 and n2 5; this deviation drops off rapidly for increasing values of ß ( F , ( x ) already approaches a normal distribution for small n). From Eq. (4.14) one recognizes that for y = 0.5, T-nE[z]= 0 andthus,fornlarge, n = T / E [ z ] (Eq. (4.10)). Let us now consider the case in which the same part occurs k times in the system. For F ( x ) = 1- e - h x , Eqs. (4.12) -(4.15) hold with
l/
instead of h. This is because the sum of independent Poisson processes is a Poisson process (Eq. (7.27)) and k parts must be operating for the required function (see also Point 2 on p. 131). The same holds if 1 Systems use the same part, one or more per system with total k parts of the same type, and Storage is centralized (Example 4.3). Considering that k parts are available at t = 0 (operating at t = 0), it is reasonable to define as number of spare parts nSp the quantity
where n is the number of parts obtained from Eqs. (4.12) - (4.16), see Examples 4.3 and 4.4 for two practical applications.
129
4.5 Basic Models for Spare Parts Provisioning
Example 4.3 A part with a constant failure rate h = 1 0 - ~h-I is used three times in a system ( k = 3). Give the number of spare parts n which must be stored to Cover a cumulative operating time "P. T = 10,000h with a probability y 2 0.90.
Solution Considering k hT= 30, the exact solution is given by the smallest integer nSp= n - 3 for which
i=o l !
holds (Eq. (4.12)). From Table A9.2 it follows, for q = 1- 0.9 = 0.1 and t,,q = 2.30 = 60, the value V = 75.2 (lin. interpolation); thus, V = 76 and (Appendix A9.2 & Eq. (4.12)) n =V / 2 = 38 (the same result is obtained with Fig. 7.3 for m=30 with Eq. (4.15) for U = 1 and d = 1.28, yielding n = 38 ( considering that 3 parts are operating at t = 0 , it follows that (Eq. (4.17)) nsp= 38 - 3 = 35.
4.5.2 Decentralized Logistic Support, Nonrepairable Spare Parts For Users who have the same system located at different places, spare parts are often stored decentralized, i.e. separately at each location (decentralized means that spare parts cannot be transferred from one location to another location). If there are 1 systems, each with a given part, and the storage of spare parts is decentralized at each system (or location), a first approach could be to store with each system the same number of spare parts obtained using Eqs. (4.9) and (4.17). In this case, the totaI number of parts would be n .I ( ( n - k ) . I spare parts). This number n of Parts, which would be sufficient to meet, with a probability > y (often >> y ) the needs of the I systems with a centralized storage (Example 4.4), would now in general be too small to meet all the individual needs at each location. In fact, assurning that failures at each location are independent, and that with n parts the probability of meeting the needs at any location individually is y , then the probability of meeting the need at all locations is y l . Thus, to meet the need at the 1 locations with a probability y
parts are required, where nl is computed for each location individually with
z)
. To make a e.g. using Eq. (4.15) with dl instead of d ( @ (d)=y , @(dl)= comparison between a centralized and a decentralized logistic support, let us assume that the Part considered appears k times in each of the 1 locations, has constant failure rate ?L, and k AT>> d :/2 > d 2 / 2 holds. In this case, Eqs. (4.15) & (4.16) lead to n=k h ~ + d m ~ ,
k hT »d2 / 2, k = 1,2,..., probability y .
(4.20)
130
4 Maintainability Analysis
For centralized logistic support, Eq. (4.20) yields n c e n = 1 k ~ ~ + d . , / l k h T , l k h ~ > > d ~ / 2 , k , l = 1,...,, 2probabilityy. (4.21) For decentralized logistical support, Eq. (4.20) yields
+ Tdl@), ndec = Z ( ~ ? L
khT >> d; 12, k,L = 1,2,..., probability y , (4.22)
'fi
instead of y (for example, where dl is obtained as for d in Eq. (4.15) with y, = d = 1.64 for y = 0.95 and dl = 2.57 for 1 = 10 i.e. for y, =0.9949, see Table A9.1). From the above considerations it follows that for k AT >> d12/2 > d '1 2
5
with @(d) = y & @(d !) = (see Example 4.4). Setting AT = T l M m , Eq. (4.23) can be used for arbitrary distribution of the failure-free time of the spare parts.
Example 4.4 Let h = 1 0 - ~h-I be the constant failure rate of a part in a given system. The wer has 6 locations (L = 6) and would like to achieve a cumulative operating time T = 50,000 h at each location with a probability y 2 0.95. How many spare parts could be saved if the User would store all spare parts at the same location (centralized logistic support)? Solution FromFig. 4.3 ( T I M T T F = 5 , y = G = 0 . 9 9 ) , Fig 7.3 ( m = 5 , y=0.99, c = n l - I ) , or from a x 2 - ~ a b l e( t V s q= 10, q=1-0.99= 0.01, V = 2n,) each User would need n, = 12 parts (nl = 14 using Eq. (4.15) with d = d l =2.33 and hT = 5); thus n&= 6.12= 72 parts and (Eq. (4.17)) n.Tpdec= 6.11 = 66 spare parts. Combining the storage (L = 6), it follows from Fig. 7.3 (m=30, y=0.95, c = n „ , - 1 ) o r from Table A9.2 ( t v , q = 6 0 , q=0.05, v=2n„,) that n, = 40 (acen = 41 using Eq. (4.15) with d = 1.64 and hT = 30); thus, nSPcen= 40 - 6 = 34. A centralized storage would save 66 - 34 (or 72 - 40) = 32 spare parts (Eq. (4.23) gives 1.57 instead of 1.8 (left) and 1.67 instead of 1.94 (right), because k h T = 5 is not >> dlz 1 2 = 2.71). Supplementary result: Provisioning independently for each location with y= 0.95 yields n, = 10 (Fig. 4.3) and thus n = 6.10 = 60.
4.5.3 Repairable Spare Parts In Sections 4.5.1 and 4.5.2 it was assumed that the spare parts (LRUs) were nonrepairable, i.e. that a new spare part was necessary at each failure. In many cases, spare parts can be repaired and then stored for reuse. Calculation of the number of spare parts which should be stored can be performed in a way similar to the investigation of a k-out-of-n standby redundancy, where k is the number of parts used in the system (as in Eq. (4.17)) and n is the smallest integer to be determined
4.5 Basic Models for Spare Parts Provisioning
131
such that the requirement is met with a given (fixed) probability y. Following two cases have to be considered:
1. y is the probability that a request for a spare part at a time point t can be met without time delay; in this case, y can be considered as the point availability PAs (in steady-state to simplify investigations) and n is the smallest integer such that PAs 2 y for a given (fixed) y. 2. y is the probability that any request for a spare part during the time intewal ( O J ] will be met without time delay; in this case, y can be considered as the reliabili~functionR s o ( t ) and n is the smallest integer such that R s O ( t )2 y for given (fixed) y and t. If the spare parts have a constant failure rate h = 1 / MTTF and a constant repair rate p = 1 I MTTR, birth-und-death processes can be used (Section A7.5.5). To simplify investigations, it is assumed that only one spare part at a time can be repaired (only 1 repair Crew is available) and no further failures are considered when a request for a spare part cannot be met (corresponds to the assumption nofurther failure at System down (Fig. 6.13). For Case 1 above, Eq. (6.138) with h, = 0 and Eq. (6.140) yield n-k
PA^
=
2 pj
=I-P,-~+~
y
j=O
with
Sought is the smallest integer n which satisfies Eq. (4.24) for given (fixed) y, k, ?L, and y. Often n = k + 1 (one spare part) or n = k + 2 (two spare parts) will be suficient. In these cases, results of Table 6.8 yield
nJp= n - k = 1 spare part, 1 repair crew, Case I ,
nSp= n - k = 2 spare parts, 1repair crew, Case 1 .
If PAS2 is still < y, more than 2 spare parts are necessary. A good approximation for the number nsp of spare parts can be obtained using the smallest integer nsp=n- k satisfying (Table 6.8)
132
4 Maintainability Analysis
Using results of Appendix A7.5.5 (Eq.(A7.157)) and considering kh << p, it can be shown that the approximations given by Eqs. (4.26)- (4.28) hold also if the assumption "no further failures are considered when a request for a spare part cannot be rnet", is not made. The case in which nSp+ 1 repair crews are available (instead of 1 repair crew) is considered by Eq. (4.32) for comparative investigations. For Case 2 above, the reliability function can be approximated by an exponential function (Eq. (6.93)), yielding (Eqs. (6.144) & (6.145) with vi = kk) RsolW = e Rso2(t) = e
- t ( k ~ ) ~ / ~ nSp = n - k = 1 spare part, 1 repair crew, Case 2 , - t (kh13lP2
,
(4.29)
nsp= n - k = 2 spare parts, 1repair crew, Case 2 . (4.30)
If Rso2(t), with t as mission time, is still < y, more than 2 spare parts are necessary. A good approximation for the number n S p of spare parts can be obtained using the smallest integer nSp=n - k satisfying (Table 6.8) Rso,„(t)
5=
62-
t"kh~p)"sp+l>Y , nsp=n-k spare parts, 1 repair crew, Case 2. (4.31)
For Eqs. (4.29) to (4.31) it holds necessarily that no further failures are considered when a request for a spare Part cannot be met (system down states are made absorbing for reliability calculations). The case in which n S p repair crews are available is considered by Eq. (4.33) for comparative investigations.
Example 4.5 A system contains k = 100 identical parts (LRUs) with a constant failure rate h = 1 0 - ~h-I and which can be repaired with a constant repair rate p = 10-I hK1. (i) Give the number of spare parts which mnst be stored in order to meet without any time delay and with a probability y t 0.99 a request for a spare part at a time point t (consider the steady-state only, one repair crew, and no further failure when a request for a spare Part cannot be met). ( i i ) If one spare Part is stored ( n = k + I), how large is the probability that any request for a spare part dunng the time interval (0, 104h] will be met without any time delay? Solution (i) Taking n = k + 1 = 101, Eq. (4.26) yields
Thus only one spare part
1) must be stored.
(ii) For n = k + l , Eq. (4.29)yields Rsol ( t ) = e-0.00001 t and thus RsO, ( l ~ ~ h ) = e - 0.91. ~''=
,
Supplementary result: To reach Rs (lo4) r 0.99 one needs n , *=2 spare parts (Rs 0 2 (104) = 0.999).
133
4.5 Basic Models for Spare Parts Provisioning
Assuming for cornparative investigations that each of the n„= 11- k spare parts can be repaired independently from each other (n„+ 1 repair crew, no further failures when a request for a spare part cannot be met), results of Section A7.5.5, with vi = k h , i = 0, ... , r z - k , and Bi=ip, i = 1 , ... ,n-k + 1 , yield (see also Eq. (6.149)) PAs
= l - ( k h l p ) n ~ ~ + l / ( n , p + l ) ! , nSp= n - k spare parts, n , ~
nSp+ 1 repair crews, Case 1,
(4.32)
nSp= n - k spare parts, n „ , repair crews, Case 2.
(4.33)
and, with v i as before and Bi = i p , i = 1, ..., n - k , Rsonsp( t
E
e
- t p ( k h / p ) n ' p" 1 b s p ) !
Using results of Appendix A7.5.5 (Eq.(A7.157)) and considering kh << y, it can be shown that the approximation given by Eq. (4.32) holds also if the assumption "no further failures are considered when a request for a spare part cannot be rnet", is not made. For Eq. (4.33) it holds necessarily that no further failures are considered when a request for a spare Part cannot be met (system down states are made absorbing for reliability calculations). Generalization of the repair rate leads to semi-regenerative processes with n-k + 1 regeneration and n-k not regeneration states (Section 6.5.2, Appendix A7.7, Sections 6.4.2). For instance, assuming for the repair time a density g(t), a mean MTTR, and a variance Var [T'], Eq. (6.109) with kA instead of h and 3L ,= 0 (see remark on p. 490) and Eq. (6.113) for g ( h ) , lead to
n„ = n - k = I spare part, 1repair crew, Case 1.
Similarly, Eq. (6.107) with k h instead of h and h , = 0 and Eq. (6.114) leads to
n,yp= n - k = 1 spare part, 1 repair crew, Case 2.
The last approximation in Eq. (4.34) assumes for the coefficient of variation
K
that
which holds for most of the distribution functions used for repair times (Fig. 4.4). Assuming MTTR=l/y,i.e. the same mean time to repair disregarding the distribution of the repair time, the last approximations in Eqs. (4.34) and (4.35)
134
4 Maintainability Analysis
yield the Same result as given by Eqs. (4.26) and (4.29). This shows the small inpuence of the repair time distribution on results a t system level. The last approximation in Eq. (4.35) is obtained by assuming k h M T ; ~ R « I , i.e. using g ( k h ) = 1 - k h MTTR (Eq. (6.1 14). For the approximation in Eq. (4.34) it was necessary to use g ( k h ) = 1- k AMTTR + ( k h 1 2 (MTTR' +VX[Z' 1) 12 (Eq. (6.113)). in Eqs. (4.31), (4.33) & (4.35), and P A s as in Taking Rs (t ) = e -"Mn's Eqs. (4.28), (4.32) & (4.34), PAs can be expressed as (Eq. (A7.189))
with MTTRs = 1Ip, M V R s = 11( n - k + 1)p & M% = MTTR , respectiveiy. The results of Sections 4.5.1 to 4.5.3, in particular those of Section 4.5.2 on decentralized logistic support, can be extended to Cover the more general case of Systems with dzfferent spare parts.
4.6
Repair Strategies
Repair (restoration) strategies can be very different according to the objective to be reached (choice between block and age replacement, minimization of the number of spare parts or of the down time at system level, maximization of the availability by given cost andlor logistic support constraints, etc.). In addition to the considerations of Section 4.5 on spare parts provisioning, this section deals with some basic repair strategies from a system performance point of view. Specific cost aspects are considered in Section 4.7. In order to avoid wearout failures, replacement can be performed at a given (fixed) operating time 0 or at failure if the operating time is smaller than 0 (age replacement). Assuming that after replacement the system is as-good-as-new, each replacement is a renewal point for the underlying process. Fig 4.5a gives a possible time schedule for this case. If F ( x ) is the distribution function of the involved failure-free time T ,results of Appendix A7.2 for renewal processes and of Section 4.5 for spare parts provisioning can be used, taking for the failure-free time T the truncated distribution function F O ( x )
Fe(,
=
for O a x < 0 for ~ 2 0 ,
instead of F ( X ) . In particular, for F ( X ) = Pr {T 2 X}
= 1- e-hx
it holds that
4.6 Repair Strategies
b)
'renewal point
Figure 4.5 Possible time schedules for a repairable system with preventive maintenance: a) After 0 operating hours or at failure; b) Only at fixed times 0 , 2 8 , ... ( X start by X = 0 at each renewal point)
and Var [T] = a 2 = E[T2 1 - E 2 [ T I = - 1( I h2
e-2he) -
-
e-"
(4.40)
h
(Eqs. (A6.38), (A6.41), (A6.45)). For the number of replacements v ( t ) in the interval (O,t] it follows in particular that (Eq. (A7.34))
with MTTF and a from Eqs. (4.39) and (4.40) for the case of constant failure rate. A further possibility is to perform a replacement only at times 8,2 8, ...,taking in charge that if there is a failure between k 8 and ( k + 1)8 the system is down from the failure occurrence up to the time (k + 1)0 . Figure 4.5b shows a possible time schedule. If V ( n 8 ) is the number of failures in the time interval (0, n e ], the probability to have V ( n 8 ) = k is given by the binomial distribution (Eq. (A6.120)) with p = F ( 8 ) )
Mean and variance of the number of failures in ( 0 , n8] is then given by (Eqs. (A6.122) and (A6.123)) E [ v ( n 8 ) ] = n F ( B ) and
Var[v(nB)]=nF(0)(1-F(8)).
(4.43)
If the age replacement is too expensive, a further strategy is to assume that at times 8,28, ... the system is inspected, but a replacement at the time ( k + 1)8 is performed only if a failure is occurred between k 0 and (k + 1)8. If the failure-free time z has distribution function F ( X ) , the replacement time -crepl has distribution
136
4 Maintainability Analysis
This case has been investigated in [6.16] with cost considerations. If cl = inspection cost ( q > 0 ) cl + c2 = cost for inspection and replacement (c2> 0 ) c3= cost for unit of time (h) in which the system is down waiting for repl. (c3>0),
the total cost C for unit time is for t + W given by
where MTTF = E [T]. For 0 + W , E [ T ~ , ~-+ ~ ] and C + c3; thus, inspection are useful for C < c3. For given F ( X ) it is possible to find a 0 which minirnizes C. For Che mission availability and work-mission availability, as defined by Eqs. (6.28) and (6.31), it can be asked in some applications that the number of repairs (replacements) be limited to N (e.g. because just N- 1 spare parts are available). In this case, the summation in Eqs. (6.29) and (6.32) goes up to n = N . If k elements E I ,...,Ek with constant failure rates hl,...,hk and constant repair rates ,ul,...,pk are in series, a good approximation for the work-mission availability with limited repairs is obtained by multiplying the probability for total system down time < x f o r ~ Table 6.10 (2nd row)) unlimited repairs (Eq. (7.22) with h = hs and p = I . L from with the k probabilities that Ni-1 spare parts will be sufficient for element Ei [6.10] (similar as for Eq. (4.19)). A strategy can also be based on the repair time T ' itself. Assuming for example that if the repair is not finished at time A the failed element is replaced (at time A) by a new equivalent one in a negligible time, the distribution function G(x)of the repair (restoration) times T ' is truncated at A (Eq. (4.38) with A instead of 8). For the case of constant repair rate p, the Laplace transform of G(x)to be used in reliability or availability computations is given by (Appendix A9.7) W
However, a truncated distribution function will break the memoryless property and must thus be considered like a general distribution function, yielding to semiregenerative processes (Appendix A7.7 and Sections 6.4.2 and 6.5.2).
4.7
Cost Considerations
Cost considerations are important in practical applications and apply in particular to spare parts provisioning (Section 4.5) and maintenance strategies (Section 4.6). In the following two basic models based on homogeneous Poisson processes (HPP) with fixed and random cost are discussed.
137
4.7 Cost Considerations
As a first example consider the case in which a constant cost co is related to each repair (renewal) of a given item. Assuming that repair duration is negligible and times between successive failures are independent and exponentially distributed with parameter ?L, the failure flow is a homogeneous Poisson process and the probability for n failures during the operating time t ( v ( t )= n ) is given by (Eq. (A7.41))
Eq. (4.47) is also the probability that the cumulated repair cost over t is C = n co. Mean and variance of C are (Eqs. (A6.40) and (A6.46) with Eq. (A7.42)) E [ C ] = C ~ ? Land ~
~ar[~l=ci?~t.
(4.48)
For large At, C is approximately normal distributed with mean and variance as per Eq. (4.48). If repair cost is a random variable T; 0 distributed according to F(x)=Pr{Si<X } ( i = i,2,...), v ( t ) the Count function giving the number of failures in the operating time interval (0,t ] and 5 the sum of 5 over (0,t ] ,it holds that (Eq. (A7.218))
,
,
5 is distributed as the (cumulative) repair time for failures occurred in a total operating time t of a repairable item, and is given by the work-mission availabili~ (Eq. (6.32) with T. = t ) . Assuming that the failures flow is a homogeneous Poisson process (HPP) with parameter ?L and all 5 are independent from V ( t ) and have the same exponential distribution with parameter p, Eq. (6.32) with constant failure and repair rates A(x)=?L and p(x) = y and T. = t yields (Eq. (A7.219))
Mean and variance of (A6.45), (A6.41))
5,
follow as (Eq. (A7.220), see also Eqs. (4.50), (A6.38),
Furthermore, for t-+w the distribution of 5 , approach a normal distribution with mean and variance as per Eq. (4.51). Moments of 5 , can also be obtained for arbitrary F ( ~ ) = P r { ~ ~ 5 x ) , Fw( iOt)h= 0 (ExampleA7.14)
138
4 Maintainability Analysis
Of interest in some practical applications can also be the distribution of the time T C at which the cumulative cost 5, crosses a give (fixed) barrier C. For the case given by Eq. (4.50), i.e. in particular for Si> 0 , the events ( ' ~ , > t } and
{kt5C)
(4.53)
are equivalent. Form Eq. (4.50) it follows then (Eq. (A7.223))
More general cost optimization strategies are often necessary in practical applications. For example, spare parts provisioning has to be considered as a parameter in the optimization between performance, reliability, availability, logistic Support and cost, taking care of obsolescence aspects as well. In some cases, one parameter is given (e.g. cost) and the best logistic structure is sought to maximize system availability or system performance. Basic considerations, as discussed above and in Sections 1.2.9, 8.4, A6.10.7, A7.5.3.3, applies. However, even assuming constant failure and repair rates, numerical solutions can become necessary, see [4.24] for an example.
5 Design Guidelines for Reliability, Maintainability, and Software Quality
Reliability, maintainability, and software quality have to be built into complex equipment und System during the design and development phase. This has to be supported by analytical investigations (Chapters 2, 4, and 6) as well as by design guidelines. Adherence to such guidelines limits the influence of those aspects which can invalidate the models assumed for analytical investigations, and contributes greatly to build in reliability, maintainability, and software quality. This chapter gives a comprehensive list of design guidelines for reliability, maintainability, and software quality of complex equipment and systems, harmonized with industry's needs [1.2 (1996)l.
5.1 Design Guidelines for Reliability Reliability analysis in the design and development phase (Chapter 2) gives an estimate of an item's true reliability, based on some assumptions regarding data used, interface problems, dependence between components, compatibility between materials, environmental influences, transients, EMC, ESD, etc., as well as on the quality of manufacture and the user's skill level. To consider exhaustively all these aspects is difficult. The following design guidelines can be used to alleviate intrinsic weaknesses and improve the inherent reliability of complex equipment and systems.
5.1.1 Derating Thermal and electrical Stresses greatly influence the failure rate of electronic components. Derating is mandatory to improve the inherent reliability of equipment and systems. Table 5.1 gives recommended stress factors S (Eq. (2.1)) to be used
140
5 Design Guidelines for Reliability, Maintainability, and Software Quality
Table 5.1 Recommended derating values for electronic components at ambient temperature 20°C 1 O A 140°C
* breakdown voltage; ** isolation voltage (0.7 for Ui,); +sink current; ++low values for inductive loads; X O J < 100°C
for industrial applications (40°C ambient temperature O A , GB as per Table 2.3). For BA > 40°C, a further reduction of S is necessary, in general, linearly up to the limit temperature, as shown in Fig. 2.3. Too low values of S ( S < 0.1) can also cause problems. S = 0.1 can be used in many cases to calculate the failure rate in a standby or dormant state. As rule of thumb, S <= 0.5 is a good choice for reliability.
5.1.2 Cooling As a general rule, the junction temperature eJ of semiconductor devices should be kept as near as possible to the ambient temperature of the equipment or System
141
5.1 Design Guidelines for Reliability
in which they are used. For a good design, BJ 5 100°C is recommended. In a steady-state situation, i.e. with constant power dissipation P, the following relationships
can be established and used to define the thermal resistance
RJA for junction - ambient Rcs for case - surface
RJc for junction - case RSA for surface - ambient,
where,su$ace is used for heat sink.
Example 5.1 Determine the thermal resistance RSA of a heat sink by assuming P = 400 mW, BJ = 70°C, arid RJC + RCS = 35'CIW. Solution From Eq. (5.2) it follows that R~~ =
OJ -BA - RJC -RCs
and thus
RSA =--300C 35OCIW = 40°CIW 0.4 W
For many practical applications, thermal resistance can be assumed to be independent of the temperature. However, R j c generally depends on the package used (lead frame, packaging form and type), Rcs varies with the kind and thickness of thermal compound between the device package and the heat sink (or device support), and RsA is a function of the heat-sink dimensions and form as well as of the type of cooling used (free convection, forced air, liquid-cooled plate, etc.). Typical thernial resistance values RJc and Ra for free convection in ambient air without heat sinks are given in Table 5.2. The values of Table 5.2 are iizdicative and have to be replaced with specific values for exact calculations. Cooling problems should not only be considered locally at the component level, but be integrated into a thermal design concept (thermal management). In defining the layout of an assembly, care must be taken in placing high power dissipation parts away from temperature sensitive components like wet Al capacitors and optoelectronic devices (the usefill life is reduced by a factor of 2 for a 10 - 20°C increase of the ambient temperature). In placing the assemblies in a rack, the cooling flow should be directed from the parts with low toward those with high power dissipation.
142
5 Design Guidelines for Reliability, Maintainability, and Software Quality
Table 5.2 Typical thermal resistance values for semiconductor component packages
( Package form
/ Package type
DIL
Plastic
DIL
CeramiclCerdip
PGA
SOL, SOM, SOP
1
(
1
RJA ["cIw]**
10 - 40*
30 - IOO*
7 - 20'
30 - IOO*
6 - 10*
Ceramic IPlastic @MT)
RJC ["CIW]
20 - 60*
(
20 - 40*
1
70- 240*
PLCC
Plastic
10 - 20*
30 - 70*
QFP
Plastic
15 - 25*
30- 80*
T0
Plastic
2 - 20
60 - 300
1
JC = junction to case; JA = junction to ambient; *lower values for 2 64 pins; **free convection at 0.15 d s (factor 1.5 - 2 lower for forced cooling at 4 mls)
5.1.3 Moisture For electronic components in non hermetic packages, moisture can cause drift and activate various failure mechanisms such as corrosion and electrolysis (see Section 3.2.3, Point 8 for considerations on ICs). Critical in these cases is not the water itself, but the impurities and gases dissolved in it. If high relative humidity can occur, care must be taken to avoid the formation of galvanic couples as well as condensation or ice formation on the component packages or on conductive Parts. As stated in Section 3.1.3, the use of ICs in plastic packages can be allowed if one of the following conditions is satisfied: 1. Continuous operation, relative humidity < 70%, noncorrosive or marginally corrosive environment, junction temperature
5.1 Design Guidelines for Reliability
143
5.1.4 Electromagnetic Compatibility, ESD Protection Electromagnetic compatibility (EMC) is the ability of an item to function properly in its intended electromagnetic environment without introducing unacceptable electromagnetic noise (disturbances) into that environment. EMC has thus two aspects, susceptibility and emission. Agreed susceptibility and ernission levels are given in international standards (IEC 61000 13.81). Electrostatic discharge (ESD) protection is a Part of an electromagnetic immunity concept, mandatory for semiconductor devices (Section 3.2.3). Causes for EMC problems in electronic equipment and systems are in particular switching and transient phenomena, electrostatic discharges, stationary electromagnetic fields. Coupling can be conductive (galvanic), through common impedance, by radiated electromagnetic fields. In the context of ESD or EMC, disturbances often appears as electrical pulses with rise times in the range 0.1 to 10kV / ns, peak values of 0.1 to 10kV, and energies of 0.1 to 1 0 3 r n ~(high values for equipment). EMC aspects, in particular ESD protection, have to be considered early in the design and development of equipment and systems. The following design guidelines can help to avoid problems: For high speed logic circuits ( f > 20MHz) use a whole plane (layer of a multilayer), or at least a tight grid for ground and power supply, to minimize inductance and to ensure a distributed decoupling capacitance (4 layers as signall Vcc / ground / signal or better 6 layers as shield / signal / Vcc / ground / signal / shield are recommended). For low frequency digital circuits, analog circuits, and power circuits use a single-point ground concept, and wire all different grounds separately to a common groundpoint at system level (across antiparallel suppressor diodes). Use low inductance decoupling capacitors (generally lOnF ceramic capacitors, placed where spikes may occur, i.e. at every IC for fast logic and bus drivers, every 4 ICs for HCMOS) and a 1pF metallized paper (or a 1OpF electrolytic) capacitor per board; in the case of a highly pulsed load, locate the voltage regulator on the same board as the logic circuits. Avoid logic which is faster than necessary and ICs with widely different rise times; adhere to required rise times and use Schmitt-trigger inputs if necessary.
144
5 Design Guidelines for Reliability, Maintainability, and Software Quality
Pay attention to dynamic Stresses (particularly of breakdown voltages on semiconductor devices) as well as of switching phenomena on inductors or capacitors; implement noise reduction measures near the noise source (preferably with Zener diodes or suppressor diodes). Match signal lines whose length is greater than V . t „ also when using differential transmission (often possible with a series resistor at the source or a parallel resistor at the sink, V =signal propagation speed = ~ 1 6 ) for HCMOS also use a 1 to 2 kQ pull-up resistor and a pull-down resistor equal to the line impedance Zo, in series with a capacitor of about 200pF per meter of line. Capture induced noise at the beginning and at the end of long signal lines using parallel suppressors (suppressor diodes), series protectors (ferrite beads) or series/parallel networks (RC), in that order, taking into account the required rise and fall times. Use twisted pairs for signal and return lines (one twist per centimeter); ground the return line at one end and the shield at both ends for magnetic shielding (at more points to shield against electricfields); provide a closed (360") contact with the shield for the ground line; clock leads should have adjacent ground returns; for clock signals leaving a board consider the use of fiber optics, coax, trileads, or twisted pairs in that order. Avoid apertures in shielded enclosures (many small holes disturb less than a single aperture having the Same area); use magnetic material to shield against low-frequency magnetic fields and materials with good surface conductiviq against electric fields, plane waves, and high frequency magnetic fields (above IOMHz, absorption loss predominates and shield thickness is determined more for its mechanical rather than for its electrical characteristics); filter or trap all cables entering or leaving a shielded enclosure (filters and cable shields should make very low inductance contacts to the enclosure); RF parts of analog or mixed signal equipment should be appropriately shielded (air core inductors have greater emission but less reception capability than magnetic core inductors); all signal lines entering or leaving a circuit should be investigated for common-mode emission; minimize common-mode currents. Implement ESD current-Jow paths with multipoint grounds at least for plugin populated printed circuit boards (PCBs), e.g. with guard rings, E S D networks, or suppressor diodes, making sure in particular that all signal lines entering or leaving a PCB are sufficiently ESD protected (360" contact with the shield if shielded cables are used, latched and strobed inputs, etc.); ground to chassis ground all exposed metal, if necessary use secondary shields between sensitive parts and chassis; design keyboards and other operating parts to be immune to ESD.
;
5.1 Design Guidelines for Reliability
5.1.5 Components and Assemblies 5.1.5.1 Component Selection 1. Pay attention to all specification limits given by the manufacturer and to company-specific rules, in particular to dynamic Parameters and breakdown limits. 2. Limit the number of entries in the list of preferredparts (QPL) and whenever possible ensure a second source procurement; if obsolescence problems are possible (very long warranty or operation time), observe this aspect in the QLP andlor in the design Ilayout of the equipment or System considered. 3. Use non-qualzfied parts and components only after checking the technology and reliability risks involved (the learning phase at the manufacturer's plant can take more than 6 months); in the case of critical applications, intensify the feedback to the manufacturer and plan an appropriate incoming inspection with screening.
5.1.5.2 Component Use Tie unused logic inputs to the power supply or ground, usually through pullup lpull-down resistors (100kQ for CMOS), also to improve testability; pull-up 1 pull-down resistors are also recommended for inputs driven by three-state outputs; unused outputs remain basically Open. Protect all CMOS terminals from or to a connector with a 100kQ pull-up I pull-down resistor and a 1 to 10kQ series resistor (latch-up) for an input, or an appropriate series resistor for an output (add diodes if Vin and Vmt cannot be limited between - 0.3 V and VDD+ 0.3 V); observe power-up and power-down sequences, make sure that the ground and power supply are applied before and disconnected after the signals. Analyze the thermal stress (internal operating temperature) of each part and component carefully, placing dissipating devices away from temperaturesensitive ones, and adequately cooling components with high power dissipation (failure rates double generally for a temperature increase of 10 - 20°C ); for semiconductor devices, design for a junction temperature BJ I 100°C (if possible keep BJ I 80°C). Pay attention to transients, especially in connection with breakdown voltages of transistors ( VBEo I 5 V; Stress factor S < 0.5 for VCE, VGS,and Vm). Derate power devices more than signal devices (stress factor S < 0.4 if more than 105 power cycles occur during the useful life). Avoid special diodes (tunnel, step-recovery, pin, varactor, which are 2 to 20 times less reliable than normal Si diodes); Zener diodes are about one half as reliable as Si switching diodes, their stress factor should be > 0.1.
146
5 Design Guidelines for Reliability, Maintainability, and Software Quality
7. Allow a +30% drift of the coupling factor for opdocoupler during operation; regard optocouplers and LEDs as having a limited useful life (generally > 106h for OJ < 40°C and < 105h for OJ > 80°C), design for OJ 70°C (if possible keep OJ .:40°C); pay attention to optocoupler voltage (S a 0.3). 8. Observe operating temperature, voltage stress (DC and AC), and technological suitability of capacitors for a given application: Foil capacitors have a reduced impulse handling capability; wet Al capacitors have a limited useful life (which halves for every 10°C increase in temperature), a large series inductance, and a moderately high series resistance; for solid Ta capacitors the AC impedance of the circuit as viewed from the capacitor terminals should not be too small (the failure rate is an order of magnitude higher with 0.1Q / V than with 2Q / V, although new types are less sensitive); use a 10 - lOOnF ceramic capacitor parallel to each electrolytic capacitor; avoid electrolytic capacitors < 1pF , 9. Cover EPROM windows with metallized foils, also when stored. 10. Avoid the use of variable resistors in final designs (50 to 100 times less reliable than fixed resistors); for power resistors, check the internal operating temperature as well as the voltage stress.
5.1.5.3 PCB and Assembly Design Design all power supplies to handle permanent short circuits and monitor for underlover voltage (protection diode across the voltage regulator to avoid V„, > V„ at power shutdown); use a 10 to lOOnF decoupling ceramic capacitor parallel to each electrolyte capacitor. Clearly define, and implement, inte&ces between different logic families. Establish timing diagrams using worst-case conditions, also taking the effects of glitches into consideration. Pay attention to inductive und capacitive coupling in parallel signal leads ( 0.5 - 1pH / m , 50 - iOOpF / m); place signal leads near to ground returns and away from power supply leads, in particular for clocks; for high-speed circuits, investigate the requirement for wave matching (parallel resistor at the sink, series resistor at the source); introduce guard rings or ground tracks to limit coupling effects. Place all input/output drivers close together, near the connectors, but away from clock circuitry and power supply lines (inputs latched and strobed). Protect PCBs against damage through insertion or removal under power (use appropriate connectors). For PCBs employing s u f i c e mount technology ( S M T ) , make Sure that the component spacing is not smaller than 0.5mm and that the lead width and spacing are not smaller than 0.25 mm ; test pads and solder-stop pads should be provided; for large leadless ceramic ICs, use an appropriate lead frame
5.1 Design Guidelines for Reliability
147
(problems in S M T arise with soldering, heat removal, mismatch of expansion coefficients, pitch dimensions, pin alignment, cleaning, and contamination); pitch < 0.3 mm can give production problems. 8. Observe the power-up and power-down sequences, especially in the case of different power supplies (no signals applied to unpowered devices). 9. Make sure that the rnechanicalfixing of power devices is appropriate, in particular of those with high power dissipation; avoid having current carrying contacts under thermomechanical Stress. 10. The testability of PCBs and assemblies should be considered early in the design of the layout (number and dimension of test points, pull-up / pulldown resistors, activation/deactivation of three-state outputs, see also Section 5.2); manually extend the capability of CAD tools if necessary.
5.1.5.4 PCB and Assembly Manufacturing 1. Keep conductive the workplaces for assembling, soldering, and testing, in particular ground tools and personnel with lMQ resistors; avoid touching the active parts of components during assembling; use soldering irons with transformers and grounded tips. 2. When using automatic placing machines for inserted devices, verify that only the parts of pins free from insulation goes into the soldering holes (resistor networks, capacitors, relays) and that iC pins are not bent into the soldering holes (hindering degassing); for surface mount devices (SMD), make sure that the correct quantity of solder material is deposited, and that the stand-off height between the component body and the printed circuit surface is not less than 0.25 mm (pitch < 0.3 mm can give production problems, see also Section 3.3.4 for possible placing related ESD damages). 3. Control the soldering temperature profile; for wave soldenng choose the best compromise between soldering time and soldering temperature (about 3s at 245°C) as well as an appropriate preheating (about 60s to reach 100°C); check the solder bath periodically and make Sure that there is sufficient distance between the solder joints and the package for temperature sensitive devices; for surface mount technology (SMT) give preference to IR reflow soldering and provide good solder-stop pads (vapor-phase can be preferred for substrates with meta1 core or PCBs with high component density); avoid having inserted and surface mounted devices (SMD) on the same (two-sided) PCB (thermal shock on the SMD with consequent crack formation and possibly ingress of flux to the active Part of the component, in particular for ceramic capacitors greater than 100 nF and large plastic ICs). 4. Avoid soldering gold-platedpins; if not possible, tin-plate the pins in order to reduce Au concentration to < 4% in the solder joint (intermetallic layers) and < 0.5% in the solder bath(contamination), 0.2 ym
148
5 Design Guidelines for Reliability, Maintainability, and Software Quality
5. Avoid having more than one heating process that reaches the soldering temperature, and hence any kind of rework; for temperature sensitive devices, consider the possibility of adequate protection during soldering (support, cooling ring, etc.). 6. For high reliability applications, wash PCBs and assemblies after soldering (deionized water (< 5yS/ Cm), in any case with halogen-free liquids); periodically check the washing liquid for contamination; use ultrasonic cleaning only when resonance problems in components are excluded. 7. Avoid any kind of electrical overstress when testing components, PCBs or assemblies; avoid removal and insertion under power. 5.1.5.5 Storage and Transportation Keep the storage temperature between 10 and 30°C and the relative humidity between 40 and 60%; avoid dust, corrosive atmospheres, and mechanical Stresses (particularly for electromechanical components); use hermetically sealed containers for high-humidity environments only. Limit the storage time by implementingfirst-in /first-out rules (storage time should be no longer than two years, just-in-time shipping is often only possible for a stable production line). Ensure antistatic storage und transportation of all E S D sensitive electronic components, in particular semiconductor devices (use metallized, unplasticized bags, avoid PVC for bags). Transport PCBs and assemblies in antistatic containers and with all connectors shorted.
5.1.6 Particular Guidelines for IC Design and Manufacturing 1. Reduce latch-up sensitivity by increasing critical distances, changing local doping, or introducing vertical thick-oxide isolation. 2. Avoid significant voltage drops along resistive leads (polysilicon) by increasing line conductivity andlor dimensions or by using multilayer metallizations. 3. Give sufficient size to the contact windows and avoid large contact depth and thus sharp edges (slopes); ensure material compatibility, in particular with respect to metallization layers. 4. Take into account chemical compatibility between materials and tools used in sequential processes; limit the use of planarization processes to uncritical metallization line distances; employ preferably stable processes (low-risk processes) which allow a reasonable Parameter deviation; control carefully the wafer raw malerial (CZ/FZ material, crystal orientation, O2 conc., etc.).
5.2 Design Guidelines for Maintainability
5.2 Design Guidelines for Maintainability Maintainability, even more than reliability, must be built into complex equipment and systems. This has generally to be performed project specific with a maintenance concept. However, a certain number of design guidelines for maintainability apply quite generally. These will be discussed in this section for the case of complex electronic equipment and systems with high maintainability requirements.
5.2.1 General Guidelines Plan and implement a concept for automatic fault recognition and automatic or semiautomatic fault isolation (localization and diagnosis) down to the line replaceable unit (LRU) level, including hidden failures and software defects, as far as possible. Partition the equipment or system into line replaceable units (LRUs) and apply techniques of modular construction, starting from the functional structure; make modules functionally independent and electrically as well as mechanically separable; develop easily replaceable LRUs which can be tested with commonly available test equipment. Aim for the greatest possible standardization of parts, tools, and testing equipment; keep the need for external testing facilities to a minimum. Conceive operation and maintenance procedures to be as simple as possible, also considering personnel safety, describe them in appropriate manuals. Consider environmental conditions (thermal, climatic, mechanical) in field operation as well as during transportation and Storage.
5.2.2 Testability Testability includes the degrees of failure recognition and isolation, the correctness of test results, and test duration. High testability can generally be achieved by improving obsewability (the possibility to check internal signals at the outputs) and controllability (the possibility to modify internal signals from the inputs). Of the following design guidelines, the first five are more for assemblies, and the last five are more for ICs (ASICs in particular).
1. Avoid asynchronous logic (asynchronous signals should be latched and strobed at the inputs). 2. Simplify logical expressions as far as possible. 3. Improve testability of connection paths and simple circuitry using ICs with boundary-scan (IEEE STD 1149 [4.10]).
150
5 Design Guidelines for Reliability, Maintainability, and Software Quality
4. Separate analog und digital circuit paths, as well as circuitry with different supply voltages; make power supplies mechanically separable. 5. Make feedback paths separable Logic
~~y
W:
point
1 Control signal
6. Realize modules as self-contained as possible, with small sequential depth, electrically separable and individually testable,
I
Control signal 1
Control signal2
with MUXs with gates
,
7. Allow for external initialization of sequential logic
Ext. clock Test point
,
Flip-Flop
v\c,_~
point Clock
8. Develop and introduce built-in self-test @IST); introduce test modi also for the detection of hidden failures. 9. Provide enough test points (at a minimum on functional-unit inputs and outputs as well as on bus lines) and Support them with pull-up 1pull-down resistors, provide access for a probe, taking into account the capacitive load (resistive in the case of DC measurements). 10. Make use of a scan path to reduce test time; the basic idea of a scan path is shown on the right-hand side of Fig. 5.1, the test procedure with a scan path is as follows ( n = 3 in Fig. 5.1): 1. Activate the MUX control signal (connect Zto B). 2. Scan-in with n clock pulses an appropriate n-bit test Pattern, this Pattern
5.2 Design Guidelines for Maintainability
I
4
With scan patli
Without scan path Combinationai lagic
I+
Combinational logic
Figure 5.1 Basic structure of a synchronous sequential circuit, without a scan path on the left-hand side and with a scan path on the right-hand side
3.
4. 5.
6. 7. 8.
9.
appears in parallel at the FF outputs and can be read serially with n - 1 additional clock pulses (repeat this step to completely test MUXs & FFs). Scan-in with n clock pulses a first test pattern for the combinatorial logic (feedback part) and apply an appropriate pattern also to the input (both Patterns are applied to the combinatorial circuit and generate correspondY and at the inputs A of the MUXs). ing results which appear at the output Verify the results at the output Y . Deactivate the MUX control Signal (connect Z to A). Give one clock pulse (feedback results from the combinatorial circuit appear parallel at the FF outputs). Activate the MUX control signal (connect Z to B). Scan-out with n - 1 clock pulses and verify the results, at the Same time a second test pattern for the combinatorial circuit can be scanned-in. Repeat steps 3 - 8 up to a satisfactory test of the combinatorial part of the circuit (see e.g. [4.12,4.13,4.23] for test algorithms specially developed for combinatorial circuits).
5.2.3 Accessibility, Exchangeability 1. Provide self-latching access flaps of sufficient size; avoid the need for special tools (one-way screws, Allen screws, etc.); use clamp fastening.
152
5 Design Guidelines for Reliability, Maintainability, and Software Quality
2. Plan accessibiliiy by considering the frequency of maintenance tasks. 3. Use preferably indirect plug connectors; distribute power supply and ground over several contacts (20% of the contacts should be used for power supply and ground); plan to have reserve contacts; avoid any external mechanical stress on connectors, define (if possible) only one kind of extender for PCBs and plan its use. 4. Provide for speedy replaceability by means of plug-outlplug-in techniques. 5. Prevent faulty installation o r connection (of P C B s for instance) through mechanical keying.
5.2.4 Operation, Adjustment 1. Use high standardization in selecting operational tools and make any labeling simple and clear. 2. Consider human aspects in the layout of operating consoles and in defining operating and maintenance procedures. 3. Order all steps of a procedure in a logical sequence and document these steps by a visual feedback. 4. Describe system Status, detected fault, or action to be accomplished concisely infull text. 5.Avoid any form of hardware adjustment (or alignment) in the field; if unavoidable, describe the procedure carefully.
5.3 Design Guidelines for Software Quality Software plays an increasingly important role in equipment and systems, both in terms of technical relevance and of development cost (often higher than 70% even for small systems). Unlike hardware, software does not go through a production phase. Also, software cannot break or wear out. However, it can fail to satisfy its required function because of defects which manifest themselves while the system is operating (dynamic defects). A fault in the software is thus caused by a defect, even if it appears randomly distributed in time, and software problems are basically quality problems which have to be solved with quality assurance tools (defect prevention, configuration management, testing, and quality data reporting systems). For equipment and systems exhibiting high reliability or safety requirements, software should be conceived and developed to be defect tolerant, i.e. to be able
5.3 Design Guidelines for Software Quality
153
to continue operation despite the presence of software defects. For this purpose, redundancy considerations are necessary, in time domain (protocol with retransmission, cyclic redundancy check, assertions, exception handling, etc.), space domain (error correcting codes, parallel processes, etc.), or as a combination of both. Moreover, if the interaction between hardware and software in the realization of the required function at the system level is large (embedded software), redundancy considerations should also be extended to Cover hardware defects and failures, i.e. to make the system fault tolerant (Sections 2.3.7 and 6.8.3 - 6.8.7). In this context, effort should be devoted to the investigation of causes-to-effects aspects (criticality) of hardware and software faults from a system level point of view, including hardware, software, human factors, and logistic Support as well. This section introduces basic concepts and tools for software quality assurance, with particular emphasis on design guidelines and preventive actions. Because of their utility in debugging complex software packages, models for software quality growth are also discussed (Section 5.3.4). Greater details can be found in [5.31-5.701 and [A2.8], in particular [5.44,5.50,5.60,5.67, A2.8 (730 for SQ Assurance Plans)]. A first difference between hardware and software appears in the life-cycle phases (Table 5.3). In contrast to Fig. 1.6, the production phase does not appear in the software life-cycle phases, since software can be copied without errors. A partition of the software life-cycle into clearly defined phases, each of them closed with an extensive design review, is mandatory for software quality assurance. A second basic distinction between hardware and software is given by the quality attributes or characteristics (Table 5.4). The definitions of Table 5.4 extend those given in Appendix A l and take care of established standards [A2.8, 5.501. Not all quality attributes of Table 5.4 can befiljilled at the same time. In general, apriority list must be established and consequently followed by all engineers involved in a project. A further difficulty is the quantitative evaluation (assessment) of software quality attributes, i.e. the definition of software quality metrics. An attempt to aggregate (as user) some of the attributes in Table 5.4 is given in [5.45]. From the above considerations, software quality can be defined as the degree to which a software package possesses a stated combination of quality attributes (characteristics). If supported by an appropriate set of software quality metrics, this allows an objective assessment of the quality level achieved. Since only a Iimited number of quality attributes can be reasonably well satisfied by a specific software package, the main purpose of software quality assurance is to maximize the common part of the quality attributes needed, specified, und realized. To reach this target, specific activities have to be performed during all software life-cycle phases. Many of these activities can be derived from hardware quality assurance tasks, in particular regarding preventive actions (defect prevention), configuration management, testing, and corrective actions. However, auditing software quality assurance activities in a project should be more intensive and with a shorter feedback than for hardware (Fig. 5.2, Tab. 5.5).
154
5 Design Guidelines for Reliability, Maintainability, and Software Quality
Table 5.3 Software life-cycle phases (see Fig. 1.6 for hardware life-cyclephases) Phase
Objective / Tasks
Problem definition Feasibility check
lefinition
lesign, 'oding, resting
ntegration, falidation, nstallation
Iperation, Maintenancc
Investigation of alternative soiutions Interface definitions Feasibility check
Setup of detailed specifications Software design Coding Test of each module Verification of compliance with module specifications (design reviews) Data acquisition Feasibility check
Input Problem definition Constraints on Computer size, programming languages, 110, etc.
System specifications Proposal for the definition phase
Revised system specifications Interface specifications Proposal for the design, coding, and testing phase
Integration and validation of the software Venfication of compliance with system specifications (design reviews) Setup of the definitive documentation
Completed and tested software modules Tested V 0 facilities Proposal for the integration, validation, and installation phase
Use/application of the software Maintenance (correctiv and perfective)
Completed and tested software Complete and definitive documentation
Output System specifications for functional (what) and performance (how) aspects Proposal for the definitioi phase Revised system specifications Interface specifications Updated estimation of cost and schedule Feedback from Users Proposal for the design, coding, and testing phase Definitive flowcharts, data flow diagrams, and data analysis diagrams Test procedures Completed and tested software modules Tested U0 facilities Proposal for the integration, validation, and installation phase Software documentation
Completed and tested software Complete and definitive documentation
Conceming the design and development of complex equipment and Systems, the traditional Separation between hardware and software should be overcome, taking from euch side the "good part" of methods und tools and putting them together for new better methods and tools (strategy applicable to other situations as well).
5.3 Design Guidelines for Software Quality
Table 5.4
Important software quality attnbutes and characteristics
Attribute
/Definition
Compatibility
Degree to which two or more software modules or packages can perform their required functions while sharing the same hardware or software environment
Completeness
Degree to which a software module or package possesses the functions necessary and sufficient to satisfy user needs
Consistency
Degree of uniformity, standardization, and freedom from contradiction within the documentation or parts of a software package
Defect Freedom [Reliability)
Degree to which a software package can execute its required function without causing system failures
Defect Tolerance rRobustness)
Degree to which a software module or package can function correctly in the presence of invalid inputs or highly stressed environmental conditions
Documentation
Totality of documents necessary to descnbe, design, test, install, and maintain a software package
Efficiency
Degree to which a software module or package performs its required function with minimum consumption of resources (hardware andl or software)
Flexibility
Degree to which a software module or package can be modified for use in applications or environments other than those for which it was designed
Integrity
Degree to which a software package prevents unauthonzed access to or modification of Computer programs or data
Maintainability
Degree to which a software module or package can be easily modified to correct faults, improve the performance, or other attributes
Portability
Degree to which a software package can be transferred from one hardware or software environment to another
Reusability Simplicity
restability Usability
1
/ Degree to which a software module can be used in another program Degree to which a software module or package has been conceived and implemented in a straightforward and easily understandable way Degree to which a software module or package facilitates the establishment of test critena and the performance of tests to determine whether those critena have been met Degree to which a User can l e r n to operate, prepare inputs for, and interpret outputs of a software package
Software module is used here also for sojiware element
5.3.1 Guidelines for Software Defect Prevention Defects can be introduced in different ways and at different points along the life cycle phases of software. The following are some causes for defects: 1. During the concept and definition phase
misunderstandings in the problem definition,
5 Design Guidelines for Reliability, Maintainability, and Software Quality --- - . , -
,
.
Software System Specification
and Installation \
\
I
I
I
I
I \
.-- - - I ,
\ \
Basic Software Structure
I
, I
,. .
i
Modules Integration \
I
I
I
I
I i / I
Modules Validation
Modules (Software Element) Design, Coding, and Testing
Figure 5.2 Procedure for software development (top-down design and bottom-up integration with vertical and horizontal control loops)
constraints on CPU performance, memory size, computing time, 110 facilities or others, inaccurate interface specifications, too little attention to User needs and/or skills.
2. During the design, coding, and testing phase inaccuracies in detailed specifications, misinterpretation of detailed specifications, inconsistencies in procedures or algorithms, timing problems, data conversion errors, complex software structuring or large dependence between software modules. 3. During the integration, validation, and installation phase too large interaction between sofONare modules, errors during software corrections or modifications, unclear or incomplete documentation, changes in the hardware or software environment, exceeding important resources (dynamic memory, disk, etc.).
5.3 Design Guidelines for Software Quality
157
Defects are thus generally caused by human errors (software developer or user). Their detection and removal become more expensive as the software life cycle Progresses (often by a factor of 10 between each of the four main phases of Table 5.3, as in Fig. 8.2 for hardware). Considering that many defects can remain undiscovered for a long time after the software installation (since detected only by particular combinations of data and system states), the necessity for defect prevention through an appropriate software quality assurance becomes mandatory. Following design guidelines can be useful: 1. Fix written procedures/rules and follow them during software development, such rules specify quality attributes with project specific priority and corresponding quality assurance procedures. 2. Formulate detailed specifications und inte$aces as carefully as possible, such specifications Iinterfaces should exist before coding begins. 3. Give priority to object oriented programming. 4. Use well-behaved high-level programming languages, assembler only when a problem cannot be solved in other way; use established Computer Aided Software Engineering (CASE) for prograrn development and testing. 5. Partition software into independent software modules (modules should be individually testable, developed top-down, and integrated bottom-up). 6. Take into account all constraints given by I/O facilities. 7. Develop software able to protect itself and its data; plan for automatic testing and validation of data. 8. Consider aspects of testing / testability as early as possible in the development phase; increase testability through the use of definition languages (Vienna, RTRL, PSL, IORL). 9. Improve understandability and readability of software by introducing appropriate comments. 10. Document software carefully and carry out sufficient configuration management, in particular with respect to design reviews (Table 5.5). Software for on-line Systems (product and embedded software) should further be conceived to be as far as possible tolerant on hardware failures and to allow a System recoplfiguration, particularly in the context of a fail-safe concept (hardware and software involved in fail-safe procedures should be periodically checked during the operation phase). For this purpose, redundancy considerations are necessary, in the time domain (protocol with retransmission, cyclic redundancy check, assertions, exception handling, etc.), in the space domain (error correcting codes, parallel processes, etc.), or a combination of both. Moreover, if the interaction between hardware and software in the realization of the required function at the system level is large (embedded software), redundancy considerations should be extended to cover hardware defects and failures, i.e. to make the system fault tolerant (Sections 2.3.7 and 6.8.6). In this context, effort should be devoted to the
158
5 Design Guidelines for Reliability, Maintainability, and Software Quality
investigation of causes-to-effects aspects (criticality) of hardware and software faults from a system level point of view, including hardware, software, human factors, and logistic support as well (Section 2.6).
5.3.2 Configuration Management Configuration management is an important quality assurance tool during the design and development of complex equipment and systems, both for hardware und software. Applicable methods and procedures are outlined in Section 1.3.3 and discussed in Appendices A3 and A4 for hardware. Some of these methods have been introduced in software standards [A2.8]. Of particular importance for software are design reviews, as given in Table 5.5 (see also Table A3.3 for hardware aspects) and configuration control, i.e. management of changes and modifications.
5.3.3 Guidelines for Software Testing Planning for software testing is generally a difficult task, as even small programs can have an extremely large number of states which makes a complete test impossible. A test strategy is then necessary. The problem is also known for hardware, for which special design guidelines to increase testability have been developed (Section 5.2). The most important rule, which applies to both hardware and software, is the partitioning of the item (hardware or software) into independent modules which can be individually tested and integrated bottom-up to constitute the system. Many rules can be project specific. The following design guidelines can be useful in establishing a test strategy for software used in complex equipment and systems:
1. Plan software tests early in the design and coding phases, and integrate them step by step into a test strategy. 2. Use appropnate tools (debugger, coverage-analyzer, test generators, etc.). 3. Perform tests first at the module level, exercising all instructions, branches and logic paths. 4. Integrate and test successively the modules bottom-up to the system level. 5. Test carefully all suspected paths (with potential defects) and software parts whose incorrect running could cause major system failures. 6. Account for all defects which have been discovered with indication of running time, software & hardware environments at the occurrence time (state, parameter set, hardware facilities, etc.), changes introduced, and debugging effort. 7. Test the complete software in itsfinal hardware and software environment. Testing is the only practical possibility to find (and elirninate) defects. It includes
5.3 Design Guidelines for Software Quality Table 5.5 Software design reviews (IEEE Std 1028-1988[A2.8]) Objective Management Review E
+
.4
3
Technical Review
Provide recomrnendations for the following activities Progress, based on an evaluation of product development status changing project direction or identifying the need for alternate planning adequate allocation of resources through global control of the project Evaluate a specific software element and provide management with evidence that the software element conforms to its specifications the design (or maintenance) of the software element is being done according to plans, Standards, and guidelines applicable for the project changes to the software element are properly implemented and affect only those system areas identified by change specifications
Inspection
Detect and identify software element defects, in particular verify that every software element satisfies its specifications venfy that every software element conforms to applicable Standards identify deviations from standards and specifications evaluate software engineenng data (e.g. defect and effort data)
Walkthrough
Find defects, omissions, and contradictions in the software elements and consider alternative implementations (long associated with code examination, this process is also applicable to other aspects, e.g. architectural design, detailed design, test plans Iprocedures, and change control procedures)
I Software
software element is used here also for software module; see also Tab. A3.3 for gutem oriented design reviews
debug tests (generally performed early in the design phase using breakpoints, desk checking, dumps, inspections, reversible executions, single-step operation, or traces) and run tests. Although costly (often up to 5 0 % of the software development cost), tests cannot guarantee freedom from defects. A balanced distribution of the efforts between preventive actions (defect prevention) and testing must thus be found for each project.
5.3.4 Software Quality Growth Models Since the beginning of the seventies, a large number of models have been proposed to describe the occurrence of software defects during operation of complex equipment and Systems. Such an occurrence can generate a failure at system level and appears often randomly distributed in time. For this reason, modeling has been done in a similar way as for hardware failures, i.e. by introducing the concept of software failure rate. Such an approach may be valid to investigate software quality growth during software validation and installation, as for the reliability growth models developed in the sixties for hardware (Section 7.7).
160
5 Design Guidelines for Reliability, Maintainability, and Software Quality
However, from the considerations of the preceding sections, the main target should be the development of software free from defects and thus to focus the effort on defect prevention rather than on defect modeling. However, because of their use in investigating software qualio growth, this section introduces briefly basic models known for software defect modeling. Between consecutive occurrence points of a software defect, the 'Ifnilmre rate" is a function of the number of defects present in the software. This model leads to a death process and is known as Jelinski-Moranda model. If at t = 0 the software contains n defects, the probability P i ( t ) = Pr(i defects have been removed up to the time t n defects were present at t = 0) can be calculated recursively from (see Fig. A7.9 with vo = nh, vi = ( n - i ) h and Bi = 0 )
I
Po(t)=e-i"t,
t Pi(t)=j(n-i+l)he-(n-i)ke-l(t-x)&, 0
i = l , ..., n , (5.3)
or directly as
Figure 5.3 shows P o ( t ) to P 3 ( t ) for n = 10. This model can be easily extended to Cover the case in which the Parameter ?L also depends on the number of defects still present in the software. Between consecutive occurrence points of a software defect, the 'Ifailure rate" is a function of the number of defects still present in the software and of the time elapsed since the last occurrence point of a defect. This model generalizes Model 1 above and can be investigated using semi-Markov processes (Appendix A7.6).
0
llnh
21nh
31nh
Figure 5.3 Pi (t ) = Pr{i defects have been removed up to the time t I n defects were present at t = O} for i = 0 - 3 and n = 10 (the time interval between consecutive occurrence points of a defect is exponentially distnbuted with Parameter Li = ( n - i) X)
5.3 Design Guidelines for Software Quality
Figure 5.4 Simplified modeling for the time behavior of a system whose failure is caused by a hardware failure ( Z i-iz;') or by the occurrence of a software defect ( Z i -t Z
i)
3. The jlow of occurrence of software defects constitutes a nonhomogeneous Poisson process (Appendix A7.8.2). This model has been extensively investigated in the literature, together with reliability growth models for hardware, with different assumptions on the form of the process intensity (Section 7.7). 4. The jlow of occurrence of software defects constitutes an arbitrary point process. This model is very general but difficult to investigate. All the above models have a theoretical foundation. However, in practical applications they often suffer from the lack of information (for instance about the number of defects actually present in the software) and data. Also they do not take care of the criticalis, (effect at system level) of the defects still present in the software under consideration (several minor faults are in general less critical than just one major fault). The use of nonhomogeneous Poisson processes is discussed in Section 7.7, see e.g. also [6.3, A7.301 for some critical comments. Oversimplified models should also be avoided [5.69]. For systems with hardware and software, one can often assume that defects in the software will be detected and eliminated one after the other. Only hardware failures should then remain. Figure 5.4 shows a possibility to take this into account [6.9]. However, interdependence between hardware and software can be greater as assumed in Fig. 5.4. Also is the number (n) of defects in the software at the time t = 0 unknown and by eliminating a software defect new defects can be introduced. Modeling software defects as well as systems with hardware and software is still evolving.
6 Reliability and Availability of Repairable Systems
Reliability and availability analysis of repairable Systems is generally performed using stochastic processes, including Markov, semi-Markov, and semi-regenerative processes. The mathematical foundation of these processes is in Appendix A7. Equations used to investigate Markov and serni-Markov models are surnrnarized in Table 6.2. This chapter investigates systematically most of the reliability models encountered in practical applications. Reliability figures at system level have indices S i (e.g. MTTF',), where S stands for system and i is the state entered at t = 0 (Table 6.2). After Section 6. 1 (introduction, assumptions, conclusions), Section 6.2 investigates the one-item structure under general conditions. Sections 6.3 - 6.6 deal extensively with series, parallel, and series-parallel structures. To unify models and simplify calculations, it is assumed that the system has only one repair Crew and no further failures occur at system down. Starting from constant failure and repair rates between successive states (Markov processes), generalization is performed step by step (beginning with the repair rates) up to the case in which the process involved is regenerative with a minimum number of regeneration states. Approximute expressions for large series -parallel structures are investigated in Section 6.7. Sections 6.8 considers systems with complex structure for which a reliability block diagram often does not exist. On the basis of practical examples, preventive maintenance, imperfect switching, incomplete coverage, elements with more than two states, phased-mission systems, common cause failures, and general reconfigurable fault tolerant systems with reward & frequencylduration aspects are investigated. A general procedure for complex structures is given in Section 6.8.8. Sections 6.9 introduces alternative investigation methods (Petri nets, dynarnic FTA, computeraided analysis), and gives a Monte Carlo approach useful for rare events. Asymptotic & steady-state is used as a synonym for stationary (p. 476). Results are summarized in tables. Selected examples illustrate the practical aspects.
6.1 Introduction, General Assumptions, Conclusions Investigation of the time behavior of repairable Systems spans a very large class of stochastic processes, from simple Poisson process through Markov and semiMarkov processes up to sophisticated regenerative processes with only one or just a few regeneration states. Nonregenerative processes are rarely considered because
163
6.1 Introduction and General Assumptions
of mathematical difficulties. Important for the choice of the class of processes to be used are the distribution functions for the failure-free and repair times involved. If failure and repair rates of all elements in the system are constant during the stay time in every states (not necessarily at a state change, e.g. because of load sharing), the process involved is a (time-homogeneous) Markov process with finitely many states, for which the stay time in each state is exponentially distributed. The same holds if Erlang distributions occurs (supplementary states, See e.g. Section 6.3.3). The possibility to transform a given stochastic process into a Markov process by introducing supplementary variables is not considered here. Generalization of the distribution functions for repair times leads to semi-regenerative processes, i.e. to processes with an embedded semi-Markov process. This holds in particular if the system has only one repair crew, since each termination of a repair is a renewal point (because of the constant failure rates). Arbitrary distributions of repair and failure-free times lead in general to nonregenerative stochastic processes. Table 6.1 shows the processes used in reliability investigations of repairable systems, with their possibilities and limits. Appendix A7 introduces these processes with particular emphasis on reliability applications. All equations necessary for the reliability and availability calculation of systems described by time-homogeneous Markov processes and semi-Markov processes are summarized in Table 6.2. Besides the assumption about the involved distribution functions for failure-free and repair times, reliability and availability calculation is largely influenced by the Table 6.1 Stochastic processes used in reliability and availability analysis of repairable systems Stochastic process
I
Can be used in modeling
Background
/ Difficulty
Renewal process
Spare parts provisioning in the case of arbitrary failure rates and negligible replacement or repair time (Poisson process for const. h)
Renewal theory
Medium
Alternating renewal vrocess
One-item repairable (renewable ) structure with arbitrary failure and repair rates
Renewal theory
Medium
Systems of arbitrary structure whose elements Markov process (MP) have constant failure and repair rates (Ai,y i ) (finite state space, during the stay time (sojourn time) in everj time-homogeneous) state (not necessarily at a state change, e.g. because of load sharing) Semi-Markov process (SMP)
Some systems whose elements have constant Integral or Erlangian failure rates (Erlang distributed equations failure-free times) and arbitrary repair rates
Systems with only one repair crew, arbitrary Semi-regenerative process (proc. with structure, and whose elements have constant only few regen. states) failure rates and arbitrary repair rates Nonregenerative process
Differential equations or Integral equations
Systems of arbitrary structure whose elements have arbitrary failure and repair rates
Integral equations
Low
Medium
High
164
6 Reliability and Availability of Repairable Systems
maintenance strategy, logistic support, type of redundancy, and dependence between elements. Existente of a reliability block diagram is assumed in Sections 6.2 6.7, not necessarily in Sections 6.8 and 6.9. Results are expressed as functions of time by solving appropriate systems of differential (or integral) equations, or given by the mean time to failure or the steady-state point availability at system level (MTTFsi or PAs) by solving appropriate systems of algebraic equations. If the system has no redundancy, the reliability function is the same as in the nonrepairable case. In the presence of redundancy, it is generally assumed that redundant elements will be repaired without operational interruption at system level. Reliabili~investigations thus aim to find the occurrence of the first system down, whereas the point availability is the probability to find the system in an up state at a time t , independently of whether down states at system level have occurred before t. In order to unify models and simplify calculations, the following assumptions are made for the analyses in Sections 6.2 - 6.6 (partly also in Sections 6.7 - 6.9).
-
1. Continuous operation: Each element of the system is in operating or reserve state, when not under repair or waiting for repair. (6.1) 2.No further failures at system down: At system down the system is repaired (restored) according to a given maintenance strategy to an up state at system level from which operation is continued, failures during a repair at system down are not considered. (6.2) 3. Only one repair Crew: At system level only one repair Crew is available, repair is performed according to a stated strategy, e.g. first-inlfirst-out. (6.3) 4. Redundancy: Redundant elements are repaired without interruption of Operation at system level; failure of redundant parts is immediately recognized. (6.4) 5. States: Each element in the reliability block diagram has only two states (good or failed); after repair (restoration) it is as-good-as-new. (6.5) 6. Independence: Failure-free and repair times of each element are stochastically independent, > 0, and continuous random variables with finite mean (MTTF; MTTR) and variance (failure-free time is used as a synonym for failure-jree operating time and repair as a synonym for restoration). (6.6) 7. Support: Preventive maintenance is neglected; fault coverage, switching, and logistic support are ideal (repair time = restoration time = down time). (6.7) The above assumptions holds for Sections 6.2 - 6.6, and apply in many practical situations. However, assumption (6.5) must be critically verified, in particular for the aspect as-good-as-new, when the repaired element does not consist of just one Part which has been replaced by a new one, but contains parts which have not been replaced during the repair. This assumption is valid if the nonreplaced parts have constant (time independent) failure rates, and applies in this case to considerations at system level. At system level, reliability figures have indices Si (e.g. MTTFsi) where S stands for system and i is the state entered at t = 0 (Table 6.2). Assuming irreducible processes, asymptotic & steady-state is used as a synonym for stationary.
6.1 Introduction and General Assumptions
165
Section 6.2 considers the one-item repairable structure under general assumptions, allowing a careful investigation of the asymptotic und stationary behavior. For the basic reliability structures encountered in practical applications (series, parallel, and series-parallel), investigations in Sections 6.3 - 6.6 begin by assuming constant failure und repair rates for every element in the reliability block diagram. Distributions of the repair times, and as far as possible of the failure-free times, are then generalized step by step up to the case in which the process involved remains regenerative with a minimum number of regeneration states. This, also to show capability & limits of the models involved. For large series-parallel structures, approximate expressions are developed in deep in Section 6.7. Procedures for investigating repairable systems with complex structure (for which a reliability block diagram often does not exist) are given in Section 6.8 on the basis of practical examples, including imperfect switching, incomplete coverage, more than two states, phased-mission systems, common cause failures, and fault tolerant reconfigurable systems with reward & frequencylduration aspects. It is shown that the tools developed in Appendix A7 (summarized in Tab. 6.2) can be used to solve many of the problems occurring in practical applications, on a case-by-case basis working with the diagram of transition rates or a time schedule. Alternative investigation methods (Petri nets, dynamic FTA), as well as computer-aided analysis is discussed in Section 6.9 and a Monte Carlo approach useful for rare events is given. From the results of Sections 6.2 - 6.9, the following conclusions can be drawn: 1. As long as for each element in the reliability block diagram the condition MTTR« MTTF holds, the shape of the distribution function of the repair time has small influence on the mean time to failure and on the steady-state availability at system level (see for instance Examples 6.7,6.8, 6.9). 2. As a consequence of Point 1, it is preferable to start investigations by assuming Markov models (constant failure and repair rates for all elements, Table 6.2); in a second step, more appropriate distribution functions can be considered. 3. The assumption (6.2) of no further failure at system down has no influence on the reliability function; it allows a reduction of the state space and simplifies calculation of the availability and interval reliability (yielding good approximate values for the cases in which this assumption does not apply). 4. Already for moderately large systems, the use of Markov models can become time-consuming (up to e .n ! states for a reliability block diagram with n elements); approximate expressions are important, and the method based on macro-structures (Table 6.10) adheres well to many practical applications. 5. For large systems or complex structures, following possibilities are available: work directly with the diagram of transition rates (Section 6.8), calculation of the mean time to failure and of the steady-state availability at system level only (Table 6.2, Eqs. (A7.126), (A7.173), (A7. l U ) , (A7.175)), use of approximate expressions (Section 6.7), use of alternative methods or Monte-Carlo sirnulation (Section 6.9).
166
6 Reliability and Availability of Repairable Systems
Table 6.2 Relationships for the reliability, point availability, and interval reliability of Systems described by time-homogeneous Markov processes & semi-Markov processes (Appendix A7.5 - A7.6)
6.1 Introduction and General Assumptions Table 6.2 (cont.) ,
168
6 Reliability and Availability of Repairable Systems
6.2 One-Item Structure A one-item structure is a unit of arbitrary complexity, generally considered as an entity for investigations. Its reliability block diagram is a single element (Fig. 6.1). Considering that in practical applications the one-item structure can have the complexity of a system, and also to use the Same notation as in the following sections of this chapter, reliability figures are given with the indices S or SO (e.g. PAs, Rso(t), MTTFso), where S stands for system and 0 specifying item new at t = 0. Under the assumptions (6.1) to (6.3) and (6.5) to (6.7), the repairable one-item structure is completely characterized by the distribution function of the failure-free times 'cO,'cl ,. .. F A ( ~ ) = P r { ~ O Sand ~ ) F(x)=Pr{zi 2 x 1 , with densities d FA(X) fA(X) = ---
dx
and
dF(x) f(x) = -,
dx
the distribution function of the repair times 28, T ; , ... GA(x) = Pr{'cb 5 X ] and
G ( x ) = Pr{.c; 2 X},
with densities
and the probability p that the one-item structure is up at t = 0
or 1 - p = Pr{down (i.e. under repair) at t = O}, respectively ( z i & T ; are interarrival times, and X is used instead of t). The time behavior of the one-item structure can be investigated in this case with help of the alternating renewal process introduced in Appendix A7.3.
Figure 6.1 Reliability block diagram for a one-item structure
169
6.2 One-Item Structure
Figure 6.2 Possible time behavior of a repairable one-item structure new at t = 0 (repair times greatiy exaggerated; aiternating renewal process with renewal points 0, Sduul. Sduu2,... for a transition from down state to up state given that the item is up at t = 0, marked by 0 )
Section 6.2.1 considers the one-item structure new at t = 0, i.e. the case p = 1 and FA(x)= F(x), with arbitrary F(x) and G(x). Generalization of the initial conditions at t = 0 (Sections 6.2.3) allows in Sections 6.2.4 and 6.2.5 a depth investigation of the asymptotic and steady-state behavior.
6.2.1 One-Item Structure New at Time t = 0 Figure 6.2 shows the time behavior of a one-item structure new at t = 0. T ~ .t2,. , .. are the failure-free times. They are statistically independent and distributed according to F(x) as per Eq. (6.8). Similarly, T;, T;, ... are the repair times, distributed according to G ( x ) as per Eq. (6.10). Considering assumption (6.5), the time points 0, Sduul,... are renewal points and constitute an ordinary renewal process embedded in the original alternating renewal process; investigations of this Section are based on this property (Sduumeans a transition from down (repair) to up (operating) starting up at t = 0).
6.2.1.1 Reliability Function The reliabili~functionRso(t) gives the probability that the item operates failure free in (0, t ] given item new at t = 0 Rso(t) = Pr{up in (O,t]
I new at t = 0).
(6.13)
Considering Eqs. (2.7) and (6.8) it holds that Rso(t) = Pr{zl > t } = 1 - F(t). The mean time to failure given item new at t = 0 follows from Eq. (A6.38) 00
I
= ~ ~ dt ~ ,( f )
(6.15)
0
with the upper limit of the integral being TL should the useful life of the item be
170
6 Reliability and Availability of Repairable Systems
limited to TL (in this case, Rso(t) jumps to 0 at t =TL). In the following, TL = will be assumed.
-
6.2.1.2 Point Availability The point availability PAso(t) gives the probability of finding the item operating at time t given item new at t = 0 PAsO(t) = Pr{up at t
I new at t = O}.
(6.16)
For PAso(t) it holds that
A(t) is often used instead of PAso(t). Equation (6.17) is derived in Appendix A7.3 (Eq. (A7.56)) using the theorem of total probability. 1- F(t) is the probability of no failure in (0, t], hd„(x)dx gives the probability that any one of the renewal points Sduul,Sduu2,... lies in (X,X + dx] , and 1- F(t - X) is the probability that no further failure occurs in (X,t]. Using Laplace transform (Appendix A9.7) and considering Eq. (A7.50) with FA(x)= F(x), Eq. (6.17) yields
?(s) and g ( s ) are the Laplace transforms of the failure-free time and repair time densities, respectively (given by Eqs. (6.9) and (6.11)).
Example 6.1 a ) Give the Laplace transform of the point availability P A S O ( t )for the case of a constant failure rate h ( h ( x )= h). b) Give the Laplace transform and the corresponding time function of the point availability for the case of constant failure und repair rates h and p ( h ( x ) = h and y ( x ) = P ) .
Solution , (6.18) yields a) With F(x) = I - e-AX or f(n) = ~ e - ' ~Eq.
Supplernentary results:
&)=a(a x)'-'e
-CLZ/
r(ß)(Eq. (A6.98)) yields
PA^^ (s) =
(s + a)' (s+ h ) ( s + a ) ' - h a b
6.2 One-Item Structure b) With f(x) = ~ e - ' and ~ g(x) = pe-'".X, Eq. (6.18) yields
and thus (Table A9.7b)
PAs ( t )converges rapidly, exponentially with a time constant l / ( h + p ) = 1/11 = MTTR, to the asymptotic value p/(h -t- p) = 1 - h l p , see Section 6.2.4 for an extensive discussion.
PAso(t) can also be obtained using renewal process arguments (Appendices A7.2, A7.3, A7.6). After the first repair the item is as-good-as-new. Sduul is a renewal point and from this time point the process restarts anew as at t = 0. Therefore
Considering that the event IupattJ occurs with exactly one of the following two mutually exclusive events ( no failure in (0, t ] }
it follows that
where f ( x )* g(x) is the density of the sum q + T; (see Fig 6.2 and Eq. (A6.75)). The Laplace transfonn of PAso(t)as per Eq. (6.22) is that given by Eq. (6.18).
6.2.1.3 Average Availability The average availability AAso(t) is defined as the expected proportion of time in which the item is operating in (0, t ] given item new at t = 0 1
AAso(t) = - Ertotal up time in (0, t ] t
1 new at t =0].
(6.23)
172
6 Reliability and Availability of Repairable Systems
Considering PAsO(x)from Eq. (6.17), it holds that
Eq. (6.24) has a great intuitive appeal. It can be proved by considering that the time behavior of a repairable item can be described by a binary random function C(t) taking values 1 for up and 0 for down. From this, E [ ( ( t ) ]= l . P A s o ( t )+ t O.(1-PASO(t))= P A S O ( t )and, taking care of J ( ( X ) &=total up time i n ( 0 t ] , it 0 follows that
6.2.1.4
Interval Reliability
The interval reliability IRsO(t,t + 0 ) gives the probability that the item operates failure free during an interval [ t ,t + 01 given item new at t = 0
IRso(t,t+O)=Pr{upin[t,t+O]
I newat t = O } .
(6.25)
The same method used to obtain Eq. (6.17) leads to t
1 ~ „ ( t , t +B) = l - ~ ( t + 0 ) + ~ h ~ ( x ) ( l - ~ ( t + 0 - ~ ) ) d r .
(6.26)
0
Example 6.2
.+
Give the interval reliability IRso (t t
0) for the case of a constant failure rate h (h(x) = h).
Solution With F(x) = 1 - eVhx it follows that t
IRsO(t,t+0)=e
-h(t+9)
-1 (t+ 8-2) dx = hduu(x)e
0
t
[e-)"+Jhd„(x)e -h (t - X ) dx] 0
Comparison with Eq. (6.17) for F(x) = I - ehx yields
It must be pointed out that the product rule in Eq. 6.27, expressing Pr{up in [t, t + 01 I new at t = 0 } = Pr{up at t I new at t = 0 ) . Pr{no failure in ( t , t + 01 1 up at t } ,is valid only because of the constant failure rate (memoryless property, Eq. (2.14));in the general case, the second term is Pr{no failure in ( t , t + 01 I (up at t nnew at t = O)}, which differs from Pr{ no failure in ( t, t + 01 I up at t ] .
173
6.2 One-Item Structure
6.2.1.5 Special Kinds of Availability In addition to the point and average availability (Sections 6.2.1.2 and 6.2.1.3), there are several other kinds of availability useful for practical applications [6.5 (1973)l: 1Mission Availability: The mission availability MAso(T„ t,) gives the probability that in a mission of total operating time (total up time) T, each failure can be repaired within a time span t„ given item new at t = 0 MASO(To,t,) = Pr{each individual failure occuring in a rnission with total operating time T, can be repaired in a time < t , new at t =O}. (6.28)
I
Mission availability is important in applications where interruptions of length I t, can be accepted. Its computation considers all cases with n = 0,1, ... failures, taking care that at the end of the mission the item is operating (to reach the given (fixed) operating time T,).+) Thus, for given T, > 0 and t„
holds. F,(T,) - F„l(To) is the probability for n failures during the total operating time T, (Eq. (A7.14)); (G(to))n is the probability that all n repair times will be shorter than t,. For constant failure rate h it holds that F,(To)-F„I(To)=(hTo) n e - h T I n ! and thus
2. Work-Mission Availability: The work-mission availability WMAso (T„ X)gives the probability that the sum of the repair times for all failures occurring in a mission of total operating time (total up time) T, is <X,given item new at t = 0 WMASO(To,x)=Pr{sum of therepair times for allfailures occurring in a rnission of total operating time T, is I x new at t =0).
I
(6.31)
Similarly as for Eq. (6.29) it follows that for given (fixed) T,> 0 and x> 0 *'
where G,(x) is the distribution function of the sum of n repair times with distribution G(x) (Eq. (A7.13)). As for the mission availability, the item is up at the end of the mission (to reach the given (fixed) operating time T,). For constant failure and repair rates (L,P),Eq. (6.32) yields (see also Eq. (A7.219))
*)
An unlimited number n of repair is assumed here, See e.g. Section 4.6 (p. 136) for n limited.
174
6 Reliability and Availability of Repairable Systems
T. >O given, x >O,
WMA~~(T X)„= 1 - e - @ T ~ + p ) k=O
n=l
WMAs (To,0)= e - ~ T o .
Defining DT as total down time and UT= t - DT as total up time in (O,t], one can recognize that for given fixed t, WMA„(t -X, X)= Pr {DT in (O,t] 2 X] t). However, the item holds for an item described by Fig. 6.2 ( t > 0, O< X I can now be up or down at t, and the situation differs from that defined by Eq. (6.31). The function WMA„ ( t - X, X) has been investigated in [A7.29 (1957)l. In particular, a closed analytical expression for WMA„ ( t -X, X) is given for constant failure and repair rates ( h , P), and it is shown that the distribution of DT converges for t + to a normal distribution with mean t h l ( h + p ) = t h l p andvariance t 2 h p / ( h + p ) 3 = t 2 h / p 2 . Itcan benoted, that for the interpretation given by Eq. (6.31), mean and variance of the total repair time are given by T. h I F and T. 2 h /P*, respectively (Eq. (A7.220)).
-
3. Joint Availability: The joint availability JAso(t, t + 8) gives the probability of finding the item operating at the time points t and t+ 8, given item new at t = 0 (t and t + 0 are given fixed time points, see e.g. [6.14 (1999), 6.281 for stochastic demand) JAso(t,t+O)=Pr{(upattnupatt+O)
1 newat t=O}.
(6.34)
For the case of a constant failure rate h(x) = h , the multiplication rule of Eq. (6.27) yields
For arbitrary failure rate, one has to consider that {up at t n up at t+ 8 I new at t = O] occurs with one of the following 2 mutually exclusive events (AppendixA7.3) {upin[t,t+O]
I newat t = 0 }
or { (up at t
n next failure occurs before t + 8 n up at t + 8 ) 1 new at t = 0) .
The probability for the first event is the interval reliability IRsO(t,t+ 8) given by Eq. (6.26). For the second event, it is necessary to consider the distribution function of the forward recurrence time in the up state ~ ~ ~ ( t ) . As shown in Fig. 6.3, ~ ~ ~can( only t ) be defined if the item is up at time t, hence
I
P r ( z R u ( t ) > x n e w a t t = O } = P r { u p i n ( t , t + x ] I(upat t n n e w a t t = O ) ] and thus, as in Example A7.2 and considering Eqs. (6.16) and (6.25),
175
6.2 One-Item Structure
I
Pr{TRu(t)> X new at t = 0} =
I
Pr{up in [t,t+X ] new at t = 0)
I
Pr{up at t new at t = 0)
=1-F
-
IRSO(t,t+ X ) PAso(t)
(6.36)
(X). 'RU
For constant f a h r e rate h ( x ) = h one has 1 -F, Ru ( X )= e - h , as per Eq. (6.27). Considering Eq. (6.36) it follows that 8 ~ ~ ~ ~ 0( )t= .~ t ~+ ~ ~ 0( ) t+ ,P tA +~ ~ ( ~ f)T J (" x ) P A ~ ~ ( B - x ) ~ ~ ~ D
I
where PAsl ( t ) = Pr(up at t a repair begins at t
Ru
=
0) is given by
t
P A s i ( t ) = h„(x)(l-
~ (- Xt ) ) & ,
(6.38)
0
with hdud(t)=g(t)+g(t)* f ( t ) *g(t)+g(t)* f ( t )* g ( t f ( t ) g ( t ) +. (Eq. (A7.50)). J A s O ( t , t + 8 ) can also be obtained in a sirnilar way to P A s o ( t ) in Eq. (6.17), by considering the alternating renewal process starting up at the time t with z R u ( t ) distributed according to F, ( X ) as per Eq. (6.36). This leads to Ru
* g ( x ) * f ( x )* g(x)+ ..., see Eq. (A7.50), and f ' (X) = P A S O ( t ) f T R ( x ) P = A S O ( t ) d F , R ( x ) l d x = - dIRso(t,t + x ) l a x , s e e =RU
with h L u ( x )= f '
( X )*g(x)+fS ( X ) Ru
=RU
Eqs. (6.36) and (6.37). Similarly as for 'cR,(t), the distribution function for the fonvard recurrence time in the down state z R d ( t )is given by (Fig. 6.3)
with h„(t) = f ( t )+ f ( t )* g ( t ) * f ( t ) + ... (Eq. (A7.50)). For constant failure rate h ( x )= h , Eq. (6.37) or (6.39)leads to Eq. (6.35), by considering Eq.(6.19).
t ) zRd( t ) in an alternating renewal process Figure 6.3 Forward recurrence times ~ ~ , (and
176
6 Reliability and Availability of Repairable Systems
6.2.2 One-Item Structure New at Time t = 0 and with Constant Failure Rate h In many practical applications, a constant failure rate h can be assumed. In this case, the expressions of Section 6.2.1 can be simplified making use of the memoryless property given by the constant failure rate. Table 6.3 summarizes the results for the cases of constant failure rate (L)and constant or arbitrary repair rate ( P or Mx) = g(x)/ ( I - G ( x ) ) ) .
6.2.3 One-Item Structure with Arbitrary Initial Conditions at Time t = 0 Generalization of the initial conditions at time t = 0 , i.e. the introduction of p, FA(x) and GA(x)as defined by Eqs. (6.12), (6.8), and (6.10), leads to a time behavior of the one-item repairable structure described by Fig. A7.3 and to the following results: 1. Reliability function RS ( t )
I
R s ( t ) =Pr{upin(O,t] upatt = 0 }= I - F A ( t ) .
(6.41)
Equation (6.41) follows from Pr{ up in [O,t]} = Pr{ up at t = 0 n Pr{ up in (O,t]} = P r { u p a t t =O}.Pr{upin(O,tI I u p a t t = ~ } = p . ( i - ~ ~= (pt .)~ S ( t ) .
2. Point availability P A S ( t )
with h„(t) = f,(t) * g ( t ) + f A ( t )* g ( t ) * f ( t )* g ( t ) + ... and h d u d ( t = ) g A ( t )s g A ( t )* f ( t )* g ( t ) + g A ( t )* f ( t ):r g ( t ) * f ( t )* g ( t )+ ... . 3. Average availability AAS( t ) t
1 1 A A s ( t ) = - E [total up time in ( 0 , t ] ] = - j P A s ( x ) d x . t t 0
6.2 One-Item Structure
4. Intewal reliability IRS(t, t + 8)
5. Joint availability J A s ( t , t + 0 ) JAs(t,t+8)=Pr{upatt n u p a t t + 8 )
with IRs(t, t + 8) from Eq. (6.44) and PAsl(t) from Eq. (6.38).
+Lrauic; ".J
I\r;auira iui U I c p u u a v i c UIIG-IMIII ~ L I U L L U I CI K W ä~ L
= V auu
Repair rate
cunszurzzJazlure rare h
Remarks, Assumptions
constant (P)*)
arbitrary
1. Reliability tion Rso (tfunc)
2. Point
WIUI
Rso(O=Pr(upin (0, t ] newatt = 01
I
e-X'
+
I
(
I
PA„ (t) = Pr(up at t newatt = 0). hduu = f *g+f *g*f*g f
...
3. Average
f
availability AA„ (t
t 0
~ ( 1e-fi+")r
L+ +
t ( h +P ) *
1
,
AAS (t)= E [total up time in (0,t l ( n e w a t t = O ) / t
,
IRs (t ,t + 8) = Pr{up in [t, t f 01 I newatt = 0 ) JA,,@, t + 0)=Pr(upattn u p a t t + €II n e w a t t = 01, PAs ,,(X)as in point 2
MA, (T,, tf ) = Pr{each failure in amission with total operating time T, can be repaired in a time 2 % 1 newatt = 0 )
h = failurerate;
P r ( ~ ~ , , (Ct )X] = 1 - e-X' (Fig. 6.3); up means in the operating state;
*)
Markov process
178
6 Reliability and Availability of Repairable Systems
6. Forward recurrence time ( ~ ~ and ~ ( ~tR )d ( tas ) in Fig. 6.3) P r { ~ ~ l~ X( }t =) 1 - I R S ( t , t + X ) l P A S ( t ) ,
(6.46)
with IRs(t, t + X ) according to Eq. (6.44) and PAs(t) from Eq. (6.42),and
Pr{ZRd(t)5 X }
= 1-
Pr{down in [t,t + X ] } 1 - PAs(t)
where
f
with hudu(t)= f A ( t )+ f A ( t )* g ( t )* f ( t )+ f A ( t )* g ( t ) * f ( t )* g(t) * f ( t )+ ... and hUdd(t)= g ~ ( t*)f ( t ) +gA(t) * f ( t )* g ( t )* f ( t ) +... Expressions for mission availability and work-mission availability are generally only used with items new at time t = 0 (see [6.5 (1973)l for a generalization.
6.2.4 Asymptotic Behavior As t + expressions for the point availability, average availability, interval reliability, joint availability, and distribution function of the forward recurrence time (Eqs. (6.42)-(6.47))converge to quantities which are independent of t and initial conditions at t = 0. Using the key renewal theorem (Eq. (A7.29))it follows that lim PAs(t) = PAs =
MTTF
MTTF
t+-
+ MTTR
MTTF MTTR = PA+
lim AAS ( t ) = AAS = MTTF
t+-
lim IRs(t, t + 0 ) = IRs(0) =
MTTF
t+-
lim JAs(t, t t-f-
iim P r { ~ ~ 5~X(] t=)-
lim Pr{ZRd(t)5 t+-
+ MTTR
+ 0 ) = JAs(0) = MTTFMTTF + MTTR MTTF
t-f-
X}
=
J(l-~(y))dy, 8
PA„,(0)> (6.52)
0
P
MTTR
(6.49)
+
(6.53)
179
6.2 One-Item Structure
where MTTF = E[zi], MTTR = E[%;], i = 1,2, ..., and PAo,(@ is the point availability according to Eq. (6.42) with p = 1 and F A ( t ) from Eq. (6.57) or Eq. (6.52). In practical applications, PA and AA (or PAS and AAS for system oriented values) are often referred as a v a i l a b i l i t y and denoted by A. The use of PAs =AAS = (MTBF - MTTR) / MTBF is to avoid, because it implies MTBF = MTTF + MTTR.
Example 6.3 Show that for a repairable one-item structure in continuous operation, the limit limPAS (t) = PAs = t-f
W
MTTF MlTF
+ MTIR
is valid for any distribution function F(x) of the failure-free time and G(x) of the repair time, if MTTF < M T I R < W, and the densities f(x) and g(x) go to 0 as X t W . 00,
Solution Using the renewal densig theorem Eq. (A7.31) it follows that
-
lim hd„ (t ) = lim hdud(t ) = t-f
t-f 00
1
MTIF
+ MTTR
Furthermore, applying the key renewal theorem Eq.(A. 7.29)to PAs (t ) &ven by Eq. (6.42) yields Ce
Ca
1(1 - F(n)W
(1 - F(x))dx lirnPA,(t)=p(l-l+ t-f
-
O
MTIF MTTF P M T T F + MTTR
+ MTIR
)+(l-P)O MlTF MTTF MTTF
+ MTTR
-
+ MTTR MTTF MTIF
+ MTTR
The limit MTTF 1 ( M T I F + MTTR) can also be obtained from the final value theorem of the Laplace transform (Table A9.7), considering for s + 0 and g ( s ) = l - s M T T R + O ( S ) = = - SMTTR. ) per Eq. (A7.89). When considering g(h) for availability calculations, the with ~ ( s as approximation given by Eq. (6.54) often leads to PAs = 1, already by simple redundancy structures. In these cases, Eq. (6.113) has tobe used.
In the case of c o n s t a n t failure & repair rates h ( x ) = h and p ( x ) = p, Eq. (6.42) yields
Thus, for this important case, the convergence of P A s ( t ) toward PAs = / ( h + p ) is exponential with a time constant 11 ( h + p ) < 1/ p = MTTR. In particular, for
180
6 Reliability and Availability of Repairable Systems
p = 1, i.e. for PAs(0) = 1 and PA,(t)
PAso(t),it follows that
Generalizing the distribution function G(x) of the repair time and 1 or F(x) of the failure-free time, PAso(t) oscillates damped (as in general for the renewal density h(t) given by Eq. (A7.18)). However, for constant failure rate A and providing LMVR sufficiently small and some rather weak conditions on the density g(x), lower and upper bounds for PAso(t) can be found [6.25]
PA„(t)
1 2
A M m
1 + AMTTR -
1+AM7TR
e- (A+lIMTTR) t ,
>
and
PAso(t) 5
1 1 + h MTTR
+ Cu
1 + )L MTTR
e
-(h+llMI1X)t,
t20,
4 = 1 holds for many practical applications ( h M7TR << 0.1). Sufficient conditions =1 are given in [6.25]. However, conditions on C , are less important as on PAso(t)5 1 is always true. The case of a gamma distribution with density g(x)= a ßxP-I e -9 r(ß),mean ß 1a , and shape Parameter ß 2 3, leads for instance ~ - ~ t~ 2 3MTTR= ~ ~ ~ 3ßIa. ~ to P A ~ ~ ( ~ ) - P A~, M T T R atleastfor for
C,
CL,since
I
11
6.2.5 Steady-State Behavior For
the alternating renewal process describing the time behavior of a one-item repairable structure is stationary (in steady-state), see Appendix A7.3. With p, FA(t),and GA(t)as per Eq. (6.57), the expressions for the point availability (6.42), average availability (6.43), interval reliability (6.44), joint availability (6.45), and the distribution functions of the forward recurrence time (6.46) and (6.47) take the values given by Eqs. (6.48) - (6.53) for all t e 0, see Example 6.4 for the point availability PAs. This relationship between asymptotic & steady-state (stationary) behavior is important in practical applications because it allows the following interpretation (see also the remark on p. 450):
A one-item repairable structure is in a steady-state (stationary behavior) if it began operating at the time t = - und will be considered only for t 2 0, the time t = 0 being an arbitrary time point.
181
6.2 One-Item Structure
Table 6.4 Results for a repairable one-item in asymptotic & steady-state (stationaty) behavior
1
Failure and repair rates I ~onstant*) Arbitrary
/
Remarks, assumptions
MTTF
1. Pr{upatt=O] (U)
MTTF =
+ MWR
MTTF
(FA(x)
i0
PrtTRu(t) C XI)
GA(X) is also the distribution function of (t) as in Fig. 6.3
1--
- G ( . ) ) &
i>1 i>1
FA(X) is also the distribution (1) as in Fig. 6.3 function of
1 - e-"
3. Distribution of
E[zi],
MTTR = E[zj],
(GA(x) = Pr{TRd(t)C X])
4. Renewal densities hdu(t) arid hud(t)
I MTTF
+ MTTR
MTTF
+ MTTR
MlTF
+ MTTR
MTiTF
+MUR
1
5. Point availability ( PAS ) -
6. Average availability ( AAS
7. Interval reliability (IR, (0))
20=
~ [ t o hup i time in (0,t]],
m
(1 - F(x))dx 0
8. Joint availability (JAS (8))
MWF
. PA^
MTTF
(e)
P (-! h f p h+P
e-(a +We
+ MWR t-
)
(L+ P
JAS(0)=Pr(upattnupatt+8), P A ~ ~ ~ ( O ) = P A asper ~(O) Eq. (6.42) with p = 1 and FA(t) as in point 2
h , =fahre, repair rate; up=operating state; h ud(t)=failure frequency, hdu(t)=repairfreq.; *) Markov proc.
For constant failure rate h and repair rate p, the convergence of PAso(t) to PAs is exponential with time constant = 1 / p = MTTR as per Eqs. (6.55). Extrapolating the results of Section 6.2.4, one can assume that for practical applications, the function PAso(t)is captured at least for some t > to > 0 in the band PAso(t)- PAs = ~ M T R e-t'MUR when generalizing the distribution function of repair times. Thus,
I
I
for practical purposes one can assume that after a time t = 10 MTTR, the point availability PAsO(t)has reached its steady-state (stationary)value PAs = AAS.
Important results for the steady-state behavior of a repairable one-item structure are given in Table 6.4.
182
6 Reliability and Availability of Repairable Systems
Example 6.4 Show that for a repairable one-item structure in steady-state, i.e. with p, FA ( X ) ,and GA ( X ) as per Eq. (6.57),the point availability is PAs (t ) = PAs = MTTF I (MTTF + MTTR) for all t > 0 . Solution Applying the Laplace transform to Eq. (6.42) and using Eqs. (A7.50) and (6.57)yields
+
MTTR
1 - ?(s)
sMTTR -
MTTF
+ MUR
1 - lf(s)g(s)
<
s
and finally
from which
(s) =
MTTF MTTF
+ MTTR
1
.-, s
and thus PAs (t ) = PAS for all t > 0.
6.3
Systems without Redundancy
The reliability block diagram of a system without redundancy consists of the series connection of all its elements EI to E„ see Fig. 6.4. Each element Eiin Fig. 6.4 is characterized by the distribution functions Fi(x) for the failure-free time and Gi(x) for the repair time.
6.3.1 Series Structure with Constant Failure and Repair Rates for Each Element In this section, constant failure and repair rates are assumed, i.e. F i ( x ) = l - e -hix ,
X
2 0,
G i ( x ) = 1 - e-pi X,
X
> 0,
and
6.3 Systems without Redundancy
.+.J+.
+ J E,-
Figure 6.4 Reliability block diagram for a system without redundancy (senes structure)
holds for i = 1, ... , n. Because of Eqs. (6.58) and (6.59), the stochastic behavior of the system is described by a time-homogeneous Markov process. Let Zo be the system up state and Zi the state in which element Ei is down. Taking assumption (6.2) into account, i.e. neglecting further failures during a repair at system level (in short: no further failures at system down), the corresponding diagram of transition probabilities in ( t , t + 6t] is given in Fig. 6.5. Equations of Table 6.2 can be used to obtain the expressions for the reliability function, point availability and interval reliability. With U = {z,,), Ü = {z, , ..., Z n ) and the transition rates according to Fig. 6.5, the r e l i a b i l i ~function (see Table 6.2 for notation) follows from
Figure 6.5 Diagram of the transition probabilities in ( t , t+ 6 t] for a repairable senes structure (constant failure and repair rates hi and pi, only one repair Crew, ideal failure recognition & switch, no further failures at system down, arbitrary t, 6 t 0, Markov process)
184
6 Reliability and Availability of Repairable Systems
and thus, for the mean time to failure,
The point availability is given by
with P„(t) from (Table 6.2) P,(,)
= evhs
n
t
i=l
0
' + X J& e-'s
X ~ i o (-X) t di-
t
P i O ( t ) = ~ p i e - ~ x ~ O O ( t - x ) d xi =, i , . , n.
(6.63)
0
The solution Eq. (6.63) leads to the following Laplace transform (Table A9.7) for PAso(t)
From Eq. (6.64) there follows the asymptotic & steady-state value of the point and average availability
Because of the constant failure rate of all elements, the intewal reliability can be directly obtained from Eq. (6.27) by
with the asymptotic & steady-state value
where n
6.3 Systems without Redundancy
185
6.3.2 Series Structure with Constant Failure Rate and Arbitrary Repair Rate for Each Element Generalization of the repair time distribution functions G i ( x ) ,with densities gi(x) and G i ( 0 )= 0 , leads to a semi-Markov process with state space Zo, ..., Z„ as in Fig. 6.5 (this because of Assumption (6.2) of no further failures at system down). The reliability function and the mean time to failure are still given by Eqs. (6.60) and (6.61). For the point availability let us first calculate the semi-Markov transition probabilities Qij( X ) using Table 6.2
The system of integral Equations for the transition probabilities (conditional state probabilities) Pij ( t ) follows then from Table 6.2
For the Laplace transform of the point availability PAso(t)= Poo(t) one obtains finally from Eq. (6.69)
from which, the asymptotic & steady-statevalue of the point and average availability
with lim (1 - g(s))= s MTTR, as per Eq. (6.54), and s-to
m
MTiR, = J ( l - ~ ~ ( t ) ) d t .
(6.72)
0
The intewal reliability can be calculated either from Eq. (6.66) with PAs,(t) from Eq. (6.70) or from Eq. (6.67) with PAs from Eq. (6.71).
6 Reliability and Availability of Repairable Systems
Example 6.5 A system consists of elements EI to E4 which are necessary for the fulfillment of the required function (series structure). Let the failure rates h l = 1 0 - ~ h - l , h 2 = O . ~ . l ~ - ~ h - l , hg = 1 0 - ~ h - I , h 4 = 2 . 1 0 - ~ h - l be constant and assume that the repair time of all elements is lognormally distributed with Parameters h = 0.5 h-I and 0 = 0.6. The system has only one repair Crew and no further failure can occur at system down (failures during repair are neglected). Give the reliability function for a mission of duration t = 168h, the mean time to failure, the asymptotic & stationary values of the point and average availability, and he asymptotic & stationary values of the interval reliability for 0 = 12 h . Solution 4 -1 The system failure rate is hs = h, + h2 + h 3 + h, = 36.10 h according to Eq. (6.60). The reliability function follows as RSO(t)= e-0.0036t, from which RS0(168h) = 0.55. The mean time to failure is M T o = 1 1hs = 27811. The mean time to repair is obtained from Table A6.2 as E[;] = ( e ~ ~ h/ = ~ MTTR ) / = 2.4 h . For the asymptotic & steady-state values of the point and average availability as well as for the interval reliability for 0 = 12 h it follows from Eqs. (6.71) and (6.67) that PAS = AAS = 1/(1+36. 104 .2.4) = 0.991 and IRS(12) = 0.991. e-0.0036'12 = 0.95.
6.3.3 Series Structure with Arbitrary Failure and Repair Rates for Each Element Generalization of repair and failure-free time distribution functions leads to a nonregenerative stochastic process. This model can be investigated using supplementary variables, or by approximating the distribution functions of the failure-free time in such a way that the involved stochastic process can be reduced to a regenerative process. Using for the approximation an Erlang distribution function leads to a semi-Markov process. As an example, let us consider the case of a two-element series structure ( E l , E2) and assume that the repair times are arbitrary, with densities g l ( x ) and g z ( x ) , and the failure-free times have densities
and
Equation (6.73) is the density of the sum of two exponentially distributed random Under these assumptions, the two-element time intervals with density hl e-'1'. series structure corresponds to a 1-out-of-2 standby redundancy with constant failure rate h l , in series with an element with constant failure rate A2. Figure 6.6 gives the equivalent reliability block diagram and the corresponding state transition diagram. This diagram only visualizes the possible transitions and can not be
6.3 Systems without Redundancy
187
considered as a diagram of the transition probabilities in ( t , t + 6t3. Zo is the system up state, Z1$ and Z2' are supplementary states necessary for calculation only. For the semi-Markov transition probabilities Qij( X ) one obtains (see Table 6.2)
From Eq. (6.75) it follows that (Table 6.2 and Eq. (6.54))
L
I-out-of-2 standby (E,. = E , )
Figure 6.6 Equivalent reliability block diagram and state transition diagram for a two series element system ( E1 and E2) with arbitrarily distnbuted repair times, constant failure rate for E2, and Erlangian ( n = 2 ) distributed failure-free time for EI, only one repair Crew, ideal failure recognition & switch, no further failures at system down, (5-state semi-Markov process)
6 Reliability and Availability of Repairable Systems
(6.79)
(6.80) The interval reliability I R s o ( t , t + 0) can be obtained from
IRso(t, t + 0 ) = Poo ( t )Rm(0) + Por ( t )RSr (0) with Rsi~(0) =edhl +h2)e,because of the constant failure rates hl and h2. Important results for repairable series structures are summarized in Table 6.5.
Table 6.5 Results for a repairable system without redundancy (elements EI, ..., E, in series), one repair Crew, ideal failure recognition & switch, no further failures at system down Quantity
Expression
Remarks, assumptions
1. Reliability function (Rso(t))
independent elements (independent :lements at least up to system failure)
2. Mean time to system failure ( MTTFso)
~ ~ (= te-'~ ) 5 ~ ~ ~= e(- ' ~t ) md MTTF„ = 1 I hs with hs = h , + ... + h ,
3. System failure rate up to system failure ( L, (t ))
independentelements (independent dements at least up to system failure) At system down, no further failures :an occur:
4. Asymptotic & steady-state value of the point availability & average availabilit) ( PAs = AAS )
a) Constant failure rate h ; and constant repair rate lifor each element ( i = 1, ... ,n)
hi and arbitrary repair rate pi ( t ) with MUR, = mean time to repair for each element ( i = 1, ...,n)
b) Constant failure rate
C) Zelement senes structure with failure rates h21t l(1 + Al t ) for E , and h2 for E2
. Asymptotic & steadystate value of the interval reliability (IRs(B): *)
Each element has constant failure ratc Li, hs = h , + ... + h ,
Supplementary results: If n repair Crews were available, PAs = ni (1 / (1 +Ai /P;)) = 1 - Z i hi / P i
6.4 1-out-of-2 Redundancy
6.4
1-out-of-2Redundancy
The I-out-of-2 redundancy, also known as I-out-of-2: G, is the simplest redundant structure arising in practical applications. It consists of two elements El and E 2 , one of which is in the operating state and the other in reserve. When a failure occurs, one element is repaired while the other continues the operation. The system is down when an element fails while the other one is being repaired. Assuming ideal switching and failure recognition, the reliability block diagram is a parallel connection of elements El and E2, see Fig. 6.7. Investigations are based on the assumptions (6.1) to (6.7). This implies in particular, that the repair of a redundant element begins immediately on failure occurrence and is performed without interruption of operation at system level. The distribution functions of the repair times, and of the failure-free times are generalized step by step, beginning with the exponential distribution (memoryless), up to the case in which the process involved has only one regeneration state (Section 6.4.3). Influence of switching and incomplete coverage is considered in Sections 6.8.3 and 6.8.4, preventive maintenance in Sections 6.8.2, common cause failures in Section 6.8.7.
6.4.1 1-out-of-2Redundancy with Constant Failure and Repair Rates for Each Element Because of the constant failure and repair rates, the time behavior of the 1-out-of-2 redundancy can be described by a time-homogeneous Markovprocess. The number of states is 3 if elements El and E2 are identical (Fig. 6.8) and 5 if they are different (Fig. 6.9), the corresponding diagrams of transition probabilities in ( t , t + 6t] are also given in Fig. A7.4. Let us consider first the case of identical elements EI and E2 (see Example 6.6 for different elements) and assume as distribution function of the failure-free time
in the operating state and
Figure 6.7 1-out-of-2 redundancy reliability block diagram (ideal failure recognition and switch)
190
6 Reliability and Availability of Repairable Systems
in the reserve state. This includes active (parallel) redundancy for A r = k , warm redundancy for h, < h , and standby redundancy for ?L, = 0. Repair time is assumed to be distributed (independently of Ar) according to
For the investigation of more general situations (arbitrary load sharing, more than one repair Crew, or other cases in which failure andlor repair rates change at a state transition) one can use the birth und death process introduced in Appendix A7.5.5. For all these cases, investigations are generally performed using the method of differential equations (Table 6.2 and Appendix A7.5.3.1). Figure 6.8 gives the diagram of transition probabilities in ( t , t + 6t] for the point availability (Fig. 6.8a) and the reliability function (Fig. 6.8b), respectively. Considering the system behavior at times t and t + 6 t , the following dzfference equations can be established for the state probabilities Po(t), P l ( t ) , and P2(t) according to Fig. 6.8a, where Pi(t)= Pr{processinZi at t ] , i = 0, 1, 2.
For 6t 40 , it follows that
The system of differential equations (6.85) can also be obtained directly from Table 6.2 and Fig. 6.8a Its solution leads to the state probabilities Pi(t), i = O,1, 2. Assuming as initial conditions at t = 0 , Po(0)= 1 and P1(0)= P2(0)= 0 , the above state probabilities are identical to the transition probabilities Poi(t), i = 0, I, 2, i.e. Poo(t)= Po(t), POl(t)= P l ( t ) , and PO2(t)= P2(t). The point availability PA„(t) is then given by (see Table 6.2 for notation)
P A s l ( t ) or P A s 2 ( t ) could have been determined for suitable initial conditions. ) From Eq. (6.86) it follows for the Laplace transform of P A S O ( t that
and thus for t + .o
6.4 I-out-of-2 Redundancy
191
Figure 6.8 Diagram of the transition probabilities in ( t , t+ 6 t] for a repairable 1-out-of-2 warm redundancy (Fig. 6.7, two identical elements, constant failure (h, L,) and repair (P) rates, one repair Crew, arbitrary t, 6 t L 0, Markov proc.): a) For the point availability; b) For the reliability function
If PAso(t)= PAs for all t 2 0 , then PAs is also the value of the point and average availability in the steady-state. Because of Po + P, + P2 = PAs + P2 = 1 it follows + y 2 ) , with 4 = lim Pi(t), i = 0,1, 2. that P2 = 1- PAS = k ( h + h,) / ( ( L+ h , ) ( h Further irnportant results for a 1-out-of-2 redundancy are in ~e&i"ons6.8.3 (imperfect switching), 6.8.4 (incomplete coverage), and 6.8.7 (common cause failures). To calculate the reliability function (by the method of differential equations) it is necessary to consider that the 1-out-of-2 redundancy will operate failure free in (0, t ] only if in this time interval the down state at system level (state Z2 in Fig. 6.8) will not be visited . To recognize if the state Z2 has been entered before t it is sufficient to make Z2 absorbing (Fig. 6.8b). In this case, if Z2 is entered the process remains there indefinitely, thus the probability of being in Z2 at t is the probability of having entered Z2 before the time t, i.e. the unreliability 1- R s ( t ) . To avoid ambiguities, the state probabilities in Fig. 6.8b will be marked by an apostrophe (prime). The procedure is similar to that for Eq. (6.85) and leads to the following system of differential equations
, ~ i ( t ) With . the and to the corresponding state probabilities ~ h ( t ) ~, ; ( t )and initial conditions at t = 0 , P&O) = 1 and P;(o) = P ~ ( o )= 0 , the state probabilities ~ & t )~, ; ( t )and , Pi(t) are identical to the transition probabilities PO0(t)= ~ ; ( t ) , ~ & ( t=)~ i ( t )and , p,&.(t) = ~ i ( t ) .The reliability function is then given by (see Table 6.2 for notation) (6.90) Rso(t) = 6 0 ( t )+ P& ( t ).
With the initial condition P1(0)= 1, R s l ( t ) would have been obtained. Eq. (6.90) yields the following Laplace transform for R s o ( t )
192
6 Reliability and Availability of Repairable Systems
from which the mean time to failure (M7TFSo= Rso(0),Eq. (2.61))follows as
Important for practical applications is the situation for h,hr<
with r1,2 =
- ( 3 h + y ) f d ( 3 h + p ) -82 2 2
and thus (Table A9.7b) R S O ( t )= ( r 2 e r i t - r 1 e r z t ) / ( r 2 - r l ) . For h << p, it follows that rl = 0 and r2 = - p, yielding
Using z / i - E = 1 - ~ / 2for 2rl=-(3h+p)(1--\j1-8h2/(3h+p)2)leads to r,=-2h21(3h+p). RSO(t)can thus be approximated by a decreasing exponential function with time constant M n 4 , ( 3h + p) 12 h2. ) This important result shows that:
-
+
For h << p, a repairable I-out-of-2 active redundancy with constant failure rate ?L und constant repair rate p behaves approximately like a one-item structure with constant failure rate hs = 2 h2 / ( 3 h + p); an equivalent repair rate ps for the one-item structure can be obtained by comparing the equations for the steady-state point availability und leads to pS = h S / ( I - PAS) p, with PAS from Eq. (6.88) (see also Table 6.10).
-
Extension of the above result to warm redundancy ( A r< h ) leads to
As in all these considerations, Ar = h yields active and L, = 0 standby redundancy. Because of the memoryless properiy of the time-homogeneous Markov process, the intewal reliabili~follows directly from the transition probabilities Pij ( t ) and the reliability functions R s i ( t ) ,see Table 6.2. Assuming Po(0) = 1 yields
+)
- 2 Moreexactly: Rs,(t)=erl'/(i - r , / r 2 ) - e '2 /(r21rl-i)=e f
~ ~ t / ( 3 ~ + ~ )
(i+2h2/ ):
- e-pt2h2/ .:
6.4 1-out-of-2Redundancy
The Laplace transform of fRs,(t, t + 8 ) is then given by
which leads to the following asymptotic & steady-state value (Table 6.2)
To compare the effectiveness of calculation methods, let us now express the reliability function, point availability, and interval reliability using the method of integral equations (Appendix A7.5.3.2). The Q , ( X ) are given according to Eq. (A7.102) and Fig. 6.8a by
Q21(x) = P ~ {
Rm(t) = e-(h+xr)r + [(A + h,)e-(h+hi)* ~
~-X)& ~
0
for Sie reliabili~functionsRso(t) and Rsl ( t ) ,as well as
(
t
6 Reliability and Availability of Repairable Systems
and
for the transition probabilities. The solution of Eqs. (6.96) and (6.97) yields Eqs. (6.87), (6.91), and (6.94). Equations (6.96) and (6.97) show how the use of integral equations leads to a quicker solution than differential equations for arbitrary initial conditions at t = 0. Table 6.6 summarizes the main results of Section 6.4.1. It gives approximate expressions valid for h << p and distinguishes between the cases of active redundancy (h, = X), warm redundancy ( h, < h), and standby redundancy (h, = 0). From Table 6.6, the improvement in M%) through repair, without interruption of operation at system level, is given as lower and upper bounds by
active
standby
Investigation of the unavailability in steady-state 1- PAS leads to
1- PAs = 1- AAS
-
active C1
MTBF
standby p
MTBF
The above results can easily be extended to Cover situations in whick failure or repair rates are modified at state changes (e.g. because of load sharing, differences within the element, repair priority, etc.). These cases, simply modify the transition rates on the diagram of transition probabilities in ( t , t + 6 t ] , See for example Figs. 2.12 and A7.4- A7.6.
Example 6.6 Give the mean time to failure MITFSO and the asymptotic & steady-state value of the point availability PAS for a 1-out-of-2 active redundancy with two different elernents EI and E2, constant failure rates hl, h 2 , and constant repair rates p1, p2 (one repair crew).
195
6.4 1-out-of-2 Redundancy
Table 6.6 Reliability function RSO(t), mean time to failure M q O , steady-state availability PAs = AAS, and interval reliability IRS (0) for a repairable 1-out-of-2 redundancy with identical eleements (Fig.6.7, constant failure h, h, & repair rates p (h, Ar« p), one repair Crew, Markov proc.)
* new at t = 0 , ** asymptotic & steady-stak value (for practical applications with A l y > 0.01, convergence of PAs ( t ) to PAs is good after t = 10 1 F , see also p. 181) Supplementary results: See Table 6.9 for the case with two repair Crews; assuming in Fig. 6.8a Z2+ Z, with pg instead of 2, + 21 with yields PAs =AAS = 1 - 2h2 I l p g (active red.)
Solution Figure 6.9 gives the reliability block diagram and the diagram of transition probabilities in and PAS can be calculated from appropriate Systems of algebraic equations. (t, t + 6t]. According to Table 6.2 and considering Fig. 6.9 it follows for the mean time to failure that
and in particular for hl << pl and h 2 << p2, M W o
=
kj*2
,
Al h 2 ( ~ 1 +~2 As for Eq. (6.93), the reliabilityfinction can be expressed by
For the asymptotic & steady-state value of the point availability and average availability
196
6 Reliability and Availability of Repairable Systems
PAs = Po + P, + P2 holds with Po, ( 4 + h2) Po = P,
4, and P2 as solution of (Table 6.2)
P, + P2 P2
(L2 PI)^ =
Po + P2 P4 (L, + P2) P2 = L2 PO+ P, q P14 =L24 (1.2P4 =h1P2. One (arbitrarily chosen) of the five equations must be dropped and replaced by Po + 4 + P2 + P3 + P4 = 1. The solution yields Po through P4, from which
(6.101) Equation (6.101) can also be written in the form
With h l = h 2 = h and p1 = (1.2 = y , Eqs. (6.98) and (6.101) become Eqs. (6.92) and (6.88), respectively (with h, = L).
1-out-of-2 active
""W
1-(hl+p2)St
+
Figure 6.9 Reliability block diagram and diagram of transition probabilities in ( t , t 6 t] for a repairable 1-out-of-2 active redundancy with different elements (ideal failure recognition and switch, const. failure and repair rates h l , h 2 , p1, and (1.2, one repair Crew, arbitr. t, F t 1 0 , Markov process)
6.4 1-out-of-2 Redundancy
197
6.4.2 1-out-of-2Redundancy with Constant Failure Rate and Arbitrary Repair Rate Consider now a 1-out-of-2 warm redundancy with 2 identical elements El and E2, failure-free times distributed according to Eqs. (6.81) and (6.82), and repair time with mean MTTR, distributed according to an arbitrary distribution function G ( x ) with G(0) = 0 and density g(x). The time behavior of this system can be described by a process with states Zo, Zl , and Z2. Because of the arbitrary repair rate, only states ZO and Zl are regeneration states. These states constitute a semi-Markov process embedded in the original semi-regenerative process (Fig. A.7.11). The semi-Markov transition probabilities Qij( X ) are given by Eq. (A7.183). Setting these quantities in the equations of Table 6.2 (SMP), by considering Qo(x)= QO1(x)and Qi ( X ) = Qio(x)+ Q ; ~ ( x )with Q ; ~ ( X ) as per Eq. (A7.184), it follows for the reliabili~finctionsR s i ( t )
and for the transition probabilities Pij ( t ) of the embedded semi-Markov process
The solution of Eq. (6.104) leads in particular to
and (with MTTFso = k„(0), Eq. (2.61))
198
6 Reliability and Availability of Repairable Systems
The Laplace transform of the point availability P A s o ( t ) = Poo(t)+ Pol(t) follows as a solution of Eq. (6.105)
PA &)
=
(S
+ h ) ( l - g ( s ) )+ h r ( l - g(s + I ) )+ h + s g(s + h ) (S
+ h ) [ ( s+ h + h r ) ( l- g ( s ) )+ s g(s + X ) ]
(6.108)
and leads to the asymptotic & steady-state value of the point availability PAs and average availability AAS (considering PAs O=",; s PA s o ( s ) and lim (1- g ( s ) ) = s-to s . MTTR +o(s) as per Eq. (6.54)) h+ g(h)) - h ( k + h , ) MVR + k g ( h ) '
PAs = AA where m
I
M T I R = (1 - G(n)) d x 0
and g(h) is the Laplace transform of the density g ( t ) for s = ? L See , Eq. (6.88) for g ( t ) =,ue-pt, i.e. g(h) = pl(h + P), and Examples 6.7 &6.8 for the approximation of g ( h ) . Calculation of the interval reliability is difficult because state Z1 is regener-ative only at its occurrence point (Fig. A7.11). However, for h MlTR << 1 . g(h)-+l and the asymptotic value of the state probability for Z1 ( 4 = H m P o l ( t ) ) becomes very small with respect to the state probability for Zo (Po=hil Poo(t)). For the asymptotic & steady-state value of the intewal reliability it holds then that
In many practical applications, it holds that Eq. (6.11 1) can be further simplified to
?L
MTTR < 0.01.
In such cases,
Example 6.7 Let the density g(x) of the repair time T' of a system with constant failure rate h > 0 be continuous and assume furthermore that hE[z ] = h MTTR << 1 and h << 1. Investigate the quantity g(h) for h -+0 .
dq]
Solution For h 4 0, h MTTR << 1, and h e -At lead to
<< 1, the three first terms of the series expansion of
From this, follows the approximate expression
199
6.4 1-out-of-2 Redundancy
In many practical applications, g(h)- 1 - h MTIR is a sufficiently good approximation, however not in calculating steady-state availability (Eq. (6.114) would give for Eq. (6.109) PAS = 1, thus Eq. (6.113) has to be used). P
it follows g(h) = -= 1 - h / p = 1 - hM7TR. L + CL
Supplernentary results: Assuming g(x) =
Example 6.8 In a 1-out-of-2 wann redundancy with identical elements E1 and E2 let the failure rates h in the operating state and h, in the reserve state be constant. For the repair time let us assume that it is distributed according to G(x) = 1 - e-P'(X-W)for x 2 W and G(t) = 0 for x < W , with MTTR 1/ y > W . Assuming h iy << 1, investigate the influence of iy on the mean time to failure MTTRs and on the asymptotic & steady-state value of the point availability PAS.
-
Solution With
and considenng that m
a
MTTR = l t g ( t ) d t = (ty'e-P'('-")dt 0
= iy
1
1
.
.U
+-z -, U'
W
i.e. p l = y I ( l - p ~ )and thus g(h) = p(l -LW) / (h + y ( l - )LW)),Eq. (6.107) (left-hand equality) and Eq. (6.109) lead to the approximate expressions
and
On the other hand, W = 0 leads to 1 - g(h) = h / (h + y) and thus (Eqs. (6.92) and (6.88))
Assuming C( »h, h,yields (considenng 0 5 hiy M=%~,~>~ =1-hiy M7TF,o, I,f=o
and
J?%,
h /P)
g>o
PAS,,=o
- l + h w Lh =+ lh. P
(6.115)
200
6 Reliability and Availability of Repairable Systems
Equation (6.1 15) allows the conclusion to be made that: For h M7TR << 1, the shape of the distributionfunction of the repair time has (as long as MTTR is unchanged) a small influence on results at system level, in particular on the mean time to failure MTiTssound on the asymptotic & steady-state value of the point availability PAs of a 1-out-of-2 redundancy. Example 6.9 shows a numerical comparison. This important result can be extended to complex structures. Exarnple 6.9 A 1-out-of-2 parallel redundancy with identical elements EI and E2 has failure rate h = 10-2h-' and lognormally distributed repair times with mean MTTR = 2.4h and variance 0.6h2 (Eqs. (A6.112), (A6.113) with h = 0.438 h-', o = 0.315). Compute the mean time to failure MZ7FS and the asymptotic & steady-state point and average availability PAs with approximate expressions: (i) g(h) from Eq. (6.114); (ii) g(h) from Eq. (6.1 13); (iii) g(t ) = p'e-"('-'), t r y , w = 1 , 3 h , l I p ' = l . l h , l / p = 2 . 4 b (Eq.(4.2)); (iv) g(t)=pe-"and 1 1 p = 2 . 4 h .
,,
Solution (i) With g(h) = 0.976 it follows (Eq. (6.107)) that M7TFso = 2183h and (Eq. (6.109)) PAS = 1. (ii) With g(h) = 0.9763 it follows (Eq. (6.107)) that MZ7Fs = 221 1 h and (Eq. (6.109)) PAs 0.9994. (iii) Example 6.8 yields MTTFso, ~ = 1 , h3 = 2206h and PAS, v=i,3h = 0.9995. (iv) From Eqs. (6.92) and (6.88) it follows that M T F s = 2233 h and PAs = 0.9989.
-
-
Supplernentary results: Numerical computation with the lognormal distribution (MTTR = 2.4h, Var [ T ' ] = 0.6h2 ) yields MTTFso=2186h and PAs 0.9995. For a failure rate h=10-~h-: results were: 209'333h, 1; 209'61 lh, 0.999997; 209'563h, 0.999995; 209'833, 0.999989; 209'513h, 0.999994.
6.4.3 1-out-of-2Redundancy with Constant Failure Rate only in the Reserve State and Arbitrary Repair Rates Generalization of the repair and failure rates of a 1-out-of-2 redundancy leads to a nonregenerative stochastic process. However, in many practical applications it can be assumed that the failure rate in the resewe state is constant. If this holds, and the 1-out-of-2 redundancy has only one repair Crew, then the process involved is regenerative with exactly one regeneration state [6.5 (1975)l. To See this, consider a 1-out-of-2 warm redundancy, satisfying assumptions (6.1) -(6.7), with failure-free times distributed according to F(x) in the operating state and V(x) = 1- e - ' r x in the reserve state, and repair times distributed according to G(x) for repair of failures in the operating state and W(X) for repair of failures in the reserve state (F(0) =V(O) =G(O) = W(0) = 0, densities f(x), g(x), w(x)). Figure 6.10a shows a time schedule of such a system and Fig. 6.10b gives the state transition diagram (to visualize possible state transitions) of the involved stochastic process.
6.4 1-out-of-2 Redundancy
-operating reserve
renewal point
Figure 6.10 Repairable 1-out-of-2 warm redundancy with constant failure rate h , in the reserve state, arbitrary failure rate in the operating state, arbitrary repair rates, one repair Crew, ideal failure recognition and switch; a ) Possible time schedule (repair times greatly exaggerated); b) state transition diagram to visualize possible state transitions (only Z1 is a regeneration state)
States Zo, Zl , and Z2 are up states. State Z1 is the only regeneration state present here (Fig. 6.10a). The occurrence of Z1 brings the process to a situation of total independence from the previous time development. It is therefore sufficient to investigate the time behavior from t = 0 up to the first regeneration point and between two consecutive regenerationpoints (Appendix A7.7). Let us consider first the case in which the regeneration state Z1 is entered at t = 0 (SRPO)and let SRPl be the first renewal point after t = 0. The reliability jünction Rsl(t) is given by (see Table 6.2 for definitions) t
R„(t) = l - F ( t ) + l u l ( x ) ~ s l ( -X)&, t 0
with 1- F(t) = Pr(fai1ure -free operating time of the element operating at t = 0 is > t Z1 entered at t = 0)
1
and t
~ u , ( x ) ~ ~ ( t - x=) Pr{(SRpl & 5 t n upin(SRpl, t l ) 0
I Zi enteredat t = 0)
202
6 Reliability and Availability of Repairable Systems
d)
C)
Figure 6.11 Possible time schedules at t = 0 for the 1-out-of-2 redundancy according to Fig. 6.10
Thefirst renewal point SRPl occurs at the time X (i.e. within the interval (X,x+dx]) only if at this time the operating element fails und the resewe element is ready to enter the operating state. The quantity ul(x), defined as ul(x) = lim
1
- Pr{(x
S X ~ O6x
< SRPl I X + 6x
1 Zl entered at t = 0 )
can be obtained as (Fig. 6.11a)
The point availabili~is given by
with 1 - F(t) = Pr{failure - free operating time of the element operating a t t = O i s > t zlenteredatt=O),
I
6.4 1-out-of-2Redundancy t
Su,(x-)~~~~(t-= x )Pr{(SRpl d* I t n upat t ) ) Zl enteredatt = 01, 0
and
J u 2 ( x ) P A S l ( t- X ) &
= Pr((SRPl 5 t n systemfailedin (0, SRPI]
0
n up at t )
I Z1 entered at t = 0 ) .
The quantity u 2 ( x ) , defined as 1
u 2 ( x ) = lim -Pr{(x < SRPl I 6x40 6x
X
+ 6x n system failed in (0, X ] )
I Z1 entered at t = 01,
can be obtained as (Fig. 6.1 1b) X
u 2 ( x ) = g ( x ) ~ ( x+) Jhl
(6.121)
with
One can recognizes that u 1 ( x )+ u 2 ( x ) is the density of the random variable giving successive occurrence times of state Z1, i.e. interarrival times separating the renewal points 0, SRP1,SRP2,... of the embedded renewal process. Consider now the case in which at t = 0 the state Zo is entered. The reliability finction R s o ( t ) is given by t
~ „ ( t )= l - ~ ( t +) l u ) ( x ) ~ ~ ~ ( t - x ) &
(6.123)
0
with (Fig. 6 . 1 1 ~ ) 1
u 3 ( x ) = lim -Prix< 8x40 6x
SRPl I x + 6 x
I ZOentered at t = 0 ) = f ( x ) P A o ( x ) ,
where P A o ( x ) = Pr{reserve element up at time X
I ZOentered at t = 0 )
(6.124)
204
6 Reliability and Availability of Repairable Systems
The point availability PAso( t ) is given by
with (Fig. 6.1 1d) 1 u 4 ( x )= lim - Pr{(x< SRPl i X + Sx n System failed in (0,X ] ) sxio X
]
Zo entered at t = 0 ) = f hLd,(y)w ( x - y)(P(x)- F(y )) dy 0
(6.128)
and h u d u ( ~=) V ( Y ) +~ ( Y ) * w ( Y ) * ~ (~Y( )Y+) * w ( Y ) * ~* w ( Y( y) ) * v ( y ) +... .
(6.129) Equations (6.116), (6.120), (6.123), (6.127) can be solved using Laplace transforms. However, analytical difficulties can arise when calculating Laplace transforms for F(x), G ( x ) , W(x) , u l ( x ) , u 2 ( x ) , u 3 ( x ) ,and u 4 ( x ) as well as at the inversion of the final equations. Easier is the calculation of the mean time to failure MT;TFSO and of the asymptotic & steady-state values of the point and average availability PAS = A A S , for which the following expressions can be found using Laplace transforms, see Eqs. (6.123), (6.116) for MITFso and Eqs. (6.120), (6.127) for PAs
and
with
Eq. (6.130) considers Eqs. (2.59), (2.61)), 6.132). Eq. (6.131) considers that PAs exists, given by PAs = AA s -- ls-0 im s ( s ) ,and that u l ( x )+ u 2 ( x )is the density of a random variable with finite mean (p. 203), and thus ( u 3 ( x )+ u 4 ( x ) )dx = 1 .
PA^^
Jom
205
6.4 1-out-of-2 Redundancy
The model investigated in this section has as special cases that of Section 6.4.2, with F ( x ) = 1 - e T h x and W ( x ) = G ( x ) , as well as the 1-out-of-2 standby redundancy with identical elements and arbitrarily distributed failure-free and repair times, see Exarnple 6.10. Important results for a 1-out-of-2 redundancy with arbitrary repair rates, and failure rates as general as possible within a regenerative process, are given in Table 6.7.
Example 6.10 Using the results of Section 6.4.3, give the expressions for the reliability function RSO(t) and the point availability PASO(t) for a 1-out-of-2 standby redundancy with 2 identical elements, failure-free time distnbuted according to F(x), with density f(x), and repair time distributed according to G(x)with density g(x).
-
Solution For a standby redundancy, ul (X)= f(x) G(x), u2(x) = g(x) F(x) , u3(x) = f(x), and u4 (X) 0 (Eqs. (6.117), (6.121), (6.124), and (6.128)). From this, the expressions for RSO(t), RSl(t), PASO(t)).and PASl ( t ) can be given. The Laplace transforms of RSO(t) and PASO(t) are
with
from Eq. (6.133) (or Eq. (6.130) with u3(x) = f(x), = F(m) = 1 and Eq. (6.132))
The mean time to failure MT-0 ul(x) = f(x) G(x), and by MTTFso = M n F
+
MTTF
For the asymptotic & steady-state value of the point and average availability PAS = AAS, Eq. (6.134) (or Eq. (6.131) with ul(x) = f(x)G(x) and u2 (X) = g(x)F(x))) yields PA, = AAS =
_
MTTF
206
6 Reliability and Availability of Repairable Systems
Table 6.7 Mean time to failure MZTFs o, steady-state point &average availability PAs =AAS, and interval reliability IRS@) for a repairable I-out-of-2 redundancy with two identical elements (Fig.6.10, one repair crew, arbitrary repair rates, failure rates as general as possible within a regenerative proc.) Standby ( 1 Distribution of the failure-free times
0;
Active ( h , = h )
L
Distribution of the repair times Mean of the failure-free times
MTTF = Ce
1 MTTF or -
L,
MTTR = Mean of the repair times
m
f (1
-
G(x))d.x
MTrR
M l T R or MTrR,
MTTR
0
MTrF
Mean time to failure (M m s o )
+
MTTF
MTTF
+ m
MTF
J u3 (X)& D
W
1-jf(x)~(x)da 0
Point & average availability (PAs = AAS)*
MTTF
MTIF
nterval reliability IR,x(W)* U,@),uZ(x), u3(x) as per Eqs. (6.1 17), (6.121), and (6.124); OS = operating state, RS = reserve state * asymptotic & steady-state value
6.5
k-out-of-nRedundancy
A k-out-of-n redundancy, also known as k-out-ofn: G, consists of n often identical elements, of which k are necessary for the required function and n - k are in reserve state (or repair). Assuming ideal failure recognition and switching, the reliability block diagram is as given in Fig. 6.12. Investigations in this Section assume
6.5 k-out-of-n Redundancy
U
k-out-of-n
Figure 6.12 k-out-of-n redundancy reliability block diagram (ideal failure recogntion & switch)
identical elements EI, ..., E„ only one repair Crew, and nofurther failures at system down (failures during a repair at system level are neglected, as per assumption (6.2)). Section 6.5.1 considers the case of warm redundancy with constant failure rate h in the operation state and h,c h in the reserve state as well as constant repair rate p. This case includes active redundancy (h, = h) and standby redundancy (L, 2 0). An extension to Cover other situations in which the failure rate is modified at state changes (e.g. for load sharing) is possible using the equations for the birth und death process developed in Appendix A7.5.5 (see also Section 2.3.5). Section 6.5.2 investigates a k-out-of-n active redundancy with constant failure rate and arbitrary repair rate. The influence of series elements (including switching elements) is considered in Sections 6.6 - 6.7. Imperfect switching, incomplete coverage, and common cause failures are investigated in Section 6.8.
6.5.1
k-out-of-nWarm Redundancy with Identical Elements and Constant Failure and Repair Rates
Assuming constant failure and repair rates, the time behavior of the k-out-of-n redundancy with identical elements can be investigated using a birth und death process (Appendix A7.5.5). Figure 6.13 gives the corresponding diagram of transition probabilities in (t, t + 6t]. From Fig. 6.13 and Table 6.2, the following system of differential equations can be established for the state probabilities Pj ( t ) = &{in state Z . at t}of a k-out-of-n warm redundancy with one repair Crew and no J further failures at system down (constant failure rates h & h, and repair rate p)
208
6 Reliability and Availability of Repairable Systems
Figure 6.13 Diagram of transition probabilities in ( t , t + 6t ] for a repairable k-out-of-n warm redundancy (n identical elements, const. failurekrepair rates, no further failures at system down, one repair crew, ideal failure recognition & switch, arbitrary t, 6 t J O , birth and death proc., Z o -Zn.+ up states)
with
v j = k h + ( n - k - J]?+,
j = 0, ..., n - k .
For the investigation of more general situations (arbitrary load sharing, more than one repair crew, or other cases in which failure andlor repair rates change at a state transition) one can use the birth und death process introduced in Appendix A7.5.5. The solution of the system (6.137) with the initial conditions at t = 0 , Pi(0) = 1 and Pj ( 0 ) = 0 for j + i, yields the point availability (see Table 6.2 for definitions) n-k
PAsi(t> = C p i j ( t ) , with P, ( t )= Pj ( t ) from Eq. (6.137) with Pi(())= 1. In many practical applications, only the asymptotic & steady-stateyalue of the point availability PAs is required. This can be obtained by setting Pj ( t )= 0 and P j ( t ) = Pj ( j = 0, ... , n - k + 1) in Eq. (6.137). The solution is (Appendix A7.5.5) n-k
PAS =
E Pj = 1 - Pn-k+l,
j=o
with
j P' = ---, n-k+l C ni
~i = V
, 7c0=1.(6.140) pi
PAs is also the asymptotic & steady-state value of the average availabilityAAs. As shown in Example A 7.11 (Eq. (A7.157)), for 2 v j < y it holds that
From this, the following bounds for PAs can be used in many practical applications (assuming 2vj < y, j = 0,..., n-k) to obtain an approxinzate expression for PAs
6.5 k-out-of-n Redundancy
The reliabilityfunction follows from Table 6.2 and Fig. 6.13
with vi as in Eq. (6.138). Similar results hold for the mean time to failure
The solution of Eqs. (6.142) and (6.143), shows that Rsi(t) and MTTFsi depend on n - k only. This leads for n - k = 1 to
andfor n - k = 2 to
This property holds for the point availability PAs as well, see Table 6.8 for results. Because of the constant failure rate, the intewal reliability follows directly from n-k
IR,(t,t
+0) =
C Pij(t)R,(e),
i = 0, ..., n - k
(6.146)
j=O
with P,(t) as in Eq. (6.139) and Rsi(8) from Eq. (6.142) with t = O . asymptotic & steady-state value is then given by
The
210
6 Reliability and Availability of Repairable Systems n-k
with Pj from Eq. (6.140). Table 6.8 surnrnarizes the main results for a k-out-of-n warm redundancy with constant failure and repair rates. Assuming, only for comparative investigations with the results of Table 6.8 and Section 6.7, n repair crews (one for each element), following approximate expressions can be found for active redundancy (totally independent elements) l6.27, 6.431 1
MTTFso = -(p 1 h)n-k, k L(:)
n repair Crews, active red., h / p << 1
(6.148)
n repair Crews, standby red., h lp << 1
(6.149)
and for standby redundancy [6.42]
MTTFSo =
( n - k ) !pn-k ( k ~ ) ~ - ~ + l
PAs = 1 -
( k h 1p)n-k+l (n-k+l)!
= I -
1
( n - k + 1) p MTTFSO
According to Eq. (A7.189), PAs in Eqs. (6.148) and (6.149) can be expressed as MTTRS
PAS = I - MVFs
6.5.2
with M
T =
1
(n-k+l)p
and M T 3
= M%.
k-out-of-nActive Redundancy with Identical Elements, Constant Failure Rate, and Arbitrary Repair Rate
Generalization of the repair rate (by conserving constant failure rates ( L , h,), only one repair Crew, and no further failure at system down), leads to stochastic processes with n - k + 1 regeneration und n - k not regeneration states ({Zo ,Zl} und {G} inFig.A7.11 for n - k =1, {Zo,Z1,Z2'}und {Z23Z3)inFig. A7.12for n - k = 2 ) . As an example let us consider a 2-out-of-3 active redundancy with 3 identical elements, failure rate ?L and repair time distributed according to G ( x ) with G ( 0 ) = 0 and density g(x). Because of the assumption of nofurther failure at system down, results of Section 6.4.2 for the 1-out-of-2 warm redundancy can be used for n- k = l by setting k h instead of h (see Tab. 6.8 as well as Eqs. (A7.183) & (A7.184) for n- k=l and Eqs. (A7.185) & (A7.186) for n - k = 2 ) . For the 2-out-of-3 active redundancy one has to Set 2 h instead of h and h instead of h , in Eqs. (6.107) & (6.109) to obtain Eqs. (6.152) & (6.155). However, in order to show the utility of considering time schedules, an alternative derivation is given below.
6.7 Approximate Expressions for Large Senes-Parallel Structures
21 1
Table 6.8 Mean time to failure MTTFSO, steady-state point & average availability PAS = AAS, and interval reliability IRS(@) for a repairable k-out-of-n warn redundancy with n identicd elements (one repair crew, constant failure and repair rates h , h r , y (Ar < h in reserve state, hr =O for standby), no further failures at System down, ideal failure recognition & switch, Markov process)
Mean time to failure ( MTTFsO)
Asymptotic & steady-state point and average availability ( PAS = AAS)
Interval reliability ( IRS (B))*
gen. case
L 1
n=2 k=l
n=3 k=2
gen. case
V O V ~ P + V O+ pP 3~ =1-- v v v vovlv2+voviy+voy2+p3 p3
=RSO(0)
N
L C
n=3 k=l
n =5 k =3
n-k arbitrary
v i = k h + ( n - k - i ) h „ i=O, ..., n - k ; h,h,=failurerates (h,=h+active r e d . ~ V o . . . V , - k = h n - k + ' n ! l (k-I)!, h r = O + s t a n d b y redundancy j V o . . . V , - k = @ h ) n - k + l ) ; p=repairrate ( y = l l ~ m be? ~ cause of only one repair crew); Rs ,(8) from Eq. (6.142); * See [6.5 (1985)l for exact solutions
Using Fig. 6.1421, the following integral equation can be established for the reliability function R s o ( t ) , see Table 6.2 for definitions,
212
6 Reliability and Availability of Repairable Systems
The Laplace transform of R S o ( t ) follows as
and the mean time to failure as M77FS0 =
5 - 3g(2h)
6 h (1 - g(2 h))
For the point availability, Fig. 6.14b yields
0
t
t
p ~ „ ( t ) = e - ~ ' ~ ( l - ~ ( t )+) ~ ~ ( X ) ~ - ~ ~ P A ~+Ig(r)(l-e-2hx)~~SI(t-x)& ~ ( ~ - X ) & 0 0 (6.153)
from which,
Asymptotic & steady-state value of the point and average availability follows from
Sz
by considering (1 - g ( s ) ) =s. M7TR + o ( s ) as per Eq. (6.54), see Eq. (6.113) for the approximation of g ( 2 h ) . For the asymptotic & steady-state value of the intewal reliability, Eq. (6.112) can be used in most applications. Generalization of failure and repair rates leads to nonregenerative stochastic processes.
, A renewal points
a) Calculation of RSO( t )
0
b) Calculation of PASO(t)
Figure 6.14 Possible time schedule for a repairable 2-out-of-3 active redundancy (const. failure rate, arbitrary repair rate, one repair Crew, no further failures at System down, repair times exaggerated)
6.6 Simple SeriesParallel Stmctures
6.6
Simple Series - Parallel Structures
A series - parallel structure is an arbitrary combination of series and parallel models, see Table 2.1 for some exarnples. Such a structure is generally investigated on a case-by-case basis using the methods of Sections 6.3 - 6.5. If the time behavior can be described by a Markov or semi-Markov process, Table 6.2 can be used to establish equations for the reliability function, point availability, and interval reliability. As a first example, let us consider a repairable 1-out-of-2 active redundancy with elements E I = E2 = E in series with a switching element E,. The failure rates h and h , as well as the repair rates p and pv are constant. The system has only one repair Crew, repair priority on E , (a repair on E I or E2 is stopped as soon as a failure of E, occurs, see Example 6.11 for the case of no priority), and nofurther failures at system down (failures during a repair at system level are neglected). Figure 6.15 gives the reliability block diagram and the diagram of transition probabilities in ( t , t + Ft]. The reliability function can be calculated using Table 6.2, or directly by considering that for a series structure the reliability at system level is still the product of the reliability of the elements
Because of the term e-hvt, the Laplace transform of R s o ( t ) follows directly from the Laplace transform of the reliability function for the 1-out-of-2 parallel by replacing s with s + h, (Table A9.7) redundancy RsOi-out-of-2,
The mean time fo failuve MTTFso follows from
=
~
~
~
(
0
)
The last Part of Eq. (6.158) clearly shows the effect of the series element E,. The asymptotic & steady-state value of the point and average availability PAs = AAS is obtained as solution of following system of algebraic equations, see Fig. 6.15 and Table 6.2,
6 Reliability and Availability of Repairable Systems
I
1-out-of-2active repair pnonty on E,, (E = E = E ) 1
2
Figure 6.15 Reliability block diagram and diagram of transition probabilities in ( t , r+Ft ] for a repairable 1-out-of-2 active redundancy with a switch element (two identical elements, constant failure and repair rates (h, hv, p, P,), one repair Crew, repair priority on E,, no further failures at system down, ideal failure recognition, arbitrary t, F td- 0, Markov process, ZO & Z2 are up states)
For the solution of the system given by Eq. (6.159), one (arbitrarily chosen) equation must be dropped and replaced by PO+ 4 + P2 + E j + 4'1. The solution yields Po through P4, from which
As for the mean time to failure (Eq. (6.158)), the last Part of Eq. (6.160) shows the influence of the series element E,. For the asymptotic & steady-state value of the intewal reliability one obtains (Table 6.2)
Example 6.11 Give the reliability function and the asymptotic & steady-state value of the point and average availability for a 1-out-of-2 active redundancy in series with a switching element, as in Fig. 6.15, but without repairpriority on the switching element.
6.6 Simple SeriesRarallel Structures
2-out-of-3 active (E = E = E = E ) 1
2
3
Figure 6.16 Reliability block diagram and state transition diagram for a 2-out-of-3 majority redundancy (constant failure rates hfor E and h, for E,,, repair time distributed according to G(x) with density g(x), one repair Crew, no repair priority, no further failures at system down; ZO, Z1, and Z4 constitute an embedded semi-Markov process, ZO and Z1 are up states)
Solution The diagram of transition probabilities in (a, t +6t] of Fig. 6.15 can be used by changing the transition from state Z3 to state Z2 to one from Z3 to Z1 and p v in p . The reliability function is still given by Eq. (6.156), then states Z1, Z3, and Z4 are absorbing states for reliability calculations. For the asymptotic & steady-state value of the point and average availability PAS = AAS, Eq. (6.159) is modified to
and the solution leads to h 2h(h+hv)lp2 h 1 - 2 Cl-". 2 h ( h + h , , ) ~ ~ ~V 1+(2h+hy)lp PV 1
PAs = AAS =
h,,
(6.162)
As a second exampie let us consider a 2-out-of-3 majoriq redundancy (2-out-of3 active redundancy in series with a voter E,) with arbitrary repair rate Assumptions (6.1) - (6.7) also hold here, in particular Assumption (6.2), i.e. n o further failures at system down. The system has constant failure rates, I. for the three redundant elements and A, for the series element E,,, and repair time distributed according to G(x)with G(0)= 0 and density g(x). Figure 6.16 shows the corresponding reliability diagram and the state transition diagram. ZO and Z1 are up states. ZO, Z1 and Z4 are regeneration states and constitute a semi-Markov process embedded in the original process. This property will be used for the
216
6 Reliability and Availability of Repairable Systems
investigations. From Fig. 6.16 and Table 6.2 there follows for the semi-Markov transition probabilities QO1(x), QIO(x), Q04(x)>Q40(x), Q121(~),arid Q134(~)the expressions
Q121(~)is used to calculate the point availability. It accounts for the process returning from state Z2 to state Z1 and that Z2 is not a regeneration state (probability for the transition Z1 + Z2 + Z1, see also Fig. A7.1 la), similarly for Q134(~). Qi2(x) and Qi3(x) as given in Fig 6.16 are not serni-Markov transition probabilities (Z2 and Z3 are not regeneration states). However,
.X) Qi3(x) useful for the calculation of the yields an equivalent Q ~ ( X )Q= ~ ~ ( +Q;~(x)+ reliability function. Considering that Zo and Z1 are up states and at the same time regeneration states, as well as the above expressions, the following system of integral equations can be established for the reliabilityfunctions R s o ( t )and RSl(t)
6.6 Simple SeriesParallel Structures
217
The solution of Eq. (6.164) yields
and
R s o ( s ) and MT„ s =s
could have been obtained as for Eq. (6.157) by setting
+ h, inEq (6.151).
For the point availability, calculation of the transition probabilities Pij ( t ) with Table 6.2 and Eq. (6.163) leads to
and
From Eqs. (6.167) and (6.168) it follows the point availability P A s o ( t ) = Poo(t)+Pol(t) and from this (using Laplace transform) the asymptotic & steady-state value
218
6 Reliability and Availability of Repairable Systems
(6.169) with MTTR as per Eq. (6.110). For the asymptotic & steady-state value of the interval reliability, the following approximate expression can be used for practical applications (Eq. (6.111))
In Eq. (6.170) it holds that P o = ~ m w P o o ( t )with , Poo(t) from Eqs. (6.167). For 3 2 h + h,) = 1, I R s ( @ = Rso(B) can be used.
Example 6.12 (i) Give using Eqs. (6.166) and (6.169) the mean time to failure M T q O and the asymptotic & steady-state point and average availability PAS =AAS for the case of a constant repair rate p. (ii) Compare for the case of coustant repair rate the tnie value of the interval reliability IRs(8) with the approximate expression given by Eq. (6.170).
Solution (i) With ~ ( x ) =-e-ILX l it follows that g(2h + h,) = p1(2 h + h, M7TFso =
5h+h,+p
1
+ p) and thus from Eq. (6.166) 1
-
(3h+hv)(2h+h,)+ph,
h , , + 6 h 2 / ( 5 h + h , + ~ ) h,+6h2lF
(6.171)
and from Eq. (6.169)
P (ii) With PO0(t) and POl(t) from Eqs. (6.167) & (6.168) it follows for the asymptotic & steadystate value of the interval reliability (Table 6.2) that
The approximate expression according to Eq. (6.170) yields
i.e. the same value as per Eq. (6.173) for 3h << p and considenng RS1(8) 5 RS0(8).
6.6 Simple SerieslParallel Structures
219
To give a better feeling for the mutual influence of the different Parameters involved, Figs. 6.17 and 6.18 compare the mean time to failure MTTFso and the asymptotic & steady-state unavailability 1 - PAs of some basic series - parallel structures. The equations are taken from Table 6.10 which summarizes the results of Sections 6.2 to 6.6 for constant failure and repair rates (approximate expressions are used to simplify calculation, see Section 6.7.2). Comparison with Figs. 2.8 and 2.9 (nonrepairable case) confirms that the most important gain is obtained by the first step (structure b), and shows that the influence of series elements is much greater in the repairable than in the nonrepairable case. Referring to the structures a), b), and C) of Figs. 6.17 and 6.18 the following design rule can be formulated: The failure rate of the series element in a repairable 1-out-of-2 active redundancy should not be greater than 1 % (0.2% for p / hl > 500) of the failure rate of the redundant elements, i.e. with respect to Fig. 6.17 h z < O.O1hl in general, and h 2 < 0.002hl for p/hl > 500.
6.7
Approximate Expressions for Large Series Parallel Structures
6.7.1
Introduction
(6.174)
-
Reliability and availability calculation of large series -parallel structures rapidly becomes time-consuming, even if constant failure rate Ai and repair rate vi is assumed for each element Ei of the reliability block diagram and only mean time to failure MTTFso or steady-state availability PA'; = AAS is required. This is because of the large number of states involved, which for a reliability block diagsam with n elements can reach i + EIE::, n ! ~ ~ , il ! =l ean!, considering all possible repair strategies. For instance, the system of Fig. 6.19 with 4 elements would have more than 50 states if the assumption of no further failure at system down (6.2) were dropped. 2 n states holds for nonrepairable Systems or for system with totrilly independent elements (Point 2 below). Use of approximate expressions becomes thus important. Besides the assumption of one repair Crew und no furtherfailure at system down (Sections 6.2 - 6.6, partly 6.7 & 6.8), given below as Point 3, further assumptions yielding approximate expressions for system reliability and availability are possible, provided that hi<
n;=kl+;
1. Totally independent elements: If each element of the reliability block diagram operates independently from every other (active redundancy, independent elements, one repair Crew for each element), series - parallel structures can be reduced to one-item structures, which are themselves successively integrated
6 Reliability and Availability of Repairable Systems
Figure 6.17 Comparison between a one-item structure and a 1-out-of-2 active redundancy with a series element (repairable, one repair Crew, repair priority on E 2 , no further failure at system down, constant failure rates hl & h2 and repair rate p, Markov process, hl remains the same in both structures, eqs. according to Table 6.10; given on the right are MZ7'Fs O, 1M0a and (1 - PAS,) I (1 - PAsa) with MZTFSO, and 1- PASc from Fig. 6.18; See Fig. 2.8 for the nonrepairable case)
6.7 Approximate Expressions for Large Series-Parallel Stmcturres
Figure 6.18 Comparison between basic series - parallel structures (repairable, one repair Crew with repair priority on E3, active redundancy, no further failure at system down, constant failure rates hl to hg and repair rate p, Markov process, hl and h2 remain the same in both structures, equations according to Table 6.10; see Fig. 2.9 for the nonrepairable case)
222
6 Reliability and Availability of Repairable Systems
into further series - parallel structures up to the system level. To each of the one-item structure obtained, the mean time to failure MVFsso and steady-state availability P A s , calculated for the underlying series -parallel structure, are used to calculate an equivalent M Z T R ~from PA^ = MI ( M T +~ MT ) using M= MTTFso. To simplify calculations, and considering the comments given to Eq. (6.93), constant failure rate hs = i 1 MW,, and constant repair rate ys = 1 I M?TRS are assumed for each of the one-item structures obtained. Table 6.9 summarizes basic series- parallel structures based on totally independent elements (see Section 6.7.2 for an example). 2. Macro-structures. A macro-stmcture is a series, parallel, or simple series parallel structure which is considered as a one-item structure for calculations at higher levels (integration into further macro structures up to system level) [6.5 (1991)l. It satisfies Assumptions (6.1) - (6.7), in particular one repair Crew for each macro-structure and no further failures during a repair at the macrostructure level. The procedure is similar to that of point 1 above (see also the remark to Eq. (4.37) for the calculation of an equivalent M n i i s ) . Table 6.10 summarizes basic macro-structures useful for practical applications, see Sections 6.2 to 6.6 for results and Section 6.7.2 for an example. 3. One repair Crew und no further failures at system down: Assumptions (6.3) and (6.2), valid for all models investigated in Sections 6.3 - 6.6, applies in many practical applications. No further failures at system down means that failures during a repair at system level are neglected. This assumption has no influence on the reliability function at system level and its influence on the availability is limited (if Ai << yi can be assumed for each element Ei).
4. Cutting states: Removing the states with more than k failures from the diagram of transition probabilities in ( t , t + 6 t ] , or the state transition diagram, produces in general an important reduction of the state diagram. The choice of k (often k = 2 ) is based on the required precision. An upper bound of the error for the asymptotic & steady-state value of the point and average availability P A s = A A S (based on the mapping of states with k failures at the system level in the state Zk of a birth and death process and using (Eq. (A7.157)) Pk 2 q ,, valid for 2(Al + ... + h,) < min { p i , ..., P,}) has been given in [2.50 (1992)l.
5. Clustering of states: Grouping of elements in the reliability block diagram or of states in the diagram of transition probabilities in ( t , t + 6t] produces in general an important reduction of the number of states in the state diagram. Combination of the above methods is possible. In any case, series elements must be grouped before any analysis (see Section 6.3 and the second row of Table 6.10). Considering that the steady-state probability for states with more than one failure decreases rapidly as the number of failures increases ( - A l p for each failure, see e.g. pp. 230 and 258 and the corresponding Figs. 6.20 and 6.34), all methods given
6.7 Approximate Expressions for Large Series-Parallel Structurres
Figure 6.19 Basic reliability block diagrarn for an unintemptible el.power supply (UPS)
above yield good approximate expressions for M7TFso and PAs in practical applications. However, referring to the unavailability 1 - PAs, method 1 above can deliver lower values, for instance a factor 2 with an order of magnitude ( A l for a 1-out-of-2 active redundancy (compare Table 6.9 with Table 6.10). An analytical comparison of the above methods is difficult, in general. Numerical investigations show a close convergence of the results given by the different methods, as illustrated for instance in Section 6.7.2 (p. 230) for a practical example with extremely low values for p l h (down to 20).
6.7.2
Application to a Practical Example
To illustrate how methods 1 to 3 of Section 6.7.1 work, let us consider the system with a reliability block diagram as in Fig. 6.19, and assume system new at t = 0, active redundancy, constant failure rates Al to A3, constant repair rates p, to p3, repair priority E I , E3, E2 [6.5 (1988)l. Except for some series elements (to be considered separately in a final step), the reliability block diagram of Fig. 6.19 describes an unintemptible power supply (UPS) used for instance to buffer electrical power network failures in Computer Systems (El being the power network). + ) Although limited to 4 elements, the stochastic process describing the system of Fig. 6.19 would contain more than 50 states if the assumption of no further failure at system down were dropped. Assuming no further failure at system down, the state space is reduced to 12 states (Fig. 6.20). In the following, the mean time to failure (MTTFso) and the asymptotic & steady-state point and average availability ( PAs = AAS) of the system given by Fig. 6.20 is investigated using method 1 (Table 6.91, method 2 (Table 6.10), and method 3 (Table 6.2) of Section 6.7.1. For a numerical comparison, results are given on p. 230 (also for method 4 and for the exact solution obtained by dropping the assumption of no further failure at system down), showing that all methods used deliver good approximate expressions. +)
A refinement to include the battery discharge has been investigated recently [6.45 (2002)l.
224
6 Reliability and Availability of Repairable Systems
Method 1 of Section 6.7.1 yields, using Table 6.9,
System
From Eqs. (6.175) - (6.177) it follows that
and
Method 2 of Section 6.7.1 yields, using Table 6.10,
System
6.7 Approximate Expressions for Large Series-Parallel Structurres
225
Table 6.9 Basic structures for the investigation of large series-parallel Systems by assuming totally independent elements (each element operates and is repaired independently of every other element, Point 1 p. 221), constant failure & repair rates ( h ,P),active redundancy, one repair Crew for each element, ideal failure recognition, Markov process (for rows 1 to 5 See Eqs. (6.48), (2.48) & (6.60), (2.48) & (6.99),(2.48) & (6.171) with h, =0, and (2.48) & (6.148),respectively; h s = 11 MTTFso and P , = 1 1 MZTR, = h s 1 (1 - PA, ) are used to simplify the notation; approximations valid for h i << pi ; PA, = AAS =asymptotic & steady-state point and average availability, often denoted by A)
)LS=h,
*
X , PA, PS=1 -PA,
PAS = PA1
PA -
pS=p,
-
-
1 =I-hs/Ps l+hslps
hs 1 - PAs
h
h
... PA, = I - ( J + .,. +J) P1
is = A l + ... + L ,
L l h , r MTTF,, =
1
=,
111 112 Al L2 ( P , + P2)
1-out-of-2(active)
U
2-out-of-3 active (E = E = E =E) 1 2 3
LP PA, = 1 - -
k-out-of-n active (E1=... = E n = E )
Pn h Li+ ... +X, ps = -2= ]-PAS h l l k l + ... + h , l ~ ,
226
6 Reliability and Availability of Repairable Systems
From Eqs. (6.180) and (6.181) it follows that
and
Method 3 of Section 6.7.1 yields, using Table 6.2 and Fig. 6.20, the following System of algebraic equations for the mean time to failure ( M i=
where
From Eqs. (6.184) and (6.185) it follows that
6.7 Approximate Expressions for Large Senes-Parallel Structurres
227
Table 6.10 Basic macro-siructures for the investigation of large series-parallel Systems by successive building of macro-structures bottom up to system level (Point 2 p. 222), constant failurekrepair rates (h, P), active redundancy, one repair Crew for each macrostructure, no further failure at system down, ideal failure recognition, Markov proc. (for rows 1-6 see Eqs. (6.48), (6.65) &(6.60), (6.103) & (6.99), and (6.160) & (6.158), (6.65), (6.60) &Tab. 6.8, and same as for row 5, respectively; h s = 11 Mp, = I I M V R , = A, /(I -PAs ) are used to simplify the notation; approximations valid for Ai<< p i )
j+ ,=,L-
U
1-out-of-2 (active)
1-out-of-2 active (El = E2 = E) repair pnonty on E
I
2-out-of-3 active (E 1=E2=E3=E) repair pnority on Ev
Lß1
repair priority E,
k-out-of-n active (E,=
... = E
=E)
h
h, + . . . + A n
I-PA,
hllpl+...+h,,lp,,
228
6 Reliability and Availability of Repairable Systems
with
Similarly, for the asyrnptotic & steady-state value of the point and average availability PAS = AAS the following System of algebraic equations, can be obtained using Table 6.2 and Fig. 6.20
with pias in Eq. (6.185). One (arbitrarily chosen) of the Eqs. (6.188) must be dropped and replaced by Po + f j + ... + q1= 1. The solution yields Po to ql, from which
with
and
6.7 Approximate Expressions for Large Series-Parallel Structurres
repair priority: El' E3' E2
Figure 6.20 Reliability block diagram and diagram of transition probabilities in (t, t + 6t ] for the system described by Fig. 6.19 (active redundancy, one repair Crew, repair priority in the sequence E I , E3, E2, no further failures at system down, ideal failure recogntion, harbitrary t, 6 t k 0 , Markov process (Pi = pPij))
xj
230
6 Reliability and Availability of Repairable Systems
An analytical comparison of Eqs. (6.186) with Eqs. (6.178) and (6.182) or of Eq. (6.189) with Eqs. (6.179) and (6.183) is difficult. Numerical evaluation yields (L and p in h-', M7TF in h) Al
12
L3 P1 P2 P3
MTTFSo (Eq. (6.178), totally IE) MITFSO(Eq. (6.182), MS) MTTFSO(Eq. (6.186), no FF) (Method 4, Cutting)
M q O (only one repair crew) 1 -PAS (Eq. (6.179), totally IE) 1- PAS (Eq. (6.183), MS) I - PAS (Eq. (6.189), no FF) 1- PAS (Method 4, Cutting) 1- PAS (only one repair crew)
Also given in the above numerical comparison are the results obtained by method 4 of Section 6.7.1 (for a given precision of 1 0 - ~on the unavailability 1- PAs) and by dropping the assumption of no further failures at system down in method 3. These results confirm that for Li <
6.8 Systems with Complex Structure
6.8
Systems with Complex Structure
Structures and models investigated in the previous sections of this chapter were based on the existence of a reliability block diagram and on some simplifying assumptions ((6.1) - (6.7)) ; in particular, elements with only two states (goodlfailed) and ideal fault coverage & switching. This was, so far, good to understand basic investigation methods and tools, See e.g. Figs. 6.9, 6.10 & A7.6, Example 6.11, Section 6.7.2, and Table 6.2. However, in practical applications more complex situations can arise. This section uses tools developed in Appendix A7 (summarized in Table 6.2 for Markov & semi-Markov processes) to investigate complex fault tolerant repairable Systems for cases in which a reliability block diagram does not exist or can not easily be found. Constant failure and, in general, also constant repair rates are assumed. On the basis of practical examples it is shown that working with the diagram of transition probabilities or a time schedule, problems occurring in practical applications can be solved on a case-by-case basis. To improve readability, the diagram of transition probabilities in (t,t+6t] will be replaced in this section by the diagram of transition rates, which considers transition rates p, only, by omitting 6 t and 1-pi6 t . Of Course, each new System can provide a starting point for a better model, and a large number of Papers is known on this subject too. After some general considerations (Section 6.8.1), Section 6.8.2 deals with aspects ofpreventive maintenance. Sections 6.8.3 & 6.8.4 consider imperfect switching and incomplete coverage. Elements with more than two states or one failure mode are discussed in Section 6.8.5. Section 6.8.6 investigates fault tolerant reconfigurable systems by considering that reconfiguration can occur because of mission profile (phased-mission systems) or failure. For this last case, reward und frequency 1 duration aspects are involved in the analysis as well. Section 6.8.8 summarizes the procedure for modeling systems with complex structure. Alternative investigation methods (Petri nets, dynamic FTA, Computeraided analysis) are introduced in Section 6.9 and a Monte Carlo procedure, useful for rare events is given. As a general rule, modeling complex systems is a task which must be solved in close cooperation between project and reliability engineers.
6.8.1
General Considerations
In the context of this book, a structure is complex when the reliability block diagram either does not exist or cannot be reduced to a series-parallel structure with independent elements (p. 52). If the reliability block diagram exists, but not as seriesparallel structure, reliability and availability analysis can be performed using one or more of the following assumptions (as in previous sections, failure-free time is used as a synonym for failure-fiee operating time, repair as a synonym for restoration):
232
6 Reliability and Availability of Repairable Systems
1. For each element in the reliability block diagram, failure-free times and repair times are statistically independent. 2. Failure and repair rates of each element are constant (time independent). 3. Each element in the reliability block diagram has a constant failure rate. 4. The flow of failures is a Poisson process (homogeneous or nonhomogeneous). 5. Nofurther failures can occur at system down. 6. Redundant elements are repaired on-line (no interruptions at system level). 7. After each repair, the repaired element is as-good-as-new. 8. After each repair, the entire system is as-good-as-new. 9. Only one repair crew is available, repair is started as soon as the repair Crew is free (first-infirst-out)or according to a given repairpriority. 10. Totally independent elements, i.e. each element operates and is repaired independently of every other element (n repair Crews for n elements). 11. Ideal failure recognition (in particular no hidden failures or false alarms). 12. All failure-free times and repair times are > 0, continuous, and have afinite mean and variance. 13. For each element, the mean time to repair is much lower than the mean time to failure ( M7TRi << MTTFi). 14. Switches and switching operations are 100% reliable and have no influence on the reliability of the system. 15. Preventive maintenance is not considered.
A clear definition of the assumptions stated is important tofix the validity of the results obtained. It is often tacitly assumed that each element has only 2 states (goodlfailed), one failure mode (cg. shorts or opens), and a time invariant required function (e.g. continuous operation). Elements with more than two states or one failure mode are discussed in Section 6.8.5 (see also Section 2.3.6 for the nonrepairable case). A time dependent operation andlor required function can be investigated when constant failure rate is assumed (Section 6.8.6.2). The following is a brief discussion of the above assumptions. With assumptions 1 and 2, the time behavior of the system can be described by a time-homogeneous Markov process with finitely many states. Equations can be established using the diagram of transition probabilities in (t,t+6t] and Table 6.2. Difficulties can arise because of the Zarge number of states involved. In such cases, a first possibility is to limit investigation to the calculation of the mean time to failure M n F S o and the asymptotic & steady-state value of the point and average availability PAs = AAS, i.e. to the solution of algebraic equations. A second possibility is to use approximate expressions (Section 6.7) or special software tools (Section 6.9.3). Assumption 4 often applies to Systems with a large number of elements. As shown in Sections 6.3 - 6.6, assumption 5 simplifies calculation of the point availability and interval reliability. 1s has no influence on the reliability function, in particular on MTTFso, and can be used for approximate expressions when assumption 13 applies
6.8 Systems with Complex Structure
233
(see Section 6.7.2 for an example). Assumption 6 must be met during the system design. If not satisfied, improvements given by redundancy are questionable (see Example 6.16 and Fig. 6.26) and at least fault recognition should be required and implemented. Assumptions 7 and 8 are satisfied if either assumption 2 or 3 holds. Assumption 7 is frequently used, its validity must be checked. Assumption 8 is rarely used (only with assumptions 2 or 3). Assumption 9 simplifies calculation and is useful for deriving approximate expressions, especially if assumption 13 holds. Together with assumption 3, the behavior of the system can be described by a semi-regenerative process (process with an embedded semi-Markov process). Assumption 3 alone can assure that the process is regenerative. With assumption 10, point availability can be computed using the reliability equation for the non repairable case (Eqs. (2.47) & (2.48)). This assumption rarely applies in practical applications. However, it allows a simple calculation of an upper bound for the point availability. Assumption 13 is generally met. It leads to approximate expressions, as illustrated in Section 6.7 or by using asymptotic expansions, See e.g. r6.19, A7.261. As shown in Examples 6.7- 6.9, the shape of the distribution function of the repair time has small influence on the results at system level PAS, IRs@)), if assumption 13 holds. Assumptions 14 and 15 simplify investigations. They are valid for all models discussed in Sections 6.2-6.7. Investigation of large series -parallel structures or of complex structures is in general time-consuming and can become mathematically intractable. As a first step it is useful to operate with Markov models. Refinements can then be considered on a case-by-case basis. If the reliability block diagram does not exists, stochastic processes and tools introduced in Appendix A7 can be used to investigate reliability and availability of fault tolerant systems, on the basis of the diagram of transition rates or a time schedule, See Sections 6.8.3 - 6.8.7 for some examples. on systems with imperfect switching, incomplete coverage, more than two states or one failure mode, reconfigurable structure, and common cause failures. A general procedure for the investigation of complex fault tolerant systems is given in Section 6.8.8. Alternative investigation methods (Petri nets, dynamic FTA, computer-aided analysis) are introduced in Section 6.9 and a Monte Carlo procedure, useful for rare events is given.
6.8.2 Preventive Maintenance Preventive maintenance is necessary to avoid wearout failures and to identify and repair latent or hidden failures, i.e. failures of redundant elements which cannot be recognized during normal operation. This section investigates a one-item repairable structure with prevenrive maintenance at T p M ,2 T p M ,... . The resuits are basic for the investigation of more complex structures and will be useful in the
6 Reliability and Availability of Repairable Systems
RPM(t)=R (t)= e-'
0.2 0
2 T p ~ 4TpM
%M
Figure 6.21 Reliability functions for a one-item structure with preventive maintenance (of negligible duration) at times TpM,2 TpM,... for two distribution functions F ( t ) = I - R(t) of the failure-free time (item new at t = 0, TpM,2TPM, ... ; left increasing and right constant failure rate)
following sections to investigate some aspects of fault tolerant repairable Systems (Section 6.8.6.2). Further models / strategies for preventive maintenance or maintenance optimization are possible (Section 4.6). The item considered is new at t = 0. Its failure-free time is distributed according to F(x) with density f(x), the repair time has distribution G(x)with density g(x). Preventive maintenance is of negligible time duration (e.g. specialized personnel is available) and restores the item to as-good-as-new. If a preventive maintenance is due at a time in which the item is under repair, one of the following cases will apply: 1. Preventive maintenance will not be performed (as included in the running repair, considering that after each repair the item is as-good-as-new). 2. Preventive maintenance is performed, i.e. a running repair is terminated with the preventive maintenance in a negligible time span (this maintenance strategy is also known as age replacementpolicy, se also Section 4.6). Both situations can occur in practical applications. In case 2, the times 0, TPM, 2 TpM,... are renewal points. Case 2 will be considered in the following. The reliability function R p M ( t )can be calculated from
with R ( x ) = 1 - F ( x ) ,where F(x) is the distribution function of the failure-free time of the one-item structure considered. Figure 6.21 shows the shape of R ( t ) and R p M ( t )for an arbitrary F(x), and for F(x) = 1- L h x . Because of the rnernoryless property which characterizes the exponential distribution function, R p M ( t )= R(t) = e - l t
holds for
F(x) = 1- e-L X ,
(6.193)
independeiztly of TpM. From Eq. (6.192), the mean time to failure with preventive maintenance MrrFpM follows as
6.8 Systems with Complex Struciure
Figure 6.22 Point availability for a one-item stnicture with repair at every failure and preventive maintenance (of negligible duration) at times TpM, 2 TpM , ... (item new at t = 0 , TpM, 2 TpM, ... and after each repair)
For F(x) = l-e-hx, Eq(6.194) yields MUFpM = l / h independently of TPM. Determination of optimal preverzfive maintenance periods must consider Eq. (6.194) as well as cost and logistic Support aspects (for f (0) = 0 , M n , -;.00 for T„ +0). Example 6.13 shows a further practical application of preventive maintenance. Calculation of the point availability is easy if preventive maintenance is performed at TpM, 2 TPM, ... (case 2 above) and leads to
with PAso(t) from Eq. (6.17). Figure 6.22 shows PAPM(t). Contrary to RpM(t), PApM(t) goes to 1 at 0 (item is new), TpM, 2TpM, ..., i.e. at each renewal point. If the time duration for the preventive maintenance is not negligible, it is useful to define, in addition to the availability introduced in Section 6.2.1, the overall (or operational) availability OA, defined for t -+ as the ratio of the total up time in (0, t] to the sum of total up and down time in (0, t], i.e. to t . Defining M7TF = mean time to failure and MDT = mean down time (with MTTR = mean time to repair (restore), MTTPM = mean time to carry out preventive maintenance, MLD = mean logistic delay, and TpM = preventive maintenance period) it follows that (see p. 122) OA =
MTlF MTTF (6.196) MTTF + MDT MTlF + MUR + MLD + MTTPM(MUF / TpM)
For MLD = 0, the overall availability is often called technical availability. Other availability measures are possible, e.g. as in [6.11] for railway applications.
6 Reliability and Availability of Repairable Systems
Example 6.13 Assume a nonrepairable (up to system failure) 1-out-of-2 active redundancy with two identical elements with constant failure rate h . Give the mean time to failure MTTFpMby assuming a preventive maintenance with period TpM i < l / h. The preventive maintenance is performed in a negligible time span and restores the 1-out-of-2 active redundancy as-good-as-new.
Solution For a nonrepairable (np to system failure) I-out-of-2 active redundancy with two identical elements with constant failure rate h , the reliability function is given by Eq. (2.22) ~ ( t=) 2e-ht The mean time to failure with preventive maintenance follows from Eq. (6.194) as
Using e-X= 1- x MTFPM =
+ x 2 / 2 it follows that
~ T P M- TPM -
h2T.~M
1
( = M T B F . M T B F / T p M for M T B F = l / h ) .
Without preventive maintenance, Eq. (2.22) yields M shows the gain given by the preventive maintenance.
6.8.3
(6.197)
h2PM '
T = 3 / 2 h . Equation (6.197) clearly
Imperfect Switching
In practical applications, switching is necessary for powering down failed elements and powering up repaired elements. In some cases it is sufficient to locate the switching element in series with the redundancy on the reliability block diagram, yielding series -parallel structures as investigated in Section 6.6. However, such an approach is often too simple to Cover real situations. This section shows this on the basis of practical examples. Further considerations are given in Section 6.8.4 dealing with incomplete coverage. As afirst example, Fig. 6.23 shows a situation in which measurement points MI and M 2 , switches S 1 and S 2 , as well as a control unit C must be considered. To simplify, let us consider only the reliability function in the nonrepairable case (up to system failure). From a reliability point of view, switch Si,element Ei,and measurement point M i in Fig. 6.23 are in series (i = 1, 2). Let 'rbl and T b 2 be the corresponding failure-free times with distribution function F b ( x ) and density f b ( x ) . T , is the failure-free time of the control device with distribution function F c ( x ) and density f c ( x ) .
6.8 Systems with Complex Structure
Figure 6.23 Functional block diagram for a 1-out-of-2 redundancy with switches SI and S2, measurement points M I and M 2 , and control device C
Consider first the case of standby redundancy and assume that at t = 0 element EI is switched On. A system failure in the interval (0, t ] occurs with one of the
following mutually exclusive events {T, > zbl n(zbl + Tb2) I t )
or
{T, < ~ b Il t}.
It is implicitly assumed here that a failure of the control device has no influence on the operating element, and does not lead to a commutation to E I . A verification of these conditions by a FMEA (Section 2.6) is necessary. With these assumptions, the reliability function Rso(t) of the system described by Fig. 6.23 is given by (nonrepairable case, system new at t = 0)
Assuming further fb(x)= Ab eLhhx and f,(x) = h , e-'C',
Eq. (6.198) yields
0 leads to the results of Section 2.3.5 for the 1-out-of-2 standby redundancy. Assuming now an active redundancy (at t = 0, EI is put into operation and E2 into the reserve state), a system failure occurs in the interval (0, t] with one of the following mutually exclusive events h,
2
{zbl It n T, > z~~n Tb2 I t}
or
t}. {T, < zbl I
The reliabilityfinction is then given by (nonrepairable case, system new at t = 0) t
Rso(t) = 1 -[~b(t)/fb(x)(1- F,(x)) 0
t
+ [fb(-")~,(x)&I . 0
(6.201)
238
6 Reliability and Availability of Repairable Systems
From Eq.(6.201) and assuming fb(x)= hb e-'bx and f,(x) = h , e-'cx it follows that
and
h,
0 leads to the results of Section 2.2.6.3 for the 1-out-of-2 active redundancy. From Eqs. (6.200) and (6.203) one recognizes that for h, >> hb
M7TFso = 1/ h b ?
for h , >> a b ,
(6.204)
for both standby and active redundancy, i.e. to a situation as where no redundancy. As a second example consider a I-out-of-2 warm redundancy with constant failure rate h, h , and repair rate p. The switching element can fail with constant failure rate h , and failure mode stuck at the state occupied just before failure. At first, let us consider the case in which the failure of the switch can be immediately recognized and repaired with constant repair rate y,. Furthermore, assume only one repair Crew, repair priori9 on the switch, and no further failure at system down. Asked are the mean time to system failure MTTFso for system new (state ZO) at t = 0 and the asymptotic & steady-state (stationary) point and average availability PAS =AAS. The involved process is a time-homogeneous Markov process. Figure 6.24 give the diagrams of transition rates for reliability and availability calculation, respectively (down states hatched). From Fig. 6.24a and Table 6.2 or Eq. (A7.126) it follows that M?TFso is given as solution of the following system (Mi = M7TFsi)
yielding
-
P h ( h + L,.+ h,)
The approximation assumes p, = p and h, Ar, h, << p. From this approximate expression it follows that the effect of impefect switching with failure mode stuck at the state occupied just before failure, immediately recognized and repaired, is minor and becomes negligible (yielding results for ideal switch as in Tab. 6.6) for
h, << h + h r ,
(for p = p , >> h , Ar, X , ) .
(6.207)
The case h,=O implies p,=O and must be investigated using the exact expression for M7TFso, yielding M q o = (p+ 2h + L,) lh(h+ h,) as per Table 6.6.
6.8 Systems with Complex Structure
239
Figure 6.24 Diagram of transition rates for a repairable 1-out-of-2 warm redundancy (const. failure & repair rates h , Ar, p) with imperfect switching (failure rate h , , repair rate P,, failure mode stuck at the state occupied), failure of the switch inmediately recognized and repaired with repairpriority, one repair Crew; system down in 2 2 , Z2?, 22"; no further failure at system down; Markov process
From Fig. 6.24b and Table 6.2 or Eq. (A7.127) it follows that PAs = AAS is given as solution of the following system of algebraic equations
One of the Eq. (6.208), arbitrarily chosen, must be replaced by Cq = 1. The asymptotic & steady-state point and average availability follows from
The approximation assumes p, = P and h, hr,h , << p. Equation (6.209) allows the same conclusion as for Eq. (6.206). ?L, = 0 implies y, = 0 and yields results for ideal switch (Table 6.6). Further models for imperfect switching are conceivable. For instance, by assuming that for the model of Fig. 6.24 failure of the switch (with stuck at the state occupied just before failure and failure rate L,) can only be recognized and repaired at system down together with failed elements (one or both) at a repair rate pg. This situation occurs e.g. in power Systems (refuse to start). Figure 6.25 gives the corresponding diagrams of transition rates for reliability and availability calculation,
6 Reliability and Availability of Repairable Systems
b) For availability
a) For reliability
pm8=h,; pol=h+hr; plO=p;pn = ~ ~ m ~ = h ; p,=h+h, +L,; po,=h; p l = h + p ; p2=pg Figure 6.25 Diagram of transition rates for a repairable 1-out-of-2 warm redundancy (const. failure & repair rates h , h „ p ) with imperfect switching (failure rate h,, failure mode stuck at the state occupied), failure of the switch recognized and repaired only at system down (Z2 ) together with failed elements at a global repair rate pg ; no further failure at system down; Markov process
respectively (down state hatched). Results are given in Example 6.14. A further possibility is to assume no connection as failure mode(Fig. 6.31) or a constant probability C that the switch will perform correctly when called to operate (Figs. 6.27,6.28). Example 6.14 Compute the mean time to system failure M7TFs0 for system new (inZo) at t=O and the steadystate point and average availability PAs=AA, of the 1-out-of-2 warm redundancy as per Fig. 6.25. Solution From Fig. 6.25a and Table 6.2, MTTFso is given as solution of (with Mi=M7TFsi) (6.210) p, M, = 1 + h, Mo- + (h + h,)Ml, PV MV = 1, PI MI = l+PMo yielding 3
Because of the not recognized failure of the switch, the condition on h, to yield results valid for ideal switching (Table 6.6) is more severe as Eq. (6.207) and is given by (see also Eq. (6.240))
From Fig. 6.251, and Table 6.2 or Eq. (A7.127) it follows that PA, =AAS is given as solution of pOPO= p 4
+ p&,
pVPO'= h0PO, P14 = ( h + h , ) P o .
P26
One of the Eq. (6.213), arbitrarily chosen, must be replaced by Po asymptotic & steady-state point and average availability follows from
+ hPv.
(6.213)
+ Pu + 4 + P2 = 1.
The
Equation (6.214) allows the same conclusion for h, as for Eq. (6.211). If Eq. (6.212) is not satisfied, i.e. for p h , > > h ( h + h r ) , Eq. (6.211) yields M T , = 1 l h , + 1 l h (nonrepairable 1out-of-2 stanbby redundancy with h, and h ) and Eq. (6.214) PAs =AAS 1 - h l pg (one-item).
-
6.8 Systems with Complex Structure
6.8.4
24 1
Incomplete Coverage
Incomplete fault (failure) coverage occurs because of lack or failure in the diagnosis. Fault coverage is defined as the proportion of faults of an item that can be recognized under given conditions. A fault coverage greater as 0.9 is often required for complex equipment and Systems. Lacks in the diagnosis lead to hidden ur latent failures, i.e. failures (in the main system) which are not covered by diagnosis and can be recognized only during a repair or a preventive maintenance. Hidden or latent failures can cause serious reduction of the advantage offered by a redundancy (see for instance Eq. (6.223)). Failure modes of a diagnosis have to be investigated on a case-by-case basis. However, from a logical point of view, two basic failure modes are (Fig. 6.29) false alarm, no alarm emitted (alarm defection).
Incomplete coverage acts on the switching operation and is often investigated as part of imperfect switching. Following an illustrative example, this section discusses some basic possibilities to consider incomplete coverage. Consider a 1-out-of-2 active redundancy with two different elements El and E 2 , and assume that failures of EI can be recognized only during the repair of E2 or at a preventive maintenance (hidden failures in E l ) . Elements El and E2 have constant failure rates h1 and h2,the repair time of E2 is distributed according to an arbitrary function G(x) with G(0) = 0 and density g(x), and the repair of El takes a negligible time (see Example 6.16 for constant repair rate). If no preventive maintenance is performed, Fig. 6.26a shows a possible time schedule of the system (new at t =O), yielding for the reliabili~finction
The Laplace transform of Rso(t) follows as
and the mean time tofailure becomes
If preventive maintenance is performed at times TPM,2 TPM,... independently
6 Reliability and Availability of Repairable Systems
0 renewal point
a) Without preventive maintenance
b) With preventive maintenance (penod TpM)
Figure 6.26 Possible time schedules for reliability calculation of a repairable 1-out-of-2 active redundancy withhiddenfailures in element EI (new at t = 0 , repairtimesgreatly exaggerated)
of the state of element E2, and after each preventive maintenance (assumed of negligible duration as in Section 6.8.2) the entire system is as-good-as-new, then the times 0, TPM,2 TPM,... are renewal points for the system. For the reliability function Rso,(t) it follows that (considering Eq. (6.192) and Fig. 6.26b)
with R s o ( t ) as per Eq. (6.215). The mean time to failure M I T F ~ ~ ,follows as
The time TpM between two consecutive preventive maintenance operations can be optimized considering Eq. (6.217) or Eq. (6.219) as well as cost and logistic aspects ( + for T„ + 0). Examples 6.15 and 6.16 show practical applications of the above incomplete coverage model.
-
6.8 Systems with Complex Structure
Example 6.15 Give approximate expressions for the mean time to failure M q O given by Eq. (6.217). Solution For g(hl) -t 1, it follows from Eq. (6.217) that
A better approximation is obtained by considering g(hl) = 1 - h l MTTR
MTTFso =
h,
+ h2 + h i MTTR
h, h2 (1 + h, M'ITR)
with MTTR as mean time to repair element E2 (Eq. (6.1 10)).
Example 6.16 Investigate RSO(t) per Eq. (6.216) and RsopM(t) per Eq. (6.218) as well as MITFSO per Eq. (6.217) and MTTFsopM per Eq. (6.219) for the case of constant repair rate p (g(x) = pe-llX). Solution With g(s +Al) = p / ( s + h l + p) it follows from Eq. (6.216) that
The mean time to failure MT-0
One recognizes, that h l
+ h 2 << p
follows from Eq. (6.222)
yields directly to
Equations (6.223) and (6.224) show that the repairable 1-out-of-2 active redundancy with hidden failures in one element behaves like a nonrepairable I-out-of-2 standby redundancy. This result bears out, how important it is in the presence of redundancy to investigate failure recognition and failure modes. In the case of periodic preventive maintenance (period TpM), Eq. (6.219) yields
1 - h X + ((A X)' / 2. The optimization The last part of Eq. (6.225) has been obtained using e-" of the time TpM between two consecutive preventive maintenance operations must consider Eq. (6.225), cost, and logistic aspects. Equation (6.222) follows also from Table 6.2 (Markov process with up states ZO, Z1 & Z2, absorbing state Z3, transition rates pol = h l , p02 = h 2 , P23 = Al, and Po (0) = 1). P13 = h 2 , ~ 2 =0 p X
244
6 Reliability and Availability of Repairable Systems
Figure 6.27 State transition diagram for availability calculation of a repairable 1-out-of-2 active redundancy (const. failure & repair rates (h, p), incomplete coverage (switch to reserve element with probability C), one repair Crew; semi-Markov process; Qzl (X)-0 for reliability; see also Fig. 6.28
A further possibility to consider incomplete coverage is to assume that a failure will be recognized (only) with a probability C. This case is similar to that of imperfect switching mentioned at the end of Section 6.8.3 and is known in the literature [6.45 (2001)l. Figure 6.27 gives the state transition diagram of the corresponding semi-Markov process for availability calculation (down state hatched, Qzl(x) = 0 for reliability calculation). The transition from state Z1 occurs instantaneously to Z1 with probability q , = C or to Z 2 with probability p,t2= 1- C. Assuming constant failure and repair rates, the model of Fig. 6.27 can be investigated using a time-homogeneous Markov process with the diagram of transition rates given in Fig 6.28 (also known in the literature of power Systems as redundancy with no start at call [6.34]). Examples 6.17 and 6.18 investigate the models of Figs. 6.27 & 6.28, showing their equivalence. Example 6.17 Give the mean time to system failure M T o for system new (in Zo) at t=O and the steady-state point and average availability PA,=AA, of the l-out-of-2 warm redundancy as per Fig. 6.27.
Solution From Fig. 6.27 and Table 6.2 or Eq. (A7.173), M W Ois given as solution of (with Mi=MZTFSi )
with Ti=-J (1 - Q (X))&, and Q ~ ( x ) = ~ ~ Q(Eqs.(A7.166) ~(x) and (A7.165)). Considering Fig. 6.27 itfollows that T, = 1 / 2 h , T,. = 0, Tl = 1 I (h + F), T,= 1 l p , yielding
+ 4 is given as From Fig. 6.27 and Table 6.2 or Eq. (A7.178), PA, =AAS= Po + 6, + P ~ +~P ~TT ,~)1(p0To ~ + pl
%T).
(6.228)
Thereby, p j is the state probability of the embedded Markov chain, obtained as solution of TO=%P/@+P),
PI,=Po, Pl=Pl*c+R,
~ = ~ ~ , ( l - c ) t p ~ h / ( h + (6.229) p)
(Table 6.2 or Eq. (A7.175)), yielding (considering x j p j = l ) = p,, = p / (2(h + 2p) - pc) P,= ( h + p ) l (2 (h+2p)-pc), p2= ( h + p - pc)/ (2(h+ 2p)- pc). From Eq. (6.228) it follows then PAS =AAS= (p2+2hp) 1 (p2+2hp + 2h(h +P-pc)) = 1- 2h(h+p-CLc)/(p2+2hp). (6.230)
6.8 Systems with Complex Structure
p12=h ; P
P
pz, = p (Pzi=0 for reliability)
Po= 2h; Pi=h + P; P2=P (P2= 0 for reliability)
Rgure 6.28 Diagram of transition rates for availability calculation of a repairable I-out-of-2 active redundancy (const. failure & repair rates (h,P), incomplete coverage (switch to the reserve elemeut with probability C),one repair crew); Markov proc.; p2, 0 for reliability; see also Fig. 6.27
-
Example 6.18 Give the mean time to system failure M7TFs0 for system new (in Zo) at t=O and the steady-state point and average availability PAs=AAs of the 1-out-of-2 warm redundancy as per Fig. 6.28. Solution From Fig. 6.28 & Table 6.2 or Eq. (A7.126), MiTFSO is given as solution of (with M i= MTTFsi)
From Fig. 6.28 and Table 6.2 or Eq. (A7.127), PA, =AAS = Po + 6 is given as solution of
One of the Eq. (6.233), arbitrarily chosen, must be replaced by Po + q+P2 = 1; the solution 2 yields Po = p l (p2+2hp + 2h(h + p - P))and pl= 2hp I (p2+2hp +2h(h + p-pc)), from which
Comparison of Eqs. (6.227) with (6.232) and (6.230) with (6.234) shows the equivalence of the rnodels given by Figs. 6.27 and 6.28 (for constant failure and repair rates). For C =1, Eqs. (6.232) and (6.234) yield results of Table 6.6 for active redundancy. For C = 0 , Eqs. (6.232) and (6.234) yield results for a one-item with failure rate 2 h and repair rate p; most unfavorable case, since at the first failure it is not possible to identify the failed element, yielding to a system down. Using results of Table 6.6 one recognizes that (for the model considered in Figs. 6.27 & 6.28) the effect of incomplete coverage is negligible for
Other possibilities to consider incomplete coverage are conceivable. Assuming for instance that the second element continues to operate after nonrecognition of a failure leads to the model considered in Fig. 6.25 with h,=2h(l- C ) & h+hr=2hc,yielding for C = 0 to a nonrepairable 1-out-of-2 active redundancy. A more elaborated
246
6 Reliability and Availability of Repairable Systems Pol= Po3= piz=hc+hNc=h;
PO4= h ~ Po5= ~ ~; D F ; P20=P30=P40=P5rj=P
po=(hc+hNc) + (LDF+ h D D ) = h + h D ; Pi= h c + h N c = h ; p 2 = p 3 = P,= P,= C = covered, NC = not covered
D F = false alarm, D D = alarrn defection
Figure 6.29 Diagram of transition rates for availability calculation of a one-item structure with incomplete coverage and 2 failures modes for the diagnosis (constant failure and repair rates hc , hNC, hDF, hDD , P); Markov process; p- 0 for reliability calculation
model which considers 2 failure modes for the diagnosis, false alarm ( A m ) and alarm defection ( R D D )has been proposed in [6.42]. Figure 6.29 shows this model by considering one repair rate ( y) for all failure modes, yielding (Table 6.2) M 7 T F -- +''D SO-h(h+hD)
and
&J
pAs = AA -ph+phDD+h(h+hD)
„
h „= h = h , = 0 leads to results of Section 6.2.1. A possible diagram of transition rates for a 1-out-of-2 active redundancy with 2 repair Crews on the basis of Fig. 6.29 is Fig. 3 of [6.42]. A further example for a duplex System is Fig. 1 of [1.11].
Elements with more than two States or one Failure Mode
6.8.5
Elements with more than two states (good Ifailed for instance) or one failure mode (e.g. Open or short) often arise in practical applications. Some considerations have been given in Sections 2.3.6 and 6.8.4. This section shows, on the basis of practical examples, that items with more than two states or one failure mode can be investigated operating directly with the diagram of transition rates. As afirst wample consider an item with the three states good, waiting for repair, repair [6.13]. Figure6.30 shows this model. From Fig. 6.30 &Table 6.2 it holds that Mi7'Fso =
1
and
PAs = AA -
1
PP'+UP + P')
1
= 1 - ~ ( - + ~ ) . P P
(6.237)
The item in Fig. 6.30 behaves like a one-item structure with failure rate h and repair rate 1/(1/y+1/P')). More complex structures can also be investigated, See e.g. [6.13].
P
P'
h = failure rate, U, = repair rate, P'= failure recognition / isolation rate
Figure 6.30 Diagram of transition rates for availability calculation of a one-item with 3 states good, waiting for repair, repair; constant failure, failure recognition & repair rates (h,p',p); Markov proc.
6.8 Systems with Complex Structure
Figure 6.31 Diagram of transition rates for reliability calculation of a repairable 1-out-of-2 wann redundancy (const. failure and repair rates h , hr,p); switch with failure modes stuck at the state occupied (const. failure & repair rates L,, p,) and no connection (const. failure rate L,), failure of the switch immediately recognized and repaired with repair priority, one repair crew; Markov process
As a second example consider a I-out-of-2 warm redundancy with constant failure rate h , h, and repair rate y. The switching element can fail with constant failure rate ?L, for failure mode stuck at the state occupied just before failure or h, for failure mode no connection. F a h r e of the switch can be immediately recognized and repaired with constant repair rate p, or y O Furthermore, assume only one repair crew, repair priority on the switch, and no further failure at system down (also for the switch, no further failure is possible after a failure with one of the two possible failure modes). Asked is the mean time to system failure M7TFso for system new (in state Zo) at t = 0. The involved process is a time-homogeneous Markov process. Figure 6.3 1 gives the diagrams of transition rates for reliability calculation (extension for availability is as in Fig. 6.24). FromFig. 6.31 and Table 6.2 it follows that M7TFssois given as solution of the following system of algebraic equations (Mi = M7TFsi)
The system given by Eq. (6.238) is identical to that of Eq. (6.205), the only difference being by po and pl for which h, has been added (compare Fig. 6.24a with Fig. 6.31). M7TFso is thus given by Eq (6.206) with po and pl as per Fig. 6.3 1, yielding for p, = y and h, L,., h, << p the approximate expression
and MTTFso= 1 I h, for h, large. The failure mode no connection (L,) is more disturbing than the failure mode stuck at the state occupied just before failure (L,), see Eq (6.207), and the effect of imperfect switching is negligible (Tab. 6.6) only for h0«h(h+hr)lp
and
h,«h+h„
@=p,»h,hr,h„ho).
(6.240)
The condition given by Eq. (6.240) is similar to that given by Eq. (6.212). Further models for Systems with more than two states or one failure mode are conceivable.
248
6 Reliability and Availability of Repairable Systems
6.8.6 Fault Tolerant reconfigurable Systems Fault tolerant structures are able to recognizes and isolate failuresl faults and reconfigure themselves to continue operation with minimum loss of performance and / or safety (graceful degradation). Such a characteristic must be built in during design & development. Typical examples of fault tolerant systems are safety circuits as well as power and telecommunication networks. Following a short discussion on ideal reconfiguration, this section deals with reconfiguration occurring at given fixed times or at failure by considering also non ideal conditions, for instance imperfect switching in Section 6.8.6.3. Investigation is based on tools introduced in Appendix A7 and summarized in Table 6.2. Constant failure and repair rates are assumed, yielding to time-homogeneous Markov processes. Procedures are illustrated on a case-by-case basis using diagrams of transition rates.
6.8.6.1 Ideal case Each redundant structure belongs to a fault tolerant reconfigurable structure and must be validated for this purpose during design & development, for instance with a FMEA (Section 2.6). For the redundant structures investigated in Sections 2.2, 2.3, 6.4 - 6.7 and Appendix A7, independent elements (p. 52), ideal fault coverage, ideal switching, and no reduction of system performance at failure of a redundant element was assumed. Because of these assumptions, investigations often lead to series parallel structures (Sections 6.6 and 6.7). Imperfect switching, incomplete coverage, and items with more than two states or one failure mode have been considered in Sections 6.8.3 - 6.8.5. In Sections 6.8.6.2 and 6.8.6.3, time und failure censored reconfiguration is investigated. Section 6.8.6.4 considers reward and frequency/ duration aspects. In addition, Section 6.8.7 deals with common causes failures, Section 6.8.8 gives a general procedure for complex repairable system, and Section 6.9 presents alternative investigation methods for complex systems.
6.8.6.2 Time Censored Reconfiguration (Phased-MissionSystems) In some practical applications, systems are used for different required functions. If each required function can be considered separately from one another, investigation is performed by considering a reliability block diagram (if it exist) for each required function. Otherwise, if mission phases follow each other, investigation must consider the system reconfiguration at the end of each phase and one call this a phased-mission System. Investigation of phased-mission systems can be more complicated as stated in some literature [2.7,2.17,6.2, 6.7, 6.24, 6.33,6.4l],dealing with binary state assignment (limited to totally independent elements (p. 52)), considering time dependent failure or repair rates (breaking the Markov property),
249
6.8 Systems with Complex Structure
using semi-Markov processes (of limited validity), or missing Assumption 4 below (important when transferring state probabilities at the end of phase k to initial probabilities for phase k + 1). A lower bound Rsol for the mission reliability Rso is obtained by connecting the reliability block diagrams for each phase in series for the whole mission duration (Example 2.5). An upper bound for Rso is given by the smaller of the reliability for each phase taken separately by assuming that all elements involved are as-good-as-new at begin of the phase considered; thus, k = I, ... ,n (for n phases).
Examples 6.19 - 6.21 illustrate some general considerations and example 6.22 gives a numerical application of Eq. (6.241). For availability, Eq. (6.246) applies. The following practice oriented procedure (Point (ii) below) for reliability and availability analysis of repairable phased-mission Systems allows, in particular, consideration of standby redundancy and arbitrary repair strategy. (i) General assumptions: 1. Failure and repair rates ( A iand pi) of all elements are constant during the sojourn time in any state within each phase, but can change (stepwise) at a state (or phase) change because of change in configuration, component use, Stress, repair strategy or other; for all elements it holds that hi « pi. 2. At the begin of the mission all elements are as-good-as-new. 3. Phase duration Tl, ... ,T, are given (fixed) values, each of them so large that asymptotic & steady-state values for availability can be assumed for every phase (T1,... ,T, >> i / p i for all elements, see Section 6.2.5 and Table 6.6). 4. For availability investigation, not used elements in a phase are either as-goodas-new and put in standby (failure rate h = 0 ) at begin of the phase or repaired (Assumption 3) and then put in standby (repair priority on elements used); for reliability investigation, down states at system level are absorbing states and the above rule holds for elements which have not caused system down. 5. System has only one repair Crew and no further failures can occur at system down; system down is an absorbing state for reliability; for availability, the system is restored to an operating state according to a given repair strategy. 6. Fault coverage, switch, and logistic support are ideal. 7. For each phase, a reliability block diagram exists. Example 6.19 A one-item is used in a mission with phase I (duration T l , const. failure rate L,),followed by phase 2 (duration Tz, const. failure rate h 2 ) . Compute the reliability function for item new at r = 0. Solution For the reliability function of the whole mission it holds that ( T , T2 given (fixed)) Rs = Pr {phasel failure free n phase 2 failure free) = Pr {phasel failure free J . Pr {phase2 ). (6.242) failure free I phase 1 failure free) = e-'lT1. e-'zT2 = e-('lT1+ The product rule in Eq. (6.242) holds only because of constant failure rates (see also Eq. (6.27)).
250
6 Reliability and Availability of Repairable Systems
Figure 6.32 Diagrams of transition rates for a one-item used in a mission with phase 1 (duration T,, const. failure rate L,), followed by phase 2 (duration Tz, const. failure rate Lz); Markov process
Example 6.20 Show that Eq. (6.242) can be obtained using a Markov approach, i.e. working with two separate transition rate diagrams for phase 1 and for phase 2, and setting final state probabilities from phase 1 as initial-state probabilities for phase 2. Solution Figure 6.32 gives the diagrams of transiti?n rates for phase 1 and 2 (separately). For phase 1, the stak probability Pi, (t ) follows from P;, (t ) = - h1 P;, (t ) (Table 6.2, Eq. (A7.115), yielding P;, (t ) = eLhit, for P;, (0) = 1. Thus, R s o ( ~ , ) = p ; , o ( ? ) =e
and Pig1(71 ) = I -
.
Pi,, (t) P;,
follows from P ' (t ) + Pisl (t )= 1 or by solving (Fig. 6.32 and Table 6.2) 40 ( t ) = L, ~ ; , ~ (with t ) P,,, (0)=0. Similarly, for phase 2 with t starting at t = T,,
Example 6.21 A one-item System with reliability function Rs ( t ) is used for a mission of random duration T, > 0 distributed according to Fw(t)=Pr{zw 5 t] with F+ ,, (0)=0 and density f,(t). Give the reliability, first for the general case and then by assuming constant failure rate h and 6t exponentially distributed mission duration (fw (t )= Se- ). Solution As mission duration can take any time between (O,=), reliability takes a constant value given by Ca
Rs = J f , ( t ) ~ ~ ( t ) d t , 0 (see also Eq. (2.76)). For fw(t)=
(6.244) and constant failure rate h, Eq. (6.244) yields
> 0 is Supplementary results: In practical application, mission duration is limited to T, and a truncated random variable with Pr {T, = T , ) = 1 - F, ( T , -0 ) ; for this case, Eq. (6.245) becomes Rs = S I (S + h) +e-(S'h)Tw
h / (O + L ) .
25 1
6.8 Systems with Complex Smicture
(ii) Procedure for reliability & availability computation of repairablephased-mission systems withfixedphase duration Tl, ... ,T„ satisfying the general assumptions (i): 1. Group series elements used in all phases (power supply, cooling, etc.) in one element and put this as a series element in final results (Table 6.10, 2nd row). 2. Draw the diagram of transition rates for reliability evaluation, separately for each phase (I, ... , I ? ) , beginning by phase 1 with Z 1 , O(1 referring to phase 1 and 0 being the state in which all elements are as-good-as-new); down states at system level are absorbing states; use the same state numbering for the same state appearing in successive phases; however, state Z k , corresponding to a state Z,, in a phase C preceding phase k can also contain as-goodas-new elements appearing in phase k but not in a pervious phase, or standby elements (not used in phase k) with failure rate h = 0 ; for k > 1, state Z k , o contains all as-good-as-new elements used in phase k and (as necessary) elements not used in phase k which are standby with failure rate h = 0 ; verify correctness of all transition rates diagrams (as-good-as-new is Same as operating or ready to operate, because of the assumed constant failure rate). 3. For availability investigation, use results of Table 6.10 (or extend diagrams of transition rates, allowing a return to an operating state after system down according to a given repair strategy) to compute the asymptotic & steady-state availability for euch phase separately (PAk,s = AAk, for phase k), taking care of elements which are not used in the phase considered and can act as standby redundancy ( h = 0 ) for working elements; for the whole mission it holds then PAs = AAS 1 min (PAk,s = AAk, ,),
k = 1, ...,n (for n phases) .
(6.246)
4. For reliability investigation, compute the reliability function R l , (Tl ) at the end of phase 1 starting in state Z l , o at t = 0 in the same way as for a one mission system (Table 6.2), as well as states probabilities Pi, ( T 1 )for all up states Z l , ; if Z i , (possibly with further as-good-as-new elements used in phase 2) is an up state in phase 2, Pi, (T1)becomes the probability P;, ( 0 ) to start phase 2 in Z 2 , ; if Z1, is a down state in phase 2, Pi, ( T 1 )adds to the initial probability of starting phase 2 in the down state; if Zl, does not appear in phase 2, Pi, (T1) adds to the initial probability in state Z2, to give Pi, o ( 0 ) (from rule 2 above and verifying that for each phase the sum of all states probabilities is 1); reliability calculation must take care of elements which are not used in the phase considered and can act as standby redundancy ( h= 0 ) for working elements; continuing in this way, following equation can be found for the mission reliability Rso starting phase 1 in Z l , Rso"
X P;,~(T~),
U, = set of up states in phase n.
(6.247)
z €U,
To simplify the notation used in Exarnple 6.20, the variable x starting by X = 0 at the begin of each phase is used in Rule 4 instead o f t (starting by t = 0 with phase 1).
252
6 Reliability and Availability of Repairable Systems
Figure 6.33 Reliability block diagrams and diagram of transition rates for reliability calculation of a phased-mission system with 3 phases (the diagram of transition rates for phase 2 takes care that one element E2 is put in standby with & 0 as soon as available from phase 1); dashed are indicated to which states the final state probabilities of phase 1 and phase 2 are transferred as initial probabilities for phase 2 and phase 3, resp.; constant failure and repair rates (h,p); one rep. Crew; Markov process
As an example let us consider the phased-mission system with 3 phases of given (fixed) duration Tl, Tz and T3, described by the 3 reliability block diagrams and the corresponding diagrams of transition rates for reliability investigation given in Fig. 6.33. The diagram of transition rates for phase 2 considers that in phase 2 only one element E2 is used and assumes that the second element E2 is put in standby redundancy with failure rate h2 = 0 (either from state Zl ,o or as soon as repaired if from state Z1,2). Dashed is given to which states the final state probabilities at time Tl for phase 1 and Tz (T, + Tz with respect to time t ) for phase 2 are transferred as initial probabilities for the successive phase. Let us first consider the asymptotic & steady-state mission availability PA = AA s. From Tables 6.10 and 6.6, it follows for the 3 phases (taken separately) that
The 2nd equation considers that in phase 2 one of the elements E2 acts as standby redundancy with failure rate h2 = 0, combining thus results from Table 6.6 (1- ( L 2 lP2)') and Table 6.10 (2nd row). Equation (6.246) yields then
253
6.8 Systems with Complex Structure
For the mission r e l i a b i l i ~Rso, starting in state Zi,o (all elements are as-good-asnew) at t = 0, the diagrams of transition rates of Fig. 6.33 yield for phases 1, 2, 3 to following coupled system of differential equations for state probabilities (Table 6.2)
In Eq. (6.250), P;, is used instead of pi: (X). From Eq. (6.247) it follows then
Analytical solution of the system given by Eq. (6.250) is possible, but time consuming. Numerical solution can be quickly obtained (Example 6.22). A lower bound Rsol for the mission reliability Rso is obtained by connecting the reliability block diagrams for each phase in series. For Fig. 6.33, this corresponds (practically) to consider phase 3 for a time Span Tl + T2+ T (in phase 2, for element E2 a second element E2 is available in standby redundancy). A good approximation for Rso, is
Example 6.22 Give the numerical solution of Eqs. (6.250) and (6.25 1 ) for h l = 1 0 - ~h-' , h 2 = 1 0 - ~h-' , h3 = 1 0 - ~h-', =p2 =p3 = 0.5 h-' , = 168 h , Tz = 336 h, and T3 = 672 h. Solution Numencal solution of the 3 coupled Systems of differential equations given by Eq. (6.250) yields P;,, (T3)= 0.598655, (T3)=0.023493, (T3)=0.002388, P;,, (T3 ) = 0.000092, P& (T3)= 0.000094, P;,' (with 6 digits because OE P ; , ~(?)arid
Rso = I - P ; , ~ ( ? ) = 0.625.
(Ti )).
(Ti )= 0.375278
(6.252)
RSOfollows then from Eq. (6.251)
(6.253)
Supplementary results: Computing lower and upper bound for Rso as per Eqs. (6.241) and (6.254), yields for the above numencal example 0.55 5 Rs 10.71.
254
6 Reliability and Availability of Repairable Systems
quickly obtained by computing M7TFso using Table 6.10 and setting this in R
~ = e~- ( T ~, + T 2 + T 3 ) ' M7TFsO; from this, MIT&,
-:
1 /(L,
+ 2x2, ~ p +, 2h:
/P,) and
Eq. (6.241) allows computation of an upper bound for R S O (Example 6.22). If the second element E2 were not available in phase 2 as standby redundancy, P A 2 , S = A A 2 , ~ = 1 - h 2 / p 2 and, from Eq. (6.249), P A s = A A s = l - h 2 / p 2 , since h l / p l C h2/ p2 can be assumed when considering the reliability block diagram for phase 1. Assuming furthermore that the second element E2 would be repaired before the end of phase 2, if in a failed state at the end of phase 1 ( Z 1 , 2 ) ,the diagram of transition rates for phase 2 would be equal to that for phase 1, with +&, h2+h3 P ~ - + c L arid ~ , Z1,o + Z2,0, Zl,l+ Z2,1, Z1,2 Z2,3 with 9
The corresponding initial probabilities for phase 3 would be
If an element E„, where common to all 3 phases in Fig. 6.33 (i.e. in series with all3 reliability block diagrams), Table 6.10 (2nd row) can be used to find
(considering Eq. (6.249)) and, with Rso from Eq. 6.25 1,
The above procedure can be extended to consider more than one repair Crew at system level or any kind of repair (restore) strategy. Other procedures (models) are conceivable. For instance, for nonrepairable systems (up to system failure) of complex structure and with independent elements (parallel redundancy), it can be useful to number the states using binary considerations. For randomly distributed phase duration, Eq. (6.246) can be used for availability. Reliability can be obtained by expanding results in Exarnples 6.19 - 6.21. An alternative approach for phased-mission systems is to assume that at the begin of each mission phase, the system is as-good-as-new with respect to the elements used in the mission phase considered (required elements are repaired in a negligible time at the begin of the rnission phase, if they are in a failed state, and not required elements can be repaired during a phase in which they are not used). This assumption can be reasonable for some repairable systems and highly simplifies investigation. For this case, results developed in Section 6.8.2 for preventive maintenance lead to (for phases 1,2,...)
6.8 Systems with Complex Structure
for the reliability function, and
for the point availability. Si is the state from which the ith mission phase Starts; 0, T~*,T;, ... are the time points on the time axis at which the mission phase 1,2,3, ... begin (the mission duration of phase i being here T: T:, with T~*= 0).
6.8.6.3 Failure Censored Reconfiguration In most applications, reconfiguration occurs at the failure of a redundant element. Besides cases with ideal fault coverage, ideal switching, and no system performance reduction at failure (Sections 2.2, 2.3, and 6.4-6.7), more complex structures often arise in practical applications. Such structures must be investigated on a case-bycase basis. A FMEAI FMECA (Section 2.6) is mandatory to validate investigations. Often it is necessary to consider that after a reconfiguration, the system performance is reduced, i.e. reward und frequency/duration aspects have to be involved in the analysis. A reasonably simple and comprehensive example is a power system substation. Figure 6.34 gives the functional block diagram and the diagram of transition rates for availability calculation (pg= 0 for reliability investigation). ZI2is the down state. The substation is powered by a reliable network and consists of: Two branch designated by A l & A2 und capable of performing 100% load, each with HV switch, HV circuit breaker and control elements, transformer, measurement & control elements, and LV switch. Two busbars designated by Cl & C2 und capable of performing 100% load (failure rate basically given by double contingency of faults on control elements). A coupler between the busbars, designated by B und capable of pegorming 100% load (failure modes stuck at the state occupied just before failure(does not open), failure rate ABO,and no connection (does not close), failure rate ABO). Load is distributed between Cl and C2 at 50% rate each. The diagram of transition rates is based on an extensive FMEAJFMECA [6.20 (2002)l showing in particular the key position of the coupler B in the reconfiguration strategy. Coupler B is normally open. A failure of B is recognized only at a failure of A or C. From state Zo, B can fail only with failure mode no connection, from Z1 or 3 only with failure mode
256
6 Reliability and Availability of Repairable Systems
Figure 6.34 Functional block diagram and diagram of transition rates for availability calculation of a power system substation; active redundancy, constant failure and repair rates LAI,h A 2 ,Aso, AB„ Aci ,AC2 , pA,pC ,pg ; imperfect switching of B with failure modes does not Open ( ABO, from Z1 and Z2 ) or no connection (ABO,from ZO),failure of B recognized only at failure of A or C; one repair crew, pnority on C, no further failure at system down; Markov proc.; pg 0 for reliability
stuck at the state occupied just before failure. Constant failure rates hAl, hA2,ABO, ABO, hcl, hC2 and constant repair rates p A , p C , p g are assumed. pA and pc remain the Same also if a repair of B is necessary, pg is larger than pA and pc. From the down state (ZI2)the system retums to state Zo. Furthermore, only one repair crew, repair priority on C (followed by C + B , A , A+B), and no further failure at system down (50% load is an up state with reduced performance, See Section 6.8.6.4) are assumed. Sought are mean time to system failure M7TFso for system new (in state Zo) at t = 0 and asymptotic & steady-state point and average availability PAs =AAS. The involved process is a time-homogeneous Markov process. If results are required for 100% load, Z6 - Zll are down states (see Section 6.8.6.4 for reward considerations). To simplify investigation, equations use hAl=hA2= hA and AC1=hC2=AC. TOincrease readability, the number of states in Fig. 6.34 has been reduced according to Point 2 on p. 264. From Fig. 6.34 and Table 6.2 or Eq. (A7.126) it follows that M7TFso is given as solution of the following system of algebraic equations (with Mi = MTTFSi)
257
6.8 Systems with Complex Stnicture
Because of hAl=hA2=h„h„=h„
and the symmetry in Fig. 6.34 it follows that P2=Pl, P4=P3, P7=P6, P9'P& Pll'P10 arid M2'M19 M4=M3 M7'M69 M9=M8t M 1 l = M ~ ~ . This has been considered in solving the System of algebraic equations (6.261). From Eq. (6.261) it follows that =L,
9
'
+ h C ~ C ~ 5 ~ 1 0 ~ - 2 u 2 c ~ c ~ 5 ~ s '6~ l o - a B u 2 u 5
a l w o-2hA~Aps(~:hBo)(a1
+
with
MTTFso
(6.262)
per Eq. (6.262) can be approximated by
- P , / ( 2 ( h A + h , ) ( h A + h , p A / p ~ ) for hBo = hBo= 0 (I-out-of-2 active yielding Mredundancy with A and C in series, as per Table 6.10,2nd & 3rd row). From Fig. 6.34 and Table 6.2 or Eq. (A7.127) it follows that the asymptotic & steady-state point and average availability PAs =AAS is given as solution of
One of the Eq. (6.266) must be dropped and replaced by
= 1.
The solution yields
258
6 Reliability and Availability of Repairable Systems
with
From Eqs. (6.267) - (6.269) it follows that
1+
b3 b, + 2b,(1 +hBo1p3) +
I P5 + 2 h c ( +~ ~
I P ~ +2h.A.A~01 P ~
~ 5 ~ 1 0
6.270) PAs =AAS per Eq. (6.270) can be approximated by
yielding PAs =AAS-1- 2((hA+hc ) / P )2 for hBO=ABO= O and bA = pc = pg = p (1-outof-2 active redundancy with A and C in series, as per Table 6.10). Equations (6.265) and (6.271) show the small influence of the coupler B. A numerical evaluation with h A l = h A S = h A = 4 . 1 0-6 h -1 h c l = h C 2=AC = 0.12.10-~ h-I hB,=0.08. 1oW6h-" ABO=0.6.10-~ h" pA =pC =1 /4h, pg =1/12h
(= 0.035 expected failures per year) (= 0.001 expected failures per year) (= 0.0007 expected failures per year) (= 0.005 expected failures per year)
yields M7TFs0=7.361o9h
and
PAs=AAs=1-1.63~10-9
from Eqs46.262) & (6.270), as well as M7TFso ~ 7 . 3109h . and PAs =AAS = I - 0.9.10-~ from Eqs. (6.265) & (6,271), respectively; moreover,
Considering the substation as a macro-structure (first row in Table 6.10), it holds that P A s = A A s = l - h s l p s and ~ , ( t ) = e - ' s ~ , w i t h p , = p ~and h s = l l M 7 T F S 0 .
259
6.8 Systems with Complex Structure
6.8.6.4 With Reward and FrequencyIDuration Aspects For some applications, e.g. in power and communication Systems, it is of importance to consider system performance also in the presence of failures. Reward and frequency / duration aspects are of interest to evaluate system pe~ormability. For constant failure and repair rates (Markov processes), asymptotic & steady-state system failure frequency fudS and system mean down time MDTs (mean repair (restoration) duration at system level) are given as (Eqs. (A7.143) & (A7.144))
respectively. Similar results hold for semi-Markov processes. U is the set of states considered as up states for fuds and MDTs calculation, is the complement to the totality of states considered. Pj is the asymptotic & steady-state probability of state Zj and pji the transition rate from Zj to Zi.In Eq. (6.272), all transition rates pji leaving state 3 E U toward ZieÜ are considered (cumulated states). Example 6.23 gives an application to the substation investigated in Fig. 6.34. Considering f u d s = fduS (Eq. (A7.145)), fuds can be replaced by fdus.
Example 6.23 Give the failure frequency fuds and the mean failure duration MDTs in steady-state for the substation of Fig. 6.34 for failures referred to a load loss of 100% and 50%, respectively.
For loss of 50% load, Fig. 6.34 with U={Zo- Z,} and Ü=(Z6 - Z12J yields
From Eq. (6.273) it follows that MDT~loss 100% = 42
f u d ~I s s 100% '
and
MDTs I„
50% = 1 - (Po + 2P1
+
2P3
+
P51 1 fudslas
50%
.
The numerical example on p. 258 yields fuds lasioos = 1 3 6 ~ 1 0 - ' ~ h(=~ 10" ' expected failures - ' 7 . 1 0 - ~expected failures per year), MDTs per year), fudS lass 50% = 7 8 3 l ~ - ~ h (= ioo"/o = 1 2 h , ard MDTs ~ o s s 5 ~ = %4 h .
260
6 Reliability and Availability of Repairable Systems
Example 6.24 Give the expected instantaneous reward rate in steady-state for the substation of Fig. 6.34. Solution Considenng Fig. 6.34 and the numerical exarnple on p. 258 it follows that
The reward rate I;: takes care of the performance reduction in the state considered, ( ri = 0 for down states, 0 < q < 1 for partially down states, and q = 1 for up states with 100% performance). From this, the expected instantaneous reward rate in steady-state or for t + W , MIRS, is given as (Eq. (A7.147))
The expected accumulated reward in steady-state (or for t -+ m) follows as MARs(t) = MIRs. t , see Example 6.24 for an application. in Eq. (6.274) is the asymptotic & steady-state probability of state Z i , giving also the expected percentage of time the system stays at the performance level specified by Zi(Eq. (A7.132)).
6.8.7 Systems with Common Cause Failures In some practical applications it is necessary to consider that common cause failures (CCF) can occur. Common cause failures are multiple failures resulting from a single cause. They must be distinguished from common mode failures (CMF), which are multiple failures showing the Same Symptom. Common cause failures can occur in hardware as also in software and their causes can be quite different. Some possible causes for common cause failures in hardware are: overload (electrical, thermal, mechanical), technological weakness (material, design, production), misuse (caused e.g. by operating or maintenance personnel), external event. Sirnilar causes can be found for software. In the following, a comprehensive example for investigating effects of common cause failures is considered. Results (Eqs. (6.276) and (6.280) in particular) show that common cause failure acts as a series elernent in the system's reliability structure, with failure rate ( A c ) equal the occurrence rate of the common cause failure and repair (restoration) rate ( pC ) equal the remove rate of the corresponding failure. Graphs given by Figs. 2.8 & 2.9 for nonrepairable systems and Figs. 6.17 & 6.18 for repairable systems can be used to visualize results and to Support d e s (2.28) and (6.174).
6.8 Systems with Complex Structure
a) C only on working elements, repair for C includes all other failures
C)
C on elements in working or repair state, repair for C includes all other failures
b) C only on working elements, repair for C has pnority but does not include other failures
d) C on elements in working or repair state, repair as for case b)
Figure 6.35 Diagram of transition rates for availability calculation of the 1-out-of-2 active redundancy of Fig. 6.36 with common cause failures for 4 different basic possibilities (constant failure and repair rates (h, Ac, h a , p, pc, pa), one repair crew, repair pnonty for failures caused by common cause (C), no further failures at system down (except hCdl , hC45); Markov process, Zo,Z, up states)
Figure 6.35 gives the diagrams of transition rates for the repairable 1-out-of-2 active redundancy of Fig. 6.36 with common cause failures for 4 different basic possibilities (C refers to common cause, repair priority for failures caused by C, one repair crew, no further failures at system down). The 4 possibilities of Fig. 6.35 are resumed in Fig. 6.36 for investigation. From Fig. 6.36 and Table 6.2 or Eq. (A7.126) it follows that MnFso is given as solution of the following system of algebraic equations (all down states are absorbing for reliability investigation) (2h+hc)MTZ'Qo = 1 + 2 h M 9 2 (?L
+ y ) M q 2 = l+pMTTFSO.
From Eq. (6.275), MTTFso follows as (with Ac< I ) ,
(6.275)
262
6 Reliability and Availability of Repairable Systems
Furthermore, from Fig. 6.36 and Table 6.2 or Eq. (A7.127) it follows that the asymptotic & steady-state point and average availability PAs =AAS is given as solution of the following system of algebraic equations
One of the Eq. (6.277) must be dropped and replaced by Po+ ...+P5=1 (the first equation because of the particular cases investigated below). The solution yields
P 3Z= k~3+2 .P .~ 4 = h c 4 1 + ~ ~ 4P55 =Pc54 +~> and p o = 2 h + h c , P ~ = P CP, Z = ~ + ~ C ~ I + ~ C (Fig. 6.36, Eq. (A7.103)). Considering h «P, h, «F,, hci «pci it follows that
Equations (6.276) & (6.278) can be used to investigate Fig. 6.35, yielding (Ac< h )
includes all other failures
b) C only on working elements, repair for C does not include other failures
C on elements in working or repair state, repair for C includes all other failures ( L C 4 , «!J )
d) C on elements in working or repair state, repair for C does not include other failures
a) C only on working elements, repair for C
C)
6.8 Systems with Complex Structure
1-out-of-2 active (E = E = E ) 1
2
Figure 6.36 Reliability block diagram and diagram of transition rates for availability calculation of a 1-out-of-2 active redundancy with common cause failures, for different possibilities as per Fig. 6.35 (const. failure and repair rates (h,Ac, h p, lc,pci ), one repair crew, repair priority for common cause (C), no further failures at System down (except hC4,, kC45); Markov process, Zo,Z2up states)
Case b) corresponds to a 1-out-of-2 active redundancy in series with a switch (Eqs. (6.158), (6.160)). Further approximations are possible, e.g. with Es=Fi +P3 +P4+P5. Equations (6.276) & (6.280) clearly show the effect (consequence) of a common cause failure on a 1-out-of-2 active redundancy:
The common cause failure acts as a series element with failure rate ( A C ) equal the occurrence rate of the common cause failure und repair (restoration)rate (P,-)equal the remove rate of the corresponding failure; results given by Figs. 2.8 & 2.9 for nonrepairable systems und Figs. 6.1 7 & 6.18 for repairable systems (with rules (2.28) und (6.174)) applies. The above rule holds quite general if the common cause failure acts at the Same time on all redundant elements of a redundant structure. From this,
Good protection against common cause failures can only be given if euch element of a redundant structure is realized with different technology (tools), electrically, mechanically und thermally separated, und not designed by the same designer (basically true also for sofiware). Concrete protection against comrnon cause failures must be worked out on a caseby-case basis, See Example 2.3 for a simple practical Situation. In verifying such a protection, a FMEAIFMECA (Section 2.6) is mandatory for hardware and software. In some cases, common cause failures can occur with a time delay on elements of a redundant structure (e.g. because of the drop of a cooling ventilator); in this cases,
264
6 Reliability and Availability of Repairable Systems
automatic fault recognition can avoid multiple failures. Some practical considerations on failure rates for common causes failures in electronic equipment are in [A2.6 (61508-6)], giving hc l h = 0.005 as achievable value (rule (6.174)).
6.8.8 General Procedure for Modeling Complex Systems On the basis of the tools introduced in Appendix A7 and results in Sections 6.8.1 6.8.7, following procedure can be given for reliability and availability investigation of complex Systems, both when a reliability block diagram exists or n o t (for series-parallel structures ,Section 6.7 applies, in particular Table 6.10). 1. As a first step operate with (time-homogeneous) Markov processes, i.e. assume that failure und repair rates of all elements are constant during the stay time in every state, and can change (stepwise) only at state changes, e.g. because of change in configuration, component use, Stress, repair strategy or other (dropping this assumption leads to non markovian processes, as shown e.g. in Section 6.4.2); in a further step, refinements can be considered on a case-by-case basis using serni-regenerative processes. 2. Group series elements and assign to each macro-structure EI, ..., E, a failure rate hs= hl+. .+h,and repair (restorntion) rate ys =& 1 (Ll /pl+. ..+I, /F,) (Table 6.10): a further redbictioti OJQ drlzgram ~
the ide~ltificationof up states which have a clirect transition tu a down state at \ystem Ieve1 (e.g. 5 - Z, in Fig. 6.20), i.e. of critiral opernttig mtes,
5. Identify the transition rates between each state (combination of failure and repair rates), by considering assumed repair (restoration) priorities, retained failure modes, and particularities specific to the system considered (dependence between elements, sequence of failure or failure modes, etc.).
Figure 6.37 Example for a reduction of a diagram of transition rates for M7TFS0 calculation )= (notethat ( h O + h , ) l ( l + h , ~ h , (l+h,/h,)l(l/hO+llhl))
265
6.8 Systems with Complex Structure
6. For reliability calculation, the mean time to system failure MTTFsi for system entering state Zi at t = 0 is obtained by solving (Eq. (A7.126))
Zj€U,j?li
j=O, j t i
Thereby, U is the Set of up states, Ü the Set of down states ( U u Ü = {Zo,...,Z,}), p, the transition rate from state Zi E U to state Zj E U , and pi the sum of all transition rates leaving state Zi (Table 6.2). The system of algebraic equations (6.281) delivers all M7TFsi for any Z,E U entered t = 0 (for Markov processes the condition " Zi is entered at t = 0" can be replaced by "system in Zi at t = 0"). At system level,
can often be used (in Zo all elements are operating or ready to operate, i.e. as-good-as-new because of the memoryless Markov property). 7. The asymptotic ( t -+ W) & steady-state (stationary) point und average availability PAs =AAS is given as
with
3 as solution of (Eq. (A7.127) m
pjPj = i=O,
rn
Pipi , itj
with Pj>O,
m
Pj =1, pi= j=O
p„
j=O, ..., rn. (6.284)
j=O, j#i
In Eq. (6.284), all transition rates pij leading to state 5 have to be considered. One equation for Pj , arbitrarily chosen, must be dropped and replaced by
=i
(see Section 6.2.1.5 for further availability figures).
8. Considering t h e constant failure rate for all elements, the asymptotic & steady-state interval reliability follows as (Eq. (6.27))
9. The asymptotic & steady-state system failure frequency fudS and systern mean up time MUTs are given as (Eqs.(A7.141) & (A7.142))
respectively. U is the Set of states considered as up states for fudS und MUTs calculation, 6 the complement to the totality of states considered.
266
6 Reliability and Availability of Repairable Systems
The same is for the system repair (restoration) frequency fduS and the system mean down time MDTs, given as
respectively. MUTs is the mean of the time in which the system is moving in the set of up states E U (e.g. Zo to Z7 in Fig. 6.20) before a transition in the set of down states Z i ~ Ü(e.g. Z8 to Zll in Fig. 6.20) occurs, in steady-state or for t -+ m. MDTs is the mean repair (restoration) duration at system level. fudS is the system failure intensi9 z S ( t )= Z S , as defined by Eq. (A7.230), in steady-state or for t + W . It is not difficult to recognize that one has fudS = fduS and thus
5
fuds = fduS = ZS = 1 I (MUTS + MDTS),
(6.290)
see example 6.25 for a practical application. Equations (6.287), (6.2.89), (6.290) lead to the following important relation MDTs = MUTs (1- P A s ) / P A S .
(6.291)
Considering that the asymptotic & steady-state probability Po is much greater than all other 5 ,the approximation
for MUTs is not allowed, see example 6.25).
can often be used ( Zj EU
10. The asymptotic & steady-state expected instantaneous reward rate MIRs is given by (Eq. (A7.147))
Thereby, ri= 0 for down states, 0< q<1 for partially down states, and ri =1 for up states with 100% performance. The asymptotic & steady-state expected accumulated reward MARS follows as (Eq. (A7.148)) MARS ( t )= MIRs . t .
(6.293)
In some cases it can be useful to operate with a time schedule (e.g. Fig. A7.11). Alternative investigarion methods (Petri nets, dynamic FTA, computer-aided analysis) are introduced in Section 6.9. As in the previous sections,failure-free time is used as a synonym forfailure-free operating time and repair as a synonym for restoration.
6.9 Alternative Investigation Methods Example 6.25 Investigate MUTS, MDTS, fudS, md fduS for the 1-out-of-2redundancy of Fig. 6.8a. Solution The solution of Eq. (6.85) with k i ( t )= 0 yields (Eq. (6.88)) p o = p 2 / [ ( h + h , ) ( h + p ) + y 2 ] and ~ ~ = ~ h r( ) /h[ ( ~+ + h r ) ( h + p ) + p 2 ] From Fig. 6.8a and Eqs. (6.286)-(6.289)it follows that
For this example it holds that MUTS = M q l (with M7TFsI as solution of Eq. (6.89) with Pi (0) = 1 or Eq. (6.281), see also Example A7.9), this because the system enters state Z1 after each system failure; furthermore, MDTs = 11p because only one repair Crew is available.
6.9 Alternative Investigation Methods The methods given in sections 6.1 to 6.8 are based on Markov, semi-Markov and semi-regenerative processes, according to the involved distributions for failure-free and repair (restoration) times. They have the advantage of great flexibility (arbitrary redundancy or repair strategy, incomplete coverage or switch, common cause failures) and transparency. Further tools are known to model repairable systems, e.g. based on Petri nets or dynamic fault trees. For very large or complex systems, numerical solution or Monte Carlo simulation can also become necessary. Many of these tools are similar in performance and versatility (Petri nets are equivalent to Markov models), other have limitations (fault tree analyses are basically limited to totally independent elements and Monte Carlo simulations delivers only numerical solutions), so that choice of the tool is often related to the personal experience of the analyst (see e.g. rA2.6 (61165, 60300-3-I), 6.30, 2.48 (2005)l for comparisons). However, modeling large systems still requires a close cooperation between project and reliability engineers. In the following, Sections 6.9.1 and 6.9.2 give a short introduction to Petri nets and dynamic fault trees for reliability investigations. Section 6.9.3 considers some aspects of numerical solutions.
6.9.1 Petri Nets Petri nets (PN) were introduced 1962 by C. A Petri [6.35, 6.2, 6.61 to investigate in particular synchronization, sequentiality, concurrency, and conflict in parallel working digital systems. Several extensions have been at the origin of a large literature [6.1, 6.6, 6.8,6.30,2.37, 2.48 (1999)l. Important for reliability investigations was the possibility to create algorithmically the diagram of transition rates belonging to a given Petri net. With this, investigation of time behavior on the basis of (time-homogeneous) Markov processes was Open (stochastic Petri nets).
268
6 Reliability and Availability of Repairable Systems
Extension to semi-Markov process is straightforward [6.8], but less useful for reliability investigations (Sections 6.3 & 6.4). This section gives a short introduction of Petri nets from a reliability analysis point of view. A Petri net (PN) is a directed graph involving 3 kind of elements:
Places 4 ,...,P, (drawn as circles): A place q is an input to a transition T j if an arc exist from 4 to T j and is an output of a transition Tk and input to a place 4 if an arc exist from Tk to 4 ; places may contain token (black spots) and a PN with token is a marked PN. Transitions Tl, ...,T, (drawn as empty rectangles for timed transitions or bars for immediate transitions); a transition canfire, taking one token from each input place and putting one token in each output place. Directed arcs: An arc connects a place with a transition or vice versa and has an arrowhead to indicate the direction; multiple arcs are possible and indicate that by firing of the involved transition a corresponding number of tokens is taken from the involved input place (for input multiple arc) or put in the involved output place (for output multiple arc); inhibitor arcs with a circle instead of the arrowhead are also possible and indicate that for firing condition no token must be contained in the corresponding place. Firing rules for a transition are:
1. A transition is enabled (can fire) only if all places with an input arc to the given transition contain at least one token (no token for inhibitor arcs). 2. Only one transition can fire at a given time; the selection occurs according to the embedded Markov chain describing the stochastic behavior of the PN.
3. Firing of a transition can be immediate or occurs after a time interval zu > 0 (timed PN); T U > 0 is in general a random variable (stochastic PN) with distribution function Ei ( X ) when firing occurs from transition Ti to place Pj -),..X . (yielding a Markov process for Eu (X) = 1- e 'J , 1.e. with transition rate Au, or a semi-Markov process for Fu (X) arbitrary, with F i ( 0 )= 0). From rule 3, practically only Markov processes (i.e. constant failure and repair rates) will occur in Petri nets for reliability applications (Section 6.4.2). Two further concepts useful when dealing with Petri nets are those of rnarking and reachability: A marking M = {mi, ...,m,} gives the number mi of token in the place 4 at a given time point and defines thus the state of the PN; M j is immediately reachable from Mi if M j can be obtained by firing a transition enabled by Mi. With Mo as marking at time t=O, M I ,...,Mk are all the (different) marking reachable from M o ; they define the PN states and give the reachability tree, from which, the diagram of transition rates of the corresponding Markov model follows. Figure 6.38 gives some examples of reliability structures with corresponding PN.
6.9 Alternative Investigation Methods
1-out-of-2active
repair priority on E,
a)
( E , = E2 = E )
h C)
d)
Figure 6.38 Top: Reliability block diagrarn (a), diagram of transition rates (C), Petri net (PN) (b), and reachability tree (d) for a repairable 1-out-of-2 warm redundancy (two identical elements, constant failure (L, L,) and repair (P) rates, one repair (restoration) crew, Markov process); Bottom: Reliability block diagram (a), diagram of transition rates (C),Petn net (b), and reachability tree (d) for a repairable 1-out-of-2 active redundancy with two identical elements and switch in senes (constant failure (L, L, ) and repair (P, pv) rates, one repair (restoration) crew, repair prionty on switch, Markov processes)
270
6 Reliability and Availability of Repairable Systems
6.9.2 Dynamic Fault Tree A fault tree (FT) is a graphical representation of the conditions or other factors causing or contributing to the occurrence of a defined undesirable event, referred as the top event [A2.6 (IEC 61025)l. In its original form, as introduced in Section 2.6 for failure mode analysis, a fault tree contains only static gates (essentially AND & OR), is termed static fault tree, and can handle combinatorial events (qualitatively, similar as for a FMEA (Section 2.6) or quantitatively, as with Boolean functions (Section 2.3.3). However, as the top event is in general a failure at system level, "0" is used for operating and "1" for failure. This is opposite to the notation used in Sections 2.3.3 & 2.3.4 for reliability investigations based on state space or Boolean functions. In fault trees, OR gates represent thus a series structure and AND gates a parallel structure. Figure 6.39 gives two examples of reliability structures with corresponding static fault trees. Static fault trees can be used to compute reliability and availability for the case of totally independent elements (active redundancy and each element has its own repair crew), see e.g. [A2.6 (IEC 61025)l for a comprehensive description. Reliability computation for the nonrepairable case (up to system failure) using fault tree analysis (FTA) leads for the series structure to (Eq. (2.17))
and for the k-out-of-n active redundancy to (Eq. (2.23))
( & t ) = 1 - Ri(t) = failure probability). For complex structures, computation uses, often the method of minimal cut sets (Section 2.3.4). However, because of their structure, static fault trees can not handle states or time dependencies (e.g. standby redundancy or repair strategy). For these cases, it is necessary to extend static fault trees, adding so called dynamic gates to obtain dynamic fault trees. The most important dynamic gates are [2.85, 6.36, 6.38, A2.6 (IEC 61025)l:
Priority AND gate (PAND), the output event (failure) occurs only if all input events occur and in sequence from left to rjght. Sequence enforcing gate (SEQ), the output event occurs only if input events occur in sequence from left to right and there are more than two input events. Spare gate (SPARE), the output event occurs if the number of spares is less than required. Further gates (choice gate, redundancy gate, warm spare gate) have been suggested,
6.9 Alternative Investigation Methods
U
2-out-of-3 active (E = E = E = E ) 1
2
3
a)
Figure 6.39 a) Reliability block diagram and corresponding static fault tree for a 2-out-of-3 active redundancy with switch element in senes; b) Functional block diagram and corresponding static fault tree for a redundant computer system[6.30]; "0" is used for operating and "1" for failure
e.g. in [6.38]. All above dynamic gates requires a Markov analysis, i.e. states probabilities must be computed by a Markov approach (constant failures & repair rates) and then, results used as occurrence probability for the basic event replacing the corresponding dynamic gate. Use of dynamic gates in dynamic fault tree analysis, with corresponding computer programs, has been carefully investigated, e.g. in [2.85,6.36,6.38]. Fault tree analysis (FTA) is an established methodology for reliability and availability analysis (emerging in the nineteen-sixties with investigations on nuclear power plants). However, the necessity to use Markov approaches to solve dynamic gates can limit its use in practical applications. The limits of FTA are in all methods based on binary considerations (fault trees, reliability block diagrams (RBD), binary decision diagrams (BDD), etc.). However, reliability block diagrams and fault trees are valid Support in generating transition rates diagrams for Markov analysis. So once more, combination of investigation tools is often a good way to solve difficult problems.
272
6 Reliability and Availability of Repairable Systems
6.9.3 Computer-Aided Reliability and Availability Computation Investigation of large series -parallel structures or of complex systems (for which a reliability block diagram often does not exist) is in general time-consuming and can become mathematically intractable. A large number of Computer programs for numerical solution of reliability and availability equations as well as for Monte Carlo simulation have been developed. Such a numerical computation can be in some cases the only way to get results. Section 6.9.3.1 discusses requirements for a versatile program for the numerical solution of reliability and availability equations. Section 6.9.3.2 gives basic considerations on Monte Carlo simulation and introduces an approach useful for rare events. Although appealing, numerical solutions can deliver only case-by-case solutions and can causes problems (instabilities in the presence of sparse matrices, prohibitive run times for Monte Carlo simulation of rare events or if confidence limits are required). As a general rule, analytical solutions (Sections 6.2 - 6-6, 6.8) or approximate expressions (Section 6.7) should be preferred whenever possible.
6.9.3.1 Numerical Solution of Equations for Reliability and Availability Analytical solution of algebraic or differential / integral equations for reliability and availability computation of large or complex systems can become time-consuming. Software tools exist to solve this kind of problems. From such a software package one generally expects high completeness, usability, robustness, integrity, and portability (Table 5.4). The following is a comprehensive list of requirements:
General requirements: 1. Support interface with CADICAE and confguration management packages. 2. Provide a large component data bank with the possibility for manufacturer and company-specific labeling, and Storage of non application-specific data. 3. Support different failure rate models [2.21 - 2.291. 4. Have flexible output (regarding medium, sorting capability, weighting), graphic interface, single & multi-user capability, high usability & integrity. 5. Be portable to different platforms. Spec$c for nonrepairable (up to System failure) systems: 1. Consider reliability block diagrams (RBD) of arbitrary complexity and with a large number of elements (2 1,000) and levels (2 10); possibility for any element to appear more than once in the RBD; automatic editing of series and parallel models; powerful method to handle complex structures; constant or time dependent failure rate for each element; possibility to handle as element macro-structures or items with more than one failure mode.
6.9 Alternative Investigation Methods
273
2. Easy editing of application-specific data, with User features such as: automatic computation of the ambient temperature at component level with freely selectable temperature difference between elements, freely selectable duty cycle from the system level downwards, global change of environmental and quality factors, manual selection of stress factors for tradeoff studies or risk assessment, manual introduction of field data and of default values for component families or assemblies. 3. Allow reuse of elements with arbitrary complexity in a RBD (libraries).
Specific for repairable Systems: 1. Consider elements with constant failure rate and constant or arbitrary repair rate, i.e. handle Markov, semi-Markov, and (as far as possible) semiregenerative processes. 2. Have automatic generation of the transition rates pd for Markov model and of the involved semi Markov transition probabilities QV ( X ) for Systems with constant failure rates, one repair Crew, and arbitrary repair rate (starting e.g. from a given set of successful paths); automatic generation and solution of the equations describing the system's behavior. 3. Allow different repair strategies (first-in first-out, onerepair Crew or other). 4. Use sophisticated algorithms for quick inversion of sparse matrices. 5. Consider at least 20,000 states for the exact solution of the asymptotic & steady-state availability PAs = AAS and mean time to system failure MTT&. 6. Support investigations yielding approximate expressions (macro-structures, totally independent elements, cutting states or other, see Section 6.7.1). A scientific software package satisfying many of the above requirements has been developed at the Reliability Lab. of the ETH [2.50]. Refinement of the requirements is possible. For basic reliability computation, cornmercial programs are available E2.51-2.601. Specialized programs are e.g. in L2.7, 2.17, 2.59, 2.85, 6.23, 6.24, 6.421; considerations on numerical methods for reliability evaluation are e.g. in [2.56].
6.9.3.2 Monte Carlo Simulations
The Monte Carlo technique is a numerical method based on a probabilistic interpretation of quantities obtained from algorithmically generated random variables. It was introduced 1949 by N. Metropolis and S. Ulman [6.32]. Since this first Paper, a large amount of literature has been published, see e.g. [6.12, 6.31, A7.181. This section deals with some basic considerations on Monte Carlo simulation useful for reliability engineering and gives an approach for the simulation of rare events which avoids the difficulty of time truncation because of amplitude quantization of the digital number used.
274
6 Reliability and Availability of Repairable Systems
For reliability purposes, a Monte Carlo simulation can basically be used to estimate a value (e.g. an unknown probability) or simulate (reproduce) the stochastic process describing the behavior of a complex system. In this sense, a Monte Carlo simulation is useful to achieve results, numerically verify an analytical solution, get an idea of the possible time behavior of a complex system or determine interaction among variables. Two main problems related to Monte Carlo simulation are the generation of uniformly distributed random numbers in the interval (0,l) and the transformation of these numbers in random variables with prescribed distribution function. A congruential relation s ~ + ~ =+(b)~ smod ~ m,
(6.296)
where mod is used for modulo, is frequently used to generate pseudo-random numbers (for simplicity, pseudo will be omitted in the following). Transformation to an arbitrary distribution function F(x) is often performed with help of the inverse function F -'(X), see Example A6.17. The method of the inverse function is simple but not necessarily good enough for critical applications. A further question arising with Monte Carlo simulation is that of how many repetitions n must be run to have an estimate of the unknown quantity within a given interval I E at a given confidence level y . For the case of an event with probability p and assuming n sufficiently large as well as p or (1- p) not very small, Eq. (A6.152) yields for p known
where t ( l + y)12 is the (1 + y ) / 2 quantile of the standard normal distribution; for instance, t ( l +)„y = 1.645 for y = 0.9 and 1.96 for y = 0.95 (Appendix A9.1). For p totally unknown, the value p = 0.5 has to be taken. Knowing the number of realizations k in n trials, Eq. (A8.43) can be used to find confidence lirnits for p. To simulate (reproduce) a time-homogeneous Markov process, following procedure is useful, starting by a transition in state Ziat the arbitrary time t = 0: 1. Select the next state Zj to be visited by generating an event with probability
according to the embedded Markov chain (for uniformly distributed random c X} =X). numbers 5 in (0,l) it holds that ~ r { 1
2. Find the stay time (sojourn time) in state Ziup to jump to the next state Zj by generating a random variable with distribution function (Example A6.17)
3. Jump to state Zj.
6.9 Alternative Investigation Methods
Generator of uniformly distributed random numbers
Si
b
Comparator (out ut puke
4
hk
ifcchj
Function generator (creating the quantity hk) Rcset
Output pulse train (a pulse at each renewal point S I , S2, ...)
Pigure 6.40 Block diagram of the programmable generator for renewal processes
Extension to semi-Markov processes is easy [A7.2 (1974 & 1977)l. For semi regenerative processes, states visited during a cycle must be considered (see e.g. Fig. A7.11). The advantage of this procedure is that transition sequence and stay (sojourn) times are generated with only a few random numbers. A disadvantage is that the stay times are truncated because of the amplitude quantization of Fij(x) . To avoid truncation problems, in particular when dealing with rare events distributed on the time axis, an alternative approach implemented as hardware generator for semi-Markov processes in [A7.2 (1974 & 1977)l can be used. To illustrate the basic idea, Fig. 6.40 shows the structure of the generator for renewal processes. The programmable generator for arbitrary renewal processes is driven by a clock At = Ax and consists of three main elements: a generator for (pseudo-) random numbers Ciuniformly distributed in (0,l); a comparator, comparing at each clock the actual random number Ei with kk and giving an output pulse, marking a renewalpoint, for Gi< hk; a function generator creating hk and starting with hl at each renewal point. It can be shown ( L k= wk in [A7.2 (1974 & 1977)l) that for
the sequence of output pulses constitutes a realization of an ordinary renewal process with distribution function F(kA.x) for the times between successive renewal points. hk is the failure rate belonging to the arithmetic random variable with distribution function F(kAx) (p. 405, Appendix A7.2). Generated random times are in this case not truncated, since the last Part of F(kAx) can be approximated by a geometric distribution ( L k constant as per Eq. (A6.132)). A software implementation of the approach shown by Fig 6.40 is easy, and hardware limitations disappears.
276
6 Reliability and Availability of Repairable Systems
The homogeneous Poisson process ( H P P ) , is a particular renewal process (Appendix A7.2.5) and can thus be generated (reproduced) with the generator given by Fig. 6.40 ( L k is constant, and the generated random time interval have a geometric distribution). For a nonhomogeneous Poisson process ( N H P P ) with mean value function M ( t ) = E [ v ( t ) ] ,generation can be based on the considerations given at the end of Appendix A7.8.2 (for fixed t = T , generate k according to a Poisson distribution with Parameter M(t) (Eq. (A7.190)) and then k random variables with density m( t ) l M ( T ) ; the ordered values are the k occurrence times of the NHPP on (0, T ) ).
7 Statistical Quality Control & Reliability Tests
Statistical quality control and reliability tests are performed to estimate or demonstrate quality and reliability characteristics on the basis of data collected from sampling tests. Estimation leads to a point or intewal estimate (marked with ^ in this book), demonstration is a test of a given hypothesis on the unknown characteristic. Estimation and demonstration of an unknown probability is investigated in Section 7.1 for the case of a defective probability p and in Section 7.2.1 for some reliability figures. Procedures for availability estimation and demonstration for the case of continuous operation are given in Section 7.2.2. Estimation and demonstration of a constant failure rate h (or MTBF for the case MTBF = 11X) are discussed in depth in Sections 7.2.3. The case of an MTTR is considered in Section 7.3. Basic models for accelerated tests are discussed in Section 7.4. Goodness-of-fit tests based on graphical and analytical procedures are summarized in Section 7.5. Some considerations on general reliability data analysis, with test on nonhomogeneous Poisson processes and trend tests, are given in Section 7.6. Models for reliability growth are introduced in Section 7.7. To simplify the notation, sample is used for random sample and the indices S, referring to System, is omitted in this chapter (MTBF instead of MTBFso and h or PA instead of hs or PAs). Theoretical foundations for this chapter are in Appendix A8. Selected examples illustrate the practical aspects.
7.1 Statistical Quality Control One of the main purposes of statistical quality control is to use sampling tests to estimate or demonstrate the defective probability p of a given item, to a required accuracy and often on the basis of tests by attributes (i.e. tests of type goodl bad). However, considering p as an unknown probability, a broader field of applications can be covered by the same methods. Other tasks, such as tests by variables and statistical processes control [7.1-7.51, are not considered hereafter. In this section, p will be considered as a defective probability (fraction of defective items). It will be assumed that p is the same for each element in the sample considered and that each sample element is statistically independent from each other. These assumptions presuppose that the lot is homogeneous and much larger than the sample. They allow the use of the binomial distribution (Appendix A6.10.7).
278
7 Statistical Quality Control and Reliability Tests
7.1.1 Estimation of a Defective Probabilityp Let n be the size of a (random) sample from a large homogeneous lot. If k defective items have been observed within the sample of size n, then
is the maximum likelihood point estimate of the defective probability p for an item in the lot under consideration, see Eq. (A8.29). For a given confidence level y = 1-ßl -ß2 ( O < ß I 1 1 - ß 2
for 0 < k < 12, and from
& =O
p,=l
and
-n&
for k=O
(y = l - P I ) ,
(7.3)
P, = 1
for k = n
(y=l-ßz),
(7.4)
or from
PI ="$G
and
see Eqs. (A8.37) to (A8.40) and the remarks given there. ß1 is the risk that the true value of p is larger than P, and ß2 the risk that the value of p is smaller than P,. The confidence level is nearly equal to (but not less than) y = 1 - ß1 - ß2. It can be considered as the relative frequency of cases in which the interval [P,, P,] overlaps (covers) the true value of p, in an increasing series of repetitions of the experiment of taking a random sample of size n. In many practical applications, a graphical determination of 5, and 5, is sufficient. The upper diagram in Fig. 7.1 can be used for ß1 =P2 = 0.05, the lower diagram for ß1 = ß2 = 0.1 (y = 0.9 and y = 0.8, respectively). The continuous lines in Fig. 7.1 are the envelopes of staircase functions (k, n integer) given by Eq. (7.2). They converge rapidly, for min (np, n(1- p)) 2 5 , to the confidence ellipses (dashed lines in Fig. 7.1). Using the confidence ellipses (Eq. (A8.42)), 6, and P, can be calculated from k + 0.5 b2 ijUJ
=
+ bdk(1- kln) + b2/4 n + b2
b is the (1 + y) 12 quantile of the standard normal distribution @ ( t ) ,given for sorne typical values of y by (Table A9.1)
7.1 Statistical Quality Control
Figure 7.1 Confidence limits jl and j, for an unknown probability p (e.g. defective probability) as a function of the observed relative frequency k l n ( n = sample size, k = observed events, y = confidence level = 1- ß1 - ß2, here with ß, = ß2; continuous lines are the exact solution (Eqs. (7.2) - (7.4)), dashed the confidence ellipses (Eqs. (7.3, (A8.42), (A6.149)) Example: n = 25, k = 5 gives P = k l n = 0.2 and for y = 0.9 the confidence interval r0.08, 0.381 ([0.0823, 0.37541 using Eq. (7.2), and [0.1011, 0.35721 using Eq. (7.5))
280
7 Statistical Quality Control and Reliability Tests
The confidence limits P, and intewals. In this case,
P,
can also be used as one-sided confidence
O
(orsimp1yp5Pu),
with
y=1-ß 1
&Ip
(orsimplyp>Pl),
with
y = 1 - ß 2.
(7.6)
Example 7.1 In a sample of size n = 25, exactly k = 5 items were found to be defective. Determine for the underlying defective probability p, (i) the point estimate, (ii) the interval estimate for y = 0.8 (Pi = ß2 = 0.1), (iii) the upper bound of p for a one sided confidence interval with y = 0.9.
Solution (i) Equation (7.1) yields the point estimate P = 5125 = 0.2. (ii) For the interval estimate, the lower Part of Fig. 7.1 leads to the confidence interval L0.10, 0.341, [0.1006, 0.33971 using Eq. (7.2) and [0.1175, 0.31941 using Eq. (7.5). (iii) With y = 0.9 it holds p 5 0.34. Supplementary result: The upper part of Fig. 7.1, would lead to p I 0.38 with y = 0.95.
The role of k l n and p can be reversed and Eq. (7.5) can be used to calculate the limits kl and k2 of the number of observations k in n independent trials (e.g. the number k of defective items in a sample of size n) for given probability y = 1- ßl - ß2 (with ßl = ß2) and known values of p and n (Eq. (A8.45)) k2,, = n p f b
j
m
.
(7.7)
As in Eq. (7.5), the quantity b in Eq. (7.7) is the (1+ y) 1 2 quantile of the standard normal distribution (e.g. b = 1.64 for y = 0.9, Table A9.1). For a graphical solution, Fig. 7.1 can be used by taking the ordinate p as known, and by reading kl 1 n and k2 I n from the abscissa.
7.1.2 Simple Two-sided Sampling Plans for the Demonstration of a Defective Probability p In the context of acceptance testing, the demonstration of a defective probability p is often required, instead of its estimation (Section 7.1.1). The main concem of this test is to check a Zero hypothesis Ho: p < po against an alternative hypothesis H1:p> pl on the basis of the following agreement between producer and consumer: The lot should be accepted with a probability nearly equal to (but not less than) I - a if the true (unknown) defective probability p is lower than po but rejected with a probability nearly equal to (but not less than) I - ß if p is greater than pl ( po , pl > po , und 0 < a < i -ß < 1 are given (fxed) values). po is the specijied defective probability and pl is the maximum acceptable defective
28 1
7.1 Statistical Quality Control
probability. a is the allowed producer's risk (type I error), i.e. the probability of rejecting a true hypothesis Ho: p < PO. ß is the allowed consumer's risk (type I1 error), i.e. the probability of accepting the hypothesis Ho: p < po when the alternative hypothesis H1:p> pl is true. Verification of the agreement stated above is a problem of statistical hypothesis testing (Appendix A8.3) and can be performed, for instance, with a simple two-sided sampling plan or a sequential test. In both cases, the basic model is the sequence of Bernoulli trials, as introduced in Appendix A6.10.7.
7.1.2.1 Simple Two-sided Sampling Plans The procedure (test plan) for the simple two-sided sampling plan is as follows (Appendix A8.3.1.1): 1. From po, pl, a, and ß, determine the smallest integers C and n which satisfy
and
2. Take a sample of size n, determine the number k of defective items in the sample, and r e j e c t Ho: p < p o , accept Ho: p < po ,
if k > c if k 5 C.
The graph of Fig. 7.2 visualizes the validity of the above rule (see Appendix A8.3.1.1 for a proof). It satisfies the inequalities (7.8) and (7.9), and is known as operating characteristic cuwe. For each value of p, it gives the probability of having no more than C defective items in a sample of size n. Since the operating characteristic curve as a function of p decreases monotonically, the risk for a false decision decreases for p < po and p > pl, respectively. It can be shown that the quantities C and n po depend only on a , P, and the ratio pl 1 po (discrimination ratio). Table 7.3 (p. 301) gives C and npo for some important values of a, ß and pl I po for the case where the Poisson approximation (Eq. (A9.129)) applies. Using the operating characteristic, the Average Outgoing Q u a l i ~(AOQ) can be calculated. AOQ represents the percentage of defective items that reach the customer, assuming that all rejected samples have been 100% inspected, and that the defective items have been replaced by good ones, and is given by
7 Statistical Quality Control and Reliability Tests
A 1.0 0.8 0.6
Pr {acceptance/ p ] = Pr{no more than c defects in a sample of size n I p ]
-
0.4 0.2 0
)P
I
0.05
0.1
Figure 7.2 Operating characteristic curve as a function of the defective probability p for given (fixed)nandc ( p 0 = 2 % , p l = 4 % , a = ß = 0 . 1 ; n = 5 1 0 andc=14asperTable7.3)
1
A O Q = p Pr{accepwnce p} = p
C i=O
(
n z
.
(1 -
The maximum value of AOQ is the A v e r a g e Outgoing Quality Limit [7.4,7.5]. Obtaining the solution of inequalities (7.8) and (7.9) is time-consuming. For small values of p o & pl (up to a few %), the P o i s s o n approximation (Eq. (A6.129))
can be used. Substituting the approximate value obtained by Eq. (7.12) in Eqs. (7.8) and (7.9) leads to a Poisson distribution with Parameters ml = n p l and in0 = n p 0 , which can be solved using a table of the X 2 distribution (Table A9.2). Alternatively, the curves of Fig. 7.3 provide graphical solutions, sufficiently good for practical applications. Exact solutions are in Table 7.3 (p. 301). Example 7.2 Determine the sample size n and the number of allowed defective items C to test the null hypothesis Ho : p < po = 1% against the alternative hypothesis H1 : p > pl = 2% with producer and consumer risks a = ß = 0.1 (which means a = ß 6 0.1). Solution For a = ß = 0.1, Table A9.2 yields V = 30 (value of V for which tv,ql /tV,qZ = 2 with ql 2 1- a = 0.9 and q 2 2 ß = 0.1) and, with linear interpolation, F(20.4) = 0.095 < ß and F(40.8) = 0.908>1- a ( V = 28 falls just short). Thus c = v / 2 - 1 = 14 and n= 20.4/(2.0.01)= 1020. The values of c and n according to Table 7.3 would be C = 14 and n = 10.17 / 0.01 = 1017. Using the graph of Fig. 7.3 yields practically the same result: C = 14, mo = 10.2 and ml = 20.4 for a = ß = 0.1. Both the analytical and graphical methods require a solution by successive approximation (choice of C and check of conditions for a and ß by considering the ratio pl /po).
7.1 Statistical Quality Control
7.1.2.2 Sequential Tests The procedure for a sequential test is as follows (Appendix A8.3.1.2):
1. In a Cartesian coordinate System draw the acceptance line y l ( n ) = a n - bl and the rejection line y 2 ( n ) = a n + b2, with
2. Select one item after another from the lot, test the item, enter the test result in the diagram drawn in step 1, and stop the test as soon as either the rejection or the acceptance line is crossed.
Figure 7.3
Poisson distribution
(0
results for Examples 7.2 (C = 14). 7.4 (7), 7.5(0), 7.6(2 &0), 7.9 (6))
284
7 Statistical Quality Control and Reliability Tests
Figure A8.8 shows acceptance and rejection lines for po = 1%, pl = 2% and a=ß=0.2. The advantage of the sequential test is that on average it requires a smaller sample size than the corresponding simple two-sided sampling plan (Ex. 7.10 or Fig. 7.8). A disadvantage is that the test duration (sample size) is random.
7.1.3 One-sided Sampling Plans for the Demonstration of a Defective Probability p The two-sided sampling plans of Section 7.1.2 are fair in the sense that for a = P, both producer and consumer run the same risk of making a false decision. In practical applications however, one-sided sampling plans are often used, i.e. only po and a or pl and ß are specified. In these cases, the operating characteristic curve is not completely defined. For every value of C ( C = 0,1, ...) a largest n (n = 1,2,...) exists which satisfies inequality (7.8) for a given po and a , or a smallest n exists which satisfies inequality (7.9) for a given pl and P. It can be shown that the operating characteristic curves become steeper as the value of C increases (see e.g. Figs. 7.4 or A8.9). Hence, for small values of C , the producer (if po and a are given) or the consumer (if pl and ß are given) can be favored. Figure 7.4 visualizes the reduction of the consumer risk (ß = 0.95 for p= 0.0065) by increasing values of the defective probability p or values of C, see Fig 7.9 for a Counterpart. When only po and a or p, and ß are given, it is usual to set in these cases po = AQL
and
pl = LTPD,
(7.13)
respectively, where A QL is the Acceptable Quality Level and LTPD is the Lot Tolerance Percent Defective (Eqs. (A8.77) to (A8.82)). A large number of one-sided sampling plans for the demonstration of AQL values are given in national and international standards (IEC 60410, ISO 2859, MIL-STD-I05 DIN 40080 [7.3]). Many of these plans have been established empirically. The following remarks can be useful when evaluating such plans: 1. AQL values are given in %. 2. The values for n and C are in general obtained using the Poisson approxirnati6n. 3. Not all values of C are listed, the value of a often decreases with increasing C . 4. Sample size is related to lot size, and this relationship is empirical. 5. A distinction is made between reduced tests (level I), normal tests (level 11) and tightened tests (1evelIII); level 11is normally used; transition from one level to another is often given empirically (e.g. transition from level 11 to level 111 is necessary if 2 out of 5 successive independent lots have been rejected and a retum to levelI1 follows if 5 successive independent lots are passed). 6. The value of a is not given explicitly (for C = 0, for example, a is approximately 0.05 for level I, 0.1 for level 11, and 0.2 for level 111).
7.1 Statistical Quality Control Pr {acceptance p)
0.8 0.6 -
Code F (n = 20, C = 0) Code J (n = 80, C = 1) Code N (n = 500, C = 7)
0.4 0.2 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Figure 7.4 Operating characteristic curves for demonstration of an AQL = 0.65% with sample sizes n = 20, 80, and 500 as per Table 7.1 ( a = 0.1 1 for C = 0, = 0.09 for C = 1, = 0.03 for C = 7 )
Table 7.1 presents some test procedures for AQL values from IEC 60410 [7.3] and Fig. 7.4 shows the corresponding operating characteristic curves for AQL = 0.65% and sample size n = 20,80, and 500. Test procedures for demonstration of LTPD values with given (fixed) customer risk ß are for example in [3.11 (S-19500)l. They are often based on the Poisson approximation (Eq. (A6.129)) and can be easily established using a X2 -table (Appendix A9.2) or Fig. 7.3. For given ß and LTPD, the values of n and C can be obtained taking in Fig. 7.3
and reading m = np = nLTPD for c = 0,1,2,... (Example: ß = 0.1, LTPD = 2% yields m = 3.9 for C = 1, and from this n = 3.910.02 = 195 ; the procedure is thus: test 195 items and reject LTPD = 2% if more than 1 defect occur.) In addition to the simple one-sided sampling plans described above, multiple one-sided sampling plans are often used to demonstrate AQL values. In a double one-sided sampling plan, the following procedure is used:
1. Take a first sample of size nl and accept definitely if no more than cl defects occur, but reject definitely if exactly or more than d1 defects have occurred. 2. If after the first sample the number of defects is greater than cl but less than dl, take a second sample of size n2 and accept if there are totally (in the first and second sample) no more than cz defects; elsewhere reject. The operating characteristic curve or acceptance probability for a double one-sided sampling plan can be calculated as
286
7 Statistical Quality Control and Reliability Tests
Multiple one-sided sampling plans are also given in national and international standards, See for exarnple IEC 60410 [7.3] for the following double one-sided sampling plan to demonstrate AQL = 1% Sample Size
nl
281 - 500
32
501-1,200
50
n2
cl
dl
c2
32
0
2
1
50
0
3
3
1,201 - 3,200
80
80
1
4
4
3,201-10,000
125
125
2
5
6
The advantage of multiple one-sided sampling plans is that on average they require smaller sample sizes than would be necessary for simple one-sided sampling plans. A disadvantage is that the test duration is not fixed in advance.
Table 7.1 Test procedures for AQL demonstration (test level 11, from IEC 60410 [7.3]) AQL in %
Sam-
8 A B C
D E F G H J
K L M N P Q
size
0.04 I>'~
slze
N
~
C
0.065 0.10 C
C
0.15 0.25 C
1.0
0.40 0.65 C
C
C
1.5
2.5
C
C
2 - 8 2 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 3 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 9-15 J . 1 J . 1 L . J . 1 5 . 1 16-25 L L 8 . 1 4.1 26-50 0 1 . 1 I 1 . 1 L 1 3 1 51-90 0 1' 2 0 L 91-150 1' 0 L 3 2 151-280 1' L 1 4 0 50.1 4 -1 281-500 2 1 1' 0 80.1 501-1200 2 1' 1 3 0 125 1.2k-3.2k 3 2 5 i' 1 0 200.1 3.2k-10k 5 7 2 1 3 i' 315 0 10k-35k 7 10 2 3 5 1 5001' 35k-150k 14 7 10 2 1 3 5 800 150k- 500k 21 14 5 7 10 2 3 over500k 1250 1
Use the first sampling plan above for i' or below for J,
C
4.0 C
6.5 C
1 . 1 . 1 0 1 . 1 0 1 ' i' 0 1 1 0 1' 2 1' 1 1 3 1 2 5 1 2 3 2 5 7 3 5 7 10 3 14 10 5 7 21 14 7 10 21 1' 14 10 14 21
21
1'
i'
= number of allowed defects
1'
1' 1' 1'
i' i' i'
C
7.2 Statistical Reliability Tests
7.2
287
Statistical Reliability Tests
Reliability tests are useful to evaluate the reliability achieved in a given item. Early initiation of such tests allows quick identification and cost-effective correction of weaknesses not discovered by reliability analyses. This supports a learning process, often related to a reliability growth program (Section 7.7). Since reliability tests are generally time-consuming and expensive, they must be coordinated with other tests. Test conditions should be as close as possible to those experienced in the field. As with quality control, a distinction is made between estimation and demonstration of a specific reliability figure. Section 7.2.1 uses results of Section 7.1 for reliability and availability testing for the case of a given (fxed) mission. In section 7.2.2 an unified method for availability estimation and demonstration for the case of continuous operation is introduced. Section 7.2.3 deals carefully with estimation & demonstration of a constant failure rate h (or of MTBF for the case MTBF= I / h). In addition, maintainability tests are considered in Section 7.3, accelerated tests in Section 7.4, goodness-of-fit tests in Section 7.5, general reliability data analysis and trend tests in Section7.6, reliability growth in Section7.7. To simplify notations, the indices S, referring to System, is omitted (R, PA, MTBF, h for Rso, P&, MTBFso ,As).
7.2.1 Reliability & Availability Estimation and Demonstration for the Case of a given fixed Mission Reliability (R) and availability (asymptotic & steady-state point and average availability PA = AA) are often defined as success probability for a given (fixed) rnission. Their estimation and demonstration can thus be performed as for an unknown probability p (Section 7.1) by setting For a demonstration, the null hypothesis Ho: p < po is converted to Ho: R > Ro or Ho: AA > M o , which adheres better to the concept of reliability or availability. The same holds for any other reliability figure expressed as an unknownprobability p. The above considerations hold for a given (fixed) mission, repeated for reliability tests as n Bernoulli trials. However, for the case of continuous operation, estimation and demonstration of an availability can leads to a difficulty in defining the time points tl ,t2,...,t , at which the n observations according to Eqs. (7.2) - (7.4) or (7.8)- (7.10) have to be performed. The case of continuous operation is considered in Section 7.2.2 for availability and Section 7.2.3 for reliability. Examples 7.3 -7.6 illustrate some cases of reliability tests for given fixed mission. Exarnple 7.3 In a reliability test 95 of 100 items pass. Give the confidence interval for R at y = 0.9 (ßl = ß2). Solution With p = 1- R and R = 0.95 the confidence interval for p follows from Fig. 7.1 as [0.03, 0.101. The confidence interval for R is then [0.9, 0.971. (Eq. (7.5) leads to [0.901, 0.9751 for R.)
7 Statistical Quality Control and Reliability Tests
Example 7.4 The reliability of a given subassembly was R = 0.9 and should have been improved through constructive measures. In a test of 100 subassemblies, 94 of them pass the test. Check with a type I error a = 20% the hypothesis H o : R > 0.95. Solution For po = 1- RO = 0.05, a = 20%, and n = 100, Eq. 7.8 delivers c = 7 (see also the graphical solution from Fig. 7.3 with m = npo = 5 aud acceptance probability t 1- a = 0.8, yielding a = 0.15 for m = 5 and c = 7). As just k = 6 subassemblies have failed the test, the hypothesis H o : R > 0.95 can be accepted (must not be rejected) at the level 1 - a = 0.8. Supplernentary result: Assuming as an alternative hypothesis H1 : R < 0.90, or p > pl = 0.1, the type I1 error ß can be calculated from Eq. (7.9) witb c = 7 & n = 100 or graphically from Fig. 7.3 with m = n p l = 10, yielding ß = 0.2. Example 7.5 Determine the minimum number of tests n that must be repeated to verify the hypothesis H o : R > R1 = 0.95 with a consumer risk ß = 0.1. What is the allowed number of failures c? Solution The inequality (7.9) must be fulfilled with pl = 1- Rl = 0.05 and ß = 0.1, n and C must thus satisfy
The number of tests n is a minimum for c = 0 . From 0.95n = 0.1, it follows that n = 45 (calculation with the Poisson approximation (Eq. (7.12)) yields n = 46, graphical solution with Fig. 7.3 leads to m = 2.3 and then n = ml p = 46).
Example 7.6 Continuing with Example 7.5, (i) find n for c = 2 and (ii) how large would the producer risk be for c = 0 and c = 2 if the true reliability were R = 0.97? Solution (i) From Eq. (7.9),
and thus n = 105 (Fig. 7.3 yields m = 5.3 and n = 106 ; from Table A9.2, and n = 107).
V =6,
t 2 , ~ , 9= 10.645
(ii) The producer nsk is
hence, a = 0.75 for c = 0 and n = 45, a = 0.61 for c = 2 and n = 105 (Fig. 7.3, yields a = 0.75 for C = 0 and m = 1.35, a = 0.62 for c = 2 and m = 3.15; from Table A9.2, a = 0.73 for v = 2 and t2,a =2.7, a 1 0 . 6 1 for v = 6 and t(ja =6.3).
7.2 Statistical Reliability Tests
289
7.2.2 Availability Estimation and Demonstration for the Case of Continuous Operation (asymptotic & steady-state) Availability estimation & demonstration for a repairable item in continuous operation can be based on results given in Section 6.2 for the one-item repairable structure. Point estimate (with corresponding mean and variance) for the availability can be found for arbitrary distributions of failure-free and repair times (Section 7.2.2.3). However, interval estimation and demonstration tests can lead to some difficulties. An unified approach for estimating & demonstrating the asymptotic and steady-state point and average availability PA =AA for the case of exponentially or Erlangian distributed failure-free and repair times is introduced in Appendices A8.2.2.4 & A8.3.1.4 (to simplify the notation, PA =AA is used for PAs=AAs). Sections 7.2.2.1 and 7.2.2.2 deal with this approach. Only the case of exponentially distributed failure-free and repair times, i.e. constant failure and repair rates ( h ( x )= h , p ( x ) = p ) is considered here, extension to Erlangian distributions is easy. Point and average unavailability converge for this case rapidly ( I - P A s o ( t ) and 1 - A A s o ( t ) inTable6.3) to the asymptotic & steady-state value PA = 1-PA =1- AA = h 1 ( h+ p ) = h / F . To simplify considerations, it will be assumed that the observed time interval (O,t]is >> l l p , terrninates by a repair, and exactly k (or n) failure-free times zi and corresponding repair times zi have occurred (see Section 7.2.2.3 for other possibilities). Furthermore, h << y will be assumed for the estimation, i.e.
is considered instead of PA = h l ( h + y ) in Section 7.2.2.1 (relative error of Same magnitude as X). A l p is a probabilistic value of the asymptotic & steady-state unavailability and has his statistical counterpart in D T I U T , where DT and UT are the observed down and up times. The procedure given in Appendices A8.2.2.4 and A8.3.1.4 is based on the fact that the quantity p. DT 1 h. UT is distributed according to a Fisher distribution (F-distribution) with vl = v2 = 2 k degrees of freedom. Section 7.2.2.1 deals with estimation and Section 7.2.2.2 with demonstration of PA.
7.2.2.1
AvailabilityEstimation
Having observed for a repairable item described by Fig. 6.2, with constant failure and repair rates h ( x ) = h & y ( x ) = y >> 1, an operating time UT = tl + ... + tk and a repair time DT = ti+... + tl; , the maximum likelihoodpoint estimate for PA, = h 1p is
unbiased being ( 1 -1I k )DT 1 U T , k > 1 (Example A8.10). E, = h 1 y is an approximation for PA = AA = h I ( h + p ) , sufficiently good for practical applications (relative error of same magnitude as 2 ) .For given ßl, ß2, Y = 1 - ß l - ß2
290
7 Statistical Quality Control and Reliability Tests
3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.o 0.8 0.6 0.4 0.2 0.0 1
2
4
6 810
20
40
6080100
200
400600
.. * - Figure 7.5 Confidence limits PA I / PA, = PAn1 I and PA, I PA, I E, (Eq. (7.17)) for D T / U T = maximum unknown asymptotic & steady-state unavailability PA = 1 - PA = 1 - AA (E,= ... + t $ ); y = 1 - ßl- ß2 = confidence level likelihood estimate forh I p ( U T = t, + ... + tk , DT = (here ß1 = ß2 = (1 - y)/2)); result for Example A8.8 * -
E,
A
,
.
,
.
*
A
=G
ti+
( o < ß, < 1 - ßz < I), lower und upper confidence limits for
are (Eq. (A8.65))
where F 2 k , 2 k , l - D 7 & F 2 k , 2 k , l - ß , are the i - ß2 & 1- ß, quantiies of the Fisher ( F ) djstrib~tion(bppenc#x ~ 9 . 4 , 1 ( ~ 9 .A9.6])* 3_ Figur- 7.5 gives the confidence limits ~ , l = ~% a, u / K for ß 1 = ß 2 = ( 1 - ~ 1 1 2 , useful for practical applications (Example A8.8). One-sided confidence intewals are * -
O
withy=l-ß,
and
-
-
P A 1 4 P A < l , withy=l-ß,.
(7.18)
Corresponding values for the availability can be obtained using PA = 1- PA. If failure free andlor repair times are Erlangian distributed (Eq. (A6.102)) with F 2 k , 2 k , l - ß 2 and F 2 k , 2 k + ~ - ß ,have t0 be replaced by F 2 k ß , . ; k ß h , l - B , and ßh (for unchanged MTTF & MTTR, see Example A8.11). 6, =D T ~ UT F 2 k ß h ,2 k p , remains valid. Results based only on DT are not free of pararneters (Section7.2.2.3).
29 1
7.2 Statistical Reliability Tests
7.2.2.2 Availability Demonstration In the context of an acceptance testing, demonstration of the asymptotic & steadystate point and average availability ( P A = AA) is often required. For practical applications it is useful to work with the unavailability PA = 1-PA. The main < PAo against an concern of this test is to check a Zero hypothesis H o : alternative hypothesis H l : PA > GI on the basis of the following agreement between producer and consumer:
The item should be accepted with a probability nearly equal to (but not less than) 1-a if the true (unknown) unavailability is lower than PAO but rejected with a probability nearly equal to (but not less than) I - 0 if is 0 < a < 1-ß < 1 are given (fuced) values). greater than PAl ( E o , PA1 >
so,
PAo is the specified unavailability and PA1 is the maximum acceptable unavailability. a is the allowedproducer's - - risk (type I error), i.e. the probability of rejecting a true hypothesis H o : PA < PAo. ß is the allowed consumer's - - risk (type I1 error), i.e. the probability of accepting the hypothesis Ho: PA < PAo when the alternative hypothesis H l : > PAl is true. Verification of the agreement stated above is a problem of statistical hypothesis testing (Appendix A8.3) and different approach are possible. In the following, the method introduced in Appendix A8.3.1.4 is given (comparison with other methods is in Section 7.2.2.3). Assurning constant failure and repair rates A ( x ) = h and p(x) = p, the procedure is as follows (see also rA8.28, A2.6 (IEC 61070)]):
q,
a , and ß ( O < a < 1 - ß < 1), find the smallest 1. For given (fixed) PAo, integer n (1,2, ...) which satisfy (Eq. (A8.91))
where F 2n, 2n, - a and F zn, zn, 1 -ß are the 1 - a & 1- ß quantiles of the Fdistribution (Appendix A9.4), and compute the limiting value (Eq. (A8.92))
2. Observe n failure-free times tl + ... + t, and the corresponding repair times t ; + ... + t ; , and reject H o :
< PAO ,
if
accept H o :
< E. ,
if
tl
+ ... + t ,
tl
+ ... + t ,
tl
+ ... + t , + ... + t , 1 6 . PAl / PAo used in practical applications.
tl
>6
Table 7.2 gives n and 6 for some values of It must be noted that the test duration is notfLxed in advance. However, results for fixed time sample plans are not free of parameters (remark to Eq. (7.22)).
292
7 Statistical Quality Control and Reliability Tests
„
Number n of failure-free times 'cl,. ..,T, & corresponding repair (restoration) times ,... ,T and limiting value 6 of the observed ratio (ti + ... + tn ) 1 (tl + . .. + t,) to demonstrate - % < PAo against PA > PA1 for various values of a (producer risk), ß (consumer risk), I E,
Table 7.2 T;
arid%
- -
- -
PA1/ PA,, = 4
PA1/ PA,, = 6
n = 29
n=8
n=5
6 = 1.41 PAo I PA0
6 = 1.93 PAo I PA0
(PA, > 0.99)*
(PA, > 0.99)*
= 1.39 PAo 1 PA0
=
6
= 2.06
n=4
n=13
F
6
6
(PA, t 0.99)* *a lower n can be given (with corresponding F
n=3
= 1.86 PAo 1 PA0
(PA, > 0.98)*
,
2 n , - a ) for
2.32 / PA0 (PA, t 0.98)*
PAo I PA0 (PA, > 0.99)*
PAo smaller than the limit given
Corresponding values for the availability can be obtained using PA = 1- E . If failure free andlor repair times are Erlangian distributed (Eq. (A6.102)) with ß h &Pp, F 2 n , 2 n , l - a and F 2 n , ~ n , l - ß have to be replaced by F 2 n ß „ 2 n ß X , l - a and F 2nßh .2nßli . I - ß (for unchanged M7TF & M n R , See Example A8.11).
7.2.2.3 Further Availability Evaluation Methods (for Continuous Operation) The approach introduced in Appendices A8.2.2.4 & A8.3.1.4 and given in Sections 7.2.2.1 & 7.2.2.2 yields to an exact solution based on the Fisher distribution for estimating and demonstrating an availability PA =AA , obtained by investigating DTIUT for exponentially or Erlangian distributed failure-free and repair times. Exponentially distributed failure-free times arise in many practical applications. The distribution of repair (restoration) times can often be approximated by an Erlang distribution (Eq. (A6.102) with ß > 3). Generalization of the distribution of failure-free or repair times can lead to analytical difficulties. In the following some alternative approach for estimating and demonstrating an availability PA =AA are briefly discussed and compared with the approach given in Sections 7.2.2.1 & 7.2.2.2 (item's behavior described by the alternating renewal process of Fig. 6.2). A first possibility is to consider only the distribution of the down time DT (total repair or restoration time) in a given time interval ( O , t ] . At the given (fixed) time point t, the item can be up or down. Eq. (6.33) with t-X instead of T0 gives the distribution function of DT. Moments of DT have been investigated in [A7.29 (1957)l. Mean and variance of the unavailability E = 1- PA = E [ D T I t ] can thus be given for arbitrary distributions of failure-free and repair times. In particular, for the case of constant failure and repair rates (h(x) = h, p(x) = p) it holds that
293
7.2 Statistical Reliability Tests
-
?L
limE[DTlt]=-=Alp, '+P
t+
and
-
2AP
limVar[DTlt]=---=2hltp2. t+
t(h+PY
(7.22)
However, already for the case of constant failure and repair rates, results for interval estimation and demonstration test are not free of parameters (function of y [A8.28] or h [A8.18]). The use of the distribution of DT, or D T l t for fixed r, would bring the advantage of a test duration t fixed in advance, but results are not free of parameters and the method is thus of limited utility. A second possibility is to assign to the state of the item an indicator <(t) taking values 1 for item up and 0 for item down (Boolean variable in Section 2.3.4). In this case it holds that PA(t)= Pr{<(t)= 11, and thus E[<(t)]= PA(t) and Var[<(t)l= ~ [ c ( t )- ~~ ]~ [ < (=t P) A ] ( t ) ( l - P A ( t ) ) . Investigation on PA(t) reduces to that on c ( t ) , See e.g. lA7.4 (1962)l. In particular, estimation and demonstration of PA(t) can be based on observations of < ( t ) at time points tl < t 2 < ... . A basic problern here, is the choice of the observation time points (randomly, at constant time intervals A = ti+l- ti, or other). For the case of constant failure and repair rates ( L , p), Eq. (6.20) yields P A ( t ) = P A s o ( t ) = p / ( h + p ) + ( h ~ ( h + ~ ) ) e - ( ~ + ~ ) ' (item new at t =O). Convergence to PA = AA = y / ( h + y ) = 1 - h l y is very fast. Furthermore, because of the constant failure rate, the joint availability is given by JAso(t, t + A ) = P A s o ( t ) . PAso( A ) (Eqs. (6.34) & (6.35)). Estimation and demonstration for the case of observations at constant time intervals A can thus be reduced ) 1 - P A ( A ) = (1-e-PA) h 1y = h A to the case of an unknown probability p = K ( A = for A «l l p or p = =ÄÄ = h l ( h + y ) = h l p for A »11p (Section 7.1). A further (empirical) possibility is to estimate (point estimation) and demonstrate h and y separately (Section 7.2.3), and put results in PA = ÄÄ = A l p . For an (empirical) interval estimation, the Chebyshev's inequality (Eq. (A6.49)), expressing (Eq. (7.22)) Pr{ I DT I t - h I p I > E ] I 2 h 1 ( t P 2 c 2 ) ,can be used, yielding fromastatisticalpointofview P ~ { I D T / ~ - E~] =/ ~( r~ { I~ ~F - A l f>i E]) 2 P ~ {- hE ~ f i > ~ e ] = ~ r {>Pi /Af i + ~ ) = l - ~ r { IP hA / f i + ~1) 2 h / ( t $ ~ ~ ~ ) = ß ~ = l - ~ , or Pr{ PA 5 PAll = A /fi + J2i / ( t b2(1-y)) I 2 Y by replacing E from the last equality in the preceding line, and with t as test time [7.14]. The different methods can basically be discussed by comparing Fig. 7.5 with Fig. 7.6 and Table 7.2 with Table 7.3. Results based on the Fisher distribution yield broader confidence intervals and longer demonstration tests (this can be accepted, considering that h und p are unknown and that for high availability values higher * pA, l PA, or PA, I PA, can be agreed), the advantage being exact knowledge of the involved errors (P1, ß 2 ) or risks ( a ,ß). However, for some aspects (test duration, possibility to verify maintainability with selected failures) it can become more appropriate to estimate and demonstrate 3L and p separately. A
294
7 Statistical Quality Control and Reliability Tests
7.2.3 Estimation & Demonstration of a Constant Failure Rate h (or of MTBF for the Case MTBF = 1l h ) A constant (time independent) failure rate h(x) = h occurs in many practical applications, for nonrepairables items as well as for repairable items which are assumed as-good-as-new after repair (X being the variable starting by X = 0 at the begin of the failure-free time considered, as for interarrival times). h(x)= h implies that failure-free times are independent and exponentially distributed with the same parameter h (Eq. (A6.81)). In this case, the reliability function is given by R(x)= and for the mean time to failure, MTirF = 1 1 h holds for all failure-free times (Eq. (A6.84)). For the repairable case, MTBF (mean operating time between failures) is often used in practical applications instead of M7TF. However, MTBF = 1I h holds only for the particular case h(x)= h. To avoid inisuses, MTBF is confined in this book tu the case MTBF = 1 I 1. A reason for the assumption of h(x) = h is that, by neglecting repair times, the flow of failures constitute a homogeneous Poisson process (Appendix A7.2.5). This property characterizes exponentially distributed failure-free times and highly simplifies investigations. This section deals with the estimation and demonstration of a constant failure rate h or of MTBF for the case MTBF = llh (see Appendix A8 for basic considerations and Sections 7.5 - 7.7 for further results. In particular, the case of a given (fixed) czlmulative operating time T is considered, when repair times are neglected (immediate renewal) and individual failure-free times are assumed to be independent. Due to the relationship between exponentially distributed failure-free times and homogeneous Poisson process (Eq. (A7.39)) as well as the additive property of Poisson processes, the fixed cumulative operating time T can be partitioned in an arbitrary way from failure-free times of statistically identical items (Example 7.7), see note to Table 7.3 for a practical mle. The following are some examples:
1. Operation of a single item that is immediately renewed after each failure (renewal time = 0); here, T = t = calendar time = T„„. 2. Operation of n identical items, each of them being irnmediately renewed after each failure (renewal time = 0); here, T = n t (n = 1, 2, ...). As stated above, in the case of a constant failure rate h and immediate renewal, the failure process is a homogeneous Poisson process (HPP)with intensity h (for n = 1) or nh (for n> 1) over the fixed time interval (0,T= n t ] . Hence, the probability of k failures occurring within the cumulative operating time Tis (Eq. (A7.41)) Pr{k failures within T
n W k -nhT 1 n h } = (e , k!
n=1,2,..., k = 0 , 1 , 2 ,... .
Statistical procedures for estimation and demonstration of a failure rate h can thus be based on the evaluation of the parameter ( rn = n AT) of a Poisson distribution. In addition to the case of a given (fixed) cumulative operating time T a n d
295
7.2 Statistical Reliability Tests
immediate renewal (discussed above and investigated in Sections 7.2.3.1 - 7.2.3.3), for which the number k of failures in Tis a sufficient statistic and fi = k 1 T is an unbiased estimate for h , further possibilities are known. Assuming n identical items at t = 0 and labeling the individual failure times as t; < t ; < ..., the following cases are important for practical applications: +) 1. Fixed number k of failures, the test is stopped at the kth failure and failed items are instantaneoz~slyrenewed; an unbiased point estimate for h is (k > 1)
2. Fixed number k of failures, the test is stopped at the kth failure and failed iterns are not renewed; an unbiased point estimate of the failure rate h is
3. Fixed test time t, failed items are not renewed; a point estimate of the failure rate h (given k items have failed in (O,t])is
Example 7.7 An item with constant failure rate h operates first for a fixed time Tl and then for a fixed time Tz. Repair times are neglected. Give the probability that k failures will occur in T = Z j + T2. Solution The item's behavior within each of the time periods Tl and T2 can be described by a homogeneous Poisson process with intensiv h. From Eq. (A7.39) it follows that Pr{i failures in the time period Tl
I h ] = (q)1 I!
and, because of the memoryless property of the homogeneous Poisson process Pr{k failures in T = T1 +T2
I h )=
k
i=o
T
)T
d
,
,
-i-
(k-i)!
The last Part of Eq. (7.26) follows from the binomial expansion of (T1 + T j~k . Eq. (7.26) shows that for h constant, the curnulative operating time T can be partitioned in any arbitray way (See the remark to Table 7.3 for a practical rule). Supplementay result: The same procedure can be used to prove that the sum of two independent homogeneous Poisson processes with intensities h l and h2 is a homog.
+)
* is used to distinguish t; , t;, ... as arbitrary points on the time axis, from t l , t2, ... as independent observations of a failure-free time T (starting at t = O as in Fig. 1.1).
7 Statistical Quality Control and Reliability Tests
Poisson process with intensity hl + h2 ; in fact, Pr(kfailuresin(0,~l I h , , h z ]
This result can be extended to nonhomogeneous Poisson processes.
7.2.3.1 Estimation of a Constant Failure Rate h') (or o f MTBF for the Case MTBF =llh) Let us consider an item with a constant failure rate ?L. If during the given ( f x e d ) cumulative operating time T" exactly k failures have occurred, the maximum likelihood point estimate for the unknown Parameter h follows from Eq. (A8.46) as
Lk = - ,
with
T
~ [ h= ]h
and ~ a r [ h ]= h l T .
(7.28)
For a given confidence level y = 1 - ßl - ß2 (with 0 < ß1 < 1 - ß2 < 1 ) and k > 0, the lower hl and upper h, limits of the confidence intewal for the failure rate h can be obtained from (Eiqs. (A8.47) to (A8.51))
or from
using the quantile of the X2-distribution(Table A9.2). For k = 0 , Eq. (A8.49) yields
hl= O
and
-
ln(i'ßl) h, = --,
T
with y = 1- ßl.
(7.30)
Figure 7.6 gives confidence limits bl / h and h, / h for ß1 = ß2 = (1 - y)/2, useful for practical applications. For the case MTBF = 1/ h , MTBF = T l k is biased (unbiased is M ~ B F =T l ( k + 1)); M ~ B F= 11 ~ h, and M ~ B F = , 1/ hl can be used in practical applications.
+)
The case considered in Sections 7.2.3.1 to 7.2.3.3 corresponds to a sampling plan with n elements with replacement and k failures in the given (fixed) time intewal (0, T l n ] ,Type I (time) censoring; the underlying process is a homogeneous Poisson process with constant intensity n h .
7.2 Statistical Reliability Tests
lh,
hl l h (h= k l ~ )
4
Figure 7.6 Confidence limits h1 / h, h, I h for an unknown constant failure rate h per Eqs. (7.28) & (7.29) ( T=given (fixed) cumulative operating time (time censoring); k = number of failures during T; y =1- ß, - ß2 = confidence level (here ßl= ß2 = (1; y)l2)); for tbe case MTBF=l/ X, it holds &BF= 1 l h (unbiased for k >> 1) and MTBF~=11L,, MSBF,= 1 1hl ); Examples 7.8,7.13
,.
A
Confidence limits A l , hucan also be used to give one-sided confidence intewals:
hlh,,
with
ß2
A&,
with
ß i = O and y = l - ß 2 ,
=0
and y = 1 - ß 1
or i.e. MTBF 2 M ~ B F=~1/
h, or MTBF IMTBF,
= 1/
ilfor the case MTBF = 1l h .
Example 7.8 In testing a subassembly with constant failure rate h, 4 failures occur during T = 104 cumulative operating hours. Find the confidence interval of h for a confidence level y = 0.8 ( ß1= ß2 = 0.1). Solution * * * From Fig. 7.6 it follows that for k = 4 and y =0.8, h l 1 4 = 0.44 and h,lh = 2. *With T = 104 h , k = 4 , and h = 4 . 1 0 - ~hT1, the confidence lirnits are hl = 1.7.10-~h-I and h, = 8 . 1 0 - ~h-l. Supplementary result: Correspondingone-sided conf. interval is h 5 8 . 1 0 - ~h-I with y = 0.9.
-
298
7 Statistical Quality Control and Reliability Tests
In the above considerations (Eqs. (7.28) - (7.31)), the cumulative operating time T was given (fixed), independent of the individual failure-free times and the number m of items involved (Type I censoring). The situation is different when the number offailures k is given (fixed), i.e. when the test is stopped at the occurrence of the kth failure (Type I1 censoring). Here, the cumulative operating time is a random variable (term ( k - 1)/ h of Eqs. (7.23) and (7.24)). Using the memoryless property of homogeneous Poisson processes, it can be shown that the quantities rn ( t l - t;-l) for renewal, and ( m - i + 1) (t: - t;-l) for no renewal,
(7.32)
with i = I, ..., k and t; = 0, are independent observations of a random variable distributed according to F(x) = 1- e-L X . This is necessary and sufficient to prove that the given by Eqs. (7.23) and (7.24) are maxirnum likelihood estimates for h. For confidence intervals, results of Appendix A8.2.2.3 can be used. In some practical applications, system's failure rate confidence limits as a function of component's failure rate confidence limits is sought. Monte Carlo simulation can help. However, for series Systems, constant failure rates hl,...,h,,Atime censoring, and same observation time T, Eqs. (2.19), (7.28), and (7.27) yield hs = h, + .., + h,. Furthermore, for given fixed T, 2Thi (considered here as a random variable, see Appendix A8.2.2.2) has a x2 distribution with 2 ( k i + 1) degrees of freedom (Eq. (A8.48), Table A9.2); thus, 2 Ths has a x2 distribution with E2 ( k i t I ) degrees of freedom. From a x2 distribution table (Table A9.2) one can recognize that for P r { 2 T h i < 2 T h i u } = P r { h i < h ]i 2 0 . 8 = y (i=i,..., n) and v=10,20,... (k,= ...= k,=4 as an example) one has Pr {hs2 hli... t h, ) 2 y (Y = i -e-3'2= 0.777 is given e.g. in [7.18]). Extension to different observation times Ti, series-parallel structures, or Erlangian distributed failure-free times is possible [7.18]. Estimation of h 1y as approximation for an unavailability h l ( h +P) is given in Section 7.2.2.1.
7.2.3.2 Simple Two-sided Test for the Demonstration of a Constant Failure Rate h (or of MTBF for the Case MTBF = llh) In the context of an acceptance test, demonstration of a constant failure rate h (or of MTBF for the case MTBF = 11 h) is often required, not merely its estirnation as in Section 7.2.3.1. The main concern of this test is to check a Zero hypothesis Ho: h < ho against an alternative hypothesis H 1 : h > Al, on the basis of the following agreement between producer and consumer:
Items should be accepted with a probability nearly equal to (but not less than) 1-a, if the true (unknown) h is less than ho, but rejected with a probability nearly equal to (but not less than) 1- P , if h is greater than h1 (Lo, h, > h„ und 0 < a < 1 - ß < 1 are given (fxed)values). ho is the specified h and Al is the rnaximum acceptable h ( 1l m 0 and 1l m 1 in IEC 60605 17.191 or 1/ e0 and 1181 in MZL-STD-781 [7.22] for the case MTBF = 11 h).
299
7.2 Statistical Reliability Tests
a is the allowed producer's risk (type I error), i.e. the probability of rejecting a true hypothesis H o : h iho. ß is the allowed consumer's risk (type I1 error), i.e. the probability of accepting Ho when the alternative hypothesis H 1 : h > hl is true Evaluation of the above agreement is a problem of statistical hypothesis testing (Appendix A8.3), and can be performed e. g. with a simple two-sided test or a sequential test. With the simple two-sided test (also known as the fixed length test), the cumulative operating time T and the number of allowed failure C during T are fixed quantities. The procedure (test plan) is as follows: 1. From ho, Al, a, ß determine the smallest integer
C
and the value T satisfying
and
2. Perform a test with a total cumulative operating time T, determine the number of failures k during the test, and rejectH0: h < h o ,
if k > c
accept Ho: h < h0,
if k 5 C.
(7.35)
For the case MTBF = l 1h , the above procedure can be used to test H , : MTBF> MTB5 against H , : MTBF C MTBq , by replacing ho = 11MTB5 and Al = 1I MTBF, .
Example 7.9 Following conditions have been specified for the demonstration (acceptance test) of the constant (time independent) failure rate h of an assembly: ho = 1 / 2000 h (specified h ), hl = 1 / 1000 h (minimum acceptable h ) ,producer risk cr = 0.2, consumer risk ß = 0.2. Give: (i) the cumulative test time T and the allowed number of failures c during T; (ii) the probability of acceptance if the true failure rate h were 1 / 3000 h. Solution (i) From Fig. 7.3, c = 6 and rn = 4.6 for Pr{acceptance}= 0.82, c = 6 and m = 9.2 for Pr{acceptance}= 0.19; thus c = 6 and T = 9200 h . These values agree well with those obtained from Table A9.2 (V = 14) and are given in Table 7.3. (ii) For h = 1 / 3000 h , T=9200h, c = 6
I
Prlacceptance h = I 13000 h 1 = Pr{no more than 6 failures in T = 9200 h h = L / 3000 h J = see also Fig. 7.3 for rn = 3.07 and c = 6 .
1
i
c T ~ ~ i=o 6
= 0.96,
I!
7 Statistical Quality Control and Reliability Tests
i=O
1.0
-
0.8
-
0.6
-
0.4
-
tancel L} =Pr (no more than c failures in T ( L)
0.2 0
b
0.001
0.002
?L
[hK1]
Figure 7.7 Operating characteristic curve (acceptance probability cume) as a function of h for fixed Tand C (Lo = 1 /2000h, LI = 1 /1000h, cx = ß = 0.2; T = 9200h and C = 6 as per Table 7.3; see also Fig. 7.3) (holds for MTBZf, = 2000h and MTBq = 1000h, for the case MTBF= 11?L)
The graph of Fig. 7.7 visualizes the validity of the above agreement between producer and consumer (customer). It satisfies the inequalities (7.33) and (7.34), and is known as the operating characteristic cuwe (acceptance probability curve). For each value of ?L, it gives the probability of having not more than C failures during a cumulative operating time T. Since the operating characteristic curve as a function of h is monotonically decreasing, the risk for a false decision decreases for h < hO and h > Al, respectively. It can be shown that the quantities c and h o ~ depend only on a, ß, and the ratio Al 1 ho (discrimination ratio). Table 7.3 gives C and h , , ~for some values of a , ß and hllhOuseful for practical applications. For the case MTBF = l l h , Table 7.3 holds for testing H o : MTBF > MTBFo against H 1 : MTBF < M T B F I , by replacing ho = 1I MTBFO and h1 = 1I MTBF1. Table 7.3 can also be used for the demonstration of an unknown probability p (Eqs. (7.8) and (7.9)) in the case where the Poisson approximation applies. A large number of test plans are in international standards [7.19(61124)l. In addition to the simple two-sided test described above, a sequential test is often used (see Appendix A8.3.1.2 and Section 7.1.2.2 for basic considerations and Fig. 7.8 for an example). In this test, neither the cumulative operating time T, nor the number C of allowed failures during T are specified before the test begins. The number of failures is recorded as a function of the cumulative operating time (normalized to 11 An). As soon as the resulting staircase curve crosses the acceptance line or the rejection line the test is stopped. Sequential tests offer the advantage that on average the test duration is shorter than with simple two-sided tests. Using Eq. (7.12) with p0 = l - e - ' ~ ' ~ , pl = 1-e-h16t, n = TIZit, and 6-+ 0 (continuous in time), the acceptance and rejection lines are obtained as
301
7.2 Statistical Reliability Tests
Table 7.3 Number of allowed failures C dunng the cumulative operating time T and value of h o T to demonstrate h < ho against h > hl for vanous values of a (producer risk), ß (consumer nsk), and hl / h o (can be used to test MTBF < M T B 6 against MTBF > MTB4 for the case MTBF = 11h or, using ho T = npo , to test p < po against p > pl for an unknown probability p)
*
h, l h , = 1.5
h, l h , = 2
hl l h , = 3
a=ß60.1
C = 40 hoT=32.98 ( a = ß = 0.098)
C = 14* ho T = 10.17 ( a = ß = 0.093)
hOT = 3.12 ( a ß = 0.096)
aeßZo.2
C = 17 hoT=14.33 ( a = ß = 0.197)
c=6 h o T = 4.62 ( a = ß = 0.185)
C =2 hoT = 1.47 ( a = ß = 0.184)
a=ßZ0.3
c=6 hoT = 5.41 (a = ß = 0.2997)
C =2 LOT= 1.85 ( a = ß = 0.284)
C =1 LOT = 0.92 ( a ß = 0.236)
C=
-
=
T h o , as a rule of thumb
acceptance line : yl (X) = a X - 4 ,
(7.36)
y2(x) = a x + b2,
(7.37)
rejection line : X
5
-
13 yields hoT = 9.48 and a = ß = 0.1003 ; number of items under test
with
C =
= h o T , and
Sequential tests used in practical applications are given in i n t e r n a t i o n a l s t a n d a r d s [7.12 (61124)l. To limit testing effort, restrictions are often placed on the test duration and the number of allowed failures. Figure 7.8 shows two t r u n c a t e d sequential test plans for a = ß 0.2 and Al / h O =1.5 and 2, respectively. The lines defined by Eqs. (7.36)-(7.38) are shown dashed in Fig. 7.8a.
-
Example 7.10 Continuing with Example 7.9, give the expected test duration by assuming that the true h equals ho and a sequential test as per Fig. 7.8 is used. Solution Frorn Fig. 7.8 with hl I h o = 2 it follows that E [test duration
I h = h o l = 2.4 I h o = 4800 h .
302
7 Statistical Quality Control and Reliability Tests
number of failures
number of failures
0
1
2
3
4
5
-
a) b) a) Sequential test plan to demonstrate h < h0 against h > hl for a =: ß 0.2 and hl / L o =1.5 (top), hl / L o = 2 (down), as per IEC 61124 and MIL-HDBK-781 [7.19, 7.221 (dashed on the left are the lines given by Eqs. (7.36)- (7.38)); b) Expected test duration until acceptance (continuous) and operating characteristic curve (dashed) as a function of h0 / h (can be used to test MTBF < MTBF,, against MTBF > MTBF, , for the case MTBF= 1/ h ) Figure 7.8
7.2.3.3 Simple One-sided Test for the Demonstration of a Constant Failure Rate h (or of MTBF for the Case MTBF= lh) Simple two-sided tests (Fig. 7.7) and sequential tests (Fig. 7.8) have the advantage that, for a = ß, producer and consumer run the same risk of making a false decision. However, in practical applications often only h, and a or Al and ß, i.e. simple onesided tests, are used. The considerations of Section 7.1.3 apply and care should be taken with small values of C, as operating with h, and a (or hl and ß) the producer (or consumer) can be favored. Figure 7.9 shows the operating characteristic curves for various values of C as a function of h for the demonstration of h < 111000h against h > 111000h with consumer risk ß = 0.2 for ?L = 111000h, and visualizes the reduction of producer's risk ( a = 0.8 for h=1/1000h) by decreasing h , or increasing c (counterpart of Fig. 7.4).
7.3 Statistical Maintainability Tests
Figure 7.9 Operating characteristic curves (acceptance probability curves) for h l = 1 /1000h, ß = 0.2, and c = 0 ( T = 1610h), c = 1 ( T = 2995h), c = 2 ( T = 4280h), c = 5 ( T = 7905 h), and c =M (T = (holds for MTBZj =1000h, for the case MTBF = 1/ h ) 00)
7.3
Statistical Maintainability Tests
Maintainability is generally expressed as a probability. In this case, results of Sections 7.1 and 7.2.1 can be used to estimate or demonstrate maintainability. However, estimation and demonstration of specific Parameters, for instance MTTR (mean time to repair) is important for practical applications. If the underlying random values are exponentially distributed (constant repair rate P), the results of Section 7.2.3 for a constant failure rate ?L can be used. This section deals with the estimation and demonstration of an Mi?iR by assuming that repair time is lognormally distributed (for Erlangian distributed repair times, results of Section 7.2.3 can be used, considering Eqs. (A6.102) & (A6.103)). To simplify the notation, realizations (observations) of a repair time T ' will be denoted by t , , ..., t , instead of t ; , ...,t; .
7.3.1 Estimation of an MTTR Let t l , ..., t , be independent observations (realizations) of the repair time T ' of a given item. From Eqs. (A8.6) and (A8.10), the empirical mean and variance of T' are given by
304
7 Statistical Quality Control and Reliability Tests
For these estimates it holds that E[E[T']I= E[z' ] = MTTR , ~ a r [ ~I][ =< Var[T' ] I n, and ~ [ ~ i r [ z '=] Var[< ] ] (Appendix A8.1.2). As stated above, the repair time $ can often be assumed lognormally distributed with distribution function (Eq. (A6.110))
and with mean and variance given by (Eqs. (A6.112) and (A6.113))
Form Eq. (7.41) one recognizes that lnz is normally distributed with mean 11 lnh and Variance 02. Using Eqs. (A8.24) and (A8.27), the rnaxirnurn likelihood estimation of h and o2 is obtained from
A point estimate for h and o can also be obtained by the method of quantiles. The idea is to substitute some particular quantiles with the corresponding empirical quantiles to obtain estimates for h or o. For t = 11 h , ln(ht ) = 0 and F ( 1 1 h ) = 0.5, therefore, 1 1 h is the 0.5 quantile (median) t0.5 of the distribution function F(t) given by Eq. (7.41). From the empirical 0.5 quantile t"0.5 = inf(t:6 J t ) 2 0.5) an estimate for h follows as
~ (Table A9.1); thus eo / h = t0,841 is Moreover, t = e' 1 h yields ~ (I h e) = 0.841 the 0.841 quantile of F ( t ) given by Eq. (7.41). Using h = ll?0.5 and o = h ( ht0,841) = ln(t0.841 1 t0,5),an estimate for o is obtained as
Furthermore, considering F(e-olh) = 1 - 0.841 = 0.159, i.e. to,159 = e-O I h , it follows that e20 = h t O,M1I h t 0,159 and thus Eq. (7.45) can be replaced by
305
7.3 Statistical Maintainability Tests
The possibility of representing a lognormal distribution function as a straight line, to simplify the interpretation of data, is discussed in Section 7.5.1 (Fig. 7.14, Appendix A9.8.1). To obtain interval estimates for the parameters h and D, note that the logarithm of a log normally distributed variable is normally distributed with mean In ( I / A) and variance 02. Applying the transformation ti + lnti to the individual observations tl, ..., t , and using the results known for the interval estimation of the parameters of a normal distribution [A6.1, A6.41, the confidence intervals
for 02, and with
Xi-1,,
and t:-l,, are the y for h can be found with iand 6 as in Eq. (7.43). quantiles of the and t-distribution with n - 1 degrees of freedom, respectively (Tables A9.2 and A9.3).
Example 7.11 Let 1.1, 1.3, 1.6, 1.9, 2.0, 2.3, 2.4, 2.7, 3.1, and 4.2h be 10 independent observations (realizations) of a lognormally distributed repair time. Give the maximum likelihood estimate and, for y = 0.9, the confidence interval for the parameters h and
o2
7.3.2 Demonstration of an MTTR The demonstration of an MTTR (in an acceptance test) will be investigated here by assuming that the repair time is lognormally distributed with known o2 (method 1A of M I L - S T D - 4 7 1 [7.22]). A rule is sought to test the null hypothesis H,,: MTTR = MTTRo against the alternative hypothesis H1:MTTR = M T R , for given type I error a and type I1 error ß (Appendix A8.3). The procedure (test plan) is as follows:
<
306
7 Statistical Quality Control and Reliability Tests
1. From a and ß ( 0 < a < 1- ß < I), determine the quantiles tp and tl-, of the standard normal distribution (Table A9.1)
and M w , compute the sample size n (next highest integer)
From M%
n=
(tl-a MUR,,
- tP
(Mi'TR, - M
MTTR,)'
2
(e" -1).
~R,,)~
2. Perform n independent repairs and record the observed repair times t l , ..., t , (representative sample of repair times). 3. Compute
E[< ] according to Eq. (7.39) and reject H o : M7TR = MZTR,
E [ z ' ] > c =M T T R ~ ( ~ + ~ , - ,
if (7.51)
otherwise accept Ho. The proof of the above rule implies a sample size n > 10, so that the quantity E [ z j can be assumed to have a normal distribution with mean M U R and variance Var[z']/ n (Eqs. (A6.148), (A8.7), (A8.8)). Considering the type I and type I1 errors
and using Eqs. (A6.105) and (7.49), the relationship C
= MT% +
= MVR~
+ tp
(7.52) 2
can be found, with Var0[zr] = ( e o 2 - i ) ~ n for ~ : tl-, and Varl[z']= ( e a -1) MTIR: for tp according to Eq. (7.42). The sample size n (Eq. (7.50)) follows then from Eq. (7.52) and the right hand side of Eq. (7.51) is equal to the constant c as per Eq. (7.52). The operating characteristic cuwe can be calculated from
with d=MTTR
MTTR
Replacing in d the quantity n i (aa2 - 1) from Eq. (7.50) one recognizes that the operating characteristic curve is independent of 0 2 (rounding of n neglected).
307
7.3 Statistical Maintainability Tests
Determine the rejection conditions (Eq. (7.51)) and the related operating characteristic curve for the demonstration of MTTR = M T % = 2 h against M?TR = M77Rl = 2.5 h with a = ß = 0.1. o2 is assumed tobe 0.2.
Solution For a = ß = 0.1, Eq. (7.49) and Table A9.1 yield tl-cc = 1.28 and tp = -1.28. From Eq. (7.50) it follows that n = 30. The rejection condition is then given by
From Eq. (7.53), the operating characteristic curve follows as Pr{acceptance I M I T R ] =
- j1e
d
G-,
x2/2
dw,
with d = 25.84 h I M7TR - 11.64 (see graph).
7.4
0.4 0.2 MTTR [h]
O
1
2
3
Accelerated Testing
The failure rate ?L of electronic components lies typically between 10-l0 and 10-7 h-1, and that of assemblies in the range of 10-7 to 10-5 h - l . With such figures, cost and scheduling considerations demand the use of accelerated testing for ?L estimation and demonsiration, in particular if reliablefield data are not available. An accelerated test is a test in which the applied stress is chosen to exceed that encountered in field operation, but still below the technological limits. This in order to shorten the time to failure of the item considered by avoiding an alteration of the involved failure mechanism (genuine acceleration). In accelerated tests, failure mechanisms are assumed to be activated selectively by increased stress. The quantitative relationship between degree of activation and extent of stress, i.e. the acceleration factor A, is determined via specific tests. Generally it is assumed that the stress will not change the type of the failure-free time distribution function of the item under test, but only modify the Parameters. In the following, this hypothesis is assumed to be valid (its verification should precede each statistical evaluation of data issued from accelerated tests). Many electronic component failure mechanisms are activated through an increase in temperature. Calculating the acceleration factor A, the Arrhenius model can often be applied over a reasonably large temperature range (for instance 0 to
308
7 Statistical Quality Control and Reliability Tests
150°C for ICs). The Arrhenius model is based on the Arrhenius rate law [3.44], which states that the rate V of a simple (first-order) chemical reaction depends on temperature T as
5
are Parameters, k is the Boltzmann constant ( k = 8.6.10- eV / K), and T the absolute temperature in Kelvin degrees. E , is the activation energy and is expressed in eV. Assuming that the event considered (for example the diffusion between two liquids) occurs when the chemical reaction has reached a given threshold, and the reaction time dependence is given by a function r ( t ) , then the relationship between the times tl and t2 necessary to reach at two temperatures Tl and T2.a given level of the chemical reaction considered can be expressed as E, and
V,
Furthermore, assuming r(t)
- t, i.e. a linear time dependence, it follows that
Substituting in Eq. (7.54) and rearranging, yields
By transferring this deterministic model to the mean times to failure MTTFl and MTTF2 or to the constant failure rates h 2 and h, (using MTTF = 1I ?L) of a given item at temperatures Tl and T2 , it is possible to define an acceleration factor A M T q
A=-,
M7-q
or
A =L2 for constant failure rate , Al
expressed by
The right hand sides of Eq. (7.55) applies to the case of a constant (time independent but stress dependent) failure rate h(t) = h . In some cases, A = t0,5,1 t0,52 is assumed (Appendix A6.6.3). Eq. (7.56) can be reversed to give an estimate E , for the activation energy E , based on h1 and h2 obtained empirically from two life tests at temperatures Tl and T2. To verify the model, at least three tests at Tl, T2, and T3 are necessary. Activation energy is highly dependent upon the particular failure mechanism involved (see e.g. Table 3.6 for some indications for
309
7.4 Accelerated Testing
semiconductor devices). High E, values lead to high acceleration factors, due to the assumed relations vl tl = v2 t 2 and v 1/ eEalkT For ICs, global values of E, lies between 0.3 and 0.7eV (Table 3.6), value which could basically be obtained empirically from the curves of the failure rate as a function of the junction temperature. It must be noted that the Arrhenius model does not hold for all electronic devices and for any temperature range. Figure 7.10 shows the acceleration factor A from Eq. (7.56) as a function of O2 in "C,for e1 = 35 and 55°C and with E, as parameter ( B i = Ti - 273). In the case of a constant failure rate ?L, the acceleration factor A = h21 hl can be used as a multiplicative factor in the conversion of the cumulative operating time from stress TZ to stress Tl (Example 7.13). In practical applications, the acceleration factor A often lies between 10 and some few hundreds, seldom above 1000 (Examples 7.13 and 7.14).
-
Figure 7.10 Acceleration factor A according to the Arrhenius model (Eq. (7.56)) as a function of Cl2 for 01 = 35 and 55"C, and with E, in eV as parameter ( B i = ?; - 273)
310
7 Statistical Quality Control and Reliability Tests
If the item under consideration exhibits more than one dominant failure mechanism or consists of elements E l , ..., E, having different failure mechanisms, the series reliability model (Sections 2.2.6.1 and 2.3.6) can often be used to calculate the compound failure rate 3LS ( T 2 ) at temperature (stress) TL by considering the failure rates of the individual elements hi(T1) and the corresponding acceleration factors A
Example 7.13 Four failures have occurred during 107 cumulative operating hours of a digital CMOS IC at a chip temperature of 130°C. Assuming O1 = 3 5 T , a constant failure rate h, and an activation energy E, = 0.4eV, give the interval estimation of h for y = 0.8. Solution For = 35OC, O2 = 130°C, and E, = 0.4eV it follows from Fig. 7.10 or Eq. (7.56) that A = 35. The cumulative operatin time at 35'C is thus T = 0.35 .109 h and the point estimate for h is -9 . h = &IT 11.4 .10 h . With k = 4 and y = 0.8, it follows from Fig. 7.6 that h l /L= 0.43 and h, Ih= 2 ; the confidence interval of h is therefore [4.9, 22.8].10-~h-I.
-B
Example 7.14 A PCB contains 10 meta1 film resistors with stress factor S = 0.1 and h(25"C) = 0.2.10-~h-l, 5 ceramic capacitors (class 1) with S = 0.4 and h(25"C) = 0.8.10-~h-l, 2 electrolytic capacitors (Al wet) with S = 0.6 and h(25"C) = 6.10-~h-l, and 4 ceramic-packaged linear ICs with Ae JA = 10°C and h(35"C) = 2 0 . 1 0 - ~h-l. Neglecting the contribution of the printed wiring and of the solder joints, give the failure rate of the PCB at a burn-in temperature BA of 80°C on the basis of failure rate relationships as given in Fig. 2.4.
Solution The resistor and capacitor acceleration factors can be obtained from Fig. 2.4 as resistor: ceramic capacitor (class I): electrolytic capacitor (Al wet):
A=2.5/0.7=3.6 A = 4.2 10.5 = 8.4 A = 13,610.35 = 38.9.
Using Eq. (2.4) for the ICs, it follows that h - n,. With 8 J = 35°C and 90°C, the acceleration factor for the linear ICs can then be obtained from Fig. 2.5 as A = 7.510.8 = 9.4. From Eq. (7.57), the failure rate of the PCB is then
7.4 Accelerated Testing
311
A further model for investigating the time scale reduction (time compression) resulting from an increase in temperature has been proposed by H. Eyring [3.44, 7.241. The Eyring model defines the acceleration factor as
where B is not necessarily an activation energy. Eyring also suggests the following model, which considers the influences of temperature T and of a further stress X
Equation (7.59) is known as the generalized Eyring model. In this model, a function of the normalized variable X = X 1 X. can also be used instead of the quantity X itself (for example x n , Ilxn, ln xn, In (1 l X")). B is not necessarily an activation energy, C & D are constants. The generalized Eyring model led to accepted models for electromigration (Black), corrosion (Peck), and voltage stress (Kemeny)
where j = current density, RH = relative humidity, and V = voltage, respectively (see also Eqs. (3.2)-(3.6) and Table 3.6). For failure mechanisms related to mechanical fatigue, Coffin-Manson simplified models [2.61, 2.721 (based on the inverse power law) can often be used, yielding for the number of cycles to failure
where AT refers to thermal cycles and G refers to g „ values in vibration tests (0.5 < PT < 0.8 and 0.7 C ß M < 0.9 often occur in practical applications). For damage accumulation, Miner's hypothesis of independent damage increments [3.56] must be considered with care. Also known for conductive filament formation is the Rudra's model [7.26]. Critical remarks on accelerated tests are e.g. in [7.13, 7.15, 7.211. Refinement of the above models is in Progress, in particular for ULSI ICs with emphasis On: 1. New failure mechanisms in oxide and package, as well as new externally induced failure mechanisms. 2. Identification and analysis of causes for early failures or premature wearout. 3. Development of physical models for failure ~nechanismsand of simplifed models for reliabiliv predictions in practical applications. Such efforts will give better physical understanding of the component's failure rate.
312
7 Statistical Quality Control and Reliability Tests
In addition to the accelerated tests discussed above, a rough estimate of component life time can often be obtained through short-term tests under extreme stresses (HALT, HAST, etc.). Examples are humidity testing of plastic-packaged ICs at high pressure and nearly 100%RH, or tests of ceramic-packaged ICs at temperatures up to 350°C. Experience shows that under high Stress, life time is often lognormally distributed, thus with a strong time dependence of the failure rate (see e.g. Table A6.1 and Appendix A6.10.5). Highly accelerated tests (HAST) can activate failure mechanisms which would not occur during normal operation, so care is necessary in extrapolating results to situations exhibiting lower stresses. Often, the purpose of such tests is to force (not only to activate)failures. They belong thus to the class of semi-destructive or destructive tests, often used at the qualification of Prototype to investigate possible failure modes. The same holds for step-stress accelerated tests (often used as life tests or in screening procedures), for which, ac-cumulation of damage can be more complex as given e.g. in [7.20,7.28]. A case-by-case investigation is mandatory for all this kind of tests.
7.5
Goodness-of-fit Tests
Let tl, ..., t, be n independent observations of a random variable z distributed according to F(t), a rule is asked to test the null hypothesis Ho: F(t) = Fo(t), for a given type I error a (probability of rejecting a true hypothesis Ho), against a general alternative hypothesis H1: F(t) # Fo(t). Goodness-of-fit tests deal with such testing of hypothesis and are often based on the empirical distribution function (EDF), see Appendices A8.3 for an introduction. This section shows the use of Kolmogorov-Smirnov and chi-square tests (see p. 534 for some related tests). Trend tests are discussed in Section 7.6.
7.5.1 Kolmogorov-Smirnov Test The Kolmogorov-Smirnov test is based on the convergence for n -+ empirical distribution function (Eq. (A8. 1))
1. 0
F,(t) =
of the
f~rt
I
for tg) < t < t(i+l)
1
for t 2 t(,)
(7.62)
to the true distribution function, and compares the experimentally obtained $,(t) with the given (postulated) Fo(t). Fo(t) is assumed here to be known and continuous, t(,), ..., t(,) are the ordered observations. The procedure is as follows:
7.5 Goodness-of-fit Tests
Figure 7.11 Largest deviation y1-, between a postulated distnbution function Fo(t) and the corresponding empincal disiribution function 6,(t) at the level 1- a (Pr{D, 5 yl-a IFo (t) true) = 1- a )
1. Determine the largest deviation D, between 6,( t ) and F O ( t )
D,,
=
SUP
--
( F,, ( t )- FO( t )( .
(7.63)
2. From the given type I error a and the sample size n, use Table A9.5 or Fig. 7.11 to detennine the critical value y otherwise accept H o . 3. Reject H o : F ( t ) = F o ( t ) if D, > yi-,; This procedure can be easily combined with a graphical evaluation of data. For this purpose, 6 , ( t ) and the band F o ( t ) f y l - , are drawn using a p r o b a b i l i t y c h a r t on which F o ( t ) can be represented by a straight line. If 6,(t) leaves the band F o ( t ) k y l - „ the hypothesis H o : F ( t ) = F o ( t ) is to be rejected (note that the band width is not constant when using a probability chart). Probability charts are discussed in Appendix A.8.1.3, examples are in Appendix A9.8 and Figs. 7.127.14. Example 7.15 (Fig. 7.12) shows a graphical evaluation of data for the case of a Weibull distribution, Example 7.16 (Fig. 7.13) investigates the distribution function of a population with early f a i l u r e s and a constant failure rate using a Weibull p r o b a b i l i t y c h a r t , and Example 7.17 (Fig. 7.14) uses the KolmogorovSmirnov test to check agreement with a lognormal distribution. If F o ( t ) is not completely known, a modification is necessary (Appendix A8.3.3). Example 7.15 Accelerated life testing of a wet Al electrolytic capacitor leads following 13 ordered observations of lifetime: 59,71, 153, 235, 347,589, 837, 913, 1185, 1273, 1399, 1713, and 2567h. (i) Draw the empirical distribution function of data on a Weibul! probability chart. (ii) Assuming that the underlying distribution function is W$bull, determine h ?nd ß graphically. (iii) The maximum likelihood estimation of h & ß yields ß = 1.12, calculate h and compare results of (iii) with (ii). Solution (i) Figure 7.12 presents the empirical distribution function 6,(t) on Weibull probabili'y Paper. (ii) The graphical determination of h ar@ ß leads to (straight line (ii)) h = 11840 h and ß = 1.05. (iii) With ß = t.12, Eq. (A8.31) yields h = 11908h (straight line (iii)) (see also Example A8.11).
3 14
7 Statistical Quality Control and Reliability Tests
Figure 7.12 Empirical distribution function @,(t) and estirnated Weibull distribution functions ((ii) and ( 5 ) ) as per Example 7.15
7.5 Goodness-of-fit Tests n
n L a
m
m
% s ö z Q d g z d o2 V)
GO W V ) ~ o 0 o 0 o 0 o 0 o Ö o
X N
Figure 7.13 Shape of a weighted sum of a Weibull distribution F,(t) and an exponential distribution Fb(t) as per Example 7.16, useful to detect (describe) early failures (similar for wearout failures)
316
7 Statistical Quality Control and Reliability Tests
Exarnple 7.16 Investigate the mixed distribution function F(t) = 0.2[1- e-(0.1t)0'5 ] + 0.811- e-0.0005t] on a Weibull probability chart (describing a possible early failure period). Solution The weighted sum of a Weibull distribution (ß = 0.5, h = 0.1 h-l, and MTTF = 20 h ) with an exponential distribution ( h = 0.0005 h-I and M7TF = MTBF = 11h = 2000 h ) represents the distribution function of a population of items with failure rate h(t)= ~ W11 [O.OI(0.1 t)-0.5e-(0.1')0'5 + o . o o o ~ ~ ]- I~ . ~ ~ ~ W) [0,2e-(0'1 t)0'5 + 0.8 e-0.0005 t ] , i.e. with early failures up to about t = 2 0 0 h , see graph (h(t) is practically constant at 0.0005 h-l for t between 300h and 400,00Oh, so that for 0,0010 t > 300 h a constant failure rate can be assumed for practical purposes). Figure 7.13 gives the 0.0005 function F(t) on a Weibull probability chart, showing the typical s-shape. O 100 200 300 400 Example 7.17 Use the Kolmogorov-Smimov test to verify with a type I error a = 0.2, whether the repair time defined by the observations t,, ..., t10 of Example 7.1 1 are distnbuted according to a lognormal distribution function with Parameters h = 0.5 h-l and o = 0.4 (hypothesis Ho). Solution The lognonnal distribution (Eq. (7.41)) with h = 0.5 h-l and o = 0.4 is represented by a straight line on Fig. 7.14 ( F 0 ( t ) ) With a = 0 . 2 and n = 10, Table A9.5 or Fig. 7.11 yields yl-a = 0.323 and thus the band Fo(t)I 0.323. Since the empirical distribution function @,(t) the hypothesis Ho can be accepted. does not leave the band Fo(t) I
7.5.2 Chi-square Test The chi-square test ( x2 test) can be used for c o n t i n u o u s or n o n c o n t i n u o u s distribution functions F o ( t ) of T. Furthermore, F o ( t ) need not to be completely known. For F o ( t ) completely known, the procedure is as follows:
1. Partition the definition range of the random variable T into k intervals (classes) ( a l , a2], ( a 2 , a 3 ] , ...,( a k , ak+i], the choice of the classes must be made independently of the observations tl, ..., t, (rule: n pi 2 5 , with pi as in point 3). 2. Determine the number of observations ki in each class ( a i , ai+,], i = I , ..., k (ki=numberof t j with a i < t j r a i + l ) . 3. Assuming the hypothesis H o , compute the expected number of observations for each class ( a i , a i + l ]
7.5 Goodness-of-fit Tests
Figure 7.14 K o l m o g o r o v - S e o v test to check the repair time distribution as per Example 7.17 (the distribution function with h and 6 from Example 7.1 1 is shown dashed for information only)
4. Compute the statistic
5. For a given type I error a, use Table A9.2 or Fig. 7.15 to determine the (1- a ) quantile of the chi-square distribution with k - 1 degrees of 2
freedom x ~ - ~ , ~ - ~ . 2
2
6. Reject H o : F ( t ) = Fo(t) if X , > x ~ - ~ , otherwise ~ - ~ , accept H o .
..., 0 , are If F o ( t ) is not completely known ( F o ( t )= F,(t, 01, ..., 0,), where unknown parameters), modify the above procedure after step 2 as follows: 3'. On the basis of the observations k i in each class ( a i ,u ~ + ~i =] ,1, ..., k determine the maximum likelihood estimates for the parameters 01,..., 0 , from the following system of (r) algebraic equations
with p i = Fo(ai„, ,..., 8,)-Fo(ai, 01, ..., e r ) > 0, PO + ... +pk = 1 and kl + ... + kk = n ; for each class ( U i, a i + l ] ,compute the expected
318
7 Statistical Quality Control and Reliability Tests
number of observations, i.e.
4'. Calculate the statistic
5'. For given type i error a, use Table A9.2 or Fig. 7.15 to determine the ( 1-a ) quantile of the X2distributionwith k - I - r degrees of freedom.
6'. Reject H o : F ( t ) = Fo(t)if
2; > X ~ - l - r , l - „
otherwise accept Ho.
Comparing the above two procedures, it can be noted that the number of degrees of freedom has been reduced from k - 1 to k - 1 - r , where r is the number of parameters of Fo(t) which have been estimated from the observations t l , ..., t , using the multinomial distribution (Example A8. 13). Example 7.18 Let 160, 380, 620, 650, 680, 730, 750, 920, 1000, 1100, 1400, 1450, 1700, 2000, 2200, 2800, 3000, 4600, 4700, and 5000 h be 20 independent observations (realizations) of the failure-free time .L for a given assembly. Using the chi-square test for a = 0.1 and the 4 classes (0, 5001, (500, 10001, (1000, 20001, (2000, -), determine whether or not z is exponentially distributed (hypothesis H o : F(t ) = 1 - e-", h unknown). Solution The given classes yield number of observations of kl = 2, k2 = 7 , k3 = 5 , and k4 = 6 . The point estimate of h j s then given by Eq. (7.66) with pi = e-hai - eThai+l, yielding for h the numerical solution h 0.562.10-~h-l. Thus, the numbers of expected observations in each of the 4 classes are according to Eq. (7.67) Anjl = 4.899, n j2= 3.699, n j3= 4.90, and n j 4 = 6.499. From Eq. (7.68) it follows that X„2 = 4.70 and from Table A9.2, = 4.605. -2 2 The hypothesis HO : F(t) = 1- e-hf must be rejected since X, > Xk-l-r,
-
xi,o.9
-
V
10
20
30
40
50
Figure 7.15 ( 1 - a) quantile (U percentage point) of the chi-square distribution with V degrees of freedom (X:,
-,
319
7.6 Statistical Analysis of General Reliability Data
7.6
Statistical Analysis of General Reliability Data
7.6.1 General considerations In sections 7.2 - 7.5, data were issued from a sample of a random variable T , i.e. they were n statistically independent realizations (observations) t l , ..., tn of a random variable z distributed according to F(t) = Pr{z I t } , and belonging to one of the following equivalent situations: 1. Lije tiines t l , ..., t, of n statistically identical and independent items, all starting at t = 0 when plotted on the time axis (e.g. as in Figs. 1.1, 7.12, 7.14). 2. Failure-free times separating successive failure occurrences of a repairable item (system) with negligible repair times and repaired (restored) as a whole to as-good-as-new at each repair; i.e., statistically identical and independent interarrival times with a common distribution function (F(x)), yielding a renewal process.
To this data structure belongs also the case considered in Example 7.19. A basically different situation arises when the observations are arbitrary points on the time axis, i.e. when considering a general pointprocess. To distinguish this case, the involved random variables are labeled T;,T;, ..., with t;, t;, ... for the corresponding realizations (tl*< t; < ... is assumed). This situation occurs in reliability tests when only the failed element in a system is repaired to as-good-as-new, and there is at least one element in the system which has a time dependent failure rate. Failure-free times (interanival times, by assuming negligible repair times) are in this case neither independent nor equally distributed. Only the case of a series system with constant failure rates for all elements (hl,...,An) leads (if repaired elernents are as-good-as-new) to a homogeneous Poisson process (Appendix A7.2.5), for which interarrival times are statistically independent random variables with the common distribution function F(x) = 1- e-('l+...+ (Eqs. (2.19), (7.27)). Shortcomings because of neglecting this basic property are known, seee.g. [6.3,7.1 l,A7.30]. Example 7.19 Let F(t) be the distribution function o f the failure-freetime of a given item. Suppose that at t = 0 an unknown number n of items are put into operation and that at the time t o exactly k item are failed (no replacement or repair has been done). Give a point estimate for n. Solution Setting p = F(t,), the number k of failures in (0, t o1 is binomially distributed (Eq. (A6.120))
(9
k with p = F(t,) . (7.69) Pr{k failures in (0,t O ] } = & = p (1 - p)n-k, An estimate for n can be obtained using the maximum likelihood method, yielding (Eq. (A8.23)) L = (k) pk (1 - p)n-k and finally,with L / an = 0 for n = jj (n is the unknown parameter),
a
k =klp=klF(to).
(7.70)
For Eq. (7.70), the approximation (g) = (e-kl k!)(nnl(n- k)(n-k) has been used (Stirling fomula). The Poisson approximation & .- e-np(np)klk!(Eq.(A6.129)) yields also 4 = k lp.
320
7 Statistical Quality Control and Reliability Tests
Easy to investigate when observing data on the time axis are cases involving nonhomogeneous Poisson processes (Sections 7.6.2, 7.6.3, 7.7, Appendix A7.8.2). For more general situations, difficulties can arise (except for some general results valid for stationary point processes (Appendices A7.8.3 - A7.8.5)), and the following basic rule should apply: I f neither a Poisson process (homogeneous or nonhomogeneous) nor a renewal process can be assumed for the underlying point process, care is necessary in identifying possible models; in any case, validation of model assumptions (from a physical und statistical point of view) shouldprecede data analysis.
The homogeneous Poisson process ( H P P ) , introduced in Appendix A7.2.5 as particular case of a renewal process, is the simplest point process. It is memoryless and tools for a statistical investigation are known. Nonhomogeneous Poisson processes (NHPPs) are without afereffect (Appendix A7.8.2) and for investigation purposes they can be transformed into a H P P (Eq. (A7.200)). Investigation on renewal processes (Appendix A7.2) can be reduced to that of independent random variables with a common distribution function (cases 1 and 2 above). However, disregarding the last part of the above general rule can lead to mistakes, even in the case of renewal processes or independent realizations of a random variable T . As an example, let us consider an item with two independent failure mechanisms, one appearing with constant failure rate ho =10-~h-' and the second (wearout) with a shifted Weibull distribution F(t)=l- e-(Vc-'+' ))%ith h=10-2h-1, v=104h, and ß =3 ( t > V , F(t)=0 for t 5 V). As case 2 in Eq. (A6.34), the failure-free time T has the distribution function F ( t ) = 1- e-hot for 0 5 t 5 W and F(?) = 1 - e - A ~ t .e-(Vt-~))' for t > (failure rate h ( t ) = ho for t 5 and h ( t )= ho+ßhß(tfor t > V , similar to a series model with independent elements (Eq. (2.17)). If the presence of the above two failure mechanisms is ignored and the test is stopped (censored) after 104h, the wrong conclusion can be drawn that the item has a constant failure rate of about I O - ~h-'. Investigation of cases involving general point processes is beyond the scope of this book (ouly some general results are given in Appendices A7.8.3 - A7.8.5). A large number of ad hoc procedures are known in the literature, but they often only apply to specific situations and their use needs a careful validation of the assumptions stated with the model. After some considerations on tests for nonhomogeneous Poisson processes in Section 7.6.2, Sections 7.6.3.1 and 7.6.3.2 deal with trend tests to check the assumption homogeneous Poisson process versus nonhomogeneous Poisson process with increasing or decreasing intensity. A heuristic test to distinguish a homogeneous Poisson process from a general monotonic trend is discussed in Section 7.6.3.3. However, as stated in the above general rule, the validity of a model should be checked also on the basis of physical considerations on the item considered. This in particular for the property without afereffect, characterizing Poisson processes.
321
7.6 Statistical Analysis of General Reliability Data
7.6.2 Tests for Nonhomogeneous Poisson Processes A nonhomogeneous Poisson process (NHPP) is a point processes which Count function v(t) has unit jumps, independent increments (in nonoverlapping intervals), and satisfies for any b > a 2 0 (Appendix A7.8.2 ) Pr { k events in (a,b]} = ( M ( ~ ) - M ( ~ ) ) ~ ~ - M W - M ( ~ )k,O,L,Z ), ,,.., 0 5 a < b , k!
(7.71) M(t) is the mean value function of the NHPP, giving the expected number of points (events) in (0,t] ~(t)=~[v(t)l,
(7.72)
M(O) = 0.
) Assuming ~ ( t derivable,
m(t) = dM(t) / dt 2 0
(7.73)
is the intensity of the NHPP and has for 6 t 4 0 following interpretation (Eq. (A7.89))
Because of independent increments, the number of events (failures) in a time interval (t,t + 01 (Eq. (7.71) with a = t & b = t + 8) and the rest waiting time to the next event Pr{ZR(t) > X }=Pr{ no event i n ( t , t+x]}= e - w ( t + X ) - M ( t ) ) ,
X>O,
(7.75)
are independent of the process development up to time t (Eqs. (A7.193, (A7.196)). Thus, also the mean E [zR(t)] is independent of the process development up to time t, andgiven by (Eq. (A7.197))
Furthermore, if 0 < T;< T;< ... are the occurrence times (arrival times) of the event considered (e.g. failures of a repairable system), measured from t =0, it holds for m ( t ) > 0 that the quantities =
~ ( 7 ; )< W: = ~ ( 7 2<) ...
(7.76)
are the occurrence times of a homogeneous Poisson processes with intensity one (Eq. (A7.200)). Moreover, for given (fixed) t = T and v(T) = n, the occurrence times 0 < 2; < ... 'C,*
O
and distribution function M(t) / M(T) on (0, T) (Eq. (A7.205)).
(7.77)
322
7 Statistical Quality Control and Reliability Tests
Equation (7.74) gives the unconditional probability for one event in (t,t+ 6t]. Thus, m ( t ) refers to the occurrence of any one of the events considered. It corresponds to the renewal density h ( t ) and the failure intensity z ( t ) , but differs basically from thefailure rate h ( t ) (see remark to Eq. (A7.24)). Nonhomogeneous Poisson processes (NHPPs)are introduced in Appendix A7.8.2. Some examples are discussed in Section 7.7 with applications to reliability growth. Assuming that the underlying process is a NHPP, estimation of the model parameters (parameters of m ( t ,0)) can be performed using the maximum likelihood method on the basis of observed data 0 < ty < t; <. ..< t i < T (time censoring). Considering Eqs. (7.71) and (7.74), the likelihood function follows as (Eq. (7.102))
and delivers the maximum likelihood estimate 6 for the parameters 8 of m(t ,8) by solving a L / & = 0 for 0 = 6, where 9 can be a vector (see e.g. Eq. (7.104) for the parameters a and ß of the NHPP with m(t) = aßt P-'). Using the property stated by Eq. (7.76), statistical tests for exponential distribution or for homogeneous Poisson processes (Appendices A8.2.2.2, A8.3.2, A8.3.3 and Sections 7.2.3, 7.5, 7.6.3.1) can be applied to NHPPs as well. Furthermore, using the property stated by Eq. (7.77), the goodness-of-fit tests introduced in Appendix A8.3.2 (KolmogorovSmirnov, Cramkr - von Mises, chi-square) can be used to verify agreement of the observed data t;, ...,t i < T with a postulated M o ( t ) (t;, t;, ... are the observed values (realizations) of T;,%;,. . and * is used to explicitly show that t i , t;, ... are points on the time axis and not independent realizations of a random variable T,e.g. as in Figs. 1.1, 7.12, 7.14). For the Kolmogorov-Smirnov test, the procedure given in Section 7.5.1 applies with
where < ( t )is the observed number of events in (O,i]. More difficult is the situation when the assumption that the underlying model is a N H P P must also be verified by a statistical data analysis, for instance with a goodness-of-fit test. The problem in not completely solved. However, the property given by Eqs. (7.76) and (7.77) can be used for goodness-of-fit of the NHPP with incompletely specified (up to the parameters) mean function M o ( t ) . The chi-square test holds with the procedure given in Section7.5.2 and Appendix A8.3.3. For a first evaluation, the Kolmogorov-Smirnov test (and tests based on a quadrate statistic) can be used taking half (randomly selected) of the observations t l , ..., t; to estimate the parameters and continuing with the whole sample the procedure given in Section 7.5.1 for the goodness-of-fit test [A8.11, A8.311.
7.6 Statistical Analysis of General Reliability Data
323
7.6.3 Trend Tests In reliability engineering one is often interested to test if there is a monotonic trend in the times between successive failures (interarrival times) of a repairable system with negligible repair (restoration) times, e.g. in order to detect the end of an early failure period or the begin of a wearout period. Such tests extend the tests for exponentiality or for homogeneous Poisson processes introduced in Sections 7.2.3, 7.5, A8.2.1, A8.2.2, A8.3.2, and A8.3.3. If the underlying point process can be approximated by a renewal process, a graphical procedure can be used in detecting the presence of trends, See e.g. Fig. 7.13 for the case of early failures. In the case of a nonhomogeneous Poisson process (NHPP), a trend is given by an increasing or a decreasing intensity m ( t ) ,e.g. ß > 1 or ß < 1 in Eq. (7.99). Trend tests can also be useful in investigating what kind of alternative should be considered when an assumption is to be made about the statistical properties of a given data set. However, trend tests check in general only a postulated hypothesis against a more or less general alternative hypothesis. Care is therefore necessary in drawing conclusions from this kind of statistical tests, and the basic rule given on p. 320 applies. In the following, some trend tests used in reliability data analysis are discussed, among them the Laplace test (see e.g. [A8.1]for greater details).
7.6.3.1 Tests of a HPP versus a NHPP with increasing intensity The homogeneous Poisson process (HPP) is a point process which count function v ( t ) has stationary, independent Poisson distributed increments (Eqs. (A7.41)). Interarrival times in a H P P are independent and distributed according to the same exponential distribution F ( x ) = 1- e - ' X (occurrence times are independent Gamma distributed). The Parameter h characterizes completely the H P P . h is at the Same time the intensity of the H P P and the failure rate h(x) of all interarrival times, X starting by 0 at each occurrence time of the event considered (e.g. failure of a repairable system with negligible repair (restoration) times). This ntlmerical equality has been the cause for misinterpretations and misuse in practical applications, See e.g. [6.3, 7.11, A7.301. The homogeneous Poisson process has been introduced in Appendix A7.2.5 as particular case of a renewal process. Considering v ( t ) as the count function giving the number of events (failures) in (0,t ] ,in Example A7.13 (Eq. (A7.213)) it is shown that : For given (fixed) T und v(T) = n (time censoring), the normalized arrival times 0 < T; / T < ... < T: / T of a homogeneous Poisson process (HPP) have the same distribution as if they where the order statistics of n independent identically uniformly distributed random variables on (0,l). (7.81)
Similar results hold for a NHPP (Eq. (A7.206)) :
324
7 Statistical Quality Control and Reliability Tests
For given ( ' x e d ) T und v(T) = n (time censoring), the normalized arrival times 0 < M ( T ~ * 1) M(T)< ...< ~ ( 7 : 1)M ( T )< 1 of a nonhomogeneous Poisson process (NHPP) with mean value function M ( t ) have the same distribution as ifthey where the order statistics of n independent identically uniformly distributed random variables on (0,l). (7.82)
With the above transfonnations, properties of the uniform distribution can be used to Support statistical tests on homogeneous and nonhomogeneous Poisson processes. Let w be an unifonnly distributed random variable with density fw(x)=l
on (O,l),
f,
(X)
=0 outside (0,I),
(7.83)
and distribution function F,(x) = x on (0,l). Mean and variance of w are given by (Eqs. (A6.36) and (A6.44)) E[w]=1/2
and
Var[o] =1/12.
(7.84)
The sum wl+ ... +wn of n independent random variables w has mean n l 2 and variance n112. The distribution function F,,(x) of w1+ ... +wn is defined on (0,n) and can be computed using Eq. (A7.12). F,.(x) has been investigated in [A8.8], yielding to the conclusion that F , .(X) rapidly approach a normal distribution as n increases. For practical applications one can assume that for given (fixed) T and v ( T )= n> 5, the arrival times 0 iT ; < ... < T : < T of a HPP are distributed according to X
*
1
P r {i=1[ ( , ~ ~ l i ) - n / 2 ] ~ ~ 2 ~ x } = - ~ eX - ~ ~ ~ ~ d(7.85) ~ ,
,L -,
Equation (7.85) can be used to test a HPP ( m ( t )= h ) versus a NHPP with increasing density m ( t )= d M ( t )ld t . Using Eq. (7.85) and considering the observations , procedure is (Example 7.20): (realizations) t i < t;< ...< t i < ~the
1. Compute the statistics
,-,
(1-a quantile) 2. For given type I error a determine the critical value t from a table of the standard normal distribution (Tab. A9.1). 3. Reject the hypothesis Ho: the underlying point process is a HPP, against H1: the underlying process is a NHPP with increasing density, at 1-a (7.87) confidence, if ( '$t;/ T- n/2)lJn/12 > tl-,; otherwise accept Ho. i=1
A test based on Eqs. (7.86)-(7.87) is called Laplace test and was first introduced by Laplace as a test of randnomness. From Eq. (7.87) one recognizes that X tT I T is a sufficient statistics (Appendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject Ho ( Ho true ] and ( X / T - nI2) l tends to assurne large values for Ho false (i.e. for m(t) increasing). For T=ti (failure censoring), Eq. (7.86) holds with n-1 (see e.g. [A8.1]).
4
fi
325
7.6 Statistical Analysis of General Reliability Data
A further possibility to test a HPP ( m ( t ) = L) versus a NHPP with increasing density m ( t ) = d M ( t ) l d t is to use the statistics
As shown in Example A7.13 (Eq. (A7.213), See also (7.81)), for given (fixed) T and v ( T ) = n, the normalized arrival times 0 < T; / T < ... < T: I T < 1 of a HPP have the same distribution as if they where the order statistics of n independent identically uniformly distriboted random variables on (0,l). Moreover, Example 7.21 shows for wi=z ;/T that 2 ln (T l t y ) has a X2distribution(Eq.(A6.103)) with 2n degrees of freedom (T; haslt%be used instead of the observations t:, see footnote on p. 504)
Example 7.20 In a reliability test, 8 failures have occurred in T=10,000h and t l + ... + t8 =43, OOOh has been observed. Test with a risk cx = 5% (at 95% confidence), using the rule (7.87), the hypothesis H o : the underlying point process is a HPP, against H1 : the underlying process is a NHPP with increasing density. Solution From Table A9.1 t,,„ = 1.64 > (4.3 -4)/O.816 = 0.367 and Ho can not be rejected. Example 7.21 Let the random variable w be uniformly distributend on (0,l). Sho,w that q =- ln(w) is distributed according to Fq(r)=l-e-' on (O,m), and thus 2 E-ln(oi) = 2 C q i = X;, . i=l
i=l
Solution Considering that for 0 < < 1, -ln(w) is a decreasing function defined on (0, W), it follows that the events {w < X}and {q =-ln(w)>-ln(x)} are equivalent. From this (see also Eq. (A6.31), X = Pr{o <X) = Pr{q> -ln(x)} and thus, using -lau = t , one obtains Pr{q>t] =e-' and finally Fq(t)= Pr{q < t}=1- e-?
(7.90) n
From Eqs. (A6.102)-(A6.104), it follows that 22-ln(wi)= 2 C q i has a ,=I ,=I 2n degrees of freedom.
x2 distribution with
Example 7.22 In a reliability test, 8 failures have occurred in T = 10,000h at 850,1200, 2100,3900,4950,5100 8300,9050h. Test with a risk a=5% (at 95% confidence), using the rule (7.92), the hypothesis Ho : the underlying point process is a HPP, against the alternative hypothesis H1 : the underlying process is a NHPP with increasing density. Solution From Table A9.2,
=7.96< 2 ( l n ( ~ l t ; ) +... +ln(T/r;)) = 17.5 and Ho can not be rejected.
326
7 Statistical Quality Control and Reliability Tests
Thus, the statistics given by Eq. (7.88) can be used to test a HPP ( m ( t ) = L)versus a NHPP with increasing density m ( t ) = d M ( t ) l d t . Considering Eqs. (7.89) and (7.90), the test procedure is (Example 7.22): 1. Compute the statistics
2. For given type I error a determine the critical value ,( U quantile) from a table of the distribution (Table A9.2). 3. Reject the hypothesis Ho: the underlying point process is a HPP, against H1: the underlying process is a NHPP with increasing density, at 1-a confidence, if 2 l n ( T l t * ) < otherwise accept H o . (7.92) i=l
From Eq. (7.92) one recognizes that 2 l n ( l~t * ? ) is a sufficient statistics (Appendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject H o I Hotrue } and 2 ln(T l t * ) tends to assume small values for Ho false (i.e. for m ( t ) increasing). For T = ti (failure censoring), Eq. (7.91) hold with n - 1 (see e.g. [A8.1]).
7.6.3.2 Tests of a HPP versus a NHPP with decreasing intensity Tests of a homogeneous Poisson process (HPP) versus a nonhomogeneous Poisson process ( N H P P ) with a decreasing intensity m ( t ) = d M ( t ) l d t can be deduced from those for increasing intensity given in section 7.6.3.1. Equations (7.85) and (7.89) remain true. However, if the intensity is decreasing, most of the failures tend to occur before T l 2 and test procedure for the Laplace test has to be changed in (Example 7.23) : 1. Compute the statistics
r c i t;ir)-.
121 / & E .
i=l
(7.93)
2. For given type I error a determine the critical value t , ( a quantile) from a table of the standard normal distribution (Tab. A9.1). 3. Reject the hypothesis Ho: the underlying point process is a HPP, against H1: the underlying process is a NHPP with decreasing density, at 1-a (7.94) confidence, if ( t : 1 T - n l 2) l@z < t,; otherwise accebt H o . i=l
From Eq. (7.93) one recognizes that C t: / T is a sufficient statistics (Appendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject H o I H o r n e and ( C t; I T - n l 2 ) 1 @2 tend to assume small values for Ho false (i.e. for m ( t ) decreasing). For T = ti (failure censoring), Eq. (7.93) holds with n - 1 (see e.g. [A8.1]).
7.6 Statistical Analysis of General Reliability Data
327
For the test according to the statistics (7.88), the test procedure is (Example 7.24):
1. Compute the statistics n
2
l n ( ~t lr ) .
(7.95)
i=l
2. For given type I error a determine the critical value X:n,l-a ( 1 - a quantile) from a table of the X 2 distribution (Table A9.2). 3. Reject the hypothesis Ho: the underlying point process is a HPP, against Hl : the underlying process is a NHPP with decreasing density, at 1 - a 2 confidence, if 2 )= ln(T l tr ) > X2n,l-a ; otherwise accept H o . (7.96) ,=L
From Eq. (7.95) one recognizes that 2 x l n ( l~t * ) is a sufficient statistics (Appendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject H o I H o true } and 2 z l n ( lt') ~ tend to assume large values for Ho false (i.e. for m(t) decreasing). For T= ti (failure censoring), Eq. (7.95) hold with n-1 (see e.g. [Ag. 11).
7.6.3.3 Heuristic Tests to distinguish between HPP and General Monotonic Trend In some applications, little information is available about the underlying point process describing failures occurrence of a complex repairable System. As in the previous sections, it will be assumed that repair times are neglected. What is sought is a test to identify a monotonic trend of the failure intensity against a constant failure intensity given by a homogeneous Poisson process (HPP).
Example 7.23 Continuing Example 7.20, test using the rule (7.94) and the data of Exaniple 7.20, with a risk a = 5 % (at 95% confidence), the hypothesis Ho: the underlying point process is a HPP, against the alternative hypothesis H1: the underlying process is a NHPP with decreasing density. Solution From Table A9.1, tOIoFi = - 1.64 < 0.367 and Ho can not be rejected. Example 7.24 Continuing Example 7.22, test using the rule (7.96) and the data of Example 7.22, with a risk a =5% (at 95% confidence), the hypothesis Ho : the underlying point process is a HPP, against the alternative hypothesis Hl : the underlying process is a NHPP with decreasing density. Solution From Table A9.2,
X:6, 0 , 9 5 =
26.3 > 2(ln(Tlt ;)+ ... + ln ( T l t ;))=17.5 and Ho can not be rejected.
328
7 Statistical Quality Control and Reliability Tests
Consider first, investigations based on successive interarrival times. Such an investigation should be performed at the beginning of data analysis, also because it can quickly deliver a first information about a possible monotonic trend (e.g. interarrival times become more and more long or short). Moreover, if the underlying point process describing failures occurrence can be approximated by a renewal process (successive interarrival times are independent and identically distributed), procedures of Section 7.5 based on the empirical distribution function (EDF) have a great intuitive appeal and can be useful in testing for monotonic trends as well, see Examples 7.15- 7.17 (Figs. 7.12- 7.14). In particular, the graphical procedures given in Example 7.16 (Fig. 7.13) would allow the detection and quantification of an early failure period. The same would be for a wearout period. Similar considerations hold if the involved point process can be approximated by a nonhomogeneous Poisson process (NHPP),see Sections 7.6.1 -7.6.3 and 7.7. If a trend in successive interarrival times is recognized, but the underlying point process can not be approximated by a renewal process or a NHPP, a further possibility is to consider the observed failure time points tl*
The mean value function Z ( t ) corresponds to the renewal function H(t) in a renewal process (Eq. (A7.15)); z ( t ) = d Z ( t )l d t is the failure intensity and correspond to the renewal density h ( t ) in a renewal process (Eqs. (A7.18), (A7.24)). For a homogeneous Poisson process, Z ( t ) takes the form (Eq. (A7.42))
Each deviation from a straight line Z ( t )=at is thus an indication for a possible trend (besides statistical deviations). As shown in Example A7.1 (Fig.A7.2) for a renewal process, early failures or wearout gives a basically different shape of the underlying renewal function; a convex shape for the case of early failures and a concave shape for the case of wearout. This property can be used to recognize the presence of trends in a point process, by considering the shape of the associated empirical mean value function i ( t ) given by Eq. (7.97). Such a procedure can help in detecting possible trends, but remains a rough evaluation (see Fig. A7.2 for the case of a renewal process). Care is thus necessary when extrapolating results, e.g. about the failure rate value after the early failure period or the percentage of early failures.
7.7 Reliability Growth
7.7
Reliability Growth
At the prototype qualification tests, the reliability of complex equipment or Systems can be less than expected. Disregarding any imprecision of data used or model used in calculating the predicted reliability (Chapter 2), such a discrepancy is often the consequence of weaknesses (errors and flaws) during design or manufacturing. For instance, use of components or materials at their technological limits or with internal weaknesses, cooling problems, interface problems, transient phenomena, interference between hardware and software, assembly or soldering problems, damage during handling or testing, etc. Errors and flaws cause defects and systematic failures. Superimposed to these are early failures and failures with constant failure rate (wearout should not be present at this stage). A distinction between deterministic faults (defect and systematic failures) and random faults (early failures and failures with constant failure rate) is only possible with a cause arzalysis. Such an analysis is necessary to identify and elirninate causes of observed faults, i.e. change or redesign for defects and systematic failures, screening for early failures, and repair for failures with constant failure rate. Of Course, defects und systematic failures can also be randomly distributed on the time axis, e.g. caused by a mission dependent time-limited overload, by software defects, or simply because of the system complexity. However, they still differ from failures, as they are basically independent of operating time (disregarding systematic failures which can appear only after a certain operating time, e.g. as for some cooling or software problems). The aim of a r e l i a b i l i ~growth program is the cost-effective improvement of an item's reliability through successful correction / elimination of the causes of design or production weaknesses. Early failures should be precipitated with an appropriate screening (environmental stress screening, ESS, See Section 8.2 for electronic components, Section 8.3 for electronic assemblies, and Section 8.4 for cost aspects). Considering that flaws found during reliability growth a r e i n general deterministic (defects a n d systematic failures), reliability growth is performed during prototype qualification tests and pilot production, seldom for series-produced items (Fig. 7.16). Stresses during reliability growth are often higher than those expected in the field (as for ESS). Furthermore, the statistical methods used to investigate reliability growth are in general basically dz3erent from those given in Section 7.2 for basic reliability tests (e. g. to estimate or demonstrate a constant failure rate h). This is because during the reliability growth program, design and / or production changes are introduced in the item(s) considered and statistical evaluation is not restarted after a change.
330
7 Statistical Quality Control and Reliability Tests
Reliability
First series unit Prototype
Design
Qualification
Production
Life-Cycle Phases
Figure 7.16 Qualitative visualization of a possible reliabiliv growth
A large number of models have been proposed to describe reliability growth for hardware and software, see e.g. [5.58, 5.60, 7.31-7.47, A2.6 (61014 & 61164)], some of them on the basis of theoretical considerations. A practice oriented model, proposed by J.T. Duane [7.36] and refined as a statistical model by L.H. Crow [7.35 (1975)], known also as the AMSAA model, assumes that theflow of events (system failures) constitutes a nonhomogeneous Poisson process (NHPP) with intensity
and mean valuefinction
M(t) gives the expected number of failures in (0, t]. m(t)6t is the probability for one failure (any one) in ( t , t + 6t] (Eq. (7.74)). It can be shown that for a NHPP, m ( t ) is equal to the failure rate h(t) of the first occurrence time (Eq. (A7.209). Comparing Eq. (7.99) with Eq. (A6.91) one recognizes that for the NHPP described by Eq. (7.99), thefirst occurrence time has a Weibull distribution. However, m(t) and A(t) arefundamentally different (see the remark on p. 356) and, also for this reason, all others interarrival times do not follow a Weibull distribution and are neither independent nor identically distributed. Because of the distribution of the first occurrence time, the NHPP process described by Eq. (7.99) is often called Weibull process, causing great confusion. Also used is the term power law process. Nonhomogeneous Poisson processes are investigated in Appendix A7.8.2.
33 1
7.7 Reliability Growth
In the following it will be assumed that the underlying model is a NHPP. Verification of this assumption should also be based on physical considerations on the naturelcauses of the defects and systematic failures involved, not only on statistical aspects. If the underlying process is a NHPP, estimation of the model parameters ( a and ß in the case of Eq. (7.99)) can easily be performed using observed data. Let us consider first the time censored case (Type I censoring) and assume that up to the given (fixed) time T, exactly n events have occurred at times * * t; < t i < ...< t i < T . tl, t2, ... are the realizations (observations) of the arrival times T;, T*„...and * indicates that tl, t;, ... are points on the time axis and not independent realizations of a random variable T with a given (fixed) distribution function (e.g. as in Figs. 1.1. 7.12,7.14). Considering the main property of a NHPP, i. e. that the number of events in nonoverlapping intervals are independent and distributed according to Pr{k events in (U,b]} = (Wb) - WaNk e - (M@)-M(u)) k!
9
with M(0) = 0 and 0 $ a < b (Eq. (A7.195)), and the interpretation of the intensity m ( t ) given by Eq. (7.74) or Eq. (A7.194), the following likelihood function (Eq. (A8.24)) can be found for the Parameter estimation of the intensity m(t)
Equation (7.102) considers no event (k = 0 in Eq. (7.101)) in each of the nonoverlapping intervals (0, tl*), ( t l , t;), ... , ( t i , T ) and applies to an arbitrary NHPP. For the Duane model it follows that
i=l
The maximum likelihood estirnates Ci and ß of the parameters a and ß are then obtained from
yielding
332
7 Statistical Quality Control and Reliability Tests
An estimate for the intensity of the underlying nonhomogeneous Poisson process is
With known values for & and ß, Eq. (7.105) can be used to extrapolate the attainable intensity if the reliability growth process were to be continued with the same statistical properties for a further time span A after T, yielding
see Example 7.25 for a practical application. In the case of event censoring, i.e. when the test is stopped at the occurrence of the nth event (Type I1 censoring), Eq. (7.104) holds with t i instead of T and n - 1 instead of n. Zntewal estimation for the Parameters a and ß can be found, see e.g. [A8.1]. For goodness-of-fit-tests one can consider the property of nonhomogeneous Poisson processes that, for given (fixed) T and knowing that n events have been observed in ( O,T], i.e. for given T and V ( T ) = n , the occurrence times 0 < 7;< ...< T have the same distribution as if they where the order statistics of n independent and identically distributed random variables with density m(t) 1 M ( T ) , 0 < t < T (Eq. (A7.205)). For example, the Kolmogorov-Smirnov test (Section 7.5) can be used with Fn(t) = V( t) / v(T) (Eq. (7.79)) and Fo(t) = M o ( t ) / M o ( T ) ) (Eq. (7.80)), see also Appendices A7.8.2 and A8.3.2. Furthermore it holds that if T:< T;< ... are the occurrence times of a NHPP, then ~ ; = M ( T ; <) = M ( T ; ) < ... are the occurrence times in a homogeneous Poisson process (HPP) with intensity one (Eq. (A7.200)). Results for independent and identically distributed random variables, for HPP or for
Example 7.25 Dunng the reliability growth program of a complex equipment, the following data was gathered: T = 1200 h , n = 8 and ln(Tlti*)= 20. Assuming that the underlying process can be described hy a Duane model, estimate the intensity at t = 1200h and the value attainable at t A = 3000 h if the reliability growth would continue with the same statisticalproperties.
+
Solution ln(T / tf)= 20, it follows from Eq. (7.104) that ß = 0.4 and With T = 1200 h , n = 8 and 6 = 0.47. From Eq. (7.105), the estimate for the intensity leads to A(1200) = 2.67.10-~h-' ($I (1200) = 8). The attainable intensity after an extension of the program for reliability growth by 1800h is given by Eq. (7.105) as fi(3000) = 1 . 5 4 1 0 - ~h-' .
333
7.7 Reliability Growth
exponential distribution function can thus be used. Important is also that the mean value of the random time ~ ~ (from t )an arbitrary (fixed) time point t 2 0 to the next failure is independent of the process development up to the time t and is given by (Eq. (A7.197)) m
E [ZR(t)]
m
=J Pr{no event in ( t , t t XI}&
=
J ~ - ( ~ @ + ~ ) - ~ ( ~ ) ) d(7.107) .x,
0
0
yielding, for instance,
E [zR(t)] = l l h ,
t given (fixed),X > 0,
(7.109)
for M(t+x)=M(t) + hx, i.e. m(t +X)=h for t given (fixed) and x > 0, and E [ZR(t)] = ~ ( l 110) + / aUP,
t given (fixed), X > O ,
(7.108)
x ~ A9.6 or Eq. (A6.92) with h = al'P). for M(t + X ) = ~ ( t ) + a (Appendix The Duane model often applies to electronic, electromechanical, and mechanical equipment and Systems. It can also be used to describe the occurrence of software defects (dynamic defects). However, other models have been discussed in the literature especially for software (Section 5.3.4). Among these, the logarithmic Poisson model, which assumes a nonhomogeneous Poisson process with intensity
For the logarithmic Poisson model, m(t) is monotonically decreasing with m(0) < and m(-) = 0. Considering M(0) = 0, it follows that
Models combining in a multiplicative way two possible mean value functions M(t) have been investigated in [7.33] by assuming M(t)=aln(l+tIb).(l-eTtlb)and ~ ( t ) = a t ~ . [ l - ( l + t / ~ ) e -(7.112) ~'~],
with a , b, a, y > 0 , 0 < ß < I , t r 0. In both cases, the intensity m(t) grows from 0 to a maximum, from which it goes to 0 with a shape similar to that of the models given by Eq. (7.110). The models described by Eqs. (7.100), (7.111), and (7.112) are based on nonhomogeneous Poisson processes, satisfying thus the properties discussed in Appendix A7.8.2 . Although appealing, nonhomogeneous Poisson processes (NHPP) can not solve all reliability growth modeling problems, basically because of their intrinsic simplicity related to the assumption of independent increments. The consequence
334
7 Statistical Quality Control and Reliability Tests
of this assumption, is that the NHPP is a process without aftereffect for which the waiting time to the next event from an arbitrary time point t is independent of the process development up to time t (Eq. (7.107) or Eq. (A7.197)). Furthermore, the first occurrence time 77 characterizes the NHPP (Eq. (A7.201)). In particular, a NHPP can not be used to estimate the number of defects present in a software package, see e.g. [A7.30] for further comments. In general, it is not possible to fix a priori the model to be used in a given situation. For hardware as well as for software, a physical motivation of the model, based on failure or defect causes /mechanisms, can help in such a choice. Having the "best model", the next step should be to verify that the assumptions made are compatibles with the model and after that to check the compatibility with data. Misuses or rnisinterpretations can occur, often because of dependencies between the involved random variables.
8 Quality and Reliability Assurance During the Production Phase (Basic Considerations)
Reliability assurance has to be continued during the production phase, coordinated with other quality assurance activities. In particular for monitoring and controlling production processes, item configuration, in-process und final tests, screening procedures, and collection, analysis & correction of defects and failures. The last measure yields to a learning process whose purpose is to optimize the quality of manufacture, taking into account cost and time schedule limitations. This chapter introduces some basic aspects of quality and reliability assurance during production, discusses test and screening procedures for electronic components and assemblies, introduces the concept of cost optimization related to a test strategy and develops it for a cost optimized test and screening strategy at the incoming inspection. For greater details on qualification & monitoring of production processes one may refer to [7.1-7.5, 8.1-8.151. Models for reliability growth are discussed in Section 7.7.
8.1 Basic Activities The quality and reliability level achieved during the design and development phase must be retained during production (pilot and series production). The following basic activities Support this purpose.
1. Management of the item's configuration (review and release of the production documentation, control and accounting of changes and modifications). 2. Selection and qualification ofproduction facilities und processes. 3. Monitoring and control of the production procedures (assembling, testing, transportation, Storage, etc.). 4. Protection against damage during production (electrostatic discharge (ESD), mechanical, thermal or electrical stresses). 5. Systematic collection, analysis, and correction of defects and failures occurring during the item's production or testing (back to the root cause). 6. Quality and reliability assurance during procurement (documentation, incorning inspection, supplier audits).
336
8 Quality and Reliability Assurance During the Production Phase
7. Calibration of measurement and testing equipment. 8. Performance of in-process andfinal tests (functional and environmental). 9. Screening of critical components and assemblies. 10. Realization of a test und screening strategy (optimization of the cost and time schedule for testing and screening). Configuration management, monitoring of corrective actions, and some important aspects of statistical quality control and reliability tests have been considered in Section 1.3, Chapter 7, and Appendices A3-A5. The following sections present test and screening procedures for electronic components and assemblies, introduce the concept of test and screening strategy, and develop it for a cost optimized test and screening strategy at the incoming inspection. Although focused on electronic systems, many of the considerations given below applies to mechanical systems as well. For greater details on qualification & monitoring of production processes one may refer to 17.1-7.5, 8.1-8.151 (see also Section 7.7 for reliability growth).
8.2 Testing and Screening of Electronic Cornponents 8.2.1 Testing of Electronic Components Most electronic components are tested today by the end User only on a sampling basis. To be cost effective, sampling plans should take into consideration the quality assurance effort of the component's manufacturer, in particular the confidence which can be given to the data furnished by him. In critical cases, the sample should be large enough to allow acceptance of more than 2 defective components (Sections 7.1.3, 3.1.4). 100% incoming inspection can be necessary for components used in high reliability and 1 or safety equipment and systems, new components, components with important changes in design or manufacturing, or for some critical components like power semiconductors, mixed-signal ICs, and complex logic ICs used at the lirnits of their dynamic parameters. This, so long as the fraction of defective remains over a certain limit, fixed by technical and cost considerations. Advantages of a 100% incoming inspection of electronic components are: 1. 2. 3. 4.
Quick detection of all relevant defects. Reduction of the number of defective populated printed circuit boards (PCBs). Simplification of the tests at PCB level. Replacement of the defective components by the supplier. 5. Protection against quality changesfrom lot to lot, or within the same lot.
8.2 Testing and Screening of Electronic Components
337
Despite such advantages, different kinds of damage (overstress during testing, assembling, soldering) can cause problems at PCB level. Defective probability p (fraction of defective items) lies for today's established components in the range of a few ppm (part per million) for passive components up to thousands of ppm for complex active components. In defining a test strategy, a possible change of p from lot to lot or within the same lot should also be considered. An example of test procedure for electronic components is given in Section 3.2.1 for VLSI ICs. Test strategies with cost consideration are developed in Section 8.4.
8.2.2 Screening of Electronic Components Electronic components new on the market, produced in small series, subjected to an important redesign, or manufactured with insufficiently stable process parameters can exhibit early failures, i.e. failures during the first operating hours (generally up to some few thousand hours). Because of high replacement cost at equipment level or in the field, components exhibiting early failures should be eliminated before they are mounted on printed circuit boards. Defining a cost-effective screening strategy is difficult for at least following two reasons: 1. It may activate failure mechanisms that would not appear in field operation. 2. It could introduce damage (ESD, transients) which may be the cause of further early failures. Ideally, screening should be performed by skilled personnel, be focused on the failure mechanisms which have to be activated, and not cause damage or alteration. Experience on a large number of components shows that for established technologies and stable process parameters, thermal cycles for discrete (in particular power) devices and burn-in for ICs are the most effective steps to precipitate early failures. Table 8.1 gives possible screening procedures for electronic components used in high reliability or safety equipment and Systems. Screening procedures and sequences are in national and international standards [8.27, 8.321. The following is an example of a screening procedure for ICs in hermetic packages for high reliability or safety applications:
1. High-temperature storage: The purpose of high temperature storage is the stabilization of the thermodynamic equilibrium and thus of the IC electrical parameters. Failure mechanisms related to surface problems (contamination, oxidation, contacts) are activated. The ICs are placed on a meta1 tray (pins on the tray to avoid thermal voltage stresses) in an oven at 150°C for 24h. Should solderability be a problem, a protective atmosphere (Nz)can be used. 2. Thermal cycles: The purpose of thermal cycles is to test the ICs ability to endure rapid temperature changes, this activates failure mechanisms related to mechanical stresses caused by mismatch in expansion coeficients of the
338
8 Quality and Reliability Assurance During the Production Phase
Table 8.1 Example of test and screening procedures for electronic components used in high reliability or safety equipment and Systems (apply in part to SMD) Component Resistors Capacitors Film
I
Sequence Visual inspection, 20 thermal cycles for resistor networks ( -401+ 1 2 5 " ~ ) * , , test at 2 5 T * 48 h steady-state bum-in at 100°C and 0.6 P ~ * el. Visual inspection, 48 h steady-state burn-in at 0.98„, 25°C (C, tan6, RiS)*, measurement of Risst 70°C *
and UN*, el. test at
Ceramic
Visual inspection, 20 thermal cycles ( 8„)*, 48 h steady-state burn-in at U 6 and 0.98„,*, el. test at 25'C (C, tan6, RiS)*, measurement of Ri, at 70°C
Tantalum (solid)
Visual inspection, 10 thermal cycles (OeXzr)*, 48h steady-state burn-in at U? and 0.98(low zQ)*, el. test at 25'C (C, tan6, Ir)*, meas. of Ir at 70°C *
Aluminum
Visual inspection, forming (as necessary), 48 h steady-state bum-in at UN and 0.98„*, el. test at 25'C (C, tan6, I,)*, measurement of I, at 7 0 T *
Diodes (Si)
Visual inspection, 30 thermal cycles ( -40 I+ 125T)*, 48 h reverse bias bum)*, seal test (finelgross leak)*+ in at 125°C *, el. test at 25'C ( I,, U „ U,
Transistors (Si)
Visual inspection, 20 thermal cycles ( -40/+ 125T)*, 50 power cycles (25 I 125"C, Ca. 1min on I 2min off) for power elements*, el. test at 25OC (P, I„, U„,,)*, seal test (finelgross leak)*+
Optoelectronic LED, IRED Optocoupler
Digital ICs BiCMOS
Visual inspection, 72 h high temp. storage at 100°C *, 20 thermal cycles (-201+ 80°c)*, el. test at 2 5 T (UF, .VRmin)*, seal test (finelgross leak)*+ Visual inspection, 20 thermal cycles ( -251 10O0C), 72 h reverse bias bum-in (HTRB) at 85'C *, el. test at 25'C (Ic l I„ U„ U„, , U„„, I,), seal test (finelgross leak)*+ Visual inspection, reduced el. test at 2 5 T , 48 h dyn. burn-in at 1 2 5 T *, el. test at 70°C *, seal test (finelgross leak)*+
MOS (VLSI)
Visual inspection, reduced el. test at 25OC (rough functional test, IDD), 72 h dyn. burn-in at 125'C *, el. test at 70°C *, seal test (finelgross leak)*+,
CMOS (VLSI)
Visual inspection, reduced el. test at 25'C (rough functional test, IDD), 48 h dyn. bum-in at 125OC *, el. test at 70°c*, seal test (finelgross leak)*+
EPROM, EEPROM (>W
Visual inspection, programming (CHB), high temp. storage ( 48 h 1125"C), erase, programming (inv. CHB), high temp. storage (48 h 1l2s0C), erase, el. test at 70°C, seal test (finelgross leak)*+
Linear ICs
Visual inspection, reduced el. test at 25'C (rough functional test, ICC, offsets) 20 thermal cycles ( - 4 0 I+ 125"~)*, 96 h reverse bias burn-in (HTRB) at 125°C with red. el. test at 25°C *, el. test at 70°C *, seal test (finelgross leak)*+
Hybrid ICs
Visual inspection, high temp. Storage ( 24 h 1125"C), 20 thermal cycles (-40/+ 125"C), constant acceleration (2,000 to 20,000 g, /60s)*, red. el. test at 25'C, 96h dynamic bum-in at 85 to 125OC, el. test at 25"C, seal test (finelgross leak)*+
8.2 Testing and Screening of Electronic Components
339
material used. Thermal cycles are generally performed air to air in a twochamber oven (transfer from low to high temperature chamber and vice versa using a lift). The ICs are placed on a meta1 tray (pins on the tray to avoid thermal voltage stresses) and subjected to at least 10 thermal cycles from -65 to +150°C (transfer time I lmin, time to reach the specified temperature 5 15min, dwell time at the temperature extremes 2 lomin). Should solderability be a problem, a protective atmosphere (N2) can be used. 3. Constant acceleration: The purpose of the constant acceleration is to check the mechanical stability of die-attach, bonding, and package. This step is only performed for ICs in hermetic packages, when used in critical applications. The ICs are placed in a centrifuge and subjected to an acceleration of 30,00Og, ( 300,000m 1 s2) for 60 seconds (generally z-axis only). 4. Burn-in: Burn-in is a relatively expensive, but efficient screening step that provokes for ICs up to 80% of the chip-related and 30% of the package-related early failures. The ICs are placed in an oven at 125°C for 24 to 168h and are operated statically or dynamically at this temperature (cooling under power at the end of burn-in is often required). Ideally, ICs should operate with electrical signals as in the field. The consequence of the high burn-in temperature is a time acceleration factor A often given by the Arrhenius model (Eq. (7.56))
where E , is the activation e n e r g y , k the Boltzmann's constant (8.6. 1oV5eV / K), and hl and h2) are the failure rates at chip temperatures Tl and T2 (in K), respectively, See Fig. 7.10 for a graphical representation. The activation energy E , varies according to the failure mechanisms involved. Global average values for ICs lie between 0.3 and 0.7eV. Using Eq. (7.56), the bum-in duration can be calculated for a given application. For instance, if the period of early failures is 3,000 h , €4 = 55"C, and O2 = 130°C (junction temperature in "C), the effective bum-in duration would be of about 50h for E , = 0.65 eV and 200h for E, = 0.4eV. It is often difficult to decide whether a static or a dynamic burn-in is more effective. Should surface, oxide, and metallization problems be dominant, a static burn-in is better. On the other hand, a dynamic burn-in activates practically all failure mechanisms. It is therefore important to make such a choice on the basis of practical experience. 5. Seal: A seal test is performed to check the seal integrity of the cavity around the chip in hermetically-packaged ICs. It begins with the fine leak test: ICs are placed in a vacuum ( l h at O.5mmHg) and then stored in a helium atmosphere under pressure (ca. 4 h at 5 atm); after a waiting period in Open air (30min), helium leakage is measured with the help of a specially
340
8 Quality and Reliability Assurance Dunng the Production Phase
calibrated mass spectrometer (required sensitivity approx. 1 0 - ~atrn cm3 / s , depending on the cavity volume). After the fine leak test, ICs are tested for gross leak: ICs are placed in a vacuum ( 1h at 5 mmHg ) and then stored under pressure (2 h at 5 atm) in fluorocarbon FC-72; after a short waiting period in Open air ( 2 min), the ICs are immersed in a fluorocarbon indicator bath (FC40) at 125°C; a continuous stream of small bubbles or two large bubbles from the sarne place within 30 s indicates a defect.
8.3 Testing and Screening of Electronic Assemblies Electrical testing of electronic assemblies, for instance populated printed circuit boards (PCBs), can be basically performed in one of the following ways: 1. Functional test within the assembly or unit in which the PCB is used. 2. Functional test with the help of functional test equipment. 3. In-circuit test followed by a functional test with the assembly or unit in which the PCB is used.
The first method is useful for small series production. It assumes that components have been tested (or are of sufficient quality) and that automatic or semi-automatic isolation of defects on the PCB is possible. The second method is suitable for large series production, in particular from the point of view of protection against damage (ESD, backdriving, mechanical stresses), but can be expensive. The third and most commonly used method assumes the availability of an appropriate in-circuit test equipment. With such an equipment, each component is electrically isolated and tested statically or quasi-statically. This can be sufficient for passive components and discrete semiconductors, as well as for SSI and MSI ICs, but it cannot replace an electrical test at the incoming inspection for LSI and VLSI ICs (functional tests on in-circuit test equipment are limited to some few lOOkHz and dynamic tests (Fig. 3.4) are not possible). Thus, even if in-circuit testing is used, incoming inspection of critical components should not be omitted. A further disadvantage of in-circuit testing is that the outputs of an IC can be forced to a LOW or a HIGH state. This stress (backdriving) is generally short (SOns), but may be sufficient to cause damage to the iC in question. In spite of this, and of some other problems (polarity of electrolytic capacitors, paralleled components, tolerance of analog devices), in-circuit testing is today the most effective means to test populated printed circuit boards (PCBs), on account of its good defect isolation capability. Because of the large number of components and solder joints involved, the defective probability of a PCB can be relatively high in stable production conditions too. Experience shows that for a PCB with about 500 components and 3,000 solder
8.4 Test and Screening Strategies, Economic Aspects
341
joints, the following indicative values can be expected (see e.g. Table 1.3 for a fault report form): 0.5 to 2% defective PCBs (often for 314 assembling and 114 components), 1.5 defects per defective PCB (mean value). Considering such figures, it is important to remember that defective PCBs are often reworked and that a repair or rework can have a negative influence on the quality and reliability of a PCB. Screening populated printed circuit boards (PCBs) or assemblies with higher integration level is generally a difficult task, because of the many different technologies involved. Experience on a large number of PCBs [3.76] leads to the following screening procedure which can be recommended for PCBs of standard technology used in high reliability applications (in limited amount also for rnixed technology) :
1. Visual inspection and reduced electrical test. 2. 100 thermal cycles between 0°C and +80°C, with temperature gradient 5 5°C I min (within the components), dwell time 2 lOmin, and power off during cooling (gradient 2 20°C / min only if this also occurs in the field and is compatible with the PCB technology). 3. 15min random vibration at 2 g „ , 20 - 500Hz (to be performed if significant vibrations occur in Sie field). 4. 48 h run-in at ambient temperature, with periodic power onloff switching. 5. Final electrical and functional test. Careful investigations on SMT assemblies down to pitch 0.3mm [3.79, 3.80, 3.891 have shown that basically two different deformation mechanisms can be present in tin based solder joints (see Section 3.4), grain boundary sliding at rather low temperature (or thermal) gradients and low stiffness of the structure component PCB, and dislocation climbing at higher temperature gradients and high stiffness (e.g. for leadless ceramic components). For this reason, screening of populated PCBs in SMT should be avoided if the temperature gradient occurring in the field is not known. Preventive actions, to build in quality and reliability during manufacturing, have to be preferred here. The above procedure can be considered as an environrnental stress screening (ESS), often performed on a 100% basis in a series production of PCBs used in high reliability or safety applications to provoke early failures. It can serve as a basis for screening at higher integration levels. Thermal cycles can be combined with power on / off switching or vibration to increase effectiveness. However, in general a screening strategy for PCBs (or at higher integration level) should be established on a case-by-case basis, and be periodically reconsidered (reduced or even canceled if the percentage of early failures drops below a given value, 1% for instance).
342
8 Quality and Reliability Assurance During the Production Phase
Burn-in at assembly level can be used in the context of a reliability test to validate a predicted assembly's failure rate h s . Assuming that the assembly consists of elements E,, ..., E, in series, with failure rates h l ( T l ) , ...,&(Tl) at temperature Tl and activation factors A l , ..., An for a stress at temperature T2, the assembly failure rate h S ( T 2 ) at temperature Tz (stress) can be calculated from (Eq. (7.57))
Comparison of the predicted failure rate h s (T,) with real data can be performed by submitting the assembly to a burn-in at temperature T2 and evaluating the experimentally obtained failure rate (Section 7.2.3). However, because of the many different technologies often used in an assembly (e.g. populated PCB), Tz is generally chosen < 100°C.
8.4 Test and Screening Strategies, Economic Aspects 8.4.1 Basic Considerations In view of the optimization of cost associated with testing and screening during production, each manufacturer of high-performance equipment and Systems is confronted with the following question: What is the most cost-efSective approach to eliminate all defects, systematic failures, und early failures prior to shipment to the customer ? The answer to this question depends essentially on the level of quality, reliability, and safety required for the item considered, the consequence of a defect or a failure, the effectiveness of each test or screening step, as well as on the direct and deferred cost involved (warranty cost for instance). A test und screening strategy should thus be tailored to the item considered, in particular to its complexity, technology, and production procedures, but also to the facilities and skill of the manufacturer. In setting up such a strategy, the following aspects must be considered:
1. Cost equations should include deferred cost (for instance, warraniy cost and cost for loss of image). 2. Testing and screening should begin at the lowest level of integration and be selective, i.e. consider the effectiveness of each test or screening step.
8.4 Test and Screening Strategies, Economic Aspects
343
3. Qualification tests on Prototypes are important to eliminate defects and systematic failures, they should include performance, environmental & reliability tests. 4. Testing and screening should be carefully planned to allow h i g h interpretability of the results, and be supported by a quality data reporting System (Fig. 1.8). 5. Testing and screening strategy should be discussed early in the design phase, during design reviews. Figure 8.1 can be used as Start point for the development of a test und screening strategy at the assembly level. A basic relationship between test strategy und cost is illustrated in the example of Fig. 8.2, in which two different strategies are compared. Both cases in Fig. 8.2 deal with the production of a stated quantity of equipment or Systems for which a total of 100,000ICs of a given type are necessary. The ICs are delivered with a defective probability p = 0.5%. During production, additional defects occur as a result of incorrect handling, mounting, etc., with probabilities of 0.01% at
I
hcoming inspection
I
PCB assembling and soldering
In-circuit test
J Functional test
Unit assembling and testing
+
Storage, shipping, use
Figure 8.1 Flow chart as a basis for the development of a test and screening strategy for electronic assemblies (e.g. populated pnnted circuit hoards (PCBs))
344
8 Quality and Reliability Assurance Dunng the Production Phase
the incoming inspection, 0.1% at assembly level, and 0.01% at equipment level. The cost of eliminating a defective I C is assumed to be $2 (US$) at the incoming inspection, $20 at assembly level, $200 at equipment level, and $2,000 during warranty. The two test strategies differ in the probability (DPr) of detecting/recognizing and eliminating a defect. This probability is for the four levels 0.1, 0.9, 0.8, 1.0 in the first strategy and 0.95, 0.9, 0.8, 1.0 in the second strategy. It is assumed, in this example, that the additional cost to improve the detection probability at incoming inspection ( + $20,000) are partly compensated by the savings in the test at the assembly level (- $10,000). As Fig. 8.2 shows, total cost of the second test strategy are (for this example) lower ($21,900) than those of the first one. Number of defects and cost are in all this kind of considerations expected values (means of random variables). The use of arithmetic means in the example of Fig. 8.2, on the basis of 100,000 ICs at the input, is for convenience only.
Strategy a Defective probabilities No. of defects Defects at the input
0.5%
0.01%
0.1%
10
Incoming 500,inspection 4 Assembly 56_ DPr=O.l
Discovered defects
0.01%
100
DPr= 0.9
10
Equipment DPr = 0.8
t
t
t
51
503
53
-+ 13
Warranty DPr= l
13
Defects cost (in 1000 US$)
Strategy b Defective probabilities No. of defects Defects at the input Discovered defects
0.5%
0.01%
DPr= 0.95
0.01%
DPr = 0.9
DPr = 0.8
DPr = 1
485
113
18
4
Defects cost (in 1000 US$)
1
2.3
3.6
8
Deferred cost (in 1000 US$)
(+W
(-10)
-
-
) Z = 24,900US$
Figure 8.2 Companson between two possible test strategies (figures for defects and cost have to be considered as expected values): a) Emphasis on assembly test; b) Emphasis on incoming inspection ( DPr = detectionlrecognition probability)
8.4 Test and Screening Strategies, Economic Aspects
345
Models like that of Fig. 8.2 can be used to identify weakpoints in the production process (e.g. with respect to the defective probabilities at the different production steps) or to evaluate the effectiveness of additional measures introduced to decrease quality cost.
8.4.2 Quality Cost Optimization at Incoming Inspection Level In this section, optimization of quality cost in the context of a testing und screening strategy is solved for the case of the choice whether a 100% incoming inspection or an incoming inspection on a sampling basis is more cost effective. Two cases will be distinguished, incoming inspection without screening (test only, illustrated by Fig. 8.3 and Fig. 8.4) and incoming inspection with screening (test and screening, illustrated by Fig. 8.5 and Fig. 8.6). The following notation is used:
At = probability of acceptance at the sarnpling test (i.e. probability of having no more than C defective components in a sample of size n (function of p d , given by Eq. (A6.121) with p = pd and k = C , see also Fig. 7.2 for a graphical solution using the Poisson approximation)
A, = Same as At, but for screening (screening with test) cd = deferred cost per defective component c f = deferred cost per component with early failure C , = replacement cost per component at the incoming inspection C, = testing cost per component (test only) C,
= screening cost per component ( C , includes cost for screening and
for test) C, = expected value (mean) of the total cost (direct and deferred) for incoming inspection without screening (test only) of a lot of N components C, = expected value (mean) of the total cost (direct and deferred) for incoming inspection with screening (screening with test) of a lot of N components n = sample size N = lot size pd = defective probability (defects are recognized at the test) p f = probability for an early failure (early failures are precipitated by the screening)
8 Quality and Reliability Assurance During the Production Phase
I
Lot of size N
I
+
Assembly, test, use
Deferred cost : C;' = At pd ( N - n ) c d
Figure 8.3 Model for quality cost optimization (direct and deferred cost) at the incoming inspection without screening of a lot of N compouents (all cost are expected values, see Fig. 8.5 for screening)
Consider first the incoming inspection without screening (test only). The corresponding model is shown in Fig. 8.3. From Fig. 8.3, the following cost equation can be established for the expected value (mean) of the total cost C ,
Investigating Eq. (8.1) leads to the following cases:
2. For a 100% incorning inspection, n = N and thus
it follows
8.4 Test and Screening Strategie?., Economic Aspects
Lot of size N
Empirical values for Pd >
C t > C & Cr
i-
Sample of size n
I Figure 8.4
inspection (test)
Practical realization of the procedure described by the model of Fig. 8.3
and thus a sampling test is more cost effective.
4. For
and thus a 100%incoming inspection is more cost effective. The practical realization of the procedure according to the model of Fig. 8.3 is given in Fig. 8.4. The sample of size n to be tested instead of the 100% incoming inspection if the inequality (8.4) is fulfilled, is used to verify the value of pd, which for the actual lot can differ from the assumed one. A table of AQL-values (Table 7.1) can be used to determine values for n and C of the sampling plan, AQL = pd in uncritical cases and AQL < pd if a reduction for the risk of deferred cost is desired.
348
8 Quality and Reliability Assurance During the Production Phase
k=Jd Lot of size N
Sample of a size n (screening with el. test)
Screening
S Accept?
Test
El. test (without screening) of the remaining (N - 4
J the remaining (N - n) components
Figure 8.5 Model for quality cost optimization (direct and deferred cost) at the incoming inspection with screening of a lot of N components (all cost are expected values; screening includes test)
As a second case, let us consider the situation of an incoming inspection with screening (Section 8.2). Figure 8.5 gives the corresponding model and leads to the following cost equation
The Same considerations as with Eqs. (8.2) - (8.5) lead to the conclusion that if
holds, then a sampling screening (with test) is more cost effective than a 100% screening. The practical realization of the procedure according to the model of Fig. 8.5 is given in Fig. 8.6. As in Fig. 8.4, the sample of size n to be screened instead of the 100%screening if the inequality (8.7) is fulfilled, is used to verify the values of pf and p d , which for the actual lot can differ from the assumed ones.
8.4 Test and Screening Strategies, Economic Aspects
Sample test of size n
00%i 0 i n g inspection without screening (test only)
/
Assembly, test, use
100% b o m i n g inspection with screening (screening with test)
I
Figure 8.6 Practical realization of the procedure descnbed by Fig. 8.5 (screening includes test)
The lower Part on the left-hand side of Fig. 8.6 is identical to Fig. 8.4. The first inequality in Fig. 8.6 follows from inequality (8.7) with the assumption
The second inequality in Fig. 8.6 refers to the cost for incoming inspection without screening (inequality (8.4)).
350
8 Quality and Reliability Assurance During the Production Phase
8.4.3 Procedure to handle first deliveries Components, materials, and externally manufactured subassemblies or assemblies should be submitted at the first delivery to an appropriate selection procedure. Part of this procedure can be performed in cooperation with the manufacturer to avoid duplication of efforts. Figure 8.7 gives the basic structure of such a procedure, See Sections 3.2 and 3.4 for some examples of qualification tests for components and assemblies.
+ -I.<=-':.:?> Qualification test
a propriate?
Reject
First lot
100% incoming inspection
experience
components and materials
Figure 8.7 Selection procedure for non qualified components and materials
A l Terms and Definitions
This appendix defines und comments on the terms most commonly used in reliability engineering (Fig. Al.1). Table 5.4 extends this appendix to Software quality (See also [ A I S (610)l. Attention has been paid to the adherence to relevant international standards (ISO, IEC) and recent trends [Al .1 - A. 1.81.
System, Systems Engineering, Concurrent Engineering, Cost Effectiveness, Quality - Capability - Availability, Dependability - Reliability Item Required Function, Mission Profile Reliability Block Diagram, Redundancy MTTF, MTBF Failure, Failure Rate, Failure Intensity, Derating FMEA, FMECA, FTA Reliability Growth, Environmental Stress Screening, Bum-in Maintainability Preventive Maintenance, MTTPM, MTBUR Corrective Maintenance, MTTR - Logistic Support - Fault Defect, Nonconformity Systematic Failure Failure - Safety - Quality Management, Total Quality Management (TQM) L Quality Assurance Configuration Management, Design Review Quality Test Quality Control during Production Quality Data Reporting System - Life Time, Useful Life - Life-Cycle Cost, Value Engineering, Value Analysis - Product Assurance, Product Liability
1 t
F
t
Figure A l . l
Terms most commonly used in reliability engineering
A l Terms and Definitions
Availability, Point Availability (A(t), PA(t)) Probability that the item is in a state to perform the required function at a given instant of time. Instantaneous availability is often used. The use of A(t) shouId be avoided, to elude confusion with other kind of availability (e.g. average availability A A(t ), mission availability M A(TO,to), and work-mission availability WM A(TO,X ) in Section 6.2). A qualitative definition, focused on ability, is also possible. The term item stands for a structnral unit of arbitrary complexity. Computation generally assumes continuous operation (item down only for repair), renewal at failure (good-as-new after repair), and ideal human factors & logistic support. For an item with more than one element, good-as-new after repair refers in this book to the repaired element in the reliability block diagram. This assumption is valid for the whole item (system), only in the case of constant failure rates for all elernents. Assuming renewal for the whole item, the asymptotic &steady-state value of the point availability can be expressed by PA = MTTFI(MTTF+ MTTR). PA is also the asymptotic & steady-state value of the average availability AA (often given as availability A).
Burn-in (nonrepairable items) Type of screening test while the item is in operation. For electronic devices, Stresses during burn-in are often constant higher ambient temperature (e.g. 125°C for ICs) and constant higher supply voltage. Burn-in can be considered as a part of a screening procedure, performed on a 100% basis to provoke early failures and to stabilize the characteristics of the item. Often it can be used as an accelerated reliability test to investigate the item's failure rate.
Burn-in (repairable items) Process of increasing the reliability of hardware by employing functional operation of every items in a prescribed environment with corrective maintenance during the early failure period. The term run-in is often used instead of burn-in. The stress conditions have to be chosen as near as possible to those expected infield operation. Flaws detected during burn-in can be detenninistic (defects or systematic failures) during the pilot production (reliability growth), but should be attributable only to earlyfailures (randomly distributed) during the series production.
Capability Ability to meet a service demand of given quantitative characteristics under given internal conditions. Performance (technical performance) is often used instead of capability.
A l Terms and Definitions
Concurrent Engineering Systematic approach to reduce the time to develop, manufacture, and market the item, essentially by integrating production activities into the design & development phase. Concurrent engineering is achieved through intensive teamwork between all engineers involved in the design, production, and marketing of the item. It has a positive influence on the optimization of life-cycle cost.
Configuration Management Procedure to specify, describe, audit, and release the configuration of the item, as well as to control it during modifications or changes. Configuration includes all of the item's functional and physical characteristics as given in the documentation (to specify, build, test, accept, operate, maintain, and logistically support the item) and as present in the hardware andlor software. In practical applications, it is useful to subdivide configuration management into configuration identification, auditing, control (design reviews), and accounting. Configuration management is of particular importance during the design & development phase.
Corrective Maintenance Maintenance carried out after failure to restore the required function. Corrective maintenance is also known as repair and can include any or all of the following steps: recognition, isolation (localization & diagnosis), elimination or removal (disassemble, remove, replace, reassemble), and function checkout. Repair is used in this book as a synonym for restoration. To simplify computation it is generally assumed that the repaired element in the reliability block diagram is as-good-as-new after each repair (also including a possible environmental stress Screening of the spare parts). This assumption applies to the whole item (equipment or system) if all elements of the item (which have not been renewed) have constant failure rates (seefailure rate for further comments).
Cost Effectiveness Measure of the ability of the item to meet a service demand of stated quantitative characteristics, with the best possible usefulness to life-cycle cost ratio. System effectiveness is often used instead of cost effectiveness.
A l Terms and Definitions
Defect Nonfulfillment of a requirement related to an intended or specified use. From a technical point of view, a defect is similar to a nonconfonnity, however not necessady from a legal point of view (in relation to product liability, nonconformity should be preferred). Defects do not need to influence the item's functionality. They are caused by flaws (errors, mistakes) dunng design, development, production, or installation. The term defect should be preferred to that of error, which is a cause. Unlike failures, which always appear in time (randomly distributed), defects are present at t = 0 . However, some defects can only be recognized when the item is operating and are referred to as dynamic defects (e.g. in software). Similar to defects, with regard to causes, are systematic failures (e.g. cooling problern); however, they are often not present at t=O.
DependabiIity Collective term used to describe the availability performance and its influencing factors (reliability, maintainability, and logistic support). Dependability is used generally in a qualitative sense, often defined as ability to provide the required function when demanded.
Derating Designed reduction of stress from the rated value to enhance reliability. The stress factor S expresses the ratio of actual to rated stress under normal operating conditions (generally at 25'C ambient temperature). Designed is used as a synonym for deliberate.
Design Review Independent examination of the design to identify shortcomings that could affect the fitness for purpose, reliability, maintainability or maintenance support requirements of the item. Design reviews are an important tool for quality assurance and T Q M during the design and development of hardware and software (Tables A3.3,5.3,5.5,2.8,4.3, Appendix A4). An important objective of design reviews is to decide about continuation or stopping the project considered on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3, Fig. 1.6).
Environmental Stress Screening (ESS) Test or set of tests intended to remove defective items, or those likely to exhibit early failures. ESS is a screening procedure often perfonned at assembly (PCB) or equipment level on a 100% basis to find defects and systematic failures during the pilot production (reliability growth), or to provoke early failures in a series production. For electronic items, it consists generally of temperature cycles
A l Terms and Definitions
355
andlor random vibrations. Stresses are in general higher than in field operation, but not so high as to stimulate new failure mechanisms. Experience shows that to be cost effective, ESS has to be tailored to the item and production processes. At component level, the term screening is often used.
Failure Termination of the ability to perform the required function. Failures should be considered (classified) with respect to the mode, cause, effect, and mechanism. The cause of a failure can be intrinsic (early failure, failure with constant failure rate, wearont) or extrinsic (systematic failures, i. e. failures resulting from errors or mistakes in design, production, or operation which are deterministic and has to be considered as defects). The effect (consequence) of a failure is often different if considered on the directly affected item or on a higher level. A failure is an event appearing in time (randomly distributed), in contrast to a fault which is a state.
Failure Intensity ( z ( t)) Limit, if it exists, of the mean number of failures of a repairable item within time interval ( t, t + St], to 6 t when 6 t -+0. At System level, zs ( t ) is used. Failure intensity applies for repairable items, in particular when repair times are neglected and failure occurrence is considered on the time axis (arrival times). It has been investigated for Poisson processes (homogeneous (z(t)= h ) & nonhomogeneous (z(t)= m(t ))) and renewal processes (z(t)= h(t)) (Appendices A7.2, A7.8). For practical applications it holds that z(t)6t =Pr{v(t +6t)-v(t)=l), V ( t ) =number of failures in (O,t] (Eq. (A7.229)). Seealsofailure rate.
Failure Modes and Effects Analysis (FMEA) Qualitative method of analysis that involves the study of possible failure modes and faults in subitems, and their effects on the ability of the item to provide the required function. See FMECA for comments.
Failure Modes, Effects, and Criticality Analysis (FMECA) Quantitative or qualitative method of analysis that involves failure modes und effects analysis together with a consideration of the probability of the failure mode occurrence and the severity of the effects. Goal of a FMEA or FMECA is to identify all potential hazards and to analyze the possibilities of reducing their effect andlor occurrence probability. All possible failure modes and faults with the conesponding causes have to be considered bottom-up from lowest to highest integration level of the item considered. Often one distinguishes between design and production (process) FMEA or FMECA. FMECA can be used forfault modes, effects, and criticality analysis (same for FMEA).
A l Terms and Definitions
Failure Rate
(qt))
Limit, if it exists, of the conditional probability that the failure occurs within time interval ( t , t + 6 t ] ,to 6 t when 6 t + 0, given that the item was new at t = 0 and did not fail in the interval (0, t ]. At system level, h s ( t ) is used. The failure rate applies in particular for nonrepairable items. In this case, i f z is the item failure-free time, with distribution function F(t) = Pr(t I t } , with F(O)=0 and density f(t ), the failure rate h ( t ) follows as (Eq. (A6.25), R(t ) = 1 - F(t )) 1 f(t) h ( t ) = lim - R ( t < t ~ t + 8 t l ~ > t } = - = - - - - . 6t10 6t 1 - F(t)
dR(t)ldt
W)
Considering R(0) = I , Eq. ( A l . l ) yields R ( t ) = e-1; h ( x ) m and thus, R ( t ) = echt for h ( t )= h . This important result characterizes the memoryless property o f the exponential distribution F(t)=l- e - h f , expressed by Eq. ( A l . l ) for h ( t ) = h . Only for h ( t ) = h one can estimate the failure rate h by h = k l 7: where T is the given (fixed)cumulative operating time and k > 0 the total number o f failures during T (Eq. (7.28)). Figure 1.2 shows a typical shape of h ( t ) . However,considering Eq. ( A l . l ) , the failure rate can be defined also for repairable items which are as-good-as-new after repair (restoration),taking instead of t the variable x starting by x = 0 at each repair (as for interarrival times). This is important when investigating repairable Systems (Chapter 6), e.g. with constant failure & repair rates. I f a repairable system cannot be restored to be as-good-as-new after repair (with respect to the state considered),i.e i f at least one element with time dependent failure rate has not been renewed at every repair, failure intensiv z ( t ) has to be used. It is thus important to distinguish between failure rate h ( t ) and failure intensiv z ( t ) or intensity ( h ( t ) or m ( t ) for a renewal or Poisson process). z ( t ) , h ( t ) , m ( t ) are unconditional densities (Eqs. (A7.229), (A7.24), (A7.194))and differbasically from h ( t ) which is a conditional density. This distinction is important also for the case o f a homogeneous Poisson process, for which z(t)=h (t)= m(t)=h holds for the intensity and h ( x )= h holds for the interarrival times ( X starting by 0 at each interarrival time,Eq. (A7.38)).To reduce ambiguities,force of mortality has been suggested for h ( t ) [6.3, A7.301.
Fault State characterized by an inability to perform the required function due to an internal reason. A fault is a state and can be a defect or a failure, having thus as possible cause an error (fordefects or systematic failures) or afailure mechanism (forfailures).
Fault Tree Analysis @TA) Analysis utilizing fault trees to determine which faults of subitems, or external events, or combination thereof, may result in item faults. FTA is a top-down approach, which allows the inclusion o f extemal causes more easily than a FMEAl FMECA. However, it does not necessarily go through all possible fault modes. Combination o f FMEA I FMECA with FTA leads to causes-to-effects chart, showing the logical relationship between identified causes and their single or multiple consequences. A graphical description o f cause-to-effectrelationships is the cause-to-effectdiagram fishbone or Ishikawa diagram).
A l Terms and Definitions
Item Part, component, device, functional unit, subsystem or system that can be individually described and considered. An item is a functional or structural unit, generally considered as an entitj for investigations. It can consist of hardware andlor software and include human resources.
Life Cycle Cost (LCC) Sum of the cost for acquisition, operation, maintenance, and disposal or recycling of the item. Life-cycle cost have to consider also the effects to the environment of the production, use, and disposal or recycling of the item considered (sustainable development). Their optimization uses cost effectiveness or Systems engineering tools and can be positively influenced by concurrent engineering.
Lifetime Time span between initial operation and failure of a nonrepairable item.
Logistic Support All activities undertaken to provide effective and economical use of the item during its operating phase. An emerging aspect related to logistic support is that of obsolescence management, i.e. how to assure operation over e.g. 20 years when components need for maintenance are no longer manufactured.
Maintainability Probability that a given maintenance action, performed under stated conditions and using stated procedures and resources, can be carried out within a stated time interval. Maintainability is a characteristic of the item and refers to preventive and corrective maintenance. A qualitative definition, focused on ability, is also possible. In specifying or evaluating maintainability, it is important to consider the logistic support available (procedures, personnel, spare Parts, etc.).
Mission Profile Specific task which must be fulfilled by the item during a stated time under given conditions. The mission profile defines the required function and the environmental conditions as a function of time. A system with a variable required function is termed a phased-mission system.
A l Terms and Definitions
MTBF Mean operating time between failures. At system level, MTBFs is used. MTBF applies for repairable items. However, for practical applications it is important to recognize that successive operating times between system failures have the Same mean (expected value) only if they are independent and have a common distribution function, i.e. if the system is as-good-as-new after each repair at system level. If only the failed element is restored to as-good-as-new after repair and at least one nonrestored element has a time dependent failure rate, successive operating times between system failures are neither independent nor have a common distribution. Only the case of a series-system with constant failure rates hl,...,h, for all elements EI,...,E, yields to a homogeneous Poisson process, for which successive interarrival times (operating times between system failures) are inde endent and exponentially distributed with common distribution function ~ ( x=) ~ - e - ~ (' . ''* R~1 z ) - 1-e-I 'S and mean MTBF' = 1 Ihs (repaired elements are assumed as-good-as-new, yielding system as-good-as-new because of the constant failure rates hl,...,h,). This result holds approximately also for systems with redundancy (see Eq. (6.93) and comments with M77'F). For all these reasons, and also because of the estimate MTBF= T I L, often used in practical applications, MTBF should be confined to repairable systems with constanr failure rates for all elements. Shortcomings because of neglecting this basic property are known, see e.g. [6.3,7.1 l,A7.30]. As in the previous editions of this book, MTBFs will be reserved for the case
For Markov and semi-Markov models, MUTs is used.
MTTF Mean time to failure. At system level, MTTFs is used. MTTF is the mean (expected valueJ of the item failure-free time T. It can be computed from the reliability function R(t) as MTTF = R(t ) d t , with TL as the upper limit of the integral if lhe: life time is limited to TL (R(t)= 0 for t > TL ). MTTF applies for both nonrepairable and repairable items if one assumes that after repair the item is as-good-as-new (p. 40). At system level, this occurs (with respect to the state considered) only if the repaired element is as-good-as-new and all nonrepaired elements have constant failure rates. To inclnde for this case all situations, M7TFsi is used in Chapter 6 (S stands for system and i for the state occupied (entered for a semi-Markov process) at the time at which the repair (restoration) is terminated, see e.g. Table 6.2). When dealing with failure-free and repair times, the variable x starting by X = 0 after each repair (restoration) has to be used instead of t (as for interarri~al~times).See p. 40 for further comments. An unbiased, empirical estimate for MTTF is M77F = (tl + ... t,)l n , where tl, ..., tn are observed failure-free times of n statistically identical and independent items.
+
MTTPM Mean time to preventive maintenance. See MTTR for comments.
A l Terms and Definitions
MTBUR Mean time between unscheduled removals.
MTTR Mean time to repair. At system level, M7TRS is used. Repair is used in this book as a synonym for restoration. MTTR is the mean (expected value) of the item mpair time. It can be computed from the distnbution function G(t) of the repair time as MZTR = (1 - G(t ))dt . In spsifying or evaluatiiig M7TR. it is necessary to consider the logistic support available for repair (procedures, personnel, spare Parts, test facilities). Repair time is often lognormally distributed. However, for reliability or availability computation of repairable equipment and Systems, a constant repair rate p (i.e. exponentially distributed repair times with = 1I MTTR) can be used in general to get valid approximate results, as long as MTTR << M l T F holds for each element in the reliability block diagram (Examples 6.7, 6.8, 6.9). An unbiased, empincal estimate of MTTR is M ~ T R= (tl + ... + t,) l n, where tl, ..., t, are observed repair times of n statistically identical and independent items.
I.
Nonconformity Nonfulfillment of a specified requirement. From a technical point of view, nonconformity is close to defect, however not necessarily from a legal point of view. In relation to product liability, nonconformity should be preferred.
Preventive Maintenance Maintenance carried out to reduce the probability of failure or degradation. The aim of preventive maintenance must also be to detect and remove hidden failures, i.e. nonrecognized failures in redundant elements. To simplify computation it is generally assumed that the element in the reliability block diagram for which a preventive maintenance has been performed is as-good-as-new after each preventive maintenance. This assumption applies to the whole item (equipment or system) if all components of the item (which have not been renewed) have constant failure rates. Preventive maintenance is generally performed at scheduled time intervals.
Product Assurance All planned and systematic activities necessary to reach specified targets for the reliability, maintainability, availability, and safety of the item, as well as to provide adequate confidence that the item will meet all given requirements. The concept of product assurance is used in particular in aerospace programs. It includes quality assurance as well as reliability, maintainability, availability, safety, and logistic suppori engineering.
A l Terms and Definitions
Product Liability Generic term used to describe the onus on a producer or others to make restitution for loss related to personal injury, property damage, or other harm caused by the product. The manufacturer (producer) has to speczfj> a safe operational mode for the product (item). If strict liability applies, the manufacturer has to demonstrate (at a claim) that the product was free from defects when it left the production plant. This holds in the USA and partially also in Europe [1.8]. However, in Europe the causality between damage and defect has still tobe demonstrated by the User and the limitation period is short (often 3 years after the identification of the damage, defect, and manufacturer, or 10 years after the appearance of the product on the market). One can expect that liability will more than before consider faults (defects & failures) and Cover software as well. Product liability forces producers to place greater emphasis on quality assurance Imanagement.
Quality Degree to which a Set of inherent characteristics fulfills requirements. This definition, given also in the ISO 9000:2000 Standard [A1.6, A2.91, follows closely the traditional definition of quality fitness for use) and applies to products and semices as well.
Quality Assurance All the planned and systematic activities needed to provide adequate confidence that quality requirements will be fulfilled. Quality assurance is a part of quality management, as per ISO 9000: 2000. It refers to hardware and software as well, and includes configuration management, quality tests, quality control during production, quality data reporting systems, and software quality (Fig. 1.3). For complex equipment and systems, quality assurance activities are coordinated by a quality assurance program (Appendix A3). An important target for quality assurance is to achieve the quality requirements with a minimum of cost and time. Concurrent engineering also strive to short the time to develop and market the product.
Quality Control During Production Control of the production processes and procedures to reach a stated quality of manufacturing.
Quality Data Reporting System System to collect, analyze, and correct all defects and failures occurring during production and testing of the item, as well as to evaluate and feedback the corresponding quality and reliability data.
A l Terms and Definitions
361
A quality data reporting system is generally Computer aided. Analysis of defects and failures must be traced to the cause in order to determine the best corrective action necessary to avoid repetition of the same problem. The quality data reporting system should also remain active during the operating phase. A quality data reporting system is important to monitor reliability growth.
Quality Management Coordinated activities to direct and control an organization with regard to quality. Organization is defined as group of people und facilities (e.g. a company) with an arrangement of responsibilities, authorities, und relationships [A1.6].
Quality Test Test to verify whether the item conforms to specified requirements. Quality tests include incoming inspections, qualification tests, production tests, and acceptance tests. They also Cover reliability, maintainability, and safety aspects. To be cost effective, quality tests must be coordinated and integrated in a test (und screening) strategy. The terms test and inspection are often used for quality test.
Redundancy Existence of more than one means for performing the required function. For hardware, distinction is made between active (hot, parallel), warn (lightly loaded), and standby (cold) redundancy. Redundancy does not necessarily imply a duplication of hardware, it can for instance be implemented at the software level or as a time redundancy. To avoid common mode failures, redundant elements should be realized independently from each other. Should the redundant elements fulfill only a part of the required function, a pseudo redundancy is present.
Reliability ( R , R( t )) Probability that the required function will be provided under given conditions for a given time interval. According to the above definition, reliability is a characteristic of the item, generally designated by R for the case of a fixed mission and R ( t ) for a mission with t as a Parameter. At system level RSi ( t ) is used, where S stands for system and i for the state entered at t = 0 (Table 6.2). A qualitative definition, focused on abili@,is also possible. Reliability gives the probability that no operational interruption at item (system) level will occur during a stated mission, say of duration T. This does not mean that redundant parts may not fail, such parts can fail and be repaired. Thus, the concept of reliability applies for nonrepairable as well as for repairable items. Should T be considered as a variable t, the reliabilityfunction is given by R(t) . If z is the failure-free time, distributed according to F(t) , with F(0)= 0, then R(t ) = Pr(7 > t ) = 1 - F(t ) .. The concept of reliability can also be used for processes or sewices, although modeling human aspects can lead to some difficulties.
A l Terms and Definitions
Reliability Block Diagram Block diagram showing how failures of subitems, represented by the blocks, can result in a failure of the item. The reliability block diagram (RBD)is an event diagram. It answers the question: Which elements of the item are necessary to fulfill the required function und which ones can fail without affecting it? The elements (blocks in the RBD) which mnst operate are connected in series (the ordering of these elements is not relevant for reliability computation) and the elements which can fail (redundant elements) are connected in parallel. Elements which are not relevant (used) for the required function are removed from the RBD and put into a reference list, after having verified (PMEA) that their failure does not affect elements involved in the required function. In a reliability block diagram, redundant elements still appear in parallel, irrespective of the failure mode. However, only one failure mode (e.g. short, open) and two states (good , failed) can be considered for each element.
Reliability Growth Progressive improvement of a reliability measure with time. Flaws (errors, mistakes) detected during a reliability growth program are in general deterministic (defects or systematic failures) and present in every item of a given lot. Reliability growth is thus often performed during the pilot production, seldom for series-produced items. Similarly to environmental stress screening (ESS), Stresses during reliability growth often exceed those expected in field operation, but not so high as to stimulate new failure mechanisms. Models for reliability growth can also often be used to investigate the occurrence of defects in sofnvare. Even if software defects often appear in time (dynamic defects), tbe term sofrware reliability should be avoided (sofnvare quality should be preferred).
Required Function Function or combination of functions of an item which is considered necessary to provide a given service. The definition of the required function is the starting point for every reliability analysis, as it defines failures. However, difficulties can appear with complex items (systems). For practical purposes, Parameters should be specified with tolerances.
Safety Ability of the item to cause neither injury to persons, nor significant material damage or other unacceptable consequences. Safety expresses freedom from unacceptable risk of harm. In practical applications, it is useful to subdivide safety into accidentprevention (the item is safe working while it is operating correctly) and technical safety (the item has to remain safe even if a failure occurs). Technical safety can be defined
A l Terms and Definitions
363
as the probability that the item will not cause i n j u topersons, ~ signzjicant material damage or other unacceptable consequences above a given (fked) level for a stated time interval, when operating under given conditions. Methods and procedures used to investigate technical safety are similar to those used for reliability analyses, however with emphasis on fault lfailure effects.
System Set of interrelated items considered as a whole for a defined purpose. A system generally includes hardware, software, semices, and personnel (for operation and support) to the degree that it can be considered self-sufficient in its intended operational environment. For computations, ideal conditions for humanfactors and logistic support are often assumed, leading to a technical system (for simplicity, the term system is often used instead of technical system). Elements of a system are e.g. components, assemblies, equipment, and subsystems, for hardware. For maintenance purposes, systems are partitioned into independent line replaceable units (LRUs), i.e. spare parts at equipment or system level. The term item is used for a functional or structural unit of arbitrary complexity that is in general considered as an entity for investigations.
Systematic Failure Failure related in a deterministic way to a certain cause inherent in the design, manufacturing, operation or maintenance processes. Systematic failures are also known as dynamic defects, for instance in software quality, and have a deterministic character. However, because of the item complexity they can appear as if they were randomly distributed in time.
Systems Engineering Application of the mathematical and physical sciences to develop systems that utilize resources economically for the benefit of society. TQM and concurrent engineering can help to optimize systems engineering.
Total Quality Management (TQM) Management approach of an organization centered on quality, based on the participation of all its members, and aiming at long-term success through customer satisfaction, and benefits to all members of the organization and to socieiy. Within TQM, everyone involved in the product (directly during development, production, installation, and semicing, or indirectly with management or staff activity) is jointly responsible for the quality of that product.
A l Tenns and Definitions
Useful Life Time interval starting when the item is first put into operation and ending when a limiting state is reached. The limiting state can be an unacceptable failure intensity or other. Typical values for useful life are 3 to 6 years for commercial applications, 5 to 15 years for military installations, and 10 to 30 years for distribution or power Systems (see also Lifetime).
Value Analysis Optimization of the configuration of the item as well as of the production processes and procedures to provide the required item characteristics at the lowest possible cost without loss of capability, reliability, maintainability, or safety.
Value Engineering Application of value analysis methods during the design phase to optimize the life-cycle cost of the item.
A2 Quality and Reliability Standards
Besides quantitative reliability requirements, such as MTBF = 1l ?L, MTTR, and availability, customers often require a quality assurance /management System and for complex items also the realization of a quality und reliability assurance program. Such general requirements are covered by national and international standards, the most important of which are briefly discussed in this appendix. The term management is used explicitly where the organization (company) is involved as a whole, as per ISO 9000: 2000 and TQM. A basic procedure for setting up and realizing quality and reliability requirements for complex equipment and systems, with the corresponding quality und reliability assurance program, is discussed in Appendix A3.
A2.1
Introduction
Customer requirements for quality and reliability can be quantitative or qualitative. As with performance Parameters, quantitative reliability requirements are given in system specifications or contracts. They fix targets for reliability, maintainability, availability, and safety (as necessary) along with associated specifications for required function, operating conditions, logistic support, and criteria for acceptance tests. Qualitative requirements are in national or international standards and generally deal with a quality management system. Depending upon the field of application (aerospace, defense, nuclear, or industrial), these requirements may be more or less stringent. Objectives of such standards are in particular:
1. Harmonization of quality management systems and of terms & definitions. 2. Enhancement of customer satisfaction. 3. Standardization of configuration, operating conditions, logistic support, test procedures, and selectionl qualification criteria for components, materials, and production processes. Important standards for quality management systems are given in Table A2.1, see [A2.1 - A2.131 for a comprehensive list. Some of the standards in Table A2.1 are briefly discussed in the following sections.
366
A2.2
A2 Quality and Reliability Standards
General Requirements in the Industrial Field
In the industrial field, the ISO 9000: 2000 family of standards [A2.9] supersedes the ISO 9000: 1994family and Open a new era in quality management requirements. The previous 9001 - 9004 are substituted by 9001: 2000 and 9004: 2000. The ISO 8402, on definition, is substituted by the ISO 9000: 2000. Many definitions have been revised and the structure and content of 9001: 2000 and 9004: 2000 are new, and adhere better to the industrial needs and to the concept depicted in Fig. 1.3. Eight basic quaIity management principles have been identified and considered in the ISO 9000: 2000 family: Customer Focus, Leadership, Involvement of People, Process Approach, System Approach to Management, Continuous Improvement, Factual Approach to Decision Making, and Mutually Beneficial Supplier Relationships. ISO 9000:2000 describes fundamentals of quality management Systems a n d specify the terminology involved. ISO 9001: 2000 specifies requirements for a quality management system that an organization (company) needs to demonstrate its ability to provide products that satisfying customer und applicable regulatory requirements. It focus on four main chapters: Management Responsibility, Resource Management, Product and / or Service Realization, and Measurement. A quality management system must ensure that everyone involved with a product (whether in its development, production, installation, or servicing, as well as in a management or staff function) shares responsibility for the quality of that product, in accordance to TQM. At the same time, the system must be cost effective and contribute to a reduction of the time to market. Thus, bureaucracy must be avoided and such a system must Cover all aspects related to quality, reliability, maintainability, availability, and safety, including management, organization, planning, and engineering activities. Customer expects today that only items with agreed requirements will be delivered. ISO 9004: 2000 provides guidelines that consider efficiency und effectiveness of the quality management system. The ISO 9000: 2000 family deals with a broad class of products and services (technical and non-technical), its content is thus lacking in details, compared with application specific standards used e.g. in railway, aerospace, defense , and nuclear industries (Appendix A2.3). It has been accepted as national standards in many countries, and international recognition of certification has been partly achieved. Dependability aspects, focusing on reliability, maintainability, and logistic support of systems are considered in IEC Standards, in particular IEC 60300 for global requirements and IEC 60605, 60706, 60812, 60863, 61025, 61078, 61124, 61163, 61164, 61165, 61508, und 61709for specific procedures, see [A2.6] for a comprehensive list. IEC 60300 deals with dependability programs (management, task descriptions, application guides). Reliability tests for constant failure rate ?L (or of MTBF for the case MTBF = 1l h ) are considered in IEC 61124. Maintainability aspects are in IEC 60706 and s a f e ~aspects , in IEC 61508.
A2.2 General Requirements in the Industrial Field
367
Table A2.1 Standards for quality and reliability assurance lmanagement of equipment and systems 'ndustriul ~000
Int. ISO 9000: 2000
Quality management systems - Fundamentals and vocabulary
ISO 9001 : 2000
Quality management systems - Requirements
ISO 9004: 2000
Quality management systems - Guidelines for performance improvement Dependability management (-1: Program management, -2: Program element tasks, -3: Application guides)
1986-06 Int. IEC 60605
Equipment reliability testing (-2: Test cycles, -3: Test conditions -4: Point and interval estimates, -6: Test for constant failure rate)
1994-06 Int. IEC 60706
Guide on maintainability of equipment (-1: Maint. program, -2: Analysis, -3: Data evaluation, -4: Support planning, -5: Diagnostic, -6: Statistical methods)
l006
Reliability testing - Compliance tests for constant failure rate and constant failure intensity (supersedes IEC 60605-7)
Int. IEC 61124
60068,60319,60410,60447,60721,60749,60812,60863,61000 61014,61025,61070,61078,61123,61160,61163,61164,61165 61508,61649,61650,61703,61709,61710,61882,62198 1998
Int. IEEE Std 1332
IEEE Standard Reliability Program for the Development and Production of Electronic Systems and Equipment (see also 1413 Railway Applications - RAMS Specification & Demonstration Product Liability
goftware Quality 1987-98 Int. IEEEIANSI IEC, ISOAEC
IEEE Software Eng. Standards Vol. 1 - 4, 1999 (in particular 610,730, 1028, 1045, 1062, 1465 (ISOIIEC 12119)) IEC 61713 (2000) and ISOnEC 12119 (1998), 12207 (1995)
9efense 1963 USA MIL-Q-9858
Quality Program Requirements (ed. A)
1980 USA MIL-STD-785
Rel. Program for Systems and Eq. Devel. and Prod. (ed. B)
L986 USA MIL-STD-781
Rel. Testing for Eng. Devel., Qualif. and Prod. (ed. D)
1983 USA MIL-STD-470
Maintainability Program for Systems and Equip. (ed. A)
1984 NATO AQAP-1
NATO Req. for an Industrial Quality Control System (ed. 3)
Qerospace 1974 USA NHB-5300.4 (NASA)
1996 EuropeECSS (EsAl ECSS-E ECSS-M ECSS-Q
Safety, Reliability, Maintainability, and Quality Provisions for the Space Shuttle Program (1D-1) European Corporation for Space Standardization Engineering (-00, - 10) Project Management (-00, -10, -20, -30, -40, -50, -60,-70) Product Assurance (-00, -20, -30, -40, -60, -70, -80)
2003 Europe pr EN 9 100-2003 Quality Management System
368
A2 Quality and Reliability Standards
For electronic equipment & Systems, IEEE Std 1332-1998 [A2.7] has been issued as a guide to a reliability program for the development and production phases. This document gives in a short form the basic requirements, putting an accent on an active cooperation between supplier (manufacturer) and customer, and focusing three main aspects: Determination of the Customer's Requirements, Determination of a Process that satisfy the Customer's Requirements, and Assurance that the Customer's Requirements are met. Examples of comprehensive requirements for industry application are e.g. in [A2.2, A2.31. Software aspects are considered in IEEE Software Engineering Standards [A2.8]. Requirements for product liability are given in national and international directives, see for instance [1.8].
A2.3
Requirernents in the Aerospace, Railway, Defense, and Nuclear Fields
Requirements in space und railwayfields generally combine the aspects of quality, reliability, maintainability, safety, and software quality in a Product Assurance or RAMS document, well conceived in its structure& content [A2.3 - A2.5, A2.121. In the railway field, EN 50126 [A2.3] requires a RAMS program with particular emphasis on safety aspects. Similar is in the avionics field, where EN 9100-2003 [A2.4] has been issued by reinforcing requirements of ISO 9000 family. It can be expected that space and avionics will unify standards in an Aerospace Series. MIL-Standards have played an important role in the last 30 years, in particular MIL-Q-9858 and MIL-STD-470, -471, -781, 785 & -882 [A2.10]. MIL-Q-9858 (first Ed. 1959) was the basis for many quality assurance standards. However, as it does not Cover specific aspects of reliability, maintainability, and safety, MIL-STD-785, -470, and -882 were issued. MIL-STD-785 requires the realization of a reliability program; tasks are carefully described and the program has to be tailored to satisfy User needs. MTBF = 11h acceptance procedures are in MIL-STD-781. MIL-STD-470 requires the realization of a maintainability program, with emphasis on design d e s , design reviews, and FMEAI FMECA. Maintainability demonstration is covered by MILSTD-471. MZL-STD-882 requires the realization of a safety program, in particular the analysis of all potential hazards. For NATO countries, AQAP Requirements were issued starting 1968. MIL Standards have dropped their importance. However, they can still be useful in developing procedures for industrial applications. The nuclearfield has its own specific, well established standards with emphasis on safety aspects, design reviews, configuration accounting, qualification of components / materials/production processes, quality control during production, and tests.
A3 Definition and Realization of Quality and Reliability Requirements
In defining quality und reliability requirements, it is important that market needs, life cycle cost aspects, time to market as well as development and production risks (for instance when using new technologies) are consider with care. For complex equipment und Systems with high quality & reliability requirements, the realization of such requirements is best achieved with a quality und reliability assurance program, integrated in the project activities andperformed without bureaucracy. Such a program (plan if time schedule is considered) defines the project specific activities for quality and reliability assurance and assigns responsibilities for their realization in agreement to TQM. This appendix discusses first important aspects in defining quality & reliability requirements and then the content of a quality and reliability assurance program for complex equipment und Systems with high qualiq und reliabiliiy requirements for the case in which tailoring is not mandatory. For less stringent requirements, tailoring is necessary to meet real needs and to be cost and time effective. Software specific quality assurance aspects are considered in Section 5.3. Examples for check lists for design reviews are in Appendix A4, requirements for a quality data reporting system in Appendix A5.
A3.1 Definition of Quality and Reliability Requirements In defining quantitative, project specific, quality und reliability requirements attention has to be paid to the actual possibility to realize them as well as to demonstrate them at a final or acceptance test. These requirements are derived from customer or market needs, taking care of limitations given by technical, cost, and ecological aspects. This section deals with some important considerations by setting MTBF, M V R , and steady-state availability (PA = AA) requirements. MTBF is used for MTBF = 1 / A, where is the constant (time independent) failure rate of the item considered. Tentative targets for MTBF, MTl'R, PA are set by considering operational requirements relating to reliability, maintainability, and availability, allowed logistic support,
370
A3 Definition and Realization of Quality and Reliability Requirements
required function and expected environmental conditions, experience with similar equipment or Systems, possibility for redundancy at higher integration level, requirements for life-cycle cost, dimensions, weight, power consumption, etc., ecological consequences (sustainability). Typical Jigures for failure rates h of electronic assemblies are between 100 and l,OO0.10-~h-l at ambient temperature BA of 40°C and with a duty cycle d of 0.3, See Table A3.1 for some examples. The duty cycle ( 0 < d I 1) gives the mean of the ratio between operational time and calendar time for the item considered. Assuming a constant failure rate A und no reliability degradation caused by power onloff, an equivalent failure rate
can be used for practical purposes. Often it can be useful to operate with the mean expected number of failuresper year and 100 items
< 1 is a good target for equipment and can influence acquisition cost. Tentative targets are refined successively by performing rough analysis and comparative studies (definition of goals down to assembly level can be necessary at this time (Eq. (2.71)). For acceptance testing (demonstration) of an MTBF for the case MTBF = l l h , the following data are important (Sections 7.2.3.2 and 7.2.3.3): rn
1. MTBFo = specified MTBF andlor MTBFl = minimum acceptable MTBF. 2. Required function (mission profile). 3. Environmental conditions (thermal, mechanical, climatic). 4. Allowed producer's andlor consumer's risks (a andlor P). Table A3.1 Indicative values of failure rates ?L and mean expected number msl of failures per year and 100 items for a duty cycle d = 30% and d = 100% ( B A = 40°C)
2,000
2
Telephone receiver (multifunction)
200
0.2
600
0.6
Photocopier incl. mechanical parts
30,000
30
100,000
100
Telephone exchanger
Personal computer Radar equipment (ground mobile) Control card for autom. process control Mainframe computer system
6,000
6
3,000
3
300,000
300
900,000
900
300 -
0.3 -
900
0.9
20,000
20
A3.1 Definition of Quality and Reliability Requirements
371
5. Cumulative operating time T and number C of allowed failures during T (acceptance conditions). 6. Number of systems under test ( T / MTBFOas a rule of thumb). 7. Parameters which should be tested and frequency of measurement. 8. Failures which should be ignored for the MTBF acceptance test. 9. Maintenance and screening before the acceptance test. 10. Maintenance procedures during the acceptance test. 11. Form and content of test protocols and reports. 12. Actions in the case of a negative test result. For acceptance testing (demonstration) of an MTTR, the following data are important (Section 7.3.2):
1. Quantitative requirements (MTTR, variante, quantile). 2. Test conditions (environment, personnel, tools, external Support, spare parts). 3. Number and extent of repairs to be undertaken (simulated/introduced failures). 4. Allocation of the repair time (diagnostic, repair, functional test, logistic time). 5. Acceptance conditions (number of repairs and observed empirical MTTR). 6. Form and content of test protocols and reports. 7. Actions in the case of a negative test result.
Availability usually follows from the relationship PA = MTBFI(MTBF+ MTTR). However, specific test procedures for PA = AA are given in Scction 7.2.2).
A3.2 Realization of Quality and Reliability Requirements for Complex Equipment and Systems For complex items, in particular at equipment and system level, quality and reliability targets are best achieved with a quality und reliability assurance program, integrated in the project activities and performed without bureaucracy. In such a program, project specific tasks and activities are clearly described and assigned. Table A3.2 can be used as a checklist by defining the content of a quality and reliability assurance program for complex equipment und systems with high quality und reliability requirements, when tailoring is not mandatory (see also [A2.8 (730-2002)] and Section 5.3 for software specific quality assurance aspects). Table A3.2 is a refinement of Table 1.2 and shows a possible task assignment in a company as per Fig. 1.7. Depending on the item technology and complexity, or because of tailoring, Table A3.2 is to be shortened or extended. The given responsibilities for tasks (R, C, I) can be modified to reflect the company's personnel situation. For a comprehensive description of reliability assurance tasks see e.g. [A2.6 (60300), A2.10 (785), A3.11.
372
A3 Definition and Realization of Quality and Reliability Requirements
Table A3.2 Example of tasks and tasks assignment for quality and reliability assurance of complex equipment und systems with high quality und reliability requirements, when tailoring is not mandatory (see also Section 5.3 for software specific quality assurance aspects) Example of tasks and tasks assignment for quality und reliability assurance, in agreement to Fig. 1.7 and TQM (checklist for the preparation of a quality and reliability assurance program) R stands for responsibility, C for cooperation (must cooperate), I for information (can cooperate)
Customer und rnarket requirements 1 Evaluation of delivered equipment and systems 2 Detennination of market and customer demands and real needs 3 Customer Support !
Preliminary analyses 1 Definition of tentative quantitative targets for reliability, maintainability, availability, safety, and quality level 2 Rough analyses and identification of potential problems 3 Comparative investigations Qualio und reliability aspects in specifications, quotations, contracts, etc. 1 Definition of the required function 2 Determination of extemal environmental stresses 3 Definition of realistic quantitative targets for reliability, maintainability, availability, safety, and quality level 4 Specification of test and acceptance criteria 5 Identification of the possibility to obtain field data 6 Cost estimate for quality & reliability assurance activities Quality und reliability assurance program 1 Preparation 2 Realization - design and evaluation - production
i
Reliability und maintainability analyses 1 Specification of the required function for each element 2 Determination of environmental, functional, and timedependent stresses (detailed operating conditions) 3 Assessment of derating factors 4 Reliability and maintainability allocation 5 Preparation of reliability block diagrams - assembly level - system level 6 Identification and analysis of reliability weaknesses (FMEA/FMECA, R A , worst-case, dnft, stress-strengthanalyses) - assembly level - system level
A3.2 Realization of Quality and Reliability Requirements
Table A3.2 (cont.) 7 Carrying out comparative studies - assembly level - system level 8 Reliability improvement through redundancy - assembly level - system level 9 Identification of components with limited lifetime 10 Elaboration of the maintenance concept I1 Elaboration of a test and screening strategy 12 Analysis of maintainability 13 Elaboration of mathematical models 14 Calculation of the predicted reliability and maintainability - assembly level - system level 15 Reliability and availability calculation at system level Safety und human factor analyses 1 Analysis of safety (avoidance of liability problems) - accident prevention - technical safetv identification and analysis of critical failures and of risk situations (FMEAJFMECA,FTA, etc.) - assembly level - system level theoretical investigations 2 Analysis of human factors (man-machine interface) Selection und qualzjication of components und materials 1 Updating of the list of preferred components and materials 2 Selection of non-preferred components and materials 3 Qualification of non-preferred components aud materials - planuing - realization - analysis of test results 4 Screening of components and materials Supplier selection and qualification 1 Supplier selection - purchased components and materials - external production 2 Supplier qualification (quality and reliability) - purchased components and materials - extemal production 3 Incoming inspections - planning - realization - analysis of test results - decision on corrective actions purchased components and materials extemal production
374
A3 Definition and Realization of Quality and Reliability Requirements
Table A3.2 (cont.) 3.
Project-dependent procedures und work instructions 1 Reliability guidelines 2 Maintainability guidelines 3 Safety guidelines 4 Other procedures, rules, and work instructions for development for production 5 Compliance monitoring
M
R&D
P
Q&R
C I
C C C
I I
R R R
C
R I C
I R C
C C R
C
C
C I
C C
C C C C
R R R R
10. Configuration manugement 1 Planning and monitoring 2 Realization - configuration identification during design during production dunng use (warranty period) - configuration auditing (design reviews, Tables A3.3,5.3,5.5) - configuration control (evaluation, coordination, and release or rejection of changes and modifications) dunng design during production dunng use (warranty period) - configuration accounting 11. Prototype qualification tests
1 2 3 4
Planning Realization Analysis of test results Special tests for reliability, maintainability, and safety
12. Quality control during production 1 Selection and qualification of processes and procedures 2 Production planning 3 Monitoring of production processes !3. Zn-process tests 1 Planning 2 Realization 14. Final und acceptance tests 1 Environmental tests andlor screening of series-produced items - planning - realization - analysis of test results 2 Final and acceptance tests - plaming - realization - analvsis of test results 3 Procurement, maintenance, and calibration of test equipment
375
A3.2 Realization of Quality and Reliability Requirements
Table A3.2 (cont.)
/ 15. Quality data reporting system 1 Data collection 2 Decision on corrective actions - during Prototype qualification - during in-process tests - during final and acceptance tests - during use (warranty penod) 3 Realization of corrective actions on hardware or software (repair, rework, waiver, scrap) 4 Implementation of the changes in the documentation (technical, production, customer) 5 Data compression, processing, Storage, and feedback 6 Monitoring of the quality data reporting system
1 16. Logistic suppori 1 2 3 4
Supply of special tools and test equipment for maintenance Preparation of customer documentation Training of operating and maintenance personnel Determination of the required number of spare Parts, maintenance personnel, etc. 5 After-sales (after market) support
17. Coordination and monitoring
3 Planning and realization of quality audits - project-specific - project-independent 4 Information feedback 18. Quality cost 1 Collection of quality cost 2 Cost analysis and initiation of appropnate actions 3 Preparation of periodic and special reports 4 Evaluation of the efficiency of quality & reliability assurance
R
19. Concepts, methods, and general procedures (quality und reliability)
I 2 3 4 5
Development of concepts Investigation of methods Preparation and updating of the quality handbook Development of software packages Collection, evaluation, and distribution of data, experience and know-how
20. Motivation und training
I Planning 2 Preparation of Courses and documentation 3 Realization of the motivation and training program
R
1
I
/
I
1
1 R
376
A3.3
A3 Definition and Realization of Quality and Reliability Requirements
Elements of a Quality and Reliability Assurance Program
The basic elements of a quality and reliability assurance program, as defined in Appendix A.3.2, can be summarized as follows:
1. Project organization, planning, and scheduling 2. Quality and reliability requirements 3. Reliability and safety analysis 4. Selection and qualification of components, materials, and processes 5. Configuration management 6. Quality tests 7. Quality data reporting system These elements are discussed in this section for the case of complex equipment and Systems with high quality and reliability requirements, when tailoring is not mandatory. In addition, Appendix A4 gives a catalog of questions to generate checklists for design reviews and Appendix A5 specifies the requirements for a quality data reporting System. For software specific quality assurance aspects one can refer to Section 5.3. As suggested in task 4 of Table A3.2, the realization of a quality and reliability assurance program should be the responsibility of the project manager. It is often useful to start with a quality and reliability program for the development phase, covering items 1 to 5 of the above list, and continue with the production phase for points 5 to 7.
A3.3.1 Project Organization, Planning, and Scheduling A clearly defined project organization and planning is necessary for the realization of a quality and reliability assurance program. Organization and planning must also satisfy modern needs for cost management and concurrent engineering. The system specification is the basic document for all considerations at project level. The following is a typical outline for system specifications:
1. State of the art, need for a new product 2. Target to be achieved 3. Cost, time schedule 4. Market potential (turnover, price, competition) 5. Technical performance 6. Environmental conditions 7. Operational capabilities (reliability, maintainability, availability, logistic support) 8. Quality and reliability
A3.3 Elements of a Quality and Reliability Assurance Program
9. Special aspects (new technologies, Patents, value engineering, etc.) 10. Appendices The organization of a project begins with the definition of the main task groups. The following groups are usual for a complex system: Project Management, System Engineering, Life-Cycle Cost, Quality and Reliability Assurance, Assembly Design, Prototype Qualification Tests, Production, Assembly and Final Testing. Project organization, task lists, task assignment, and rnilestones can be derived from the task groups, allowing the quantification of the personnel, material, and financial resources needed for the project. The quality and reliability assurance program must require that the project is clearly and suitably organized and planned.
A3.3.2 Quality and Reliability Requirements The most important steps in defining quality and reliability targets for complex equipment and Systems have been discussed in Appendix A.3.1.
A3.3.3 Reliability and Safety Analysis Reliability and safety analyses include failure rate analysis, failure mode analysis (FMEAIFMECA, FTA), sneak circuit analysis (to identify latent paths which can cause unwanted functions or inhibit desired functions, while all components are functioning properly), evaluation of concrete possibilities to improve reliability and safety (derating, screening, redundancy), as well as comparative studies; see Chapters 2 - 6 for methods and tools. The quality and reliability assurance program must show what is actually being done for the project considered. For instance, it should be able to supply answers to the following questions:
1. Which derating rules are considered? 2. How are the actual component-level operating conditions determined? 3. Which failure rate data are used? Which are the associated factors (TC,& xQ)? 4. Which tool is used for failure mode analysis? To which items does it apply? 5. Which kind of comparative studies will be performed? 6. Which design guidelines for reliability, maintainability, safety, and software quality are used? How will their adherence be verified? Additionally, interfaces to the selection and qualification of components and materials, design reviews, test and screening strategies, reliability tests, quality data reporting system, and subcontractor activities must be shown. The data used for component failure rate calculation should be critically evaluated (source, present relevance, assumed environmental and quality factors TC, & nQ).
378
A3 Definition and Realization of Quality and Reliability Requirements
A3.3.4 Selection and Qualification of Components, Materials, and Manufacturing Processes Components, materials, and production processes have a great impact on product quality and reliability. They must be carefully selected and qualified. Examples for qualification tests on electronic components and assemblies are given in Chapter 3. For production processes one may refer e.g. to [8.1 - 8.151. The quality and reliability assurance program should give how components, materials, and processes are (or have already previously been) selected and qualified. For instance, the following questions should be answered:
1. Does a list of preferred components und materials exist? Will critical components be available on the market-place at least for the required production and warranty time? 2. How will obsolescence problems be solved? 3. Under what conditions can a designer use nonqualified components Imaterials? 4. How are new components selected? What is the qualification procedure? 5. How have the standard manufacturing processes been qualified? 6. How are special manufacturing processes qualified? Special manufacturing processes are those which quality can't be tested directly on the product, have high requirements with respect to reproducibility, or can have an important negative effect on the product quality or reliability.
A3.3.5 Configuration Management Configuration management is an important tool for quality assurance, in particular during design and development. Within a project, it is often subdivided into configuration identification, auditing, control, and accounting. The identification of an item is recorded in its documentation. A possible documentation outline for complex equipment und Systems is given in Fig. A3.1. Configuration auditing is done via design reviews (often also termed gute review), the aim of which is to assurel verify that the system will meet all requirements. In a design review, all aspects of design and development (selection and use of components and materials, dimensioning, interfaces, etc.), production (manufacturability, testability, reproducibility), reliability, maintainability, safety, patent regulations, value engineering, and value analysis are critically examined with the help of checklists. The most important design reviews are described in Table A3.3. For complex Systems a review of the first production unit (FCAJPCA) is often required. A further important objective of design reviews is to decide about continuation or stopping the project considered on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3 & Fig. 1.6). A week
A3.3 Elements of a Quality and Reliability Assurance Program
1
DOCUMENTATION
TECHNICAL
-
System specifications Quotations, requests Interface documentation Planning and control documentation Conceptslstrategies (maintenance, test) Analysis reports Standards, handbooks, general mles
Work breakdown structures Drawings Schematics Part lists Wiring plans Specifications Purchasing doc. Handling/transpo~tation/ storagelpackaging doc.
1
PRODUCTION DOCUMENTATION Operations plansfrecords Production procedures Tool documentation Assembly documentation Test procedures Test reports Documents pertaining to the quality data reporting system
CUSTOMER
Customer system specifications Operating and maintenance manuals Spare part catalog
Fig. A3.1 Possible dc~cumentationoutline for complex equipment und Systems
before the design review, participants should present project specific checklists, see Appendix A4 and Tables 2.8 & 4.3 for sorne suggestions. Design reviews are chaired by the project manager and should cochaired by the project quality and reliability assurance manager. For complex equiprnent and Systems, the review team may vary according to the following list: project manager, project quality and reliability assurance manager, design engineers, representatives from production and marketing, independent design engineer or extemal expert, customer representatives (if appropriate).
Configuration control includes evaluation, coordination, and release or rejection of all proposed changes and modifications. Changes occur as a result of defects or failures, modifications are triggered by a revision of the system specifications. Configuration accounting ensures that all approved changes and modifications have been irnplemented and recorded. This calls for a defined procedure, as changes Imodifications rnust be realized in hardware, software, and documentation. A one-to-one correspondence between hardware or software and documentation is irnportant during all life-cycle phases of a product. Complete records over all life-cycle phases become necessary if traceability is explicitly required, as e.g. in the aerospace or nuclear field. Partial traceability can also be required for products which are critical with respect to safety, or because of product liabili~. Referring to configuration management, the quality and reliability assurance program should for instance answer the following questions:
380
A3 Definition and Realization of Quality and Reliability Requirements
1. Which documents will be produced by whom, when, and with what content? 2. Are document contents in accordance with quality and reliability requirements? 3. 1s the release procedure for technical and production documentation compatible with quality requirements? 4. Are the procedures for changes Imodifications clearly defined? 5. How is compatibility (upward and /or downward) assured? 6. How is configuration accounting assured during production? 7. Which items are subject to traceability requirements?
A3.3.6 Quality Tests Qualio tests are necessary to verify whether an item conforms to specified requirements. Such tests Cover performance, reliability, maintainability, and safety aspects, and include incoming inspections, qualification tests, production tests, and acceptance tests. To optimize cost and time schedule, tests should be integrated in a test (and screening) strategy at system level. Methods for statistical quality control and reliability tests are given in Chapter 7. Qualification tests and screening procedures are discussed in Sections 3.2 - 3.4 and 8.2 - 8.3. Basic considerations for test and screening strategies with cost considerations are in Section 8.4. Some aspects of testing software are discussed in Section 5.3. Reliability growth is investigated in Section 7.7. The quality and reliability assurance program should for instance answer the following questions:
1. What are the test and screening strategies at system level? 2. How were subcontractors selected, qualified and monitored? 3. What is specified in the procurement documentation? 4. How is the incoming inspection performed? 5. Which components and materials are 100% tested? Which are 100% screened? What are the procedures for screening? 6. How are Prototypes qualified? Who decides on test results? 7. How are production tests performed? Who decides on test results? 8. Which procedures are applied to defective or failed items? 9. What are the instructions for handling, transportation, Storage, and shipping?
A3.3.7 Quality Data Reporting System Starting at the Prototype qualification tests, all defects and failures should be systematically collected, analyzed and corrected. Analysis should go back to the cause of the fault, in order to find those actions most appropriate for avoiding repetition of
381
A3.3 Elements of a Quality and Reliability Assurance Program
Table A3.3 Design reviews during definition, design, and dev. of complex equipment und Systems System Design Review
Preliminary Design Reviews
Critical Design Review
WR)
(PDR)
(CDR)
At the end of the definition phase
Critical review of the system specifications on the basis of results from market research, rough analysis, comparative studies, patent situation, etc. Feasibility check
Item list System specifications (draft) Documentation (analyses, reports, etc.) Checklists (one for each participant)*
-
System specifications Proposal for the design phase Interface definitions Rough maintenance and logistic support concept Report
During the design phase, each At the end of prototype qualification tests time an assembly has been developed Critical review of all documents belonging to the assembly under consideration (calculations, schematics, parts lists, test specifications, etc.) Comparison of the target achieved with the system specifications requirements Checking interfaces to other assemblies Feasibility check
Cntical comparison of prototype qualification tesi results with system requirements Formal review of the correspondence between technical documentation and prototype Verification of mannufacturability, testability, and reproducibility Feasibility check
Item list Documentation (analyses, schematics, drawings, parts lists, test specifications, work breakdown structure, interface specifications, etc.) Reports of relevant earlier design reviews Checklists (one for each participant)*
Item list Technical documentation Testing plan and procedures for prototype qualification tests Results of prototype qualification tests List of deviations from the system requirements Maintenance concept Checklists (one for each participant)"
Reference configuration (baseline) of the assembly considered List of deviations from the system specifications Report
List of the final deviations from the system specs. Qualified and released Prototypes Frozen technical documentation Revised mainten. concept Production proposal Report
See Appendix A4 for a possible catalog of qucstions to generatc project specific checklists and Tab. 5.5 for software specific aspects; gate review is often uscd instead of design review
the same problem. The concept of a quality data reporting system is illustrated in Fig. 1.8 and applies basically to hardware and software, detailed requirements are given in Appendix A5.
382
A3 Definition and Realization of Quality and Reliability Requirements
The quality and reliability assurance program should for instance answer the following questions:
1. How is the collection of defect and failure data carried out? At which project phase is started with? 2. How are defects and failures analyzed? 3. Who carries out corrective actions? Who monitors their realization? Who checks the final configuration? 4. How is evaluation and feedback of quality and reliability data organized? 5. Who is responsible for the quality data reporting system? Does production have their own locally limited version of such a system? How does this Systems interface with the company's quality data reporting system?
Checklists for Design Reviews
In a design review, all aspects of design, development, production, reliability, maintainability, safety, patent regulations, value engineeringlvalue analysis are critically examined with the help of checklists. The most important design reviews are described in Table A3.3 (see Table 5.5 for software specific aspects). A further objective of design reviews is to decide about continuation or stopping the project on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3 & Fig. 1.6). This appendix gives a catalog of questions which can be used to generate project specific checklists for design reviews for complex equipment und systems with high quality & reliability requirements, when tailoring is not mandatory.
A4.1
System Design Review
1. What experience exists with similar equipment or systems? 2. What are the goals for performance (capability), reliability, maintainability, availability, and safety? How have they been defined? Which mission profile (required function and environmental conditions) is applicable? 3. Are the requirements realistic? Do they correspond to a market need? 4. What tentative allocation of reliability and maintainability down to assembly 1 unit level was undertaken? 5. What are the critical items? Are potential problems to be expected (new technologies, interfaces)? 6. Have comparative studies been done? What are the results? 7. Are interference problems (external or internal EMC) to be expected? 8. Are there potential safety Iliability problems? 9.1s there a maintenance concept? Do special ergonomic requirements exist? 10. Are there special software requirements? 11. Has the patent situation been verified? Are licenses necessary? 12. Are there estimates of life-cycle cost? Have these been optimized with respect to reliability and maintainability requirements?
3 84
A4 Checklists for Design Reviews
13. 1s there a feasibility study? Where does the competition stand? Has development risk been assessed? 14. 1s the project time schedule realistic? Can the system be marketed at the right time? 15. Can supply problems be expected during production ramp-up?
A4.2
Preliminary Design Reviews
a) General 1. 1s the assembly / m i t under consideration a new development or only a change/modification? Can existing items (e.g. sub assemblies) be used? 2.1s there experience with similar assembly /mit? What were the problems? 3.1s there redundancy hardware and / or software? 4. Have customer and market demands changed since the beginning of development? Can individual requirements be reduced? 5. Can the chosen solution be further simplified? 6. Are there patent problems? Do licenses have to be purchased? 7. Have expected cost and deadlines been met? Were value engineering used?
b) Performance Parameters 1. How have been defined the main performance Parameters of the assembly / unit under consideration? How was their fulfillment verified (calculations, simulation, tests)? 2. Have worst case situations been considered in calculations / simulations? 3. Have interference problems (EMC) been solved? 4. Have applicable standards been observed during design and development? 5. Have interface problems with other assemblies Iunits been solved? 6. Have Prototypes been adequately tested in laboratory? C) Environmental Conditions 1. Have environmental conditions been defined? As a function of time? Were these consequently used to determine component operating conditions? 2. How were EMC interference been determined? Has his influence been taken into account in worst case calculation/ simulation?
A4.2 Preliminary Design Reviews
385
d) Components and Materials 1. Which components and materials do not appear in the preferred lists? For what reasons? How were these components and materials qualified? 2. Are incoming inspections necessary? For which components and materials? How / Who will they be performed? 3. Which components and materials were screened? How / Who will screening be performed? 4. Are suppliers guaranteed for series production? 1s there at least one second source for each component and material? Have requirements for quality, reliability, and safety been met? 5. Are obsolescence problems to be expected? How will they be solved?
e) Reliability See Table 2.8.
f) Maintainability See Table 4.3. g) Safety 1. Have applicable standards concerning accident prevention been observed?
2. Has safety been considered with regard to external causes (natural catastrophe, sabotage, etc.)? 3. Has a FMEAIFMECA or similar cause-to-effects analysis been performed? Are there failure modes with critical or even catastrophic consequence? Can these be avoided? Have all single-point failures been identified? Can these be avoided? 4. Has a fail-safe analysis been performed? What were the results? 5. What safety tests are planned? Are they sufficient? 6. Have safety aspects been dealt with adequately in the documentation?
h) Human Factors, Ergonomics 1. Have operating and maintenance sequences been defined with regard to the training level of Operators and maintenance personnel? 2. Have ergonomic factors been taken into account by defining operating sequences? 3. Has the man-machine interface been sufficiently considered?
386
A4 Checklists for Design Reviews
i) Standardization 1. Have standard components and materials been used wherever possible? 2. Has items exchangeability been considered during design and construction?
j) Configuration 1. 1s the technical documentation (schematics, drawings, etc.) complete, errorfree, and does it reflect the present state of the project? 2. Have all interface problems between assemblies Iunits been solved? 3. Can the technical documentation be frozen and considered as reference documentation (baseline)? 4. How is compatibility (upward andlor downward) assured? k) Production and Testing 1. Which qualification tests are foreseen for prototypes? Have reliability, maintainability, and safety aspects been considered sufficiently in these tests? 2. Have all questions been answered regarding manufacturability, testability, and reproducibility? 3. Are special production processes necessary? Were they qualified? What were the results? 4. Are special transport, packaging, or storage problems to be expected?
A4.3
Critical Design Review (System Level)
a) Technical Aspects
1. Does the documentation allow an exhaustive and correct interpretation of test procedures and results? Has the technical documentation been frozen? Has conformance with present hardware and software been checked? 2. Are test specifications and procedures complete? In particular, are conditions for functional, environmental, reliability, and safety tests clearly defined? 3. Have fault criteria been defined for critical parameters? 1s an indirect measurement planned for those parameters which cannot be measured accurately enough during tests? 4. Has a representative mission profile, with the corresponding required function, been clearly defined for reliability tests?
A4.3 Cntical Design Review (System Level)
387
5. Have test criteria for maintainability been defined? Which failures were simulated / introduced? How have personnel and material conditions been fixed? 6. Have test criteria for safety been defined (accident prevention and technical safety)? 7. Have ergonornic aspects been checked? How? 8. Can packaging, transport and Storage cause problems? 9. Have defects and failures been systematically analyzed (mode, cause, effect)? Has the usefulness of corrective actions been verified? How? Also with respect to cost? 10. Have all deviations been recorded? Can they be accepted? 11. Does the system still satisfy customer/market needs? 12. Are manufacturability and reproducibility guaranteed within the framework of a production environment? b) Formal Aspects 1.1s the technical documentation complete? 2. Has the technical documentation been checked for correctness? For coherency? 3.1s uniqueness in numbering guaranteed? Even in the case of changes? 4.1s hardware labeling appropriate? Does it satisfy production and maintenance requirements? 5. Has conformance between Prototype and documentation been checked? 6. 1s the maintenance concept mature? Are spare parts having a different change Status fully interchangeable? 7. Are production tests sufficient from today's point of view?
A5 Requirements for Quality Data Reporting Systems
A quality data reporting System is a system to collect, analyze, and correct all defects and failures occurring during production and testing of an item, as well as to evaluate and feedback the corresponding quality and reliability data (Fig. 1.8). The system is generally computer-aided. Analysis of failures and defects must go back to the root cause in order to determine the most appropriate action necessary to avoid repetition of the same problern. The quality data reporting system applies basically to hardware and software. It should remain active during the operating phase, at least for the warranty time. This appendix summarizes the requirements for a computer-aided quality data reporting system for complex equipment and systems. a) General Requirements 1. Up-to-dateness, completeness, and utility of the delivered information must be the primary concern (best compromise). 2. A high level of u s a b i l i ~(user friendliness) and minimal manual intervention should be a goal. 3. Procedures and responsibilities should be clearly defined (several levels depending upon the consequence of defects or failures). 4. The system should be flexible and easily adaptable to new needs.
b) Requirements Relevant to Data Collection 1. All data concerning defects and failures (relevant to quality, reliability, maintainability, and safety) have to be collected, from the begin of Prototype qualification tests to (at least) the end of the warranty time. 2. Data collection forms should be preferably 8" X 11" or A4 format be project-independent and easy to fill in ensure that only the relevant information is entered and answers the questions: what, where, when, why, and how?
A5 Requirements for Quality Data Reporting Systems
389
have a separate field (20-30%) for free-format input for comrnents (requests for analysis, logistic information, etc.), these comments do not need to be processed and should be easily separable from the fixed portion of the form. 3. Description of the Symptom (mode), analysis (cause, effect), and corrective action undertaken should be recorded in clear text and coded at data entry by trained personnel. 4. Data collection can be carried out in different ways at a single reporting location (adequate for simple problems which can be solved directly at the reporting location) from different reporting locations which report the fault (defect or failure), analysis result, and corrective action separately. Operating reliability, maintainability, or logistic data can also be reported. 5. Data collection forms should be entered into the Computer daily (on line if possible), so that corrective actions can be quickly initiated (for field data, a weekly or monthly entry can be sufficient for many purposes). C) Requirements f o r Analysis
1. The cause should be found for each defect or failure at the reporting location, in the case of simple problems by a fault review board, in critical cases. 2. Failures (and defects) should be classified according to mode - sudden failure (short, Open, fracture, etc.) - gradual failure (drift, wearout, etc.) - intermittent failures, others if needed cause - intrinsic (inherent weaknesses, wearout, or some other intrinsic cause) - extrinsic (systernatic failure, i.e. misuse, mishandling, design, or manuf. failure) - secondary failure effect - irrelevant - partial failure - cornplete failure - critical failure (safety problern). 3. Consequence of the analysis (repair, rework, change, scraping) must be reported.
d) Requirements f o r Corrective Actions 1. Every record is considered pending until the necessary corrective action has been successfully completed and certified. 2. The quality data reporting system must monitor all corrective actions.
A5 Requirements for Quality Data Reporting Systems
390
3. Procedures and responsibilities pertaining to corrective action have to be defined (simple cases usually solved by the reporting location). 4. The reporting location must be informed about a completed corrective action. e) Requirements Related to Data Processing, Feedback, and Storage 1. Adequate coding must allow data compression and simplify data processing. 2. Up-to-date information should be available on-line. 3. Problem-dependent and periodic data evaluation must be possible. 4. At the end of a project, relevant information should be stored for comparative investigations.
f) Requirements Related to Compatibility with other Software Packages 1. Compatibility with company's configuration management and data banks should be assured. 2. Data transfer with the following external software packages should be assured important reliability data banks quality data reporting systems of subsidiary companies quality data reporting systems of large contractors. The effort required for implementing a quality data reporting system as described above can take 5 to 10 man-years for a medium-sized company. Competence for operation and maintenance of the quality data reporting system should be with the company's quality and reliability assurance department. The priority for the realization of corrective actions is project specific and should be fixed by the project manager. Major problems (defects and failures) should be discussed periodically by a fault review board chaired by the company's quality and reliability assurance manager, which should have, in critical cases defined in the company's quality assurance handbook, the competence to take golnogo decisions.
Basic Probability Theory
In many practical situations, experiments have a random outcome, i.e., the results cannot be predicted exactly, although the same experiment is repeated under identical conditions. Examples in reliability engineering are failure-free time of a given System, repair time of equipment, inspection of a given item during production, etc. Experience shows that as the number of repetitions of the same experiment increases, certain regularities appear regarding the occurrence of the event considered. Probability theory is a mathematical discipline which investigates the laws describing such regularities. The assumption of unlimited repeatability of the same experiment is basic to probability theory. This assumption permits the introduction of the concept of probability for an event starting from the properties of the relative frequency of its occurrence in a long series of trials. The axiomatic theory ofprobability, introduced 1933 by A.N. Kolmogorov [A6.10], brought probability theory to a mathematical discipline. In reliability analysis, probability theory allows the investigation of the probability that a given item will operate failure-free for a stated period of time under given conditions, i.e. the calculation of the item's reliability on the basis of a mathematical model. The d e s necessary for such calculations are presented in Sections A6.1- A6.4. The following sections are devoted to the concept of random variables, necessary to investigate reliability as a function of time and as a basis for stochastic processes (Appendix A7) and mathematical statistics (Appendix Ag). This appendix is a compendium of probability theory, consistent from a mathematical point of view but still with reliability engineering applications in mind. Selected examples illustrate the practical aspects.
A6.1
Field of Events
As introduced 1933 by A.N. Kolmogorov [A6.10], the mathematical model of an experiment with random outcome is a triplet [Q, F , Pr], also called probability space. & is the sample space, F the event field, and Pr the probability of each element of F . is a Set containing as elements all possible outcomes of the experiment considered. Hence & = {i,2, 3, 4, 5, 6) if the experiment consists of a single throw of a die, and SZ = [O,W ) in the case of failure-free times of an item. The
392
A6 Basic Probability Theory
elements of SZ are called elementary events and are represented by W. If the logical Statement "the outcome of the experiment is a subset A of SZ" is identified with the subset A itself, combinations of Statements become equivalent to operations with subsets of SZ. If the sample space SZ is finite or countable, a probability can be assigned to every subset of SZ. In this case, the event field F contains all subsets of SZ and all combinations of them. If L2 is continuous, restrictions are necessary. The eventfield F is thus a system of subsets of SZ to each of which a probability has been assigned according to the situation considered. Such a field is called a o-field ( o-algebra) and has the following properties:
1. SZ is an element of g . 2. If A is an element of F, its complement Ä is also an element of F . 3. If A l , A2, ... are elements of g , the countable union Al U A2 U ... is also an element of F. From the first two properties it follows that the empty set 0 belongs to F . From the last two properties and De Morgan's law one recognizes that the countable intersection Al n A2 n ... also belongs to F. In probability theory, the elements of are called (random) events. The most important operations on events are the union, the intersection, and the complement:
1. The union of a finite or countable sequence A l , A2, ... of events is an event which occurs if at least one of the events A l , A2, ... OCCU~S; it will be denoted by Al u A 2 U . . . orby U i A i . 2. The intersection of a finite or countable sequence Al, A2, ... of events is an event which occurs if each one of the events Al, A2, ... occurs; it will be denotedby Al n A 2 n...orby f l i A i . 3. The complement of an event A is an event which occurs if and only if A does - notoccur;itisdenotedby A , A = { w : w @ A } = Q \ A ,A U Ä = Q , ~ n Ä = 0 . Important properties of set operations are: Commutativelaw Associativelaw Distributivelaw Complementlaw Idempotentlaw De Morgan's law Identitylaw
: A u B = BUA; A n B = B n A :Au(BuC)=(AuB)uC;An(BnC)=(AnB)nC : A u ( B n C ) = ( A u B ) n ( A u C ) ;A n ( B u C ) = ( A n B ) u ( A n C ) : ~ n Ä = 0 A; U Ä = Q :AuA=A; AnA=A : A U B = A n B; - A nB = A U B = :A=A;Au(AnB)=AuB.
The sample space B is also called the sure event and 0 is the impossible event. The events A l , A2, ... are mutually exclusive if Ai n Aj = $3 holds for any i # j . The events A and B are equivalent if either they occur together or neither of them occur, equivalent events have the same probability. In the following, events will be mainly enclosed in braces { } .
A6.2 Concept of Probability
A6.2
Concept of Probability
Let us assume that 10 (random) samples of size n = 100 were taken from a large and homogeneous lot of populated printed circuit boards (PCBs), for incoming inspection. Examination yielded the following results: Sample number:
1
2
No. of defective PCBs:
6
5
3
1
4
5 3
6
4
7 0
8
3
9
4
1
0
5
7
For 1000 repetitions of the "testing a PCB" experiment, the relative frequency of the occurrence of event {PCB defective) is
It is intuitively appealing to consider 0.038 as the probability of the event {PCB defective}. As shown below, 0.038 is a reasonable estimation of this probability (on the basis of the experimental observations made). Relative frequencies of the occurrence of events have the property that if n is the number of trial repetitions and n ( A ) the number of those trial repetitions in which the event A occurred, then
is the relative frequency of the occurrence of A, and the following d e s apply:
1. R1: j n ( A ) 2 0 . 2. R2: &(Q) = 1.
3. R3: if the events Al, . .., Am are mutually exclusive, then n(Al U ... U Am) = n(Al)+ ... + n(Am)and ;,(Al U ... UA,) = f i n ( ~ i ) + ... + &(A&. Experience shows that for a second group of n trials, the relative frequency F,(A) can be different from that of the first group. j,(A) also depends on the number of trials n. On the other hand, experiments have confirmed that with increasing n, the value & ( A ) converges toward a fixed value p ( A ) , see Fig. A6.1 for an example. It therefore seems reasonable to designate the limiting value p ( A ) as the probability . intuitive, P ~ { A of ) the event A , with ; J A ) as an estimate of P ~ { A ) Although such a definition of probability would lead to problems in the case of continuous (non-denumerable) sample spaces. Since Kolmogorov's work [A6.10], the probability P ~ { A has ) been defined as a function on the event field F of subsets of 8. The following axioms hold for this function:
A6 Basic Probability Theory
kln
0.8
Figure A6.1 Example of relative frequency Wn of "heads" when tossing a symmetric coin n times
1. Axiom 1: Foreach A € F i s Pr{A}20. 2. Axiom 2: Pr{Q) = 1. 3. Axiom 3: If events Al, A2, .. . are mutually exclusive, then
Axiom 3 is equivalent to the following Statements taken together: 4. Axiom 3' : For any finite collection of mutually exclusive events, Pr{A1
U
... U An} = Pr{AI) + ... + Pr{A,}.
5. Axiom 3": If events Al, A2, .. . are increasing, i.e. An L An+1, n = 1,2, . .., m
then lim Pr(An} = pr{UAi1. n-3-
i=l
The relationships between Axiom 1 and R 1, and between Axiom 2 and R 2 are obvious. Axiom 3 postulates the total additivity of the set function P ~ { A ) . Axiom 3' corresponds to R3. Axiom 3" implies a continuityproperty of the set function P ~ { A )which cannot be derived from the properties of $,(A), but which is of great importance in probability theory. It should be noted that the interpretation of the probability of an event as the limit of the relative frequency of occurrence of this event in a long series of trial repetitions, appears as a theorem within the probability theory (law of large numbers, Eqs. (A6.144) and (A6.146)). From axioms 1 to 3 it follows that: Pr{@}= 0 , Pr{A}
395
A6.2 Concept of Probability
When modeling an experiment with random outcome by means of the probability space [Q, F, Pr], the difficulty is often in the determination of the probabilities P ~ { A ) for every A E g. The structure of the experiment can help here. Beside the statistical probability, defined as the limit for n -+ of the relative frequency k l n, the following d e s can be used if one assumes that all elementary events o have the same chance of occurrence:
-
1. Classical probability (discrete uniform distribution): If Q is a finite set and A a subset of LI, then Pr{A} =
number of elements in A number of elements in Q
Pr(A} =
number of favorable outcomes number of possible outcomes
2. Geometrie probability (spatial uniform distribution): If Q is a Set in the plane R~ of area Q and A a subset of Q, then Pr{A} =
area of A area of LI
It should be noted that the geometric probability can also be defined if Q is a part of the Euclidean space having a finite area. Examples A6.1 and A6.2 illustrate the use of Eqs. (A6.2) and (A6.3).
Example A6.1 From a shipment containing 97 good and 3 defective ICs, one IC is randomly selected. What is the probability that it is defective?
Solution From Eq. (A6.2), 3 Pr{ICdefective] = 100 Example A6.2 Maurice and Matthew wish to meet between 8:00 and 9:00 a.m. according to the following d e s : 1) They come independently of each other and each will wait 12 minutes. 2) The time of arrival is equally distributed between 8:00 and 9:00 a.m. What is the probability that they will meet?
Solution Equation (A6.3) can be applied and leads to, see graph, 1-2-
0.8. 0.8
PrIMatthew meets Maurice) =
= 0.36. 1
Arrival of Matthew
A
396
A6 Basic Probability Theory
Another way to determine probabilities is to calculate them from other probabilities which are known. This involves paying attention to the structure of the experiment and application of the d e s of probability theory (Appendix A6.4). For example, the predicted reliability of a system can be calculated from the reliability of its elements and the system's structure. However, there is often no alternative to determining probabilities as the limits of relative frequencies, with the aid of statistical methods (Appendices A6.1 I and A8).
A6.3
Conditional Probability, Independence
The concept of conditional probability is of great importance in practical applications. It is not difficult to accept that the information "event A has occurred in an experiment" can modify the probabilities of other events. These new probabilities are defined as conditional probabilities and denoted by Pr{B A } . If for example A B, then Pr{B A } = 1, which is in general different from the original unconditional probability Pr( B ) . The concept of conditional probability Pr{B A ) of the event B under the condition "event A has occurred", is introduced here using the properties of relative frequency. Let n be the total number of trial repetitions and let n ( A ) , n( B ) , and n ( A n B ) be the number of occurrences of A, B and A n B, respectively, with n ( A )> 0 assumed. When considering only the n ( A ) trials (trials in which A occurs), then B occurs in these n ( A ) trials exactly when it occurred together with A in the original trial series, i.e. n( A n B ) times. The relative frequency of B in the trials with the information "A has occurred" is therefore
I
I
I
Equation (A6.4) leads to the following definition of the conditional probability Pr(B A } of an event B under the condition A, i.e. assuming that A has occurred,
I
From Eq. (A6.5) it follows that Pr{A n B} = Pr{A} Pr{B
I A} = Pr{B} Pr{A I B } . I
(A6.6)
I
Using Eq. (A6.5), probabilities Pr{B A } will be defined for all B E F . Pr{B A } is
397
A6.3 Conditional Probability, Independence
a function of B which satisfies Axioms 1 to 3 of Appendix A6.2, obviously with P r { A A ) = 1. The information "event A has occurred" thus leads to a new probability space [ A , F A , P r A ] , where F A consists of events of the form A n B, with B E F and P r A { B } = P r ( B A ) , seeExampleA6.5. It is reasonable to define the events A and B as independent if the information "event A has occurred" does not influence theprobability of the occurrence of event B, i.e. if
I
I
However, when considering Eq. (A6.6), another definition, with symmetry in A and B is obtained, where P r { A ] > 0 is not required. Two events A and B are independent if and only if Pr {A n B) = Pr {A} Pr {B}.
(A6.8)
The events A l , ..., An are (stochastically) independent if for each k (1 < k 5 n) and ..., n ) any selection of distinct i l , ..., ik E {i,
holds.
A6.4
Fundamental Rules of Probability Theory
The probability calculation of event combinations is based on the fundamental d e s of probability theory introduced in this section.
A6.4.1 Addition Theorem for Mutually Exclusive Events The events A and B are mutually exclusive if the occurrence of one event excludes the occurrence of the other, formally A n B = 0. Considering a component which can fail due to a short or an Open circuit, the events failure occurs due to a short circuit and failure occurs due to an Open circuit are mutually exclusive. Application of Axiom 3 (Appendix A6.2) leads to
398
A6 Basic Probability Theory
Pr{A U B) = Pr(A} + Pr (B}.
(A6.10)
Equation (A6.10) is considered a theorem by tradition only; indeed, it is a particular case of Axiom A3 in Appendix A6.2. Example A6.3 A shipment of 100 diodes contains 3 diodes with shorts and 2 diodes with Opens. If one diode is randomly selected from the shipment, what is the probability that it is defective? Solution From Eqs. (A6.10) and (A6.2),
If the events A l , A 2 , .. . are mutually exclusive ( A in A j = 0 for all i also totally exclusive. According to Axiom 3 it follows that
# j , they
are
A6.4.2 Multiplication Theorem for Two Independent Events The events A a n d B a r e independent if the information about occurrence (or nonoccurrence) of one event has no influence on the probability of occurrence of the other event. In this case Eq. (A6.8) applies
Example A6.4 A system consists of two elements E1 and E2 necessary to fulfill the required function. The failure of one element has no influence on the other. R1 = 0.8 is the reliability of E1 and R2 = 0.9 is that of E2 . What is the reliability RS of the system? Solution Considering the assumed independence between the elements E1 and E2 and the definition of R1, R2 , and RS as R1 = Pr{EI fulfills the required function] , R2 = Pr [E2 fulfills the required function] , and RS = Pr[El fulfills the required function nE2 fulfills the required function) , one obtains from Eq. (A6.8)
A6.4 Fundamental Rules of Probability Theory
A6.4.3 Multiplication Theorem for Arbitrary Events For arbitrary events A and B, with Pr{A}> 0 and Pr{B}> 0, Eq. (A6.6) applies Pr{A n B} = Pr{A}Pr{B I A} = Pr{B}Pr{A I B}. Example A6.5 2 ICs are randomly selected from a shipment of 95 good and 5 defective ICs. What is the probability of having (i) no defective ICs, and (ii) exactly one defective IC?
Solution (i) From Eqs. (A6.6) and (A6.2), 95 94 Pr{first IC good nsecond IC good) = -.- = 0.902. 100 99 (ii) PrIexactly one defective IC] = Pr{(first IC good nsecond IC defective) U (first IC defective n second IC good)) ; from Eqs. (A6.6) and (A6.2),
Generalization of Eq. (A6.6) leads to the multiplication theorem
} > 0 is assumed. An important special case arises when Here, Pr{Al n ... n the events A l , ...,An are (stochastically) independent, in this case Eq. (A6.9) yields
A6.4.4 Addition Theorem for Arbitrary Events The probability of occurrence of at least one of the (possibly non-exclusive) events A and B is given by
Pr{A U B) = Pr { A }+ Pr{B}- Pr{A n B}.
(A6.13)
To prove this theorem, consider Axiom 3 (Appendix A6.2) and the partitioning of the events A u B and B into mutually exclusive events ( A U B = A u ( Ä n B) and B = ( A n B ) u ( Ä n B)).
400
A6 Basic Probability Theory
Example A6.6 To increase the reliability of a System, 2 machines are used in active (parallel) redundancy. The reliability of each machine is 0.9 and each machine operates and fails independently of the other. What is the system's reliability? Solution From Eqs. (A6.13) and (A6.8), Pr{the first machine fulfills the required function machine fulfills the required function] = 0.9 + 0.9 - 0.9 .0.9 = 0.99.
U
the second
The addition theorem can be generalized to n arbitrary events. For n = 3 one obtains Pr{A U B
U
In general, Pr{Al method
C } = Pr{A U ( B U C ) ) = Pr{A}+ Pr{B U C }- Pr{A n ( B U C ) ) = Pr{A}+ Pr{B}+ Pr{C}- Pr{B n C }- Pr{A n B] - Pr{A n C }+ Pr{A n B n C ) . U
(A6.14)
.. . U A n ) can be obtained by the so-called inclusion/exclusion n
Pr{Al
U
... U An) = x ( - l ) k + l ~ k k=l
with
It can be shown that S = Pr{A1 U ... U A n } < S I , S $ . S1 - S 2 , S 5 S I - S 2 + S 3 , etc. Although the upper bounds do not necessarily decrease and the lower bounds do not necessarily increase, a good approximation for S often results from only a few Si. For a further investigation one can use the Frkchet theorem Sk„< Sk ( n - k ) 1( k + I), which follows from s ~ + ~ = s ~ /(i() ( ~ :=sk(n ~ ) - k ) / ( k + 1)c sk for A1=A2=... = A n .
A6.4.5 Theorem of Total Probability Let A l , A2, ... be mutually exclusive events (Ai n Aj = 0 for all i # j), 52 = Al U A2 U ..., and Pr{Ai)> 0, i = 1, 2, ... . For an arbitrary event B one has B = B n Q = B n ( A l u A 2 U ...) = ( B n A 1 ) u ( B n A 2 ) u..., where the events B n A l , B n A2, .. . are mutually exclusive. Use of Axiom 3 (Appendix A6.2) and Eq. (A6.6) yields
Equation (A6.17) expresses the theorem (or formula) of total probability.
40 1
A6.4 Fundamental Rules of Probability Theory
Example A6.7 ICs are purchased from 3 suppliers (Al, A2, A3) in quantities of 1000, 600, and 400 pieces, respectively. The probabilities for an IC to be defective are 0.006 for Al, 0.02 for A2, and 0.03 for A3. The ICs are stored in a common container disregarding their source. What is the probability that one IC randomly selected from the stock is defective? Solution From Eqs. (A6.17) and (A6.2), 1000 2000
600 2000
400 2000
Pr{the selected IC is defective} = -0.006 + -0.02 + -0.03 = 0.015.
Equations (A6.17) and (A6.6) lead to Bayes theorem, which allows calculation of the a posteriori probability Pr{& B}, k = 1, 2, ... as a function of a priori probabilities Pr{Ai},
I
Example A6.8 Let the IC as selected in Example A6.7 be defective. What is the probability that it is from supplier Al? Solution From Eq. (A6.18), Pr{IC from Al
A6.5
0.006 I IC defective) = (1000 12000). = 0.2. 0.015
Random Variables, Distribution Functions
If the result of an experiment with a random outcome is a (real) number, then the underlying quantity is a (real) random variable. For example, the number appearing when throwing a die is a random variable taking on values in (1, ..., 6}. Random variables are designated hereafter with Greek letters T, 5, 5, etc. The triplet [Q, F, Pr] introduced in Appendix A6.2 becomes [ K , 23, Pr], where = (-W, -) and B is the smallest event field containing all (semi) intervals ( U , b] with a < b. The probabilities Pr{A] = Pr{z E Al, A E B, define the distribution law of the random variable T. Among the many possibilities to characterize this distribution law, the most frequently used is to define F(t) = Pr{z 5 t } .
(A6.19)
402
A6 Basic Probability Theory
F ( t ) is called the distributionfunction of the random variable T+). For each t, F ( t ) gives the probability that the random variable will assume a value smaller than or equal to t. Because for s > t one has {T 2 t }2 {T2 s ] , F(t) is a nondecreasing = 1. If P r ( z = t o }> 0 holds, then F(t) function. Moreover, F(--) = 0 and F(..) has a jump of height Pr{z = t o ) at t ~ It. follows from the above definition and Axiom 3" (Appendix A6.2) that F(t) is continuous from the right. Due to Axiom 2, F(t) can have at most a countable number of jumps. The probability that the random variable z takes on a value within the interval ( U , b] is given by
The following classes of random variables are of particular importance:
1. Discrete random variables: A random variable z is discrete if it can only assume a finite or countable number of values, i.e. if there is a sequence t l , t z ,...such that pk = Pr{z = t k } ,
with
zp,
= 1.
(A6.20)
k
A discrete random variable is best described by a table t1
Values of T Probabilities
t2
p,
P2
The distribution function F(t) of a discrete random variable z is a stepfunction
If the sequence t i ,t z ,... is ordered so that tk < tk+l,then F(t) =
C pj ,
for
tk 2 t < tk+i.
jSk
If only the value k = 1 occurs in Eqs. (A6.21), z is a constant ( T = tl = C ) . A constant C can thus be regarded as a random variable with distribution function F(t) =
0
for t < C
1
fort2C.
An important special case of discrete random variables is that of arithmetic random variables. The random variable z is arithmetic if it can take the values ...,- At, 0, At, ... , with probabilities
+)
From a mathematical point of view, the random variable T is defined as a measurable mapping of 52 onto the axis of real numbers = (-W, W), i.e. a mapping such that for each real value X the Set of o for which {T= ~ ( oS) X } belongs to !F, the distribution function of T is then obtained by Setting F(t) = Pr{z 5 t } = Pr@ : z(w) 5 t }.
A6.5 Random Variables, Distribution Functions
403
2. Continuous random variables: The random variable z is absolutely continuous if a function F(t)2 0 exists such that
f ( t ) is called (probability) density of the random variable z and satisfies the condition
The distribution function F(t) and the density f ( t ) are related (almost everywhere) by (see Fig. A6.2 for an example)
Mixed distribution functions, exhibiting both jumps and continuous growth, can occur in some applications. These distribution functions can generally be represented by a mixture (weighted sum) of discrete and continuous distribution functions (Eq. (A6.34)).
Figure A6.2 Relationship between the distribution function F(t) and the density f ( t ) for a continuous random variable T > 0
404
A6 Basic Probability Theory
In reliability theory, T > O denotes (in this book) the failure-free time @ilurefree operating time) of an item,distributed according to F(t) = Pr(7 I t} with F(0) = 0. The reliability function (survival function) R(t) gives the probability that the item considered will operate failure-free in (0, t]; thus, F(t) = Pr{z I t}, R(t) = Pr{z > t} = 1- F(t) ,
z > 0, F(O)=O, R(O)=1. (A6.24)
The failure rate h(t) of an item exhibiting a continuous failure-free time T is defined as
Calculation leads to (Eq. (A6.5) and Fig. A6.3) 1
h(t) = lim - . 6t.10 6t
P r { t < ~ < t + & r i ~ > t ] 1 Pr{t < T 5 t + St} = lim - . Pr(z > t} 6t50 6t Pr{z > t}
and thus, assuming F( t ) derivable, f( t) = - -dR(t) ldt , h(t) = 1- F(t) R( t) It is important to distinguish between failure rate h(t), as conditional density for failure in (t, t + St] given that the item was new a t t=O and has not failed in (0, t], and density f(t), as unconditional density for failure in (t, t + St] given only that the item was new a t t=O (assumed with F(0) = 0). The failure rate h ( t ) applies in particular to nonrepairable items. However, considering Eq. (A6.25) it can also be defined for repairable items which are as-good-as-new aper repair (renewal), taking instead of t the variable x starting by x = 0 a t euch renewal (as for interarrival times). If a repairable item cannot be restored to be as-good-as-new after repair, failure intensity z (t) (Eq. (A7.228)) has to be used (see p. 356 for a discussion). Considering R(0) = 1, Eq. (A6.25) yields
Thus, h ( t ) completely define the reliability function R(t). For practical applications it can be useful to know the probability for failure-free operation in (0, t] given that the item has already operated a failure-free time xo > O f+X0
pr{T> t + x 0
I w x o } = ~ ( t , x o=) ~ ( t + x /~~) (
x =~ e) x o
Figure A6.3 Visual aid to compute the failure rate h(t) ( h(x) for interarrival times)
(A6.27)
405
A6.5 Random Variables, Distribution Functions
From Eq. (A6.27) it follows that -
dR(t,xo)ldt
00
= ( t x o ) = ( x 0 ) and E[T-xo
R(t,xo)
/ r >xo]=[
R(x)&/ R(xo).
(A6.28) From the left-hand side of Eq. (A6.28) one recognizes that the conditionalfailure rate h(t,xo) at time t given that the item has operated failure-free a time xo is the failure rate at time t + xo ( = h for h(x)=h). This leads to the concept of bad-as-old used in some considerations on repairable Systems [6.3] (see also p. 497). Important conclusions as to the aging behavior of an item can be drawn from the shape of its failure rate. For h(t) nondecreasing, it follows for u < s and t > 0 that
For an item with increasing failure rate, inequality (A6.29) shows that the probability of surviving a further period t decreases as a function of the achieved age, i.e. the item ages. The contrary holds for an item with decreasing failure rate. No aging exists in the case of a constant failure rate, i.e. for R(t) = eLh< yielding (memoryless property of the exponential distribution)
For an arithmetic random variable, the failure rate is defined as
1
h ( k ) = P r { ~ = k A t ~ > ( k - 1 ) A t } = p k/ x P i
k = 1, 2, ..
i>k
Following concepts are important to reliability theory (see also Eqs. (A6.78) & (A6.79) for minirnum zmin & maximum T„ of a set of random variables 7 1 , ...,T,): 1. Function of a random variable: If u(x) is a monotonically increasing function and T a continuous random variable with distribution function F, ( t ) , then Pr{z S t} = Pr{q = u(z) 2 u(t)], and the random variable
= U(T)has distribution function
T u-'(t)}= F r ( r l ( t ) ) , Fq(t) = P r ( q = U(T)S t } = P ~ { 5 where
U-'
(A6.3 1)
is the inversefunction of u (Example A6.17). If du(t)ld t exists,
f,(t) = k(uw'(t)) . du-'(t) 1dt . (For
U(T)
monotonically decreasing, I du-'(t) l dt I has to be used forfq(t).)
2. Distribution with random parameter: If the distribution function of T depends on a parameter 6 with density f8(x), then for T it holds that
406
A6 Basic Probability Theory
3. Truncated distribution: In some practical applications it can be assumed that realizations 5 a or > b of a random variable 5 with distribution function F(t) are discarded (e.g. lifetimes 5 0). For a truncated random variable it holds that
4. Mixture of distributions: In many practical applications the situation arises in which two or more failure mechanisms have to be considered for a given item. The following are some examples for the case of two failure mechanisms, (e.g. early failures and wearout, early failures and constant failure rate, etc.) appearing with distribution function F l ( t ) and F 2 ( t ) ,respectively, for any given item, only early failures (with probability p) or wearout (with probability 1 - p) can appear, both failure mechanisms can appear in any item, a percentage p will show both failure mechanisms and 1 - p only one failure mechanism, e.g. wearout governed by F 2 ( t ) . The distribution functions F(t) of the failure-free time is in these cases: F(t) = pFl(t) +(I - p)Fz(t), F(t)= l - ( I
- F1
(t))(l-F2(t))=Fl(t)+F2(t)-Fl(t)F2(t),
The first case gives a mixture with weights p and 1 - p (Example 7.16). The second case corresponds to a series model with two independent elements, (Eq.(2.17)). The third case is a combination of both previous cases. The main properties of the distribution functions frequently used in reliability theory are summxiized in Table A6.1 and discussed in Appendix A6.10.
A6.6
Numerical Parameters of Random Variables
For a rough characterization of a random variable T, some typical values such as the expected value (mean), variance, and median can be used.
A6.6.1 Expected Value (Mean) For a discrete random variable T taking values t l , t 2 ,...,with probabilities pl, P?, ... ,
407
A6.6 Numerical Parameters of Random Variables
ihe expected value or mean E[T] is given by
(A6.35)
E [ T I = ~ ~ ~ P ~ , k
provided the series converges absolutely. If z only takes the values t l , ..., tm , Eq. (A6.35) can be heuristically explained as follows. Consider n repetitions of a trial whose outcome is z and assume that kl times the value tl , .. ., km times the value tm has been observed ( n = kl +... + km), the ariihmetic mean of the observed values is
As n + ki l n converges to pi (Eq. (A6.146)),and the arithmetic mean obtained above tends towards the expected value E[z] given by Eq. (A6.35). For this reason, the terms expected value and mean are often used for the same quantity E[z]. From Eq. (A6.35), the mean of a constant C i s the constant itself, i.e. E[C] = C. The mean of a continuous random variable z with density f ( t ) is given by W,
provided the integral converges absolutely. variables, Eq. (A6.36) reduces to
For positive continuous random
E[z] = J t f ( t )dt 0
which, for E[d < W , can be expressed (Example A6.9) as
Example A6.9 Prove the equivalence of Eqs. (A6.37) and (A6.38). Solution m
e3
R ) = 1F
)= (
X
) yields
t
m
m
I ~ ( t ) d t I(jf(x)di)dt =
0
0 t
Changing the order of integration it follows that (see graph) m
00
X
m
j ~ ( t ) d= t I(jdt)f(x)dx =Ixf(x)dx. 0
0 0
0
t
408
A6 Basic Probability Theory
Table A6.1 Distribution functions used in reliability analysis (with X instead of t for interamval times) Name
Distribution Function F(t) = Pr{z It ]
Density f(t) = d F(t) l dt
Parameter Range
Exponential
Weibull
Gamma
f(0
+
(x-m )'
t
Normal
2 a2
dx
Lognormal
Binomial
Poisson k
Pr(<
Geometric
i=l
Pi =
Hypergeometric
=Epi=l-(l-p)k
PU-P)
i-l
(Ki N Nn -- Ki ) i=o ) k
Pr[< < k l =
(n
2
[hl
t > 0, F(O)=O v = l , 2, ... (degrees of freedom:
A6.6 Numencal Parameters of Random Variables
Table A6.1 (cont) Failure Rate h(t ) = f(t ) 1 (1 - F(t ))
Mean E h1
Properties
Memoryless: Pr{z > t + x o 1 z > xo] = P r { ~ > t ] = e - ~
Monotonic failure rate: increasing for ß > 1 (h(0) = 0, L(=) = = decreasing for ß 1 ( h(0) = m , L(-) = 0 : Laplace transf. exists: T ( s ) = hß I (s + h)' ; Monotonic failure rate with h(-) = h ; Exp. for ß = 1, Erlangian for ß = n = 2,3, .. (sum of n exp. distributed random variab.) Gamma with ß = v 12= 1,2, ... and h = l / 2
lnz has a normal distribution; F(t) = @(ln(ht)/o)
not relevant
pi =Pr( i successes in n Bernoulli tnals} (n independent trials with Pr{A}= p ) ;
Random sample with replacement
not relevant
Memoryless: ~r { C > i + j
1
> i) = (1 -
Pi
= Pr{first success in a sequence of Bernoulli tnals occurs first at the ith trial]
not relevant
Random sample without replacement
410
A6 Basic Probability Theory
For the expected value of the random variable q = U(T) m
E[ril=
C.u(tk
~ , k
or
k
E[q] = j u(t) f (t) dt -m
holds, provided that series and integral converge absolutely. Two particular cases of Eq. (A6.39) are:
1. u(x) = Cx,
2. u(x) = xk, which leads to the k th moment of T,
Further important properties of the mean are given by Eqs. (A6.68) and (A6.69).
A6.6.2 Variance The variance of a random variable z is a measure of the spread (or dispersion) of the random variable around its mean E[T]. Variance is defined as Var[z] = E[(z - ~ [ z ] ) ~ ] ,
(A6.42)
and can be calculated as
for a discrete random variable. and as 01
( t - E [ T ] ) f~( t )dt
Var[z] = -01
for a continuous randorn variable. In both cases,
If E[T] or V a r [ ~ ]is infinite, z is said to have an infinite variance. For arbitrary constants C and A, Eqs. (A6.45) and (A6.40) yield Var[Cz - A] = C2Var[z] and
A6.6 Numerical Parameters of Random Variables
Var[C]= 0. The quantity
o=.\ivar[zl is the standard deviation of T and, for t 2 0,
is the coeficient of variation of T . The random variable (T -E[T]) / o has mean 0 and variance 1, and is a standardized random variable. A good understanding of the variance as a measure of dispersion is given by the Chebyshev's inequality, which states (Example A6.10) that for every E > 0
The Chebyshev inequality (known also as Bienaym6-Chebyshev inequality) is more useful in proving convergence than as an approximation. Further important properties of the variance are given by Eqs. (A6.70) and (A6.71).
Example A6.10 Prove the Chebyshev inequality for a continuous random variable (Eq. (A6.49)). Solution For a continuous random variable T with density f(t), the definition of the variance implies
which proves Eq. (A6.49).
Generalization of the exponent in Eqs. (A6.43) and (A6.44) leads to the kth central moment of T
412
A6 Basic Probability Theory
A6.6.3 Modal Value, Quantile, Median In addition to the moments discussed in Appendices A6.6.1 and A6.6.2, the modal value, quantile, and median are defined as follows: 1. For a continuous random variable T , the modal value is the value o f t for which f ( t ) reaches its maximum, the distribution o f z is multimodal i f f ( t ) exhibits more than one maximum. 2.The q quantile is the value tq for which F(t) reaches the value q , tq = infit: F(t)2 q J ; in general, F(t, ) = q for a continuous random variable ( t,, for which 1 - F ( t p )= Q ( t p )= P , is termed percentage point). 3. The 0.5 quantile ( t 0 , 5 ) is the median.
A6.7
Multidimensional Random Variables, Conditional Distributions
Multidimensional random variables (random vectors) are often required in reliability and availability investigations o f repairable Systems. For random vectors, the outcome of an experiment is an element o f the n-dimensional space %-,. The probability space [Q, F, Pr] introduced in Appendix A6.1 becomes [ R , , Bn, T],where B" is the smallest event field which contains all "intervals" o f the form (al,bl]. ....(a„ b,] = ( ( t l ,..., t , ) : ti E (ai, bi],i = l , ..., nJi Random vectors are designated by Greek letters with an arrow ( T = ( z l ,...,T , ) , E,= (t1, ..., E,,),:tc.). The probabilities Pr{A]= ~ r (
(A6.51)
where {zl 5 tl, ..., zn I t,)
= { ( z l 5 t l )n . .. n ( T ,
5 t,)}
is the distribution function o f the random vector jünction of z l , ..., T,. F(tl,..., t,) is:
-+ T , known
as joint distribution
monotonically nondecreasing in each variable, Zero (in the limit) i f at least one variable goes to - W , one (in the limit) i f all variables go to W, continuous from the right in each variable, such that the probabilities Pr{al < zl 4 bl, ..., a n < T , I b,}, calculated for arbitrary al, ..., a„ bl, ..., b, with ai < bi, are not negative; for example, n = 2 yields Pr{al
b
413
A6.7 Multidimensional Random Variables. Conditional Distributions
-+ It can be shown that every component zi of z = (zl, ..., T,) is a random variable with distribution function, marginal distribution function, Fi(ti) = P ~ { 5 T ti} ~ = F(w, ...,M,ti, 00, .. .,00). The components 71, ..., 7, of for any n and n-tulpe ( tl, ..., t,)
n
(A6.52)
-2 are (stochastically) E
independent if and only if,
R,,
n
F ( ~ ... ~ J,) , =
(A6.53)
F~(~~).
i=l
It can be shown that Eq. (A6.53) is equivalent to
for every BiE 2?n . -+ The random vector z = (T„ ..., T,) is absolutely continuous if a function f(xl, .. ., X,) 2 0 exists such that for any n and n-tulpe tl, ..., t,
-+
f(xl, ...,X,) is the density of T , known also as joint density of zl, ..., T,. and satisfies the condition
For any subset A
E B n,
it follows that
The density of zi, marginal densiq, can be obtained from f(tl, ..., t,) as
+
The components T I ,..., 7, of a continuous random vector z are (stochastically) independent if and only if, for any n and n-tulpe tl, ..., t, E R n ,
-+
For a two dimensional continuous random vector z = (zl, z2), the function
A6 Basic Probability Theory
(A6.58) is the conditional density of under the condition zl = t l , with f l ( t l )> 0 . Similarly fl(tl t 2 )= f ( t l , t 2 )l f 2 ( t 2 )is the conditional density for zl given z2 = t 2 , with f 2 ( t 2 )> 0. For the marginal density of . L it ~ follows that
I
Therefore, for any A
E B'
and in particular
Equations (A6.58) & (A6.59) lead to the Bayes theorem for continuous random varti a b l e ( t 1 t ) ( ( tf 1 t )) I f 2 ( t 2 ) f ( t1 t 2 )dt2,used in Bayesian statistics.
-
A6.8
Numerical Parameters of Random Vectors
4
Let T = ( T ~ ...,, T , ) be a random vector, and U a real-valued function in R,. -3 The expected value or mean of the random variable U( T ) is
for the discrete case and
for the continuous case, assuming that series and integral converge absolutely. = tl follows, in the continuous The conditional expected value of T Z given case, from Eqs. (A6.36) and (A6.58) as
A6.8 Numerical Parameters of Random Vectors
Thus the unconditional expected value of
can be obtained from
Equation (A6.65) is known as the formula of total expectation and is useful in practical applications.
A6.8.1
Covariance Matrix, Correlation Coefficient +
Assuming for T = ( T ~..,., T,) that Var [TJ < m , i = 1 , . .. ,n, an important rough characterization of a random vector is the covariance matrix a , where
I 1,
a 't .] = E[('ti - E[zi])(zj - E['tj])] V = COV[T. 1' J are given in the continuous case by
The diagonal elements of the covariance matrix are the variances of components zi, i = 1, ..., n. Elements outside the diagonal give a measure of the degree of dependency between components (obviously a , = a ji). For zi independent of T j , U . . = a .. = 0 holds. iJ 12 + For a two dimensional random vector z = ( T ~T, ~ )the , quantity
is the correlation coefflcient of the random variables z1 and TZ,provided oi = l/var[zi] <
00,
i = 1, 2.
The main properties of the correlation coefficient are: 1. I p l I l .
2. if zl and z2 are independent, then p = 0. 3. p = I1 if and only if zl and are linearly dependent.
416
A6 Basic Probability Theory
A6.8.2 Further Properties of Expected Value and Variance -+ Let TI,..., T, be arbitrary random variables (components of a random vector T ) having finite variances and Cl, ..., C, constants. From Eqs. (A6.62) or (A6.63) and (A6.40) it follows that
E[CIzl + ... +C,T,]= C1E[~,]+... +C,E[z,]. If
T,
and
(A6.68)
are independent then, fromEq. (A6.63) and Eq. (A6.45),
E[zl Z ~ I = E [ TE~[ IT ~ ]and Var[T1z2] =E[zf] E[T$]- ~ ~ [ ~z ~~ [] z(A6.69) ~ ] . The variance of a sum of independent random variables zl, ..., T, is obtained from Eqs. (A6.62) or (A6.63) and (A6.69) as
For a sum of arbitrary random variables TI,...,T„ the variance can be obtained for i, j
E (1,
A6.9
..., n ]
Distribution of the Sum of Independent Positive Random Variables and of T„, T„,
be independent non-negative arithmetic random variables with Let .zl and = i ) , i = 0,1, .... Obviously, zl +T2 is also arithmetic, ai = Pr{Tl = i ) , bi = and therefore
The sequence CO, CI,... is the convolution of the sequences ao, a l , ... and bo, bl, .... Now, let zl and z 2 be two independent positive continuous random variables with distribution functions Fl(t), F2(t) and densities fl(t), f2(t), respectively (F1(0) = F2(0) = 0). Using Eq. (A6.55), it can be shown (Example A6.11 and Fig. A6.4) that for the distribution of = zl + z2
A6.9 Distribution of the Sum of Independent Positive Random Variables and of zmi„
Figure A6.4 Visual aid to compute the distnbution of T = TI
+ TZ
T„,
417
(TI,22 > 0 )
holds, and
The extension to two independent continuous random variables over (-=, =) leads to
and z 2 defined
The right-hand side of Eq. (A6.74) represents the convolution of the densities f l ( t ) and f2(t), and will be denoted by
The Laplace transform (Appendix A9.7) of fq(t) is thus the product of the Laplace transforms of f l ( t ) and f 2 ( t )
Sq (s) = S1(s)S2 (8).
Example A6.11 F'rove Eq. (A6.74). Solution Let 1' and 72 be two independent positive and continuous random variables with distribution functions Fl (t), F2(t) and densities fl (t), f2 ( t ) , respectively (F; (0) = F2 (0) = 0). From Eq. (A6.55) with f(x, y) = fl(x)f2(y) it follows that (see also the graph)
418
A6 Basic Probability Theory F11 ( t ) = P r { r = q + ~ ~ - < t )~=~ f i ( x ) f 2 ( y ) d x d y x+y
1-X
f
= J ( Jf,(y)dy)f,(x)& = 0
Y t
0
j$(f- ~ ) f , ( x ) & 0
which proves Eq. (A6.73). Eq. (A6.74) follows with F2 (0) = 0 (Equation (A6.74) follows also from Eq. (A6.65)).
0
x X+&
t
X
Sums of positive random variables occur in reliability theory when investigating repairable Systems (e.g. Example 6.12). For n r 2 , the density f q ( t ) of q = T I +...+ T, for independent positive continuous random variables T I , ...,T, follows as fr(t)=fi(t)*
... *fn(t).
(A6.77)
Example A6.12 Two machines are used to increase the reliability of a system. The first is switched on at time t = 0 , and the second at the time of failure of the first one, standby redundancy. The failure-free times of the machines, denoted by TI and 22 are independent exponentially distributed with Parameter ?L (Eq. A6.81)). What is the reliability function of the system? Solution From RS(t) = P ~ { +T' ~2 > t] = 1- Pr{zl + 22
< t} and Eq. (A6.73) it follows that
0
R (t) gives the probability for no failures ( e-ht) or exactly one failure ( ht e-")
in (0, t].
Other important distribution functions for reliability analyses are the minimum and the maximum T„, of a finite set of positive, independent random variables TI, ..., T,; for instance, as failure-free time of a series or a 1-out-of-n parallel system, respectively. If T I , ..., T , are independent positive random variables with distribution functions Fi(t) = P ~ {IT t~] , i = 1, ..., n, then e„
n
and
Pr{zmin > t } = P ~ { >Tt ~n ... n T, > t } = n < 1 - ~ ~ ( t ) ) , i=l
It can be noted that the failure rate related to
T,~,
(A6.78)
i=l is given by
where hi(t) is the failure rate related to Fi(t). The distribution of T „ leads for F I ( t ) = ...= Fn(t) and n+.o to the Weibull distribution [A6.8]. For the mixture of distribution functions one refers to the considerations given by Eqs. (A6.34) & (2.15).
A6.10 Distribution Functions used in Reliability Analysis
A6.10 Distribution Functions used in Reliability Analysis This section introduces the most important distribution functions used in reliability analysis, See Table A6.1 for a Summary. The variable t, used here for convenience, applies in particular to nonrepairable items. For interarrival times (e.g. when considering repairable systems), x has to be used instead oft.
A6.10.1 Exponential Distribution A continuous positive random variable z has an exponential distribution if
The density is given by f(t) = he-At,
t 2 0 , f(t)= o for t < O; h > 0,
(A6.82)
and the failure rate (Eq. (A6.25)) by h(t) = h .
(A6.83)
The mean and the variance can be obtained from Eqs. (A6.38) and (A6.44) as 1
and
E[T] = h
The Laplace transform of f(t) is, according to Table A9.7,
Example A6.13 The fahre-free time T of an assembly is exponentially distributed with h = 1oM5h-l. What is the probability of T being (i) over 2,000 h, (ii) over 20,000 h, (iii) over 100,000 h , (iv) between 20,000 h and 100,000h ? Solution From Eqs. (A6.81), (A6.24) and (A6.19) one obtains (i) (ii) (iii) (iv)
Pr{z > 2,000h) = eT0'02= 0.98, Pr{z > 20,000h) = e-0'2 = 0.819, Pr{%> 100,000h} = Pr(7 > l l h = E[z]} = e-' = 0.368, Pr(20,OOOh < T 1 100,000h) = e-0.2 - e-I = 0.451.
420
A6 Basic Probability Theory
For an exponential distribution, the failure rate is constant (time-independent) and equal to h. This important property is a characteristic of the exponential distribution and does not appear with any other continuous distribution. It greatly simplifies calculation because of the following properties: 1. Memoryless property: Assuming that the failure-free time is exponentially distributed and knowing that the item is functioning at the present time, its behavior in the future will not depend on how long it has already been operating. In particular, the probability that it will fail in the next time interval 6t is constant and equal to h & . This is a consequence of Eq. (A6.30)
2. Constant failure rate at system level: If a system without redundancy consists , T, of these elements of elements EI, ..., E , and the failure-free times T ~..., are independent and exponentially distributed with Parameters A l , .. ., X , then, according to Eq. (A6.78), the system failure rate is also constant (timeindependent) and equal to the sum of the failure rates of its elements ~ ~ = ( t )
= e - h ~,
with hs = AI + ... + L , .
(A6.88)
It should be noted that the expression hs = E h i is a characteristic of the series model with independent elements, and also remains valid for the time-dependent failure rates Ai = h i ( tj , see Eqs. (A6.80) and (2.18).
A6.10.2 Weibull Distribution The Weibull distribution can be considered as a generalization of the exponential distribution. A continuous positive random variable T has a Weibull distribution if
The density is given by
and the failure rate (Eq. (A6.25)) by
h is the scale Parameter ( F @ ) depends on At only) and ß the shape parameter. ß = 1 yields the exponential distribution. For ß > 1, the failure rate h ( t ) increases
-
monotonically, with h(Oj = 0 and ?L(-) = . For ß < 1, ?L( t j decreases monotonically, with h(0) = and ?L(-) = 0 . The mean and the variance are given by
42 1
A6.10 Distribution Functions used in Reliability Analysis
h
and
where Co
r(s)= Jxz-le-x dx 0
is the complete gamma function (Appendix A9.6). The coefficient of variation K= J ~E[T]=Ia I E[%]is plotted in Fig. 4.5. For a given E[T], the density of the Weibull distribution becomes peaked with increasing P. An analytical expression for the Laplace transform of the Weibull distribution function does not exist. For a system without redundancy (series model) whose elements have independent failure-free times TI,..., T, distributed according to Eq. (A6.89), the reliability function is given by
?L!&. Thus, the failure-free time of the system has a Weibull distribution with?LV= with Parameters h' and ß . The Weibull distribution with ß > 1 often occurs in applications as a distribution of the failure-free time of components which are subject to wearout andlor fatigue (lamps, relays, mechanical components, etc.). It was introduced by W. Weibull in 1951, related to investigations on fatigue in metals [A6.20]. B.W. Gnedenko showed that a Weibull distribution occurs as one of the extreme value distributions for the smallest of n ( n + W ) independent random variables with the same distribution function (Weibull-Gnedenko distribution [A6.7, A6.81). The Weibull distribution is often given with the parameter a = ?Lß instead of ?L or also with three Parameters
Example A6.14 Shows that for a three parameter Weibull distribution, also the time scale Parameter can be determined (graphically) on a Weibull probability chart, e.g. for the empincal evaluation of data. Solution In the system of coordinates log„(t) and loglo log„(ll(l- F(t))) the two parameter Weibull distribution function (Eq. (A6.89)) appears as a straight line, allowing a graphical determination of hand ß (see Eq.(A8.16) and Fig.Ag.2). The three parameter Weibull distribution (Eq.(A6.96)) leads to a concave curve. In this case, for two arbitrary points tl and t2 > t, it holds for the mean point on the scale loglo 10g,~(l41- F(t))), defining t„ that loglo(t2-W) + loglo(tl -Y) = 210glo(t, - Y), see Eq. (A8.16), the identity a + (b - a)/2 = (a + b)/2, and Fig. A8.2. From = (tlt2 -&) l(tl t2 - 2 t m ) , as function of tl, t2, tm. this, (t2 -y)(tl -W) = ( t , - ~ ) ~and
+
422
A6 Basic Probability Theory
A6.10.3 Gamma Distribution, Erlangian Distribution, and ~2 Distribution A continuous positive random variable T has a Gamma distribution if
r is the complete
Gamma function defined by Eq. (A6.94). y is the incomplete Gamma function (Appendix A9.6). The density of the Gamma distribution is given by
and the failure rate is calculated from h ( t )= f ( t )l ( 1 - F ( t ) ) . h ( t )is constant (timeindependent) for ß = 1, rnonotonically decreasing for ß < 1 and monotonically increasing for ß > 1. However, in contrast to the Weibull distribution, h ( t ) always converges to ?L for t + W , see Table A6.1 for an example. A Gamma distribution with ß < 1 mixed with a three-parameter Weibull distribution (Eq. (A6.33, case 1)) can be used as an approximation to the distribution function for an item with failure rate as the bathtub cuwe given in Fig. 1.2. The mean and the variance are given by
and
The Laplace transform (Table A9.7) of the Gamma distribution density is
From Eqs. (A6.101) and (A6.76), it follows that the sum of two independent Gamma-distributed random variables with parameters h, ß1 and h, ß2 has a Gamma distribution with parameters h, ß1 + ß2.
Example A6.15 Let the random variables zl and 22 be independent and distributed according to a Gamma distnbution with the parameters h and ß. Determine the density of the sum q = 'CI + 22.
423
A6.10 Distribution Functions used in Reliability Analysis
Solution According Eq. (A6.98), 71 and 72 have density f(t )= h (h t / r@). The Laplace transform of f(t) is f(s) = ?@ / ( s + h ) ß (Table A9.7). From Eq. (A6.76), the Laplace transform of the density of q = zl + 72 follows as fr(s) = / (s + ~ ) ~ The ß . random variable q = 71 + 72 thus has a Gamma distribution with parameters h and 2ß (generalization to n> 2 is immediate).
x2ß
For ß = n = 2,3, ..., the Gamma distribution given by Eq. (A6.97) leads to an Erlangian distribution with parameters h and n. Taking into account Eq. (A6.77) and comparing the Laplace transform of the exponential distribution h I (s + h ) with that of the Erlangian distribution ( hI ( s + L)),, leads to the following conclusion: If z is Erlang distributed with parameters ?L und n, then T can be considered as the sum of n independent, exponentially distributed random variables with Parameter h , i.e. T = z l + ... +T, with P r ( z i < t } = 1 - e C a t , i =1, ..., n.
The Erlangian distribution can be obtained by partial integration of the right-hand side of Eq. (A6.97), with ß = n. This leads to (see also Appendices A9.2 & A9.6)
From Example A6.15, if failure-free times are Erlangian distributed with parameters (n,L),the sum of k failure-free times is Erlangian distributed with parameters (kn, h). For h = 1 1 2 and ß = V 1 2 , V = i, 2, ..., the Gamma distribution given by Eq. (A6.97) is a chi-square distribution (x2 distribution) with V degrees of freedom. The corresponding random variable is denoted X The chi-square distribution with V degrees of freedom is thus given by (see also Appendix A9.2)
t.
t L l -X12 ~ ( t ) = ~ r5 {t ]~=t, I x 2 e dx, -
r>o.p(o)=o;v=l.2,.... (A6.103)
22r(:) 0
From Eqs. (A6.97), (A6.102), and (A6.103) it follows that
distribution with V = 2 n degrees of freedom. If Cl, ..., 5, are indehas a pendent, normally distributed random variables with mean m and variance 02,then
distributed with n degrees of freedom. The above considerations show the is importance of the distribution in mathematical statistics. The distribution is also used to compute the Poisson distribution (Eq.(A6.102) with n = v 1 2 and h = 112 or Eq. (A6.126) with k = v / 2 - 1 and m = t / 2, See also Table A9.2).
424
A6 Basic Probability Theory
A6.10.4 Normal Distribution A widely used distribution function, in theory and practice, is the normal distribution, or Gaussian distribution. The random variable T has a normal distribution if
The density of the normal distribution is given by
The failure rate is calculated from A ( t ) = f ( tl)( 1 - F(t)). The mean and variance are E[T]= m and Var[z] = 0 2 , respectively. The density of the normal distribution is symmetric with respect to the line X = m . Its width depends upon the variance. The area under the density curve is equal to (Table A9.1, [A9.1], See also Appendix A9.6 for the Poisson's integral) 0.6827 for the interval rn f o, 0.95450 for the interval m f 2 o, 0.99730 for the interval rn f 3 G, 0.999533 for the interval rn I 3.5 o ,
0.9999367 0.9999932 0.99999943 0.9999999980
for the interval for the interval for the interval for the interval
+ + +
rn 4 o, rn 4.5 o , rn f 5 0,. rn 6 o.
A normal distributed random variable takes values in ( - W , +W). However, for m > 3 o it is often possible to consider it as a positive random variable in practical applications. rn + 6 o is frequently used as a sharp limit for controlling the process quality (6-oapproach). Assuming to accept a shift of the mean of 1.5 o in the manufacturing process, the 6 - 0 approach refers in this case to rn 14.5 o with respect to the basic quantity, yielding 3.4 ppm right and 3.4 ppm left the sharp lirnit. If T has a normal distribution with parameters m and o 2 , ( T - rn) 1 o is normally distributed with parameters 0 and 1, which is the standard normal distribution @ ( t )
and z 2 are (stochastically) independent, normally distributed random If variables with parameters rnl, o:, and rn2, G:, q = + 7 2 is normally distributed (Example A6.16). This rule can be generalized with parameters rnl + in2, o: + to the sum of n independent normally distributed random variables, and extended to dependent normally distributed random variables (Example A6.16).
02
A6.10 Distribution Functions used in Reliability Analysis
Example A6.16 Let the random variables 71 and 9 be (stochastically) independent and normally distributed with means ml and m2 and variances o; and o;. Give the density of the sum q = . L+~ T ~ . Solution According to Eq. (A6.74), the density of q = 71 + T
Setting u = n - m l ,
V
~ ~ O ~ ~as O W S
= t - ml - m 2 , and considering
the result
is obtained. Thus the sum of independent normally distributed random variables is also normally 2 2 distributed with mean ml + m2 and variance o , + 0 2 . If 21 and 72 are not (stochastically) independent, the distribution function of zl + 72 is still a normal distribution with m = ml + m2, 2 2 2 but with variance o = o,+ o2+ 2p ol 0 2 , where p is the correlation coefficient (Eq. (A6.67)).
The normal distribution often occurs in practical applications, also because the distribution function of the sum of a large number of (stochastically) independent random variables converges under weak conditions to a normal distribution (central lirnit theorem, Eq. (A6.148)).
A6.10.5 Lognormal Distribution A continuous positive random variable T has a lognormal distribution if its logarithm is normally distributed (Example A6.17). For the lognormal distribution,
A6 Basic Probability Theory
The density is given by --(in L t )2
1 f(t) = t 0 6
202
t>O, f ( t ) = O f o r t < O ; h,
0
> 0.
(A6.111)
The failure rate is calculated from A(t) = f ( t ) / ( l - F(t)), see Table A6.1 for an example. The mean and the variance of T are (Problem A6.6 in Appendix A l 1) e
02/2
E[T] = -
a
and
respectively. The density of the lognormal distribution is practically Zero for some t at the origin, increases rapidly to a maximum, and decreases quickly (Fig. 4.2). It applies often as model for repair times (Section 4.1) or for lifetimes in accelerated reliability tests (Section 7.4), and appears when a large number of (stochastically) independent random variables are combined in a multiplicative way. It is also the of X, when x , + ~= (1+ E, )X,, where E, is a random varialimit distribution for n+ ble independent of X, [A6.9, 6.191. The notation with rn or a = - h ( h ) is often used. ] A6.17). It must also be noted that 0' = Var [lnz] and rn = In (1 / h) = E [ l n ~ (Example
Example A6.17 Show that the logarithm of a lognomally distributed random variable is normally distributed. Solution For (ln t+ln h)' 1 202 fT(t)= and q = lnz, Equation (A6.31) yields (u(t) = lnt and U-'(t) = et)
with m = ln(1I L). This method can be used for other transformations. for exarnple: (i) (ii)
(t) = ln(t) :
Normal distribution 4 Lognormal distribution,
u(t) = ln(t) ; U-'(2)= e r :
Lognormal distribution -t Normal distribution,
(iii)
U (t) = t ß; U-'
:
Weibull distribution + Exponential distribution,
(iv)
U (t) =
U-I (t) = t ß :
Exponential distribution -t Weibull distribution,
(V)
U (t) =F;
(vi)
U (t )=Fq (t);
U (t) =
et ;
U-'
V;
(t) =
'(2);
U-I
(t) =Fq(t) :
Uniform distribution on (0, 1) -+ F? (t),
Ü 1 ( t ) = ~ i l ( t ) : F?(t)
(vii) q = C . T ;z = q l C :
-+Uniform distribution on (0, I),
F,(t)=F,(t
/ C ) and $ ( t ) = G(t l C ) / C .
In Monte Carlo simulations, more elaborated algorithms than F i l ( t ) are often used.
427
A6.10 Distribution Functions used in Reliability Analysis
A6.10.6 Uniform Distribution A random variable T is uniformly distributed in the interval distribution function
(U,
b) if it has the
The density is then given by f ( t ) = --L b-a
for a < t < b .
The uniform distribution is a particular case of the geometric probability introduced by Eq. (A6.3), for x1instead of Because of the property mentioned by case (V) of Example A6.17, the uniform distribution in the interval (0,l) plays an important role in simulation problems.
x2.
A6.10.7 Binomial Distribution Consider a trial in which the only outcomes are either a given event A or its complement Ä. These outcomes can be represented by a randorn variable of the form 6=
{
1 0
if A occurs othenvise.
6 is called a Bernoulli variable. If Pr{6=1}=p
and
Pr{6=0)=1-p,
and Var[G] = ~ [ 6 2-] ~ 2 [ 6=] p - p2 = p ( l - P ) . An infinite sequence of independent Bernoulli variables
with the same probability Pr{6i = 1) = P , i t 1, is called a Bernoulli model or a sequence of Bernoulli trials. The sequence 61, 62, ... describes, for example, the model of the repeated sampling of a component from a lot of size N, with K defective components ( p = KIN) such that the component is retumed to the lot after testing (sample with replacement). The random variable
428
A6 Basic Probability Theory
is the number of ones occurring in n Bernoulli trials. The distribution of given by
6
is
is obviously an arithmetic Equation (A6.120) is the binomial distribution. random variable taking on values in ( 9 1 , ..., n ) with probabilities pk. TO prove Eq. (A6.120), consider that
is the probability of the event A occurring in the first k trials and not occurring in the n - k following trials (SI, ..., 6, are independent); furthermore in n trials there are
different possibilities of occurrence of k ones and n - k Zeros, the addition theorem (Eq. (A6.11)) then leads to Eq. (A6.120).
Example A6.18 A populated printed circuit board (PCB) contains 30 ICs. These are taken from a shipment in which the probability of each IC being defective is constant and equal to 1%. What are the probabilities that the PCB contains (i) no defective ICs, (ii) exactly one defective IC, and (iii) more than one defective IC? Solution From Eq. (A6.120) with p = 0.01, (i) po = 0 . 9 9 ~= ~0.74, (ii) p1 = 3 0 . 0 . 0 1 . 0 . 9 9 ~=~0.224, (iii) p2 + ... + p30 = 1 - po - p1 = 0.036 Knowing pi and assuming Ci= cost for i repairs (because of i defective ICs) it is easy to calculate the mean C of the total cost caused by the defective ICs ( C = pl Cl + ... + p30 C30)and thus to develop a test strategy based on cost considerations (Section 8.4).
For the random variable
5 defined by Eq. (A6.119) it follows that
429
A6.10 Distribution Functions used in Reliability Analysis
Example A6.19 Give mean and variance of a binornially distributed random variable with Parameters n andp.
Solution Considering the independence of al, ..., 6„ the definition of Eqs. (A6.117), (A6.118), (A6.68), and (A6.70) it follows that
5
(Eq. (A6.1 lg)), and from
and Var[(] = Var[S1] + ... + Var[S,] = n p (1 - p).
A further demonstration follows, as for Example A6.20, by considering that
For large n , the binomial distribution converges to the normal distribution (Eq. (A6.149)). The convergence is good for min ( n p, n(1- p ) ) 2 5 . For small values of p, the Poisson approximation (Eq. (A6.129)) can be used. Calculations of Eq. (A6.120) can be based upon the relationship between the binomial and the beta or the Fisher distribution (Appendix A9.4). Generalization of Eq. (A6.120) for the case where one of the events A l , ..., Am can occur with probability p l , ..., p , at every trial, leads to the multinomial distribution Pr{in n trials Al occurs kl times, ... , Am occurs km times) =
with kl
+ ... + k m = n
n!
k,! ... k m !
$ ... k,
and pl + ... + P , = 1.
A6.10.8 Poisson Distribution The arithmetic random variable k
p k = P r { < = k ) = -m e-
k!
and thus
rn
5 has a Poisson distribution if ,
k = 0 , 1 , ...,
m>O,
430
A6 Basic Probability Theory
The mean and the varinnce of
5 are
E[<]= m and Var[<] = m . The Poisson distribution often occurs in connection with exponentially distributed failure-free times. In fact, Eq. (A6.125) with m = At gives the probability of k failures in the time interval(0, t ] ,given h and t (Eq. (A7.41)). The Poisson distribution is also used as an approximation of the binomial distribution for n -+ and p -+ 0 such that n p = m < To prove this convergence, called the Poisson approximation, Set m = n p , Eq. (A6.120) then yields W.
W
n!
rn k in (1--) n
(-1
Pk=k!(n-k)!
,,-k
-
n(n-1)
... ( n - k + l ) k
rn
k
.-(1--) k!
rn n-k
n
k
1 k - 1 rn = l ( 1 - - ) . . . (1--).-(1--) n n k!
from which (for k < k
m lim pk = -e n-S-
rn n-k
n
,
and m = n p < W ) it follows that -m
,
rn= n p .
k!
Using partial integration one can show that
The right-hand side of Eq. (A6.130) is a special case of the chi-square distribution (Eq. (A6.103) with V 1 2 = k + 1 and t = 2m). A table of the chi-square distribution can then be used for ~iumericalevaluation of the Poisson distribution (Table A9.2). Example A6.20 Give mean and variance of a Poisson-distributed random variable. Solution From Eqs. (A6.35) and (A6.125),
Similarly, from Eqs. (Ah.45), (A6.41), (A6.125), and considering k2 = k ( k - 1)+ k ,
43 1
A6.10 Distribution Functions used in Reliability Analysis
A6.10.9 Geornetric Distribution Let 81,ti2,... be a sequence of independent Bernoulli variables resulting from Bernoulli trials. The arithmetic random variable 5 defining the number of trials to the Jirst occurrence of the event A has a geometric distribution given by
Equation (A6.131) follows from the definition of Bernoulli variables 6i (Eq. (A6.115))
The geometric distribution is the only discrete distribution which exhibits the memoryless property, as does the exponential distribution for the continuous case. and, for any k and j > 0, In fact, from Pr{< > k ) = Pr{61 = 0 n ... n$ = 01 = ( 1 it follows that
The failure rate is time independent and given by
For the distribution function of the random variable obtains
Mean and variance are then (with
5 defined by Eq. (A6.131) one
m
m
n=l
n =l
E rucn=xl(l-x)2 and
n2xn=x(1+n)~ ( l - n ) X~ <, 1)
and
If Bernoulli trials are carried out at regular intervals At, then Eq. (A6.133) provides the distribution function of the number of time units At between successive occurrences of the event A under consideration; for example, breakdown of a capacitor, interference pulse in a digital network, etc. Often the geometric distribution is considered with pk = p(1- p?, k = (),I,..., in this case E[Q = ( 1 - p ) 1 p and Var[l;] = (1 - p ) l p 2 .
432
A6 Basic Probability Theory
A6.10.10 Hypergeometric Distribution The hypergeometric distribution describes the model of a random sample without replacement. For example, if it is known that there are exactly K defective components in a lot of size N , then the probability of finding k defective components in a random sarnple of size n is given by
pk =Pr{[= k } =
(3(1
kK)
9
k = 0, ... , min ( K n). ,
(A6.136)
Equation (A6.136) defines the hypergeometric distribution. Since for fixed n and k ( 0 5 k l n)
lim Pr{[ = k } = ( i ) p k ( l - p ) n - k , N+m
K
with p = -, N
the hypergeometric distribution can, for large N, be approximated by the binomial distribution with p = K I N . For the random variable 5 defined by Eq. (A6.136) it holds that
and
A6.11 Limit Theorems Limit theorems are of great importance in practical applications because they can be used to find approximate expressions with the help of known (tabulated) distributions. Two important cases will be discussed in this section, the law of Zarge numbers and the central limit theorem. The law of large numbers provides
433
A6.11 Limit Theorems
additional justification for the construction of probability theory on the basis of relative frequencies. The central limit theorem shows that the normal distribution can be used as an approximation in many practical situations.
A6.11.1 Law of Large Nurnbers Two notions used with the law of large numbers are convergence in probability and convergence with probability one. Let Cl, ..., and 5 be random variables on a probability space [Q, F, Pr]. 5, converge in probability to 5 if for arbitrary E > 0
c2,
5,
holds.
converge to
5 with probability one if
The convergence with probability one is also called convergence almost sure (a.s.). An equivalent condition for Eq. (A6.141) is lim Pr{sup n
k>n
-&} = 0,
for any E > 0. This clarifies the difference between Eq. (A6.140) and the stronger condition given by Eq. (A6.141). Let us now consider an infinite sequence of Bernoulli trials (Eqs. (A6.115), (A6.119), and (A6.120)), with Parameter p = P r { A ) , and let S n be the number of occurrences of the event A in n trials
The quantiiy S n I n is the relative frequency of the occurrence of A in n independent trials. The weak law of large numbers states that for every E > 0,
Equation (A6.144) is a direct consequence of Chebyshev's inequality (Eq. (A6.49)). Similarly, for a sequence of independent identically distributed random variables z l , ...,T „ with mean E [ z i ] = a and variance V a r [ z i ]= o 2 < m (i = 1, ... , n),
According to Eq. (A6.144), the sequence S n l n converges in probability to p = Pr{A]. Moreover, according to the Eq. (A6.145), the arithmetic mean ( t l + ... + t n ) l n of n independent obsewations of the random variable T (with a
434
A6 Basic Probability Theory
finite variance) converges in probability to E [ z ] . Therefore, 6 = Sn 1 n and 6 =(tl + ... + t,)l n are consistent estimatesof p = P r { A ) and a = E [ z ] ,respectively (Appendix A8.1 and A8.2). Equation (A6.145) is also a direct consequence of Chebyshev's inequality (Eq. (A6.49). A firmer statement than the weak law of large numbers is given by the strong law of large numbers,
According to Eq. (A6.146), the relative frequency S n I n converges with probability one (a.s.) to p = P r { A } . Similarly, for a sequence of independent identically distributed random variables z l , ..., T „ with mean E [ z i ] = a < and variance V a r [ z i ] = o2
( i = 1 , 2 , ...),
< W
The proof of the strong law of large numbers (A6.146) and (A6.147) is more laborious than that of the weak law of large numbers, see e.g. [A6.6 (vol. 11), A6.71.
A6.11.2 Central Limit Theorem Let
TI,TZ,
... be independent, identically distributed random variables with mean 2
E [ z i ] = a < W andvariance V a r [ ~=~o] < W , i ( C ~ ~ ) - n a
lim Pr{
n+
-=
i=l
4
1
t
= 1 , 2 , ...
. For every t C
W,
,-
x2/zdX
Equation (A6.148) is the central limit theorem. It says that for large values of n, the distribution function of the sum zl + ... + T , can be approximated by the normal distribution with mean E [ z l + ... + z n ] = n E [ z i ]= n a and variance Var[zl + ... +T,] = nVar[zi]= n o 2 . The central limit theorem is of great theoretical and practical importance, in probability theory and mathematical statistics. It includes the integral Laplace theorem (also known as the De Moivre-Laplace theorem) for the case where z i = Zii are Bernoulli variables,
n
C 6 , is the random variable 5 in Eq. (A6.120) for the binomial distribution, i.e
i=l
435
A6.11 Limit Theorems
it is the total number of occurrences of the event considered in n Bernoulli trials. From Eq. (A6.149) it follows that for n + nF
z
si
1 ,lzG - x2/2 81 -+ e d'.
I
2j211
n+
W,
-m
or, for each given E > 0,
X P
II
8,
-
I
2
E
G=
} -+ -
li..
J
,
-x2,2
e
h,
n-t
W.
(A6.150)
Setting the right-hand side of Eq. (A6.150) equal to y allows determination of the number of trials n for given y, p, and E which are necessary to fulfill the inequality 1(4+...+8,) l n - p I I E with a probability y. This result is important for reliability investigations using Monte Carlo simulations, see also Eq. (A6.152). The central limit theorem can be generalized under weak conditions to the sum of independent random variables with different distribution functions [A6.6 (Vol. 11), A6.71, the meaning of these conditions being that each individual standardized provides a small contribution to the random variable ( z i - E [ z i ] ) standardized sum (Lindeberg conditions). Examples 6.21-6.23 give some applications of the central limit theorem.
/,,/M
Example A6.21 The senes production of a given assembly requires 5,000 ICs of a particular type. 0.5% of these ICs are defective. How many ICs must be bought in order to be able to produce the series with a probability of y = 0.99? Solution Setting p = Pr{IC defective}= 0.005, the minimum value of n satisfying
i=l
i=l
must be found. Rearrangement of Eq. (A6.149) and considering t = t y leads to
where t y denotes the y quantile of the standard normal distribution @(t) given by Eq. (A6.109) or Table A9.1. For y = 0.99 one obtains from Table A9.1 t,, = to,99= 2.33. With p = 0.005, it follows that
Thus, n = 5,037 ICs must be bought (if only 5,025 = 5,000 + 5,000.0.005 ICs were ordered, then t y = 0 and y = 0.5).
A6 Basic Probability Theory Example A6.22 Electronic components are delivered with a defective probability p = 0.1%. (i) How large is the probability of having exactly 8 defective components in a (homogeneous) lot of size n = 5,000? (ii) In which interval [kl, k2] around the mean value n p = 5 will the number of defective components lie in a lot of size n = 5,000 with a probability Y as near as possible to 0.95 ?
Solution (i) The use of the Poisson approximation (Eq. (A6.129)) leads to
the exact value (obtained with Eq. (A6.120)) being 0.06527. For comparison, the following are the values of pk obtained with the Poisson approximation (Eq. (A6.129)) in the first row and the exact values from Eq. (A6.120) in the second row
(ii) From the above table one recognizes that the interval [kl,k2] = [I, 91 is centered on the mean value n p=5 and satisfy the condition ''Yas near as possible to 0.95 " ( y = pl + p2 + ...+ pg 0.96). A good approximation for kl and k2can also be obtained using Eq. (A6.151) to determine E = (k2 - k,)l2n by given p, n , and t(l+y)
-
where t ( l + y ) / 2 is the (1 + y ) / 2 quantile of the standard normal distribution @ ( t ) (Eq. (A6.109)). Equation (A6.151) is a consequence of Eq. (A6.150) by considering that
from which, nE
I
- , / n p ( l- P ) = A = t(,+y),2.
Withy = 0.95, t(l+y)12 = tO,„, = 1.96 (Table A9.1), n = 5,000, and p = 0.001 one obtains n &= 4.38, yielding kl = np - nE = 0.62 ( 2 0) and kz = np + n &= 9.38 (5 n). The same solution is also given by Eq. (A8.45)
considering b = t(l+y)
.
Example A6.23 As an example belonging to both probability theory and statistics, determine the number n of hials necessary to estimate an unknown probability p within a given interval f E at a given probability Y (e.g. for a Monte Carlo simulation).
A6.11 Limit Theorems
Solution From Eq. (A6.150) it follows that for n -t
Therefore, n
1 -
nE
E
I
e
Y dx = 2
yields
1 -
e
6
Y
dx=O.5+-=-, 2
l+y 2
-m
and thus n e l Jnp(l-p) = t ( „ y ) 1 2, from which
where t(l+y)12is the (1 + y ) l 2 quantile of the Standard normal distribution @ ( t ) (Eq. (A6.109), Appendix A9.1). The number of trials n depend on the value of p and is a maximum ( nmX ) for p = 0.5. The following table gives n„, for different values of E and y
Equation (A6.152) has been established by assuming that p is known. Thus, E refers to the number of observations in n trials (2En = k2-k, as per Eq. (A8.45) with b = t(l+y)12). However, the meauing of Eq. (A8.45) can be reversed by assuming that the number k of realizations in n trials is known. In this case, for n large and p or (1 - p ) not very small, E refers to the width of the confidence interval for p ( 2 8 = 6,-Pl as per Eq. (A8.43) with k(1- k l n) >> b2 I 4 and thus also n >> b2). The two considerations yielding a relation of the form given by Eq. (A6.152) are basically different (probability theory and statistics) and agree only because of n -t W (see also the remarks on pp. 508 and 520). For n, p or (1 - p ) small, the binomial distribution has to be used (Eqs. (A8.37) and (A8.38)).
A7 Basic Stochastic-Processes Theory
Stochastic processes are a powerful tool for investigating reliability and availability of repairable equipment and Systems. A stochastic process can be considered as a family of time-dependent random variables or as a random function in time, and thus has a theoretical foundation based on probability theory (Appendix A6). The use of stochastic processes allows analysis of the influence of the failure-free and repair time distributions of elements, as well as of the system's structure, repair strategy, and logistic support, on the reliability and availability of a given system. Considering applications given in Chapter 6, this appendix mainly deals with regenerative stochastic processes with a finite state space, to which belong renewal processes, Markov processes, semi-Markov processes, and semi-regenerative processes, including reward and frequencylduration aspects. However, because of their importance, some nonregenerative processes (in particular the nonhomogeneous Poisson process) are introduced in Appendix A7.8. This appendix is a compendium of the theory of stochastic processes, consistent from a mathematical point of view but still with reliability engineering applications in mind. Selected examples illustrate the practical aspects.
A7.1
Introduction
Stochastic processes are mathematical models for random phenomena evolving over time, such as the time behavior of a repairable system or the noise voltage of a ) diode. They are designated hereafter by Greek letters g(t), C( t ) , q ( t ) , ~ ( tetc. To introduce the concept of stochastic process, consider the time behavior of a system subject to random influences and let T be the time interval of interest, e.g. T = [O, m). The Set of possible states of the system, i.e. the state space, is assumed to be a subset of the set of real numbers. The state of the system at a given time to is thus a random variable & t o ) The random variables Q t ) , t E T , may be arbitrarily coupled together. However, for any n = 1,2, ..., and arbitrary values t l , ..., t , E T , the existence of the n-dimensional distributionfunction (Eq. (A6.51))
A7.1 Introduction
439
is assumed. {(tl), ..., <(tn) are thus the components of a random vector l ( t ) . It can be shown that the family of n-dimensional distribution functions (Eq. (A7.1)) satisfies the consistency condition F(xl, ..., xk,m, ...,W , tl, ..., tk,tk+l,. .., t,) = F(xl, ..., xk,tl, ..., tk),
k
Conversely, if a family of distribution functions F ( x l , ..., X„ t l , ..., t, ) satisfying the above consistency and symmetry conditions is given, then according to a theorem of A.N. Kolmogorov [A6.10], a distribution law on a suitable event field of the space $T consisting of all real functions on T exists. This distribution law is the distribution of a random function {(t), t E T, usually referred to as a stochastic process. The time function resulting from a particular experiment is called a sample path or realization of the stochastic process. All sample paths are in $ T , however the set of sample paths for a particular stochastic process can be significantly smaller than $ T , e.g. consisting only of increasing step functions. In the case of discrete time, the notion of a sequence of random variables C„ n G T is generally used. The concept of a stochastic process generalizes the concept of a random variable introduced in Appendix A6.5. If the random variables {(t) are defined as measurable functions <(t) = <(t,W), t E T, on a given probability space [G, F, Pr] then F(xl, ..., X„ tl, . .., tn) = Pr{o : E,(tl,o) 5 xl, ..., 5(t„w) I X,}, and the consistency and symmetry conditions are fulfilled. o represents the random influence. The function { ( t , ~ ) t, E T, is for a given W a realization of the stochastic process. The Kolmogorov theorem assures the existence of a stochastic process. However, the determination of all n-dimensional distribution functions is practically impossible, in general. Sufficient for many applications are often some specific Parameters of the stochastic process involved, such as state probabilities or stay (sojourn) times. The problem considered, and the model assumed, generally allow deterrnination of the time domain T (continuous, discrete, finite, infinite) the structure of the state space (continuous, discrete) the dependency structure of the process under consideration (e.g. memoryless) invariance properties with respect to time shifts (time-homogeneous, stationary). The simplest process in discrete time is a sequence of independent random variables Ci, t2,.... Also easy to describe are processes with independent increments, for instance Poisson processes (Appendices A7.2.5 & A7.8.2), for which
440
A7 Basic Stochastic-Processes Theory
II
pr{t(to)5 X,} n p r { t ( t i ) t(ti-~) 5
(A7.2)
i=l
holds for arbitrary n = 1,2, ..., XI, ...,X , , and to < ... < t , E T . For reliability investigations, processes with continuous time Parameter t 2 0 and discrete state space {Zo,..., Ztn) are important. Among these, the following processes will be discussed in the following sections (see Table 6.1 for a comparison). renewal processes Markov processes semi-Markov processes semi-regenerative processes (processes with an embedded semi-Markov process) particular nonregenerative processes (nonhomogeneous Poisson processes for instance). Markov processes represent a straightforward generalization of sequences of independent random variables. They are processes without aftereffect. With this, the evolution of the process after an arbitrary time point t only depends on t und on the state occupied at t, not on the evolution of the process before t. For timehomogeneous Markov processes, the dependence on t also disappears (memoryless propers). Markov processes are very simple regenerative stochastic processes. They are regenerative with respect to each state and, if time-homogeneous, also with respect to any time I. Semi-Markov processes have the Markov property at the time points of any state change; i.e., all states of a Semi-Markov process are regeneration states. In a semi-regenerative process, a subset Zo, .. ., Zk of the states {Zo,..., Z, ) are regeneration states and constitute an embedded semi-Markov process. For an arbitrary regenerative stochastic process, there exists a sequence of random points (regeneration points) at which the process forgets its foregoing evolution and (from a probabilistic point of view) restarts anew. Typically, regeneration points occur when the process returns to some particular states (regeneration states). Between regeneration points, the dependency stmcture of the process can be very complicated. In order to describe the time behavior of Systems which are in statistical equilibrium (steady-state),stationary and time-homogeneous processes are suitable. The process { ( t ) is stationary (strictly stationary) if for arbitrary n = 1,2, ..., t l , ..., t„ and time span a ( t i ,ti + a E T , i = i ,..., n)
F(xl, ... ,X,, tl
+ U ,.. . ,t, + U ) = F(xl, ... ,X„
tl, . .. , t,)
.
(A7.3)
For n = 1, Eq. (A7.3) shows that the distribution function of the random variable c ( t ) is independent of t. Hence, E [ t ( t ) ] ,Var[C(t)],and all other moments are independent of time. For n = 2, the distribution function of the two-dimensional random variable ( & t ) , { ( t + U)) is only a function of U. From this it follows that the correlation coeflcient between { ( t )and t ( t + U) is also only a function of u
A7.1 Introduction
Besides stationarity in the strict sense, stationarity is also defined in the wide sense. The process & t ) is stationary in the wide sense if the mean E[t(t)] the variance Var[<(t)], and the correlation coefficient ptS(t,t + U) are finite and independent oft. Stationarity in the strict sense of a process having a finite variance implies stationarity in the wide sense. The contrary is true only in some particular cases, e.g. for the normal process (process for which all n-dimensional distribution finctions (Eq. (A7.1) are n-dimensional normal distribution functions, See Example A6.16). A process c(t) is time-homogeneous if it has stationary increments, i.e. if for arbitrary n = L2, ..., values X,, ...,X„ time Span a, and disjoint intervals ( ti, bi) ((ti, ti + a, bi, bi + a E T, i = 1,...,n )
If c(t) is stationary, it is also time-homogeneous. The contrary is not true, in general. However, time-homogeneous Markov Processes (for instance) become stationary as t -1W . The stochastic processes discussed in this appendix evolve in time, and their state space is a subset ofnatural numbers. Both restrictions can be omitted, without particular difficulties, with a view to a general theory of stochastic processes.
A7.2
Renewal Processes
In reliability theory, renewal processes describe the model of an item in continuous operation which is replaced at each failure, in a negligible amount of time, by a new, statistically identical item. Results for renewal processes are basic and useful in many practical situations. ~ , be (stochastically) independent and To define the renewal process, let T ~ , T ... non-negative random variables distributed according to FA(x) = Pr{zO5 X}, and
x2.0,
A7 Basic Stochastic-Processes Theory
i = I , 2, ..., x > O .
(A7.7)
The random variables
, constitutes a renewal process. The points or equivalently the sequence q , , ~ l ... S I , S 2 , ... are renewal points (regeneration points). The renewal process is thus a particular point process. The arcs relating the time points 0, SI, S2, ... on Fig. A7.la help to visualize the underlying point process. A count function
can be associated to a renewal process, giving the number of renewal points in the interval ( 0 , t] (Fig. A7.lb). Renewal processes are ordinary for F A ( 2 ) = F ( x ) , otherwise they are modified (stationary for F A ( x ) as in Eq. (A7.35)). To simplify the analysis, let us assume in the following that
m
M7TFo=E[r,]
< W
and M n F = E [ r J = / ( I - F ( x ) ) & < -, i n 1. (A7.11) 0
As z o , z l ,... are interarrival times, the variable x starting by 0 at t = 0 and at each renewal point S I , S2, . .. (arrival times) is used instead o f t (Fig. A7.la).
Figure A7.1 a) Possible time schedule for a renewal process; b) Corresponding count function v(t) (Si, SZ,. . . are renewal (regeneration) points, X start by 0 at t = 0 and at each renewal point)
443
A7.2 Renewal Processes
A7.2.1 RenewaI Function, Renewal Density Consider first the distribution function of the number of renewal points v(t) in the time interval (0, t] . From Fig. A7.1, Pr{v(t) I n-1}= Pr{& > t} = 1-Pr(& I t} = l - P r { ~ ~...+ + ~ ~ - ~ S t ] = 1 - F , ( t ) ,n = 1 , 2,.... (A7.12) The functions FJt) can be calculated recursively (Eq. A6.73))
From Eq. (A7.12) it follows that
and thus, for the expected value (mean) of ~ ( t ) ,
The function H(t) defined by Eq. (A7.15) is the renewal function. Due to F(0) = 0, one has H(0) = 0. The distribution functions FJt) have densities (Eq. (A6.74)) t
fl(t) = fA(t)
and
1
f,(t) = f(r) fn-l(t -X) dr ,
n = 2, 3,
... , (A7.16)
0
and are thus the convolutions of f(x) with fn-l(x). Changing the order of summation and integration one obtains from Eq. (A7.15)
The function M
C
&(t> h(t) = fn(t) dt n=l is the renewal density. h ( t ) is the failure intensity z (t) (Eq. (A7.228)) for the case in which failures of a repairable item (system) with negligible repair times can be described by a renewal process (see also Eqs. (A7.24) and (A7.229)).
444
A7 Basic Stochastic-Processes Theory
H(t), as per Eq. (A7.17), satisfy
Equation (A7.19) is the renewal equation. The corresponding for renewal density is
It can be shown that Eq. (A7.20) has exactly one solution whose Laplace transform i(s) exists and is given by (Appendix A9.7)
For an ordinary renewal process (FA(x)= F(x)) it holds that
Thus, an ordinary renewal process is completely characterized by its renewal density h(t) or renewal function H(t). In particular, it can be shown (e.g.[6.4]) that t
j
~ar[v(t)~ ] =( t+)2 h ( x ) ~-( tX) dx - ( ~ ( t .) ) ~
(A7.23)
0
It is not difficult to see that H(t) = E[v(t)] and Var[v(t)] ase finite for all t < W . The renewal density h(t) has the following important meaning: Due to the assumption FA(0)= F(0) = 0, it follows that 1
lim - Pr{v(t + 8t) - v(t) > 1) = 0 stLo6t und thus, for 8 t -1 0, Pr {any one of the renewal points SI or S2 or ... lies in ( t ,t + Ft] } = h(t) F t +o(Ft).
(A7.24) Equation (A7.24) gives the unconditional probability for one renewal point in ( t , t +&I. h(t) corresponds thus to the failure intensity z (t) (Eq. (A7.228)) and the intensity m(t) of a Poisson process (homogeneous (Eq. (A7.42)) or nonhomogeneous (Eq. (A7.193))), but differs basically from the failure rate h(t) (Eq. (A6.25)) which gives the conditional probability for a failure in (t,t +8t] given that no failure has occurred in (O,t], and can thus be used for only (as a function of t). This distinction is important also for the case of a homogeneous Poisson process (F*(x) = ~ ( x=)1 -e-hX, Appendix A7.2.5), for which h(x)=h holds for all
445
A7.2 Renewal Processes
interarrival times (with X starting by 0 at each renewal point) and h ( t ) = hholds for the whole process. Misuses are known, See e.g. [6.3]. Example A7.1 discusses the shape of H ( t ) for some practical applications.
Example A7.1 Give the renewal function H(t), analytically for (i)
f, (X) = f(x) = h echx
(Exponential)
(ii) fA(x) = f(x) = 0.5 h(h x)2e-hx (Erlang with n = 3) (iii) fA(X)= f(x) = h (h x f l e-hx / and numerically for h(x) = h for 0 5 X
r@)(Gamma), < Y and h(x) = h + ß h ß , ( x - ~ ) ß - l for x 2 Y , i.e. for
with h = 4.10-~h-l, h w = 1 0 - ~h-', ß = 5, W = 2.10' h (wearout), and for (V) FA( X ) = F(x) as in case (iv) but with ß = 0.3 and
= 0 (early failures).
Give the solution in a graphical form for cases (iv) and (V).
Solution The Laplace transformations of fA(t) and f(t) for the cases (i) to (iii) are (Table A9.7b)
(ii) (iii)
TA (s) = F(s) = h3 /(S + h13 FA (s) = f(s) = Aß I(s + h)ß , I
i(s)follows then from Eq. (A7.22) yielding h(t) or directly H(t) = h(x)dx 0
(i)
i(s) = h 1s and H(t) = ht
(ii) h ( s ) = h 3 / ~ ( ~ 2 + 3 h s + 3 h 2 )h=3 1 s [ ( s + ~ h ) ~ + $ h ~ ] and H(t) = :[ht - 1 + 2- e-3)3'2 sin(yl3htt 2 + :)I
113
(iii) h(s) ==
n
hß 1(S + h)' i-hß/(s+h)ß t
and H ( t ) =
,
m
hnß = E [ h ~ l ( s + h ) ß ]= C .=I
,,=I
(S+ hY"'
nß-1 h X -h ---e dx. nß
'("B)
Cases (iv) and (V)can only be solved numerically or by simulation. Figure A7.2 gives the results for these two cases in a graphical form (see Eq. (A7.28) for the asymptotic behavior of H(t), dashed line in Fig. A7.2a). Figure A7.2 shows that the convergence of H(t) to its asymptotic value is reasonably fast. The shape of H(t) allows recognition of the presence of wearout (iv) or early failures (V), but can not deliver precise indications on the failure rate shape (see Section 7.6.3.3 and Problem A7.2 in Appendix A l l ) .
446
A7 Basic Stochastic-Processes Theory
Figure A7.2 a) Renewal function H(t) and b) Failure rate h(x) and density function f(x) for cases (iv) and (V)in Example A7.1 (H(t) was obtained empirically, simulating 1000 failure-free times and plotting H(t) as a continuous curve; 6 =[(a / MZTF)' - I]/ 2 according to Eq. (A7.28))
A7.2.2 Recurrence Times Consider now the distribution functions of the fonvard recurrence time ~ ~ (andt ) the backward recurrence time z s ( t ) . As shown in Fig. A7. la, T,( t ) and ~ , ( t )are the time intervals from an arbitrary time point t forward to the next renewal point and backward to the last renewal point (or to the time origin), respectively. It ) with one of the following follows from Fig. A7.la that the event ~ ~ >( Xt occurs mutually exclusive events Ao={S1 > t + x } An = { ( S n It ) n (7,
> t + X - Sn)},
n = l,2, ... .
447
A7.2 Renewal Processes
Obviously, Pr{Ao}= 1 - FA(t+ X). The event An means that exactly n renewal points have occurred before t and the (n+l)th renewal point occurs after t + x. Because of Sn and T, independent, it follows that Pr{&
I Sn = y} = Pr{z,
> t + x - y},
n
= 1, 2,
...,
and thus, from the theorem of total probability (Eq. (A6.17))
yielding finally to t
Pr{~,(t)Sx]=F~(t+x)-Jh(y)(l-F(t+x-y))dy.
(A7.25)
0
The distribution function of the backward recurrence time zS(t) can be obtained as
Since Pr{SO> t} = 1- FA(t), the distribution function of zs(t) makes a jump of height 1- F A ( t ) at the point x = t .
A7.2.3 Asymptotic Behavior Asymptotic behavior of a renewal process (generally of a stochastic process) is understood to be the behavior of the process for t + W . The following theorems hold with MTTF as in Eq. (A7.11): 1. Elementary Renewal Theorem [A6.6 (vol. 11),A7.241: If the conditions (A7.9) (A7.11) are fulfilled, then
2
H(t) lim -t MTTF'
where H(t) = E[v(t)] .
t+-
For Var[v(t)] it holds that /ltVar[v(t)] / t = 02/MVF 3,with 02=Var[ri] < W, i 2 1. (It can also be shown [6.16] that lim (V@)/ t ) = 1/ MTTFholds with probability 1.) t+-
2. Tightened Elementary Renewal Theorem [A7.24,A7.29(1957)]: If the conditions < and 0 2 = Var[.ci] < W, i 2 I, then (A7.9) - (7.1 1) are fulfilled, E[z,] = Mlim (H(t) - -)
t+-
t
MTTF
=
Ci
T
1
M?TF
2
M
-- -+ - . 2~7777'
(A7.28)
448
A7 Basic Stochastic-Processes Theory
3.Key Renewal Theorem [A7.9(vol. 11), A7.241: If the conditions (A7.9) (A7.11) are fulfilled, U ( z ) 20 is bounded, nonincreasing, and Riemann integrable over the interval (0, W), and h ( t ) is a renewal density, then lim t+=
For any a > 0, the key renewal theorem leads, with (1
forO
= 10
to Blackwell's Theorem rA7.9 (vol. 11), A7.241 lim t+-
H(t+a)-H(t) -- 1 a MnF
Conversely, the key renewal theorem can be obtained from Blackwell's theorem. 4.Renewal Density Theorem [A7.9(1941),A7.241: If the conditions (A7.9)Var[.ti]< W, i 2 1, then (A7.11) are fulfilled, f A ( x )& f ( x ) go to 0 as X -+ lim h(t) = t+-
1
-.M7TF
(A7.31)
5. Recurrence Time Limit Theorems: Assuming U ( z )= 1- F(x + z ) in Eq. (A7.29) and considering F A ( - ) = 1& MTTF per Eq. (A7.1I), Eq. (A7.25) yields rn
1 1 iimPr{.tR(r)<x)=l--J(1-F(x+z))dz=-/(I-F(y))dyi
(A7.32)
MTTF
MTTF
t+-
X
For t -+ -, the density of the fonvard recurrence time z R ( t )is thus given by fTR(x)= ( 1 - F(x))I MiTF. Assuming E [ T ~ ] =M ~ F W< , 02=Var[.ti]<W, i t 1 & E [ T , (t )]< -, it follows that lim (x2(1-F(x)))= 0 . Integration by parts yields X-f
iim E[.tR(t)]= t+-
-
2 MTTF 0 ~ ( 1 F- ( x ) ) & = -+ -. MTTF 2 2 MTTF
-J 1
CG
The result of Eq. (A7.33) is important to clarify the waiting time paradox :
?L%E
(i) [.tR(t)] = MTTF 12 holds for oZ=O (is. for 'Ti = MTTF, i t O ) , and -hx (ii) &E [.tR(t)]= E [ z i ]= 1I h, i 2 0, holds for F,(x) = F(x)= 1- e . Similar results are for the backward recurrence time z S ( t ) . For a simultaneous observation of z R ( t ) and z s ( t ) , it must be noted that in this cases z R ( t )und . t S ( t )belong to the same ziand are independent only for case (ii).
A7.2 Renewal Processes
449
6. Central Limit Theorem for Renewal Processes [A7.24(1954),A7.29(1957)1: If the conditions (A7.9) and (A7.11) are fulfilled and 02=Var[xi]< -J, i 2 1, then
Equation (A7.34) is a consequence of the central limit theorem for the sum of independent and identically distributed random variables (Eq. (A6.148)). Equations (A7.27) - (A7.34) show that the renewal process with an arbitrary initial distribution function FA(x) converges to a statistical equilibrium (steady-state) as t -+ W , see Appendix A7.2.4 for a discussion on stationary renewal process.
A7.2.4 Stationary Renewal Processes The results of Appendix A7.2.3 allow a stationary renewal process to be defined as follows:
A renewal process is stationary (in steady-state) if for all t distributionfunction of ~ ~ in ( Eq. t ) (A7.25) does not depend on t.
0 the
It is intuitively clear that such a situation can only occur if a particular relationship exists between the distribution functions FA(x) and F(x) given by Eqs. (A7.6) and (A7.7). Assuming
it follows that f A ( x )= (1 - F(x))l m F , i A ( s )= ( 1 - f ( s ) )/ ( s M T T F ) , and thus from Eq. (A7.21)
-
1
h(s) = S MTTF
yielding 1
h(t) = -. MTTF
With FA(") & h ( x ) from Eqs. (A7.35) &(A7.36), Eq. (A7.25) yields for any ( t 2 0 )
450
A7 Basic Stochastic-ProcessesTheory
Equation (A7.35) is thus a necessary und sufficient condition for stationarity of the renewal process with Pr{zi 5 X ) = F(x), i 2 1. It is not difficult to show that the Count process v ( t ) given in Fig. 7.lb, belonging to a stationary renewal process, is a process with stationary increments. For any t, a > 0 , and n = 1,2, ... it follows that
with F„l(a) as in Eq. (A7.13) and FA(x) as in Eq. (A7.35). Moreover, for a stationary renewal process, H ( t ) = t l MTTF and the mean number of renewals within an arbitrary interval ( t , t + U ] is
Comparing Eq. (A7.32) with Eq. (A7.37) it follows that under weak conditions, every renewal process becomes stationary. From this, the following as t -+ interpretation can be made which is useful for practical applications:
A stationary renewal process can be regarded as a renewal process with arbitrary initial condition FA(x), which has been started at t 4 und will only be considered for t > 0 ( t = 0 being an arbitrary time point). The most important properties of stationary renewal processes are summarized in Table A7.1. Equation (A7.32) also obviously holds for z R ( t )and z s ( t ) in the case of a stationary renewal process.
A7.2.5 Homogeneous Poisson Processes The renewal process, defined by Eq. (A7.8), with
is a homogeneous Poisson process (HPP). FA(x)per Eq.(A7.38) fulfills Eq. (A7.35) and thus, the Poisson process is stationary. From Sections A7.2.1 to A7.2.3 it follows that (see also Example A6.20)
45 1
A7.2 Renewal Processes
As a result of the m e r n o ~ l e s sproperty of the exponential distribution, the count process v(t) (as in Fig A7.lb) has independent increments. Quite generally, a point process is a homogeneous Poisson process (HPP), with intensity ?L, if the associated count function v(t) has stationary independent increments and satisfy Eq. (A7.41). Alternatively, a renewal process satisfying Eq. (A7.38) is a HPP. Substituting for t in Eq. (A7.41) a nondecreasing function M( t ) > 0, a nonhomogeneous Poisson process (NHPP) is obtained. The NHPP is a point process with independent Poisson distributed increments. Because of independent increments, the NHPP is a process without aftereffect (memoryless if HPP) and the sum of Poisson processes is a Poisson process (Eq. (7.27) for HPP). Moreover, the sum of n independent renewal processes with low occurrence converge for n+ to a NHPP, to a HPP in the case of stationary independent renewal processes (Appendix A7.8.3). However, despite its intrinsic simplicity, the NHPP is not a regenerative process, and in statistical data analysis, the property of independent increments is often difficult to be proven. Nonhomogeneous Poisson processes are introduced in Appendix A7.8.2 and used in Sections 7.6 and 7.7 for reliability tests.
Table A7.1 Main properties of a stationary renewal process
1 1. Distribution function of
Expression
I
Comments, assumptions
F(0) = 0,
70
fA (x)=dFA( X ) / du
M7TF=E[Ti], 2. Distribution function of zi, i 2 l
3. Renewal function
4. Renewal density
i21
F(0) = 0, f ( x ) = d F(x)1du
W) t 3
t20
, t > O
1
H(t) = E [ v ( t ) ]= E[number of renewal points in (0, t ] ] t
=
t
t
= i m Pr,,
01
6t.10
Sz or.. . lies in (t, t + 6t]
5. Distribution function & mean Pr{zR( t )5 X] = FA ( X ) , t 2 0 FA ( X ) as in point 1, same for T (t ) of the forward recurrence time E [ T ~ ( = ~ )T/2+Var[.ci] ] / 2 ~T = MTTF= E [ T ~ ] , i 2 i
452
A7 Basic Stochastic-Processes Theory
A7.3
Alternating Renewal Processes
Generalization of the renewal process given in Fig. A7.la by introducing a positive random replacement time, distributed according to G(x), leads to the alternating renewal process. An alternating renewal process is a process with two states, which alternate from one state to the other after a stay (sojourn) time distributed according to F(x) and G(x), respectively. Considering the reliability and availability analysis of a repairable item in Section 6.2 and in order to simplify the notation, these two states will be referred to as the up state and the down state, abbreviated as u and d, respectively. To define an alternating renewal process, consider two independent renewal processes { T ~ and } {T;), i = 0,1, .... For reliability applications, zi denotes the ith failure-free time and T ; the ith repair (restoration) time . These random variables are distributed according to FA(x)
for
GA(x)
for ~ , j
70
and
F(x)
for z i ,
itI,
X
> 0,
(A7.45)
and
G(x)
for
i t 1,
X
> 0,
(A7.46)
T;,
densities fA(x), f(x), gA(x), g(x), and means (< W ) with F, (0)=F(O)=GA(O)=G(O)=O, M l T F = E [ z i ] = J (1- F(x))dx,
i i l ,
0
and m
M U R = E [ T ; ] = / ( I - G(x))&,
i t l ,
0
where MTTF and MTTR are used for mean time to failure and mean time to repair (restoration). The sequences
form two modified alternating renewal processes, starting at t = 0 with zo and T S , respectively. Figure A7.3 shows a possible time schedule of these two alternating renewal processes (repair times greatly exaggerated). Embedded in every one of these processes are two renewal processes with renewal points Suduior Suddi marked with A and Sduuior ,Ydudimarked with 0 , where udu denotes a transition from up to down given up at t = 0, i.e.
A7.3 Alternating Renewal Processes
.
Figure A7.3 Possible time schedule for two alternating renewal processes starting at t = 0 with 70 and respectively (shown are also the 4 embedded renewal processes with renewal points . A)
6,
These four embedded renewal processes are statistically identical up to the time intervals starting at t = 0, i.e. up to
The corresponding densities are
for the time interval starting at t = 0, and f(x) * g(x) for all others. The symbol * denotes convolution (Eq. (A6.75)). The results of Section A7.2 can be used to investigate the embedded renewal processes of Fig. A7.3. Equation (A7.22) yields Laplace transforms of the renewal densities hudu(t),hduu(t),hudd(t),arid hdud(t)
huduW=
fA ($1 1- f (SIg(s)
Lduu(s)=
~(s) i - T(s)~(s)'
To describe the alternating renewal process defined above (Fig. A7.3), let us ( ~<(t) deintroduce the two-dimensional stochastic process ( <(t), z ~)(t))~ where notes the state of the process (repairable item in reliability application) u d
if the item is up at time t if the item is down at time t
454
A7 Basic Stochastic-Processes Theory
~ , ~ ( and t ) ~ ~ are ~ (thust the ) fonvard recurrence times in the up and down states, respectively, provided that the item is up or down at the time t, See Fig. 6.3. To investigate the general case, both alternating renewal processes of Fig. A7.3 must be combined. For this let p = Pr{item up at t = 0)
In t e m s of the process ( < ( t ) ,T R
and
1 - p = Pr{item down at t = 0).
(A7.51)
(t ) ( t ) ) ,
Consecutive jumps from up to down form a renewal process with renewal density
Similarly, the renewal density for consecutive jumps from down to up is given by
Using Eqs. (A7.52) and (A7.53), and considering Eq. (A7.25), it follows that
t
=p(1-FA(t+8))+~hdu(~)(l-~(t-x+8))dx
(A7.54)
0
and
Setting 8 = 0 in Eq. (A7.54) yields
The probability PA(t) = Pr(((t)= U } is called thepoint availability and IR(t,t + 81 = Pr{<(t)= u n~ ~ , (>t 8) ) the intewal reliability of the given item (Section 6.2). An alternating renewal process, characterized by the Parameters p, FA(x), F(x), G A ( x ) ,and G ( x ) is stationary if the two-dimensional process ( ( ( t ) , ~ , < ~ ~ , (ist ) ) stationary. As with the renewal process it can be shown that an alternating renewal process is stationary if and only if
455
A7.3 Altemating Renewal Processes
(A7.57) with MTTF and MTirR as in Eqs. (A7.47) and (A7.48). In particular, for t 2 0 the following relationships apply for the stationary alternating renewal process (Examples 6.3 and 6.4)
PA(t) = Pr{item up at t ] =
MTTF MTTF + MTTR
= PA,
(A7.58)
IR(t,t + 8 ) = Pr{item up at t and remains up until t + 81 Co ~
MTTF
+ MTTR
j(l-~(y))dy. 8
Condition (A7.57) is equivalent to
Moreover, application of the key renewal theorem (Eq. (A7.29)) to Eqs. (A7.54) (A7.56) yields (Example 6.4) lim Pr{c(t) = u n zRu(t)> 81 = t+m
lim Pr{<(t)= d n z R d ( t )> 8 ) = t+m
MTTF
+ MTTR 1e
MTTF
+ MTTR
lim Pr{<(t)= u J = lim PA(t) = PA = t-+-
t+-=
Y
j(l-G(y))dy,
(A7.61)
(-47.62)
8
MTTF
MTTF
-
+ MTTR
(A7.63)
-
Thus, irrespective of its initial conditions p, F A ( x ) , and G A ( x ) ,an alternating renewal process has for t -+ an asymptotic behavior which is identical to the stationary state (steady-state). In other words:
A stationary alternating renewal process can be regarded as an alternating renewal process with arbitrary initial conditions p , FA(x), und will only be considered und G A ( x ) which , has been started at t = for t 2 0 ( t = 0 being an arbitrary time point). -W
It should be noted that the results of this section remain valid even if independence between z j and T ; within a cycle (e.g. T O + T;, T~ + T;, ...) is dropped; only independence between cycles is necessary. For exponentially distributed z j and 21, i.e. for constant failure rate h and repair rate p in reliability applications, the convergence of PA(t) towards PA stated by Eq. (A7.63) is of the
456
A7 Basic Stochastic-Processes Theory
form PA(t) -PA = ( h1( h + p))e-(h+p)t= ( h~ ~ ) e - See ~ " Eq. (6.20) and Section 6.2.4 for further considerations.
A7.4
Regenerative Processes
A regenerative process is characterized by the property that there is a sequence of random points on the time axis, regeneration points, at which the process forgets its foregoing evolution and, from a probabilistic point of view, restarts anew. The times at which a regenerative process restarts occur when the process returns to some states, defined as regeneration states. The sequence of these time points for a specific regeneration state is a renewal process embedded in the original stochastic process. For example, both the states up and down of an alternating renewal process are regeneration states. All states of time-homogeneous Markov processes and of serni-Markov processes, defined by Eqs. (A7.95) and (A7.158), are regenerative. However there are processes in discrete state space with only few (two in Fig. A7.11, one in Fig. 6.10) or even with no regeneration states (see e.g. Appendix A7.8 for some considerations). A regenerative process must have at least one regeneration state. A regenerative process thus consists of independent cycles which describe the time behavior of the process between two consecutive regeneration points of the same type (same regeneration state). The ith cycle is characterized by a positive random variable zci (duration of cycle i) and a stochastic process t i ( t )defined for 0 1 t < zCi (content of the cycle). Let t , ( t ) , 0 1: t < zcn, n= 0, I , ... be (stochastically) independent and for 12 2 i identically distributed cycles. For simplicity, let us assume that the time points SI = zcO, S2 = zco + ,... form a renewal process. The random variables zCo and T,, , i > 1, have distribution functions FA(x) for T , ~and F(x) for T , ~ , densities f A ( x ) and f ( x ) , and finite means TA and T„ respectively. The regenerative process 5 ( t ) is then given by
The regenerative structure is sufficient for the existence of an asymptotic behavior (limiting distribution) for the process as t + (provided that the mean time between regeneration points is finite). This limiting distribution is determined by the behavior of the process between two consecutive regeneration points of the same regeneration state.
457
A7.4 Regenerative Processes
Defining h ( t ) as the renewal density of the renewal process given by S I , S 2 , ... and setting
it follows, similarly to Eq. (A7.25), that
] W, For any given distribution of the cycle c i ( t ) , 0 5 t < zci, i 2 1, with Tc = E [ T , ~ < there exists a stationary regenerative process c e ( t ) with regeneration points Sei, i 2 1. The cycles Ce, ( t ) , 0 5 t < have for n 2 1 the Same distribution law as c i ( t ) , 0 5 t < zCi. The distribution law of the starting cycle ( t ) , 0 5 t < T , ~ can , be calculated from the distribution law of & ( t ) , 0 5 t < T , ~ ,See Eq. (A7.57) for alternating renewal processes. In particular,
ce0
with T, = E [ T , ~<] W , i 2 1. Furthermore, for every non-negative function g(t) and S I =o,
Equation (A7.66) is known as the stochastic mean value theorem. Since U(t,B) is nonincreasing and 5 1 - F(t) for all t 2 0, it follows from Eq. (A7.64) and the key renewal theorem (Eq. (A7.29)) that
Equations (A7.65) and (A7.67) show that under general conditions, as t -+ 00 a regenerative process becomes stationary. As in the case of renewal and alternating renewal processes, the following interpretation is true:
A stationary regenerative process can be considered as a regenerative process with arbitrary distribution of the starting cycle, which has been started at t = und will only be consideved for t 2 0 ( t = 0 being an arbitrary time point). - W
45 8
A7 Basic Stochastic-Processes Theory
Markov Processes with Finitely Many States
A7.5
Markov processes are processes without aftereffect. They are characterized by the property that for any (arbitrarily chosen) time point t their evolution after t depends on t and the state occupied at t, but not on the process evolution up to the time t. In the case of time-homogeneous Markov processes, dependence on t also disappears. In reliability theory, these processes describe the behavior of repairable Systems with constant failure und repair rates for all elements. Constant rates are required during the stay (sojourn) time in any state, not necessarily at state changes (e.g. for load sharing). After an introduction to Markov chains, time-homogeneous Markov processes with finitely many states are considered in depth, as basis for Chapter 6.
A7.5.1 Markov Chains with Finitely Many States Let E,0, Ci ,... be the sequence of consecutively occurring states. A stochastic process in discrete time E,, with state space {Zo,..., Z, }is a Markov chain if for n = 0, 1,2, ... and arbitrary i, j, i o , ...,in-i E (0, . .., m } ,
The quantities pv ( n ) are the (one step) transition probabilities of the Markov chain. Investigation will be limited here to time-homogeneous Markov chains, for which the transition probabilities pv ( n ) are independent of n
For simplicity, Markov chain will be used in the following as equivalent to timehomogeneous Markov chains. The probabilities pv satisfy the relationships m
ej20
and
xqj=l,
i, j
E
{o, ... , m].
(A7.70)
j=O
A matrix with elements pq as in Eq. (A7.70) is a stochastic matrix. The k-step transition probabilities are the elements of the kth power of the stochastic matrix . example, k = 2 leads to (Example A7.2) with elements p ~For m
Pr{5n+2=Zj
II
I 5,=ziI = I-< P ~ { ( C , + ~ = Z ~5 n + 1 = Z k )I 5,=zi I k=O
m
=
= Zk
1 5,
= Zi
=Zj
1 (5,
= Z i n 5„1
= Zk)l,
k=O
50.5i,...identify successive transitions (also in the same state if pii(n) > 0 ) without any relation to the time axis; this is important when considering embedded Markov chains in a stochastic process.
459
A7.5 Markov Processes with Finitely Many States
from which, considering the Markov property (A7.68),
Results for k > 2 follow by induction.
Example A7.2 Assuming Pr{C]>O,provethat Pr((A n B )
I C ] = Pr{B I
C}Pr{A
I
( B n C)]
Solution For Pr{C] 1 0 it follows that
The distribution law of a Markov chain is completely given by the initial distribution Ai = P r { t O= Z i } , i = 0, ..., m, (A7.72) with CAi=l, and the transition probabilities p ~ since , for every arbitrary io, .... in E (0,. .., m ) ,
and
Pr{cO= Zi,n Ci = Zi,n ... n C,, = ZIn. ) = A.10piOii ...pin-,in and thus, using the theorem of total probability (A6.17),
A Markov chain with transition probabilities pg is stationary if and only if the state probabilities Pr{E,, = Zj J , j = 0, ... , m, are independent of n, i.e. (Eq. (A7.73) with n=1) if the initial distribution Ai (Eq. (A7.72)) is a solution ( p j ) of the system m
,
m
with p j 2 0 and z p j = l , i= 0
j = O,..., rn.
(A7.74)
j=l
The system given by Eq. (A7.74) must be solved by canceling one (arbitrarily chosen) equation and replacing this by pj = 1. PO,...,prnfrom Eq. (A7.74) define the stationary distribution of the Markov chain with transition probabilities TU. A Markov chain with transition probabilities pg is irreducible if every state can be reached from every other state, i.e. if for each (i,j) there is an n = n(i, j) such that (n )
Tg > O ,
i j { O m},
ntl.
(A7.75)
460
A7 Basic Stochastic-ProcessesTheory
It can be shown that the system (A7.74) possesses a unique solution with pj>O
and
fi+%+...+pm= 1,
j = o , ..., m,
(A7.76)
only if the Markov chain is irreducible, see e.g. [A7.3,A7.13, A7.27, A7.29 (1968)l.
A7.5.2 Markov Processes with Finitely Many States +I A stochastic process < ( t )in continuous time with state space {Zo,. .., Z,} is a Markov process if for n= 0,1,2, ..., arbitrary time points t +U>t> tn>... > t o 2 0 , and arbitrary i , j , i O ,..., in E {0,..., m } ,
< ( t ) ( t 2 0 ) is a jump function, as visualized in Fig. A7.10. The conditional state probabilities in Eq. (A7.77) are the transition probabilities of the Markov process and they will be designated by Pu ( t , t + U )
Equations (A7.77) and (A7.78) give the probability that <(t+ U ) will be Z j given that < ( t )was Zi. Between t and t + a the Markov process can visit any other state (this is not the case in Eq. (A7.95), in which Z j is the next state visited after Zi). The Markov process is time-homogeneous if
In the following only time-homogeneous Markov processes will be considered. For simplicity, Markov process will often be used as equivalent to time-homogeneous Markov process. For arbitrary t > 0 and a > 0 , Pij ( t + a ) satisfy the ChapmanKolmogorov equations
k=O
which demonstration, for given fixed i and j, is similar to that for Furthermore Pu ( U ) satisfy the conditions
in Eq. (A7.71).
and thus form a stochastic matrix. Together with the initial distribution
+)
Continuous (parameter) Markov chain is often used in the literature. Using Markovprocess should help to avoid confusion with Markov chains embedded in stochastic processes (footnote onp. 458).
461
A7.5 Markov Processes with Finitely Many States
the transition probabilities Pij ( U )completely determine the distribution law of the Markov process. In particular, the state probabilities for t > 0 i = 0, ..., m
Pj(t) = Pr{E,(t) = Z j},
(A7.83)
can be obtained from
Setting 0
for i
+j
and assurning that the transition probabilities Pij( t ) are continuous at t = 0, it can be shown that Pij(t) are also differentiable at t = 0. The limiting values P-(6t) l i m L = p„ sr.10 6t
and
for i 7~ j,
lim 6t.10
1- Pii(Gt) 6t
=Pi,
exist and satisfy
Equation (A7.86) can be written in the form ~ i j ( 6 t= ) p, 6t + o(6t)
and
1- Pii(&) = pi 6t + o(6t),
(A7.88)
where o(6t) denotes a quantity having an order higher than that of 6 t , i.e.
Considering for any t 2 0
the following useful interpretation for pij and pi can be obtained for 6 t -1 0 and arbitrary t pij 6t = Pr{ jump from Zi to Zj in (t ,t t 6t]
/
&t) = Zi }
It is thus reasonable to define pij and pi as transition rates (for a Markov process, pij plays a sirnilar role to that of the transition probability pij for a Markov chain).
462
A7 Basic Stochastic-ProcessesTheory
Setting a
=6t
in Eq. (A7.80) and considering Eqs. A7.78) and (A7.79) yields
and then, taking into account Eq. (A7.86), it follows that
Equations (A7.91) are the Kolmogorov's fonvard equations. With initial conditions P,(O) = 6, as in Eq. (A7.85), they have a unique solution which satisfies Eq. (A7.81). In other words, the transition rates according to Eq. (A7.86) or Eq. (A7.90) uniquely determine the transition probabilities P, ( t ) . Similarly as for Eq. (A7.91), it can be shown that P, ( t ) also satisfy the Kolmogorov's backward equations
Equations (A7.91) & (A7.92) are also known as Chapman-Kolmogorov equations. They can be written in matrix form 6 = P A & 6 = A P and have the formal solution P ( t ) = e P (0). The following description of the time-homogeneous Markov process with initial distribution Pi(()) and transition rates p,, i, j E (0, ..., m ) , provides a better insight into the structure of a Markov process as a pure jump process (Fig. A7.10). It is the basis for investigations of Markov processes by means of integral equations (Section A7.5.3.2), and is the motivation for the introduction of semi-Markov processes (Section A7.6). Let kO,Ci, ... be a sequence of random variables taking values in {Zo,..., Z, ) denoting the states successively occupied and qo, q i , ... a sequence of positive random variables denoting the stay (sojourn) times between two consecutive state transitions. Define Pij
T U = ~ i,+ j
and
pii = 0 ,
i , j E {O,..., m ] ,
(A7.93)
and assume furthermore that
and, for n= O,1,2, ...,arbitrary
i , j, io , ..., in-, E (0, ...,m),
and arbitrary xo,...,x ~ >-0, ~
463
A7.5 Markov Processes with Finitely Many States
In Eq. (A7.93, as well as in Eq. (A7.158), Z j is the next state visited after Zi (this is not the case in Eq. (A7.77), see also the remark with Eq. (A7.106)). Q, (X) is thus defined only for j z i . Co, EI, ...is a Markov chain, with an initial distribution
and transition probabilities P, = Pr{kntl = Z j
( C,,
= Zi},
with
Pii = 0 ,
embedded in the original process. From Eq. (A7.93, it follows that (Example A7.2)
Qij ( X ) is a semi-Markov transition probability and will as such be introduced and discussed in Section A7.6. Now, define So=O,
S , = I I ~ + . . . + ~ ~ - n~=, 1 , 2 ,... ,
and
E,(t)= E,„
for Sn 5 t <
From Eq. (A7.98) and the memoryless property of the exponential distribution (Eq. (A6.87)) it follows that 5(t), t 1 0 is a Markovprocess with initial distribution Pi(0) = Pr{c(O)= Z i ]
and transition rates 1 p, = lim -Pr{jumpfromZitoZj i n ( t , t + 6 t l 6rL0 6t
I c(t)=ZiJ,
and 1 pi = iim -Pr{leave Zi in ( t , t + 6t] ijtLoSt
I
j+i
m
E,(t) = Z i } = j,=o
p, .
]?Li
The evolution of a time-homogeneous Markov process with transition rates p, and pi can thus be described in the following way [A7.2 (1974 ETH)]: I f at t = 0 the process enters the state Zi, i.e. Co = Zi, then the next state to be entered, say Z j ( j # i ) is selected according to the probability pij 2 o (pii = O), and the stay (sojourn) time in Zi is a random variable with distribution function Pr{qo<x l ( c O = ~ i n = \Zli ) } = l - e -Pi". ,
464
A7 Basic Stochastic-Processes Theory
as the process enters Z j , the next state to be entered, say Zk ( k + j), will be selected with probability qk2 0 kpjj = 0) und the stay (sojourn) time q l in Z j will be distributed according to
etc. The sequence C„ n = o , i , . .. of the states successively occupied by the process is that of the Markov chain embedded in C(t),the so called embedded Markov chain. The random variable q, is the stay (sojourn) time of the process in the state defined by C,., From the above description it becomes clear that each state Zi, i = 0, ... , rn, is a regenerution state. In practical applications, the following technique can be used to determine the quantities Qij( X ) , pij, und Fij( X ) in Eq. (A7.95) [A7.2 (1985)l: Ifthe process enters the state Zi at an arbitravy time, say at t = 0, then a set of independent random times zij > 0 , j it i, begin ( z u is the stay (sojourn) time in Ziwith the next jump to Z j ) ; the process will then jump to Z j at > zij for (all) k # j. the time X i f 'tij = X und In this interpretation, the quantities Qij( X ) ,pij, and Fij(X)are given by k T V ,k Qij(x) = Pr{zij 9 x n ~ i >
#
B = Pr{zik > zij, k f j } ,
FU(x)= Pr{zij I x
I Tik > 'tu,
k
#
(A7.99)
with Q,(O)= O,,
j},
j},
with pii 3 0,
(A7.100)
with F,(O) = 0.
(A7.101)
Assuming for the time-homogeneous Markov process (memoryless property)
one obtains, as in Eq. (A7.95),
Pij Pi
P.. = - = Q . .( II
m
W)
for j i ,
pi=
i =O
pij,
P.. LI = -O
>
(A7.103)
It should be emphasized that due to the memoryless property of the time-homogeneous Markov process, there is no difference whether the process enters Zi at t = 0 or it is already there. However, this is not true for semi-Markovprocesses (Eq. A7.158).
465
A7.5 Markov Processes with Finitely Many States
Quite generally, a repairable system can be described by a time-homogeneous Markov process if and only if all random variables occurring (failure-free times and repair times) are independent und exponentially distributed. If some failure-free times or repair times of elements are Erlang distributed (Appendix A6.10.3), the time evolution of the system can be described by a time-homogeneous Markov process with appropriate state space's extension (Fig. 6.6). A powerful tool when investigating time-homogeneous Markov processes is the diagram of transition probabilities in (t, t + 6t], where 6 t -+ 0 (6 t:, 0, i.e. 6 t .10) and t is an arbitrary time point (e.g. t = 0). This diagram is a directed graph with nodes labeled by states Zi,i = 0, ... , rn, and arcs labeled by transition probabilities P, (6t), where terms of order o(6t) are omitted. It is related to the state transition diagram of the system involved, take care of particular assumptions (such as repair priority, change of failure or repair rates at a state change, etc.), and has in general more than 2" states, if n elements in the reliability block diagram are involved (see for instance Fig. A7.6 and Section 6.7). Taking into account the properties of the random variables T, , introduced with Eq. (A7.99), it follows that for 6 t + 0 Pr{(5(6t) = Z n only one jump in (0 ,6t] )
1 k(0) = Zi )
and Pr{(E,(6t) = Zj n more than one jurnp in (OJt])
1 c(0) = Zi) = o(6t).
(A7.106)
From this, P,(&) = p, 6t + o(6t),
j
+i
and
Pii (6t) = 1- pi 6t + o(6t),
as with Eq. (A7.88). Although for 6 t + 0 it holds that P, (6t) = Q, (6t), the meanings of P, (6t) as in Eq. (A7.79) or Eq. (A7.78) and Q, (6t) as in Eq. (A7.95) or Eq. (A7.158) are basically dzjjferent. With Qij(x), Z j is the next state visited this is not the case for P, (X). after Zi, Examples A7.3 to A7.5 give the diagram of transition probabilities in ( t + 6t] for some typical stmctures for reliability applications. The states in which the system is down are hatched. In state Zo all elements are up (operating or in reserve state). +) Example A7.3 Figure A7.4 shows several possibilities for a 1-out-of-2 redundancy. The difference with respect to the number of repair crews appears when leaving the states Z2 and Z3. Cases b) and C) are identical when two repair crews are available.
+)
The memoryless property, characterizing the (time-homogeneous) Markov processes, is satisfied in all diagrams of Fig. A7.4 and in all similar diagrams given in this book. Assuming e.g. that at a given time t the system of Fig. A7.4b left is in state Z 4 , development after t is independent of how many times before t the system has oscillate between Z 2 and Zo or Z 2 ,Z o , Z1 , Z 3 .
466
A7 Basic Stochastic-Processes Theory
1-out-of-2
one repair Crew
Disiribution of failure-free times operating state: F(t) = 1- ePht reserve state: F(t) = 1- e-'r Distribution of repair time: G ( t )= 1- e-Pt two repair Crews
Figure A7.4 Diagram of transition probabilities in (t, t + 6 t ] for a repairable 1-out-of-2 redundancy (constant failure rates h, h , and repair rate P): a ) Warm redundancy with El = E2 ( h , = h + active redundancy, h, = 0 jstandby redundancy); b) Active redundancy with EI E2; C )Active redundancy with EI # E2 and repairprion'ty on E1 (t arbitrary, 6t .L 0, Markov process)
*
A7.5 Markov Processes with Finitely Many States
Example A7.4 Figure A7.5 shows two cases of a k-out-of-n active redundancy with two repair crews. In the first case, the system operates up to the failure of all elements (with reduced performance from state Zn-k+l). In the second case no further failures can occur when the system is down. Example A7.5 Figure A7.6 shows a series/parallel structure consisting of the series connection (in the reliability sense) of a 1-out-of-2 active redundancy, with elements E2 and E3 and a switching element EI. The system has only one repair Crew. Since one of the redundant elements E2 or E3 can be down without having a system failure, in cases a) and b) the repair of element EI is givenfirst priority. This means that if a failure of E1 occurs during a repair of E2 or E3, the repair is stopped and El will be repaired. In cases C)and d) the repairpriority on E1 has been dropped.
E l = E 2 = ... = E n = E Distribution of -At
failure-free operating times: F(t)=l - e repair times:
G(t)=l - e-Pr
k-out-of-n (active)
vi = (n-i) h and pi (i+l) = vi for i = 0, 1, ... , n - 1 ; p10 = p ; pi(i-l) = 2p for i = 2, 3, ... , n
vi =(n-i)A and pi(i+l) = v i for i=O,l, ..., n-k; b)
p l O = p ; pi(i-l)=2p for i = 2 , 3 , ..., n - k + l
Figure A7.5 Diagram of transition probabilities in (t, t + 6t ] for a repairable k-out-of-n active redundancy with w o repair crews (constant failure rate h and repair rate P): a) The system operates up to the failure of ihe last element; b) No further failures at system down (t arbitrary, St .L 0, Markov process; in a k-out-of-n redundancy the system is up if at least k elements are operating)
468
A7 Basic Stochastic-Processes Theory E2=E3=E Distribution of failure-free times: F(t)= 1- e-L for E, F(+ 1for E I repair times: G(t)=1- e-P for E , G(t)= 1- e-PIt for E1 U
1-out-of-2(active)
a) Repair priority on E1
C)
No repair priority (i.e. repair as per first-in first-out)
b) As a), but no further failures at syst. down
d) As C),but no further failures at syst. down
Figure A7.6 Diagram of transition probabilities in (t, t + S t ] for a repairable series parallel structure with E2 = E3 = E and one repair crew: a) Repair priority on EI and the system operates up to the failure of the last element; b) Repair priority on EI and at system failure no further failures can occur; C) and d) as a) and b), respectively, but without repair priority on EI (constant failure rates h, hl and repair rates p, pl; t arbitrary; 6 t J 0, Markov process)
A7.5 Markov Processes with Finitely Many States
469
A7.5.3 State Probabilities and Stay Times (Sojourn Times) in a Given Class of States In reliability theory, two important quantities are the state probabilities and the distribution function of the stay (sojourn) times in the set of system up states. The state probabilities allow calculation of the point availability. The reliability function can be obtained from the distribution function of the stay time in the set of system up states. Furthermore, a combination of these quantities allows for timehomogeneous Markov processes a simple calculation of the intewal reliability. It is useful in such an analysis to subdivide the system state space into two complementary sets U and Ü
U = set of the system up states (up states at system level) -
U = set of the system down states (down states at system level).
(A7.107)
Partition of the state space in more than two classes is possible, see e.g. [A7.28]. Calculation of state probabilities and stay (sojourn) times can be carried out for Markov processes using the method of differential equations or of integral equations.
A7.5.3.1
Method of Differential Equations
The method of dzfferential equations is the classical one used in investigating Markov processes. It is based on the diagram of transition probabilities in (t, t + 8t]. Consider a time-homogeneous Markov process c(t) with arbitrary initial distribution Pi(0) = Pr{c(O) = Zi}and transition rates p, and pi. The state probabilities defined by Eq. (A7.83) j = 0, ..., rn,
Pj(t) = Pr{{(t) = Zj},
satisfy the system of differential equations
The proof of Eq. (A7.108) is sirnilar as for Eq. (A7.91), See also Example A7.6. The point availability PAs(t), for arbitrary initial conditions at t = 0, follows then from PAS(t) = Pr{{(t)
E
U} =
Pj(t).
(A7.109)
Zj€U
In reliability analysis, particular initial conditions are often of interest. Assurning Pi(0)=l
and
Pj(0)=O f o r j g i ,
(A7.110)
470
A7 Basic Stochastic-Processes Theory
i.e. that the system is in Zi at t = 0 (usually in state Zo denoting "all elements are up7'), the state probabilities Pj ( t ) are the transitionprobabilities Pv ( t )defined by Eqs. (A7.78) & (A7.79) and can be obtained as
with Pj ( t ) as the solution of Eq. (A7.108) with initial conditions as in Eq. (A7.110), or of Eq. (A7.92). The point availability, now designated with PAs,(t), is then given by
PAsi(t)is the probability that the system is in one of the up states at t, given it was in Z, at t = 0 . Example A 7.6 illustrate calculation of the point-availability for a 1out-of-2 active redundancy.
Example A7.6 Assume a I-out-of-2 uctive redunduncy, consisting of 2 identical elements EI = E2 = E with constant failure rate h and repair rate p, and only one repair Crew. Give the state probabilities of the involved Markov process ( EI and E2 are new at t = 0). Solution Figure A7.7 shows the diagram of transition probabilities in (t, t + St] for the investigation of the point availability. Because of the memoryless property of the involved Markov Process, Fig A7.7 and Eqs. (A7.83) & (A7.90) lead to (by omitting the terms in o(St), as per Eq. (A7.89))
and then, as St
(t) = -(h
0,
+ p ) Pl ( t )+ 2 h Po(t) + p P, (t)
i 2 ( t ) = -pP2(t)
+ h P, (t).
Equation (A7.113) also follows from Eq. (A7.108) with the pv from Fig. A7.7. The solution of Eq. (A7.113) with given initial conditions at t = 0 , e.g. Po(0) = 1, PI (0) = P2 (0) = 0 , leads to state probabilities Po(t), Pl(t), and P2(t), and then to the point availability according to Eqs. (A7.111) and (A7.112) with i = 0 (see also Example A7.9 and Table 6.2 for the solution).
A7.5 Markov Processes with Finitely Many States
Figure A7.7 Diagram of the transition probabilities in (t, t + 6 t ] for availability calculation of a 1-out-of-2 active redundancy with E,=E,= E, constant failure rate h and constant repair rate p, one repair Crew (t arbitrary, 6 t .10, Markov process with pol = 2h, pI0 =p, pIZ= h, pZ1= p ,
po=2h, p l = h + p 7PZ=P)
A further important quantity for reliability analyses is the reliability function R s ( t ) , i.e. the probability of no system failure in (0, t ] . R s ( t ) can be calculated using the method of differential equations if all states in Ü are declared to be absorbing states. This means that the process will never leave Zk if it jumps into a state Zk E Ü. It is not difficult to See that in this case, the events {first system failure occurs before t } and {system is in one of the states Üat t } are equivalent, so that the sum of the probabilities to be in one of the states in U is the required reliability function, i.e. the probability that up to the time t the process has never left the Set of up states U. To make this analysis rigorous, consider the rnodified Markov process ~ ' ( twith ) transition probabilities P~;( t ) and transition pl. = pij if II
zi E U,
p'.. = 11
o
-
if Z~ E U ,
p) =
C pi ,
(A7.114)
j=O
j jti
The state probabilities ~ ; . ( t of ) < ( t ) satisfy the following system of differential equations (see Example A7.7 for an application) m
m
$ ( t ) = -p;P;(t)+ z ~ / ( t ) ~ b ,p:J =
X p:., JZ
j = 0,..., m.
(A7.115)
i=0 i+j
i=O i+j
Assuming as initial conditions P:(o) = 1 and P;.(o)= 0 for j + i (with Zi E U), the solution of Eq. (A7.115) leads to the state probabilities P ; ( t ) and from these to the transition probabilities (A7.116)
P; ( t )= P; ( t ). The reliabilityfinction R s i ( t ) is then given by Rsi(t) = Pr{<(x)E U for 0 < X
< t 1 <(o)= zi)=
X ~:j(t),
zj €U
Z , e U.
(A7.117)
472
A7 Basic Stochastic-Processes Theory
The probabilities marked with ' ( ~ i ( t )are ) reserved for reliability calculation, when using the method of differential equations. This should avoid confusion with the corresponding quantities for the point availability. Example A7.7 illustrates the calculation of the reliability function for a 1-out-of-2 active redundancy. Example A7.7 Give the reliability function for the same case as in Example A7.6, i.e. the probability that the system has not left the states ZO and Z1 up to time t. Solution The diagram of transition probabilities in (t, t +6t] of Fig. A7.7 is modified as in Fig. A7.8 by making the down state Z2 absorbing. For the state probabilities it follows that (see Ex. A7.6)
62(t) = -h P; (t) .
(A7.118)
The so1u;ion of Eq. (A7.118) with the given iritial tonditions ,at t = O ( ~ ' ~ ( 0 ) = 1 , PI (0) = P2(0) = 0 ) leads to the state probabilities P,,(t), P, (t) and P2(t), and then to the transition probabilities and to the reliability function according to Eqs. (A7.116) and (A7.117), respectively (the dashed state probabilities should avoid confusion with the solution given by Eq. (A7.113)).
Equations (A7.112) and (A7.117) can be combined to determine the probability that the process is in an up state (set U) at t and does not leave the set U in the time interval [t,t + B], given { ( 0 )= Zi. This quantity is the interval reliability IRsi(t,t + 0). Due to the memoryless property of the involved Markov process, IRsi(t,t + 0 ) = Pr(&x)E U for t 5 x < t + B
1 { ( O ) = Zi} = zjcu
P, ( t ).Rsj ( B ) ,
(A7.119)
with i = 0,1,..., m and Pu ( t ) as given in Eq. (A7.111).
Figure A7.8 Diagram of the transition probabilities in (t, t + 6 t ] for the reliabilityfunction of a 1-out-of-2 active redundancy with E,= E 2 = E i constant failure rate h and constant repair rate y, one repair Crew (t arbitrary, 6t 0, Markov process with pol = 2h, pm = y, p12 = h ; p, = 2h, P1 = h + y , P 2 = 1
,
473
A7.5 Markov Processes with Finitely Many States
A7.5.3.2
Method of Integral Equations
The method of integral equations is based on the representation of the (time-homogeneous) Markov process g(t)as a pure jump process by means of 5, and q , as introduced in Appendix A7.5.2 (Eq. (A7.95), Fig. A7.10). From the mernoryless property it uses only the fact that jump points (in a new state) are regeneration points of g(t). The transition probabilities Pij ( t )= Pr{g(t)= Zj &0) = Zi} can be obtained by solving the following system of integral equations
1
with pi=Xjsjti P,, 6,=O for j+i, & = I . To prove Eq. (A7.120), consider that
1
Pij(t) = Pr{(k(t) = Z n no jumps in (0, t ] ) 4(O) = Z i )
+
m
1
Pr{(t(t)= Zj n firstjump in (0, t] in Zk) e(0) = Zi] k=O
k+i
The first term of Eq. (A7.121) only holds for j = i and gives the probability that the process will not leave the state Zi (e-P" = P ~ { >z t~ for all j + i ] according to the interpretation given by Eqs. (A7.99) - (A7.104)). The second term holds for any j ;t i , it gives the probability that the process will move first from Zi to Zk and take into account that the occurrence of Zk is a regeneration point. Accord) ing to Eq. (A7.95), Pr { = Zk n qo < X g(0)= Z i } = Qik(x)= pik(l - C P i xand Pr{<(t)= Zj (C0 = Zin qo = X n Ci = Zk)}= Gj ( t - X ) . Equation (A7.120) then follows from the theorem of total probability (Eq. (A6.17)). In the Same way as for Eq. (A.121), it can be shown that the reliabilityfunction R S i ( t ) , as defined in Eq. (A7.117), satisfies the following system of integral equations
1
1
Point availability PASi(t) and IRsi(t,t + 8) are given by Eqs. (A7.112) & (A7.119), with Pij(t) per Eq. (A7.120). The use of integral equations for PASi(t)can lead to mistakes, since RSi(t) and PASi(t) describe two different situations (summing for PASi(t)over all states j E (0, ..., m } leads to P A S i ( t ) = l ) .
474
A7 Basic Stochastic-Processes Theory
The Systems of integral equations (A7.120) and (A7.122) can be solved using Laplace transforms. Referring to Appendix A9.7,
and
A direct advantage of the method based on integral equations appears in the calculation of the mean stay (sojourn) time in the up states. Denoting by MTTQ the system mean time to failure, provided the system is in state Zi E U at t = 0, it follows that (Eq. (A6.38), Appendix A9.7) D0
(A7.125)
M T T F ~= ~ J ~ ~ ~ ( t =) KSi(o). d t 0
Thus, according to Eq. (A7.124), MTTFSi satisfies the following system of algebraic equations (see Example A7.9 an application) 1 P .. M 7 q i =-+ ~ Pi zjEupi
j
,
j #i
A7.5.3.3
Pi=xP„ m
m
Zi€U.
(A7.126)
j=o
jti
Stationary State and Asymptotic Behavior
The determination of time-dependent state probabilities or of the point availability of a system whose elements have constant failure and repair rates is still possible using differential or integral equations. However, it can become time-consuming. The situation is easier where the state probabilities are independent of time, i.e. when the process involved is stationary (the system of differential or integral equations reduces to a system of algebraic equations):
A time-homogeneous Markov process < ( t )with states ZO,.. ., Zm is stationary, ifits state probabilities Pi(t)= Pr{<(t)= Z i } , i = 0, ... , rn do not depend on t. This can be Seen from the following relationship Pr{k(tl) = Zi n
... n 5(tn)= Zin } = PrIS(ti) = Zil IPili2( t 2 - tl )... Pi,-li,
(tn - tn-l
which, according to the Markov property (Eq. (A7.77)) must be valid for arbitrary tl < ... C t , and i, , ... , in E {O, ..., m ) . For any a > 0 this leads to
475
A7.5 Markov Processes with Finitely Many States
From Pi(t + U ) = Pi(t) it follows Pi(t)= Pi(0)= 4, and in particular Pi(t)=O. Consequently, the process c ( t ) is stationary (in steady-state) if and only if its initial distriZ i ] ,..., m , satisfies for t > 0 the system (Eq. (A7.108)) bution q = P i ( 0 ) = P r { ~ ( O ) = i=O, m
P J. PJ. = E p i p i j ,
m
with
PjTO,
i=O
m
Z p j = l , p J. = Z p . 11. , j=O
i+ j
i=O i*j
j = O ,..., rn.
(A7.127)
The system of Eq. (A7.127) must be solved by replacing one (arbitrarily chosen) equation by x P j = 1. Every solution of Eq.(A7.127) with Pj 2 0 , j =O, ..., rn, is a stationary initial distribution of the Markov process. Equation (A7.127) expresses that Pr{ to come out from state Z J = Pr{ to come in state Z } , also known as generalized cut sets theorem. A Markov process is irreducible if for every pair i, j E {0,..., in} there exists a t such that P, ( t )> 0 , i.e. if every state can be reached from every other state. It can be shown that if P, ( t o )> 0 for some t o > 0 , then P, ( t )> 0 for any t > 0. A Markov process is irreducible if and only if its embedded Markov chain is irreducible. For an irreducible Markov process, there exist quantities q >O, j = 0 , ... , rn, with Po + ... + Pm = 1, such that independently of the initial condition Pi(0) the following holds (Markov theorem, See e.g. [A6.6 (Vol. I)]) lim Pj(t) = P j > 0 ,
j = o , ..., m.
(A7.128)
t-fm
For any i = 0, ... , m it follows then that lim P i j ( t )= P j >O, t-f
j = 0, ..., m.
W
The set of values Po, ...,Pm from Eq. (A7.128) is the limiting distribution of the Markov process. From Eqs. (A7.74) and (A7.129) it follows that for an irreducible Markov process the limiting distribution is the only stationary distribution, i.e. the only solution of Eq. (A7.127) with q > 0 , j = 0 , ... , m. Further important results follow from Eqs. (A7.174) - (A7.180). In particular the initial distribution in stationary state (Eq. (A7.18 I)), the frequency of consecutive occurrences of a given state (Eq. (A7.182)), and the relation between stationary values Pj from Eq. (A7.127) and 1;. for the embedded Markov chain (Eq.(A7.74)) givenby
476
A7 Basic Stochastic-Processes Theory
From the results given by Eqs. (A7.127)-(A7.129), the asymptotic & steady-state value of the point availability PAs is given by
If K is a subset of {Zo,..., Zm), the Markov process is irreducible, and Po, ..., Pm are the limiting probabilities obtained from Eq. (A7.127) then, Pr{ lim
total sojourn time in states Z j
E
Kin (0, t]
t
t+m
Pj ) = 1
=
(A7.132)
Zj€K
irrespective of the initial distribution Po(0),...,Pm(0). From Eq. (A7.132) it follows Pr{ lim
total operating time in (0,tl =
t
t+m
C
= PAS
=
Zj€U
The average availability of the system can be expressed as (see Eq. (6.24)) AAs(t)
1
=-
t
1
E[total operating time in (0, t] c(0) =Zi] =
t
t
PA%(^) dr . (A7.133) 0
The above considerations lead to (for any Zi E U )
Expressions kPk are useful in practical applications, e.g. for cost optimizations. For reliability applications, irreducible Markov processes can be assumed, for availability calculations. According to Eqs. (A7.127) and (A7.128), asymptotic & steady-state is used, for such cases, as a synonym for stationary.
A7.5.4 Frequency / Duration and Reward Aspects In some applications, it is important to consider the frequency with which failures at system level occur and the mean duration (expected value) of the system down time (or of the system operating time) in the stationary state. Also of interest is the investigation of fault tolerant Systems for which a reconfiguration can take place after a failure, allowing continuation of operation with defined loss of performance (reward). Basic considerations on these aspects are given in this section.
A7.5.4.1
Frequency / Duration
To introduce the concept of frequency /duration let us consider the one-item structure discussed in Appendix A7.3 as application of the alternating renewal
477
A7.5 Markov Processes with Finitely Many States
process. As in Appendix A7.3 assume an item (system) which alternates between operating state, with mean time to failure (mean up time) MTTF, and repair state, with complete renewal and mean repair time (mean down time) MTTR. In the stationary state, the frequency at which item failures fud or item repairs (restorations) fdu occurs is given as (Eq. (A7.60))
Furthermore, for the one-item structure, the mean up time MUT is (A7.136)
MTU = MTTF.
Consequently, considering Eq. (A7.58) the basic relation PA =
MTTF MTTF
+ MTTR
= fud MUT
(A7.137)
+
can be established, where PA is the point availability (probability to be up) in the stationary state. Similarly, for the mean failure duration MDT one has (A7.138)
MDT = MTTR
and thus 1-PA=
MTTR MTTF
Constant failure rate
+ MTTR ?L
= fdu. M D T .
= I I MTTF
(A7.139)
and repair (restoration) rate p = I /MT772 leads to
which expresses the stationary property of time-homogeneous Markov processes, as particular case of Eq. (A7.127) with rn = (0,l). For Systems of arbitrary complexity with constant failure and repair (restoration) rates, described by time-homogeneous Markov processes (Appendix A7.5.2), it can be shown that the asymptotic & steady-state System fuilure frequency fudS and system mean up time MUTs are given as
respectively. U is the set of states considered as up states for fudS und MUTs calculation, 6 the complement to the totality of states considered. MUTS is the mean of the time in which the system is moving in the set of up states $ E U before a transition in the set of down states ZiEÜ occurs in the stationary case or for t + -. In Eq. (A.7.141),all transition rates pji leaving state E U toward ZiEÜ are
5
478
A7 Basic Stochastic-Processes Theory
considered (curnulated states). Similar results hold for semi-Markov processes. Equations (A7.141) and (A7.142) have a great intuitive appeal: (i) Because of the memoryless property of the (time-homogeneous) Markov processes, the asymptotic steady-state probability to have a failure in ( t , t + 8t] i;s%s~i fit and, fuds&t. (ii) Defining UT as the total up time in ( 0 , t ) and v ( t ) ' aS number of failures in (0, t ) , and considering for t + the limits UTI t i, PAS and v ( t ) lt + fudS, it f0110~U T / v ( t ) + MUTS=PASlfuds f0r t+m. Same results hold for the system repair (restoration)frequency fduS and System mean down time MDTs (mean repair (restoration) duration at system level), given as
and MDTs = ( E Pi zi € 6
fdus = (1 - PAS)
fdus ,
(A7.144)
respectively. fduS is the system failure intensity z s ( t )= zs as defined by Eq. (A7.230) in steady-state or for t + W . Considering that each failure at system level is followed by a repair (restoration) at system level, one has fudS = fduS and thus
Equations (A7.142), (A7. I@), and (A7.145) yield to the following important relation between MDTs and MUTs (see also Eqs. (A7.137) - (A7.140))
Computation of the frequency of failures ( fduS) and mean failure duration (MDTs ) based on fault tree and corresponding minimal cut-sets (Sections 2.3.4, 2.6) is often used in power systems [6.22], where f f , d f and Pf appear for fduS, MDTS, and 1 - PAs . The central part of Eq. (A7.145) is known as theorem of cuts. Although appealing, C 4 MITFSi, with M= from Eq. (A7.126) and 8 from Eq.(A7.127), can not be used to calculate MUTS (Eqs.(A7.126) and (A7.127) describe two different situations, see the remark with Eq. (A7.122)).
A7.5.4.2
Reward
Complex fault tolerant systems have been conceived to be able to reconfigure themselves at the occurrence of a failure and continue operation, if necessary with reduced performance. Such a feature is important for many systems, e.g. production, information, and power systems, which should assure continuation of operation after a system failure. Besides fail-safe aspects, investigation of such systems is
A7.5 Markov Processes with Finitely Many States
479
based on the superposition of pe$ormance behavior (often assumed deterministic) and stochastic dependability behavior (including reliability, maintainability, availability, and logistic support). A straightfonvard possibility is to assign to each state Zi of the dependability model a reward rate 5 which take care of the performance reduction in the state considered. From this, the expected (mean) instantaneous reward rate MIRs ( t ) can be calculated in stationary state as
thereby, ri= 0 for down states, 0< ri
Other metrics, for instance reward impulses at state transition or the expected ratio of busy channels to jobs request, are possible (see e.g. rA7.15, 6.19 (1995), 6.26, 6.341). The reward rate can be applied directly to differential equations. For the purpose of this book, application in Section 6.8.6.4 will be limited to Eq. (A7.147). in Eq. (A7.147) is the asymptotic & steady-state probability in state Zi (Eq. (A7.127)), giving also the expected percentage of time the system stays at the performance level specified by Zi (Eq. (A7.132)).
<
A7.5.5 Birth and Death Process A birth and death process is a Markov process characterized by the property that transitions from a state Zi can only occur to state Zi+l or ZiFl. In the time-homogeneous case, it is used to investigate k-out-ofn redundancies with identical elements and constant failure und repair rates during the stay (sojourn) time in any given state (not necessarily at state transitions, e.g. because of load sharing). The diagram of transition probabilities in ( t ,t +6t] is given in Fig. A7.9. vi and Bi are the transition rates from state S i to Zi+l and Zi to Zi-l, respectively (transitions outside neighboring states can occur in ( t ,t +6t]only with probability o(6t)).The system of
Figure A7.9 Diagram of transition probabilities in (t,t + 6 t ] for a birth and death process with n + l states (t arbitrary, 6 t L 0, Markov process)
480
A7 Basic Stochastic-Processes Theory
differential equations describing the birth und death process given in Fig. A7.9 is Pj(t)=-(vj
+ O j ) P j ( t ) + vj-IP j p i ( t ) + Oj+i Pj+,(t) with Oo = V-i = V n =
= 0,
j= 0, ..., n .
The conditions v j > 0 ( j = 0, ... , n - 1 ) and O j > 0 ( j = 1, ... , n < for the existence of the limiting probabilities lim Pj ( t ) =
are sufficient
n
P' ,
with
Pj
> 0 and
t-fw
~ P J =. 1.
(A7.150)
j=O
It can be shown (Example A7.8), that the probabilities n
P. = n . p = n j / Z n i , J
00)
(A7.149)
J 0
with n i =
PJ , j = 0, ..., n are given by
... Vi-1 and 0, ... 0 ,
V.
i=O
n o = l . (A7.151)
From Eq. (A7.151) one recognizes that (k = 0, ..., n-1).
P k ~ k=Pk+l@k+l?
this holds quite general for time-homogeneous Markov processes (Eq. (A7.127)).
Example A7.8 Assuming Eq. (A7.150) prove Eq. (A7.151).
Solution Considering Eqs. (A7.149)& (A7.150), 0 = -voPo
+ OiP1
0 = -OnPn
+v~-~P,-~.
From the first equation it follows P, =
P'
are the solution of following system of algebraic eqs.
Gvo10,.
With this
4 ,the second equation leads to
Recursively one obtains
Considering Po
+ ... f P,
= 1, Po follows and thenEq. (A7.151).
The values of Pj given by Eq. (A7.15 1) can be used in Eq. (A7.134) to calculate the stationary (asymptotic & steady-state) value of the point availability. The system mean time to failure follows from Eq. (A7.126). Examples A7.9 and A7.10 are applications of the birth and death process.
A7.5 Markov Processes with Finitely Many States
Example A7.9 For the 1-out-of-2 active redundancy with one repairCrew of Examples A7.6 and A7.7, i.e. for vo = 2 h , v1 = h , O1 = O2 = p, U = { Z O ,Z 1 ) and U = { Z 2 ), give the asymptotic & steadystate value PAS of the point availability and the mean time to failure M7TFS0 and M7TFsl. Solution The asymptotic & steady-state value of point availability is given by Eqs. (A7.134) and (A7.151) PA,
Po t P,
=
1t2 h / p =
p2
-
1+2hlp+2h2/p2
+2hp
2h(h+p)+p2
The system's mean time to failure follows from Eq. (A7.126) with pol=p0 = 2 h , p12 = h , Pio=P. ~ l = h + ~ , M7TFso
=
1 / 2 h + M-
MnFs, = +
1
h t p
-
P h+p
MVFs ,
yielding 3 1+p
MTTFso = 2 h2
and
2h+p
MITFsl = 2 a2
Example A7.10 A computer system consists of 3 identical CPUs. Jobs arrive independently and the arrival times form a Poisson process with intensity h . The duration of each individual job is distributed exponentially with parameter p. All jobs have the same memory requirements D. Give for h = 2 p the minimum size n of the memory required in units of D, so that in the stationary case (asymptotic & steady-state) a new job can immediately find Storage space with a probability y of at least 95%. When ovefflow occurs, jobs are queued.
Solution The problem can be solved using the following birth and death process 1-h6t
1-(h+p)6r
1-(h+2p)6t
1-(h+3p)& h6t
1-(h+3p)Ot
h6t
...
... P 6t
2 p 6t
3pSt
3p6t
3 p 6t
3 ~ 6t
In state Zi, exactly i memory units are occupied. n is the smallest integer such that in the steadystate, Po + ... t P,-1 = y 2 0.95 (if the assumption were made that jobs are lost if ovefflow occurs, then the process would stop at state Z n ) . For steady-state, Eq. (A7.127) yields O=-hPo+pFj O = h P o - ( h + p ) e +2pP2 O = h e -(k+2p)P2 +3pP3 O = h P 2 - ( h + 3 p ) q +3pP4
482
A7 Basic Stochastic-Processes Theory
The solution leads to
n
Assuming lim n-+-
C
= 1 and considering -< 1 it follows that
i=o
3
h " 9 hlp . h Po[I+-+ -(-)"=p,[l+-+ P j=2 2 3
3 ( ~ l p ) ~ 2(3-hlp)
I = 1,
from which
The size of the memory n can now be detennined from
2(3- A l p )
9 hlp
"-1
U + - + C, - ( - ) ] > Y .
6+4h/~+(h/p)~
i=2
For h 1 P = 2 and y = 0.95, the smallest n satisfying the above equation is n = 9 ( Po = 119, P, = 2 / 9 , q =2i-113i f o r i t 2 ) .
As shown by Examples A7.9 and A7.10, reliability applications of birth and death processes identify v i as failure rates and Bi as repair rates. In this case, j = 0,
v j <
..., n - 1 ,
as in Fig. A7.9. Assuming 0 < r < 1 and thus
the following relationships for the steady-state probability Pj can be obtained (Example A7.11) P. >
' - r(1-rn-') i=j+lPi, .
O
j
,
, - 1 ,
n>j.
(A7.156)
For r 5 112 it follows that n
the steady-state probability in a state Equation (A7.157) states that for 2 v j .IBj Z j of a birth and death process described by Fig. A7.9 is 2 the sum of the steadystate probabilities in all states following Zj , j = 0 , ..., n - i [2.50 (1992)l. This relationship is useful in developing approximate expressions for system availability.
A7.5 Markov Processes with Finitely Many States
Example A7.11 Assuming Eq.(A7.155), prove Eqs. (A7.156) and (A7.157). Solution Using Eq. (A7.150),
Setting
S r for O < r < l and
i = j , j+1, ..., n-1,itfollowsthat
and thus Eq. (A7.156). Furthermore, for r 5 112 it holds that n
I Pj S 1 - (11 2)n-j 5 1, and hence Eq. (A7.157). i=j+l
A7.6
Semi-Markov Processes with Finitely Many States
The description of Markov processes given in Appendix A7.5.2 allows a straightforward generalization to semi-Markov processes. In a semi-Markov process, the sequence of consecutively occurring states forms an embedded (time-homogeneous) Markov chain, just as with Markov processes. The stay (sojourn) time in a given state Zi is a positive random variable zu whose distribution depends on Zi and on the following state Z j , but in contrast to Markov processes it is arbitrarily und not exponentially distributed. Related to semi-Markov processes are Markov renewal processes ( ~ ~ =( number t ) of transitions in state Zi during (0,tl) [A7.23]. To define semi-Markov processes, let kO,Ci,. .. be the sequence of consecutively occurring states, i.e. a sequence of random variables taking values in {ZO,..., Z m ] , and q o , q l , ... the stay (sojourn) times between consecutive states, i.e. a sequence of positive random variables. A stochastic process c ( t ) with state space {Zo,..., Zm ) is a semi-Markov process if for n= 0,1,2, ..., arbitrary i , j, io , ..., in-, E (0, ...,m),and arbitrary xo, ...,x,-~ > 0 ,
E,(t)=& for 0 1 t< q o and t ( t )=4, for q o +... t q n - l I t < q o + ...+ q n for n2 1 ( t 2 0) is a pure jump process, as visualized in Fig. A7.10.
A7 Basic Stochastic-Processes Theory
"o*
oux
oux
Figure A7.10 Possible realization for a semi-Markov process
(X
starts by 0 at each state change)
The functions Qij( X ) in Eq. (A7.158), defined only for j + i , are the semi-Markov transition probabilities (see remarks with Eqs. (A7.93) - (A7.101)). Setting
and, for pij ,J 0 ,
leads to Qjj(x) = ~q
(A7.161)
j+i, Qij(0)= 0,
with (Example A7.2)
and F,(x) =Pr{qn S x
1 (5, = Z i n
=Zj)},
j+i, qj(o)=o. (A7.163)
As for a semi-Markov process, pii=O is mandatory, Q i i ( x ) and F i i ( x ) can be arbitrary. From Eq. (A7.158), the consecutive jump points at which the process enters ziare regeneration points. This holds for any i E (0, ..., m}. Thus,
all states of a semi-Markov process are regeneration states. The renewal density of the embedded renewal process of consecutive jumps in Zi (i-renewals) will be denoted as h i ( t ) (Eq. (A7.177)). The interpretation of the quantities Qv ( X ) given by Eqs. (A7.99) - (A7.101) are useful for practical applications (see for instance Eqs. (A7.183) - (A7.186)).
A7.6 Semi-Markov Processes with Finitely Many States
485
The initial distribution, i.e. the distribution of the vector ( 5 0 - 5(0), Ci, qo) is given, for the general case, by A u (X) = Pr{ e(0) = Zin
ei = Z
n residual sojourn time (qo) in ZiI X 1
with Pi(0) = Pr{k(0) = Zi}, pij according to Eq. (A7.162), and Fij (X) = Priresidual sojourn time in Zj I x 1 (k(0) = Zi n = Zj ) } . k(0) is used here for clarity instead of kO. The semi-Markov process is memoryless only at the transition points from one state to the other. To have the time t = O as a regeneration point, the initial condition c(0) = Zi,sufficient for time-homogeneous Markov processes, must be reinforced for serni-Markov processes by Zi is entered at t = 0 .
The sequence kO,Cl, ... forms a Markov chain, embedded in the serni-Markov process, with transition probabilities pij as per Eq. (A7.162) and initial probabilities Pi(0), i = 0, ... ,m. F, (X) is the conditional distribution function of the stay (sojourn) time in Ziwith consequent jump in Zj (next state to be visited). A semi-Markov process is a Markov process if and only if Fij (X)= 1-e-PiX, for i, j E {0,... , m]. An example of a two state semi-Markov process is the altemating renewal process given in Appendix A7.3 (?o = up, Z1 = down, pol =p10= 1, FO1(x)= F("), Flo(x) = G(x), Fo(x) = FA(^), FI ( X ) = GA(x), Po(0) = P , P1(0) = 1 - P ) . In many applications, the quantities QV(X),or pV and Fij (X),can be calculated using Eqs. (A7.99) - (A7.101), as shown in Appendix A7.7 and Sections 6.3 - 6.6. For the unconditional stay (sojoum) time in Zi, the distribution function is
and the mean
In the following it will be assumed that
exists for all i, j E {0,..., m]. Consider first the case in which the process enters the state Ziat t = 0, i.e. that Pi(0)=l
and
The transition probabilities
Fi(x)=Fu(x).
486
A7 Basic Stochastic-Processes Theory
P Q ( ~=)Pr{&t) = Z j
1 Zi
is entered at t = 0 )
(A7.168)
can be obtained by generalizing Eq. (A7.120),
with 6, and Qi(t)per Eqs. (A7.85) & (A7.165). The stateprobabilities follow as m
P j ( t ) = Pr{c(t)= Z j } =
C. Pr{Zi is entered at t = 0} P i j ( t ) ,
(A7.170)
with Pj ( t )L 0 and Po(t)+ ... + P,(t) = 1. If the state space is divided into the complementary Sets U for the up states and Ü for the down states, as in Eq. (A7.107),the point availability follows from Eq. (A7.112)
PAsi(t) = Pr{k(t)E U
I Z i is entered at t = 0 ) =
PQ(t), zj=U
i = 0, ..., m,
(A7.171)
with P, ( t ) as in Eq. (A7.169). The probability that thefirst transition from a state in U to a state in Ü occurs after time t, i.e. the reliabilityfunction, can be obtained by generalizing the system of integral equations (A7.122).
with Q i ( t ) as in Eq. (A7.165). The mean of the stay (sojourn) time in U, i.e. the system mean time to failure, follows from Eq. (A7.172) as solution of the following system of algebraic equations (with as per Eq. (A7.166)) M V F s i = Ti
+X
p, M T T Q j
,
Z iE U ,
(A7.173)
Zj€U j +i
Consider now the case of a stationary semi-Markov process. Under the assumption that the embedded Markov chain is irreducible (each state can be reached from every other state with probability > O), the semi-Markov process is stationary if and only if the initial distribution (Eq.(A7.164))is given by [A7.22, A7.23, A7.281
In Eq. (A7.174), P, are the transition probabilities (Eq.(A7.162))and pj the stationary distribution of the embedded Markov chain; pj are the unique solutions of
487
A7.6 Semi-Markov Processes with Finitely Many States
The system given by Eq. (A7.175) must be solved by dropping one (arbitrarily chosen) equation and replacing this by E p j = 1. For the stationary semi-Markov process, the state probabilities are independent of time and given by
with Ti per Eq. (A7.166) and Ti from Eq. (A7.175). Tii is the mean of the time interval between two consecutive occurrences of the state Zi (in steady-state). These time points form a stationary renewal process with renewal density
hi is the frequency of successive occurrences of state Zi. In Eq. (A7.176), I j can be (heuristically) interpreted as 4 = lim„, [ ( tI Tii)Ti]1 t = Ti I Tii and as ratio of the mean time in which the embedded Markov chain is in state Zi to the mean time in all states I;;: = ~ i T1i x p k T k . Similar is for A,(x) in Eqs. (A7.174) & (A7.179). The stationary (asymptotic and steady-state) value of the point availability PAs and average availability AAS follows from Eq. (A7.176)
Under the assumptions made above, i.e. continuous sojourn times with finite means and an irreducible embedded Markov chain, the following applies for i = 0, ..., m regardless of the initial distribution at t = 0 lim Pr { & t ) = Zi n next transition in Z j Pipi n residual sojourn time in Z i IX } =
t-im
c . ~ kTk
X
k l - F , ( y ) ) d y = A , ( x ) , (A7.179) 0
k=O
and thus Ti limPr{{(t)=Zi)=Pi=t+ Ti i
-
and
limPAs(t)=PAs=xPi. t-im ziel/
(A7.180)
For reliability applications, irreducible semi-Markov processes can be assumed. According to Eqs. (A7.176) and (A7.180), asymptotic & steady-state is used, for such cases, as a synonym for stationary.
488
A7 Basic Stochastic-Processes Theory
For the alternating renewal process (Appendix A7.3 with Zo = up, Zl =down, T. = M T T F , and Tl = MTTR) it holds that po = pl = 1 / 2 (embedded Markov chain) and T o o = ~ l = T o + Eq.(A7.178)(or(A7.180))leadsto ~. PAS=Po=ToIT„=ToI(To+T,) This example shows the basic differente between I;. as =poToI (mTO+plTl). stationary distribution of the embedded Markov chain and the limiting state probability 8 in state Zi of the original process in continuous time. For time-homogeneous Markov processes (Appendix A 7 3 , it follows Ti =1/ pi (Eqs. (A7.166), (A7.165), (A7.102)); for this case, Eqs. (A7.174) & (A7.177) yield
and hi(t) = hi= Pip j =Pi / T i = l / T i i ,
i = 0, ..., m ,
(A7.182)
respectively. Eq. (A7.181) follows also directly from Eq. (A7.164) by considering F; ( X ) = Fg (X) = 1 - e-PiX. Eq. (A7.18 1) expresses the stationary property of timehomogeneous Markov processes (see also Eqs. (A7.15 1) and (A7.127)). Furthermore, Eq. (A7.161) holds with pij = pg / pi,Eq. (A7.176) reduces to Eq. (A7.130).
A7.7
Semi-regenerative Processes
As pointed out in Appendix A7.5.2, the time behavior of a repairable system can be described by a time-homogeneous Markov process only if failure-free times and repair times of all elements are exponentially distributed (constant failure and repair rates during the stay (sojourn) time in every state, with possible stepwise change at state transitions, e.g. because of load sharing). Except for the Special case of the Erlang distribution (Section 6.3.3), non exponentially distributed repair and / or failure-free times lead in some few cases to semi-Markov processes and in general to processes with only few regeneration states or to nonregenerative processes. To make sure that the time behavior of a system can be described by a semi-Markov process, there must be no "running" failure-free time or repair time at any state transition (state change) which is not exponentially distributed, otherwise the sojourn time to the next state transition would depend on how long these nonexponentially distributed times have already run. Example A7.12 shows the case of a process with states Zo, Z1,Z2 in which only states Zo and 2, are regeneration states. Zo and Z1 form a semi-Markov process embedded in the original process, on which the investigation can be based. Processes with an embedded semi-Markov process are called semi-regenerative processes. Their investigation can become time-consuming and has to be performed in general on a case-by-case basis, See for instance Example A7.12 (Fig. (A7.1 I)), Fig. A7.12 and Sections 6.4.2,6.4.3,6.5.2.
A7.7 Semi-regenerative Processes
- operating - - - - - reserve repair 0
A renewal points (for ZOand Z1, respectively)
Figure A7.11 a ) Possible time schedule for a 1-out-of-2 warm redundancy with constant failure rates (L, L,), arbitrary repair rate (density g(x)), only one repair Crew (repair times greatly exaggerated); b) State transition diagram for the embedded semi-Markov process with regeneration states ZO and Z1 ( Q i z is not a semi-Markov transition probability); during a transition Z1 + % + Z1, the embedded Markov chain (on {Zo, Z, 1) remains in Z1); this model holds for a k-out-of-n warm redundancy with n - k = 1 as well
Example A7.12 Consider a I-out-of-2 warn redundancy as in Fig. A7.4a with constant failure rates h i n operating and h r in reserve state, one repair Crew, arbitrarily distributed repair time with distribution G(x) and density g(x). Give the transition probabilities for the embedded semi-Markovprocess. Solution As Fig. A7.11a shows, only states ZO and Z1 are regeneration states. % is not a regeneration state because at the transition points into a repair with arbitrary repair rate is running. Thus, the process involved is not a semi-Markov process. However, states ZO and Z1 form an embedded semi-Markov process on which investigations can be based. The transition probabilities of the embedded serni-Markov process are obtained (using Eq. (A7.99) and Fig. A7.11) as
(A7.183)
Q121(~)=~ r t q 2 5 1 X I = k y ) ( l - e-'~)dy. 0
Q121(x) is used to calculate the point availability (Section 6.4.2). It accounts for the process returning from state Z2 to state Z1 (Fig. A7.11a) and that Z 2 is a not a regeneration state (transition Z1 + Z2 + Z1; during a transition Z1 -t Z2 + Z1, the embedded Markov chain (on {Zo, Z, I) remains in Z1) Qiz(x) as given in Fig A7.10 is not a semi-Markov transition probability ( Z2 is not a regeneration state). However, Q;,(x) expressed as (see Fig. A7.11a) X
Q;,(x)
= jhe-"(l-~(~))dy
0
= I - e-"
-jhe
-AY
G(y)dy,
(A7.184)
0
yields an equivalent Q ~ ( x =) Qio(x)+ Q ; ~ ( x )useful for calculation purposes (see Section 6.4.2).
490
A7 Basic Stochastic-Processes Theory
operating
- - - - - repair 0 A 7
renewal points (for ZO, Z1 and ZT, respectively)
Figure A7.12 a) Possible time schedule for a k-out-of-n warm redundancy with n-k =2, constant failure rates (h & L,), arbitrary repair rate (density g(x)), only one repair crew, and no further failure at system down (repair times greatly exaggerated, operating and reserve elements not separately shown in the operating phases at system level); b) State transition diagram for the embedded semiMarkov process with regeneration states Zo , Zl , and Z2,
Replacing in Eqs. (A7.183) and (A7.184) h with kh leads to a k-out-of-n warm redundancy with n-k=l, constant failure rates (h, h,), arbitrary repair times with density g(x), only one repair crew, and no further failure at system down. As a second example, Fig. A7.12 gives a possible time schedule for a k-out-of-n warm redundancy with n - k = 2 , constant failure rates ( h, L,), arbitrary repair rate (density g(x)), only one repair crew, and no further failure at system down. Given is also the state transition diagram of the involved semi-regenerative process. States ZO, Z1, and Z2. are regeneration states, Z2 and Z3 are not regeneration states. The corresponding transition probabilities of the embedded Semi-Markov process are
Q121(~), Q1232,( X ) , and Q2,32,(X) are used to calculate the point availability. They account for the transitions throughout the nonregenerative states Z2 andZ3.
A7.7 Semi-regenerativeProcesses
Similarly as for Q;,(.w) in Example A7.12, the quantities
are not semi-Markov transition probabilities, however they are useful for calculation purposes (to simplify, they are not shown in Fig. A7.12b). Results for g(x) = pe-P, i.e. for constant repair rate p, are given in Table 6.8 (n-k=2). In the following, some general considerations on semi-regenerative processes are given. A pure jump process E,(t), tr 0, with state space Zo, ..., Z, is semiregenerative, with regeneration states Zo, ...,Zk, k < rn, if the following holds: Let Co,Cl, ... be the sequence of successively occurring regeneration states and ( P ~ , <...P ~ , the random time intervals between consecutive occurrence of regeneration states (continuous and > 0), then Eq.(A7.158) must be fulfilled for n = 0,i, 2, ..., arbitrary i, j, i,, ..., in-l E {O, ..., k ] , and arbitrary positive values XO,...,x,,-~ (where 5„ q, have been changed in „C, V,). In other words, E,(t) as given by E,(t)=5, for cpo +... +
-
lim Pr{$(t) = S i } ,
t+
i=O,...,k ,
exists and do not depend on the initial distribution at t = 0, see e.g. [A6.6 (Vol. II)]. The proof is based on the key renewal theorem (Eq. (A7.29)). Denoting by Tithe mean sojourn time in the state Zi and by T; the mean of the time interval between two consecutive occurrences of Zi(cycle length), it holds for i = 0,... , k that lim Pr{c(t) = SiJ = Pi = Ti / T;
t+
492
A7 Basic Stochastic-ProcessesTheory
For the 1-out-of-2 warm redundancy of Example A7.12 it holds that PO = P I = 1 1 2 (embeddedMarkovchain), T o = l l ( h + h r ) ,T 1 = ( l - & h ) ) l h , T & = l l ( h + h , ) + M ~ T R - ~= (T o~/ ) T& ) ,M T T R , + [ ( I - g ( h ) ) l g ( h ) ] M l ~ RT,~ = ~ ( ~ ) [ ~ / ( ~ + ~ , ) + M T T R ] + ( ~P, and P, = Zi I T,;. The final result for PAs = Po +P, is given by Eq. (6.109). For constant repair rate p , g(h)= p l ( h + p ) and T. = 1l ( h + h , ) , Tl = l / ( h + P), Tio = ( p 2 + ( h + h r ) ( h+ p ) )l p 2 ( h + h r ) , T; = ( p 2 + ( h + h r ) ( h+ P ) )l p ( h + h r ) ( h+ P ) , yielding PAs =Po + P, = T. /T& + Tl /T,; according to Eq.(6.88), or Eq.(A7.152) for h„=h; this case can also be investigated considering 3 regeneration states as per Fig. 6.8a.
A7.8
Nonregenerative Stochastic Processes
The assumption of arbitrarily (i.e. not exponentially) distributed failure-free and repair (restoration) times for the elements of a system, already leads to nonregenerative stochastic processes for simple series or parallel structures. After some general considerations, nonregenerative processes used in reliability analysis are introduced.
A7.8.1 General Considerations Solutions for nonregenerative stochastic processes are often problem-oriented. However, as a possible general method, transformation of the given stochastic process into a Markov or a semi-Markov process by a suitable state space extension can be used in some cases by one of the following ways: 1. Approximation of distribution functions: Approximating the involved distribution functions (for repair andlor failure-free times) by an Erlang distribution (Eq. (A6.102)) allows a transformation of the original process into a timehomogeneous Markov process through introduction of additional states.
2. Introduction of supplementary variables: Introducing for every element of a system as supplementary variables the failure-free time since the last repair and the repair time since the last failure, the original process can be transformed into a Markov process with state space consisting of discrete and continuous Parameters. Investigations usually lead to partial differential equations which have to be solved with the corresponding boundary conditions. The first method is best used when repair andlor failure rates are monotonically increasing from Zero to a final value, its application is easy to understand (Fig. 6.6). The second method [A7.4 (1955)l is very general, but often time-consuming.
A7.8 Nonregenerative Stochastic Processes
493
A further method is based on the general concept of point process. Considering the sequence of jump times T; and states 5, entered at these points, an equivalent description of the process ((t) is obtained by a marked point process (T:, C,), n=O, 1, ... . Analysis of the system's steady-state behavior follows using Korolyuk's theorem ( Pr{jump into Zi during ( t, t A t ] } = ht 6t + o(6t), with = E [Number of jumps in Ziduring the unit time interval]), See e.g. [A7.11,A7.121. As an example, consider a repairable coherent system with n totally independent elements (p. 61). Let C 1 ( t ) ,... , CJt) and <(t) be the binary processes with states 0 (down) & 1 (up) describing elements and system, respectively. If the steady-state point availability of each element lim PAi(t) = lim Pr{ci(t) = 1) = PAi = t+-
t+m
MT6 M7Tq
+MTRi '
i = l , ...,n, (A7.189)
exists, then the steady-state point availability of the system is given by Eq. (2.48) and can be expressed as PAs = MTTFs / (MTTFs + MTTRs), see e.g. [6.4, A7.101. Investigation of the time behavior of Systems with arbitrary failure andor repair rates can become time-consurning. In these cases, approximate expressions can help to get results (see Section 6.7 for some examples).
A7.8.2 Nonhomogeneous Poisson Processes (NHPP) A nonhomogeneous Poisson process (NHPP) is a point process with independent Poisson distributed increments, i.e. a sequence of points (events) on the time axis, which Count function V ( t ) has independent increments (in nonoverlapping intervals) and satisfy
V (t) gives the number of events in (0, t]. In the following, V ( t ) is assumed right continuous with unit jumps. M(t) is the mean of V ( t ) , called mean valuefunction,
M(t) = E [ ~ ( t ) ] ,
t>O, M(O)=O,
(A7.191)
t>O, M(O)=O.
(A7.192)
and it holds that (Example A6.20) Var [ ~ ( t )=] E [ ~ ( t ) =] M(t),
M(t) is a nondecreasing, continuous function with M(0) = 0 , often assumed increasing, unbounded, and absolutely continuous. If m(t) = dM(t) / dt 2 0 ,
t>O,
(A7.193)
exists, m(t) is the intensity of the NHPP. Eqs. (A7.193) and (A7.19 1) yield Pr{v(t+6t) - ~ ( t =) l } = m(t) 6 t + o(6t),
t>O,6t.L0,
(A7.194)
and no distinction is made between arrival rate and intensiv. Equation (A7.194)
494
A7 Basic Stochastic-ProcessesTheory
gives the unconditional probability for one event (e.g. failure) in ( t ,t + St]. m ( t ) corresponds to the renewal den& h ( t ) (Eq. (A7.24)) but dijjfers basically from the failure rate A ( t ) , see remark on p. 356. Equation (A7.194) also shows that a NHPP is locally without aftereffect. This holds globally (Eq.(A7.195)) and characterizes the NHPP. However, memoryless (i.e. with independent und stationary increments) is only the homogeneous Poisson process (HPP),for which M ( t )= t holds. Nonhomogeneous Poisson processes have been greatly investigated in the literature, see e.g. [6.3, A7.3, A7.12, A7.21, A7.25, A7.30, A8.11. This appendix gives some important results useful for reliability analysis. These results hold for H P P ( M ( t )= A t ) as well, and most of them are a direct consequence of the independent increments property. In particular, the number of events in a time interval ( U , b] Pr{k events in (a, b] I Ha] =Pr{kevents in (a, b] ] =
mw-~i,)i'
,-
(M(S)-M(,,))
k!
k=1,2,..., O l a < b , (A7.195)
and the rest waiting time
ZR
( t ) from an arbitrary time point t 2 0 to the next event
I
I
P r { ~ ~ ( t ) >Hx, ] = P r { n o e v e n t i n ( t , t + x ] H , ) =Pr{no event i n ( t , t + x ] ] = -("(t+x)-M(t)) e
0,
(A7.196)
are independent of the process development up to time t (history H, or H,). Thus, also the mean E [ ~ ~ ( ist independent ) ] of the history and given by
Let 0 < T{<22 < ... be the occurrence times (arrival times) of the event considered (e.g. failures of a repairable system), measured from the origin t = T ; = 0 and * * taking values O
1. The occurrence times (arrival times) T;,22 ,... have joint density n * * * * * * * f(t, , t „ ...,t i ) =nm(ty)e-(M(ti)-M(ti-l))= e - M ( t n ) n m ( t i ) ,t o = o < t l < ...
i=l
i=l
(A7.198) (follows from Eqs. (A7.194) & (A7.195)) and marginal distribution function
+)
* is used to explicitly show that 2;,22,..., or t f , t i , ..., are points on the time axis and not independent observations of a random variable 2, e.g. as in Figs. 1.1,7.12,7.14.
495
A7.8 Nonregenerative Stochastic Processes
*
with density fi(tT)= m( t l ) ~t:)( "e-M(ti)/ (i - I ) ! & mean E [T:] = / : x f i ( x ) d r (events { T ; 5 t,' } and {atleast i events have occurred in ( 0 ,t ; ] )are equivalent). 2. The quantities
V*,= M(T;)<
= M(T
2) < ...
are the occurrence times in a HPP with intensity one ( M ( t )= t ) (follows from V,. ( t ) = v T * (M-' ( t ) ) + E [ v ( t ) l = E [V=*(M-' ( t ) ) ]=M (M-'( t ) ) = t , see Eq. (A6.31)). V
.
3. The conditional distribution functions of q„l
& T:+~
given q l = x l , ...,V„=
X,
are
(follow from Eq. (A7.195) with k = 0 or from Eq. (A7.196)).
4. For given (fixed) t = T and v ( T )= n (time censoring), the joint density of the occurrence times 0 < T ; < ... < T ; < T under the condition v ( T )= n is given by
(see Example A7.13) and that of 0 < T;< ...
T;
< T und v ( T )= n is
(follows from Eqs. (A7.203) and (A7.190)). From Eq. (A7.203) one recognizes that for given (fixed) t =T and v ( T )= n , the occurrence times 0
& distribution function M ( t ) l M ( T ) on (0,T ) (compare Eqs.(A7.210), (A7.211)).
5. Furthermore, for given (fixed) t =T and V ( T )= n , the quantities
have the same distribution as if they where the order statistics of n independent identically distributed random variables with density o n e , i. e. uniformly distributed, on ( 0 , l ) (follows from Point 2 above (Eq. (A7.200)) and Eq. (A7.213)). For the case in which one takes T= t i (failure censoring), Eqs. (A7.203) - (A7.206) and (A7.210) - (A7.213) hold with n-1 instead of n.
A7 Basic Stochastic-Processes Theory
has for t - + m a standard normal distribution (folows basically from Point 2 above and Eqs. (A7.34), (A7. lgl), (A7.192)).
7. The sum of n independent NHPPs with mean value function M,(t) and intensity m i ( t ) is a NHPP with mean value function and intensity
i=l
i=l
respectively (follows from the independent increments property of NHPPs and Eq. (A7.190), see Eq. (7.27) for HPPs). From the above properties, the following conclusions can be drawn: (1) For i = 1, Eq. (A7.199) yields t
~ r { ~t }l=l1 - e-"(t)
I
- m(x)dx
= 1- e
0
;
(A7.209)
Example A7.13 Show that for given (fixed) T and V (T) = n, the occurrence times 0
)-M(Z;_,
)) e-(M(
'O-Wt:
j)
/ (M(T)" e-"(T',n!
)
=n!
fi (m(til /M(T)).
(A7.210)
i=l
Considering that for a Set of n realizations of a given random variable there are n! permutations giving the same order statistics, the joint density of the order statistics of n independent identically distributed random variables with density m(t) 1M(T) on the interval (0, T) is given by
1
f ( t ; t c ...,t i n ) = n ! n ( m ( t i ) l M ( T ) ) ,
on(0, T), ~ < t ; <... < t , ' < ~ ,
(A7.211)
i =l
yielding Eq. (A7.203).
Supplementary results for HPPs: For a HPP, Eq. (A7.205) yields m(t) / M(T) = h / h T = 1 / T and thns f(t,:tG,..,t,'l n)= n!lTn on (0,T).
(A7.212)
Furthermore, when considering 'ci /T, Eqs. (A7.205) and (A6.31) yield T . m ( t . T ) / M ( T ) = T. h / h T = l
and
f(tl:t$ ...,t,*l n ) = n! on (0,s).
(A7.213)
Thus, for given (fixed) T and v(T)=n, the arrival times 0<'c;< ...
497
A7.8 Nonregenerative Stochastic Processes
thus, comparing Eqs. (A7.209) and (A6.26) it follows that the intensit~of a NHPP is equal to the failure rate of the first occurrence time T: or interarrival time V, = 7;. (2) Equation (A7.201) shows that the conditional density of the interarrival time T;+~- T; given T; = tn is independent of the process development up to the time t,* and is equal to the conditional failure rate at time t i + x of the first occurrence time T; given T; > t i (Eq. (A6.28)), for any n? 1; this leads to the concept of bad-as-old used in some considerations on repairable Systems, see e.g. [6.3, A7.301. (3) From Eq. (A7.202), the distribution of the occurrence time depends only on T;; thus, T;,22,... is a Markov sequence. (4) From Eq. (A7.204) one can obtain Eq. (A7.198) by considering Pr{ no event in ( t i , T ] ]= e- (M(T)-M@,?), and vice versa. (5) Equations (A7.198) and (A7.199) show that for a NHPP, occurrence (arrival) times are not independent; the same is for interarrival times, which are neither independent nor identically distributed. Thus, the NHPP is not a regenerative process. On the other hand, the homogeneous Poisson processes (HPP) is a renewal process, with independent interarrival times distributed according to the same exponential distribution (Eq. (A7.38)) and independent Gamma distributed occurrence times (Eq. (A7.39)). However, because of independent increments, the NHPP is without aftereffect (memoryless if HPP) and the sum of Poisson processes is a Poisson process, both in homogeneous and nonhomogeneous case (Eq. (7.27)). Convergence of a point process to a NHPP or to a HPP is discussed in Appendices A7.8.3 and A7.8.5. Although appealing, the assumption of independent incrernents, mandatory for Poisson processes (HPP and NHPP), can limit the validity of models uscd in practical applications with arbitrary failure andlor repair rates (see e.g. Sections 7.6 and 7.7). However, the properties in Points 1-6 above (in particular Eqs. (A7.200) & (A7.206)) are useful for statistical tests on NHPPs, as well as for Monte Carlo simulations. In particular, results for exponential distributions or for HPPs can be used and the Kolmogorov-Smirnov test holds with Fo(t)= Mo(t)l M o ( T ) )and (,?I t) = G ( t) 1 ( T ) (Sections 7.6- 7.7). Equation (A7.205) is useful to generate realizations of a NHPP (generate k for given T and M(T) (Eq. (A7.190)), then k random variables with density m(t)/ M ( T ) ; the ordered values are the k occurrence times of the NHPP On (0,T)).
v~+~=
zL1
A7.8.3 Superimposed Renewal Processes Consider a repairable series system with n totally independent elements (p. 52) and assume that repair times are negligible and that after each repair (renewal) the repaired element is as-good-as-new. Let be the mean time to failure of element Ei and M T T 6 that of the system. Theflow of system failures is given by the superposition of n independent renewal processes, each of them related to an element of the system. If v s ( t ) is the Count function at system level giving the number of system failures in (O,t]and v i ( t )that of element Ei, it holds that
498
A7 Basic Stochastic-Processes Theory n
VS(~)= E v i ( t ) ,
t>O, vi(0)=O, i = l , 2,..., n .
(A7.214)
i=l
vi(t) is a random variable, distributed as per Eq. (A7.12). Thus, for the rnean value finction a t System level Z s ( t ) it follows that (Eqs. (A6.68) and (A7.15))
yielding for the failure intensiq a t systern level zS(t) (Eq. (A7.18))
In Eqs. (A7.215) and (A7.216), Hi(t) and hi(t) are the renewal function and renewal density of the renewal process related to element Ei. However, the point process yielding vS(t) is not a renewal process. Simple results hold only for homogeneous Poisson processes (HPP), which surn is a HPP (Eq. (7.27)). The Same holds for nonhomogeneous Poisson processes (NHPP), but a NHPP is not a renewal process. For (stochastically) independent renewal processes, it can be shown that: 1. The surn of n independent stationary renewal processes is a stationary renewal process with renewal density
(follows basically from Eq. (A7.36)). 2. For n - t - , the surn of n independent renewal processes with very low occurrence (one occurrence of any type and 2 2 occurrences of all type are unlikely), and for which ?imm Zipr{vi(t)-vi(a) =I}= M( t )-M(a) holds for any fixed t and a < t , converge to a NHPP with E [ ~ ( t )=] M(t) for a l l t > 0 (Grigelionis rA7.141, see also [A7.12, A7.301); furthermore, if all renewal densities hi(t) are bounded (at t = O), the sum converge for n -1- to a HPP [A7.14].
3. For t -1- and n+-, the surn of n independent renewal processes with low occurrence (one occurrence of any type is unlikely) converge to a HPP with renewal density as per Eq. (A7.217) [A7. 171, See also [A7.8, A7.12, A7.301.
A7.8.4 Cumulative Processes Cumulative processes [A7.24, A7.4 (1962)j, also known as compound processes [A7.3, A7.9 (Vol. 2), A7.211, are obtained when at the occurrence of each event in a point process, a random variable is generated and the stochastic process given by the surn of these random variables is considered. The involved point process is often
499
A7.8 Nonregenerative Stochastic Processes
limited to a renewal process (including the homogeneous Poisson process (HPP), yielding to a cumulative or compound Poisson process) or to a nonhomogeneous Poisson Process (NHPP). The generated random variable can have arbitrary distribution. Cumulative processes can be used to model some practical situations; for instance, the total maintenance cost for a repairable system over a given period of time or the cumulative damage caused by random shocks on a mechanical structure (assuming linear superposition of damage). If a subsidiary senes of events is generated instead of a random variable and the two types of events are indistinguishable, the process is a branching process [A7.3, A7.21, A7.301, discussed e.g. in [6.3, A7.51 as a model to describe failure occurrence when secondary failures are triggered by primary failures. Let ~ ( t be ) the count function giving the number of events (on the time axis) of the involved point process (Fig. A7.1), C i the generated random variable at the occurrence of the ith event, and 5, the sum of 5 over (0,tl
,
The stochastic process of value 6 ( t > 0) is a cumulative process. It is not difficult to recognize that for t i > 0, 5 is distributed as the total repair time (total down time) for failures occurred in a total operating time (total up time) t of a repairable item, and is thus given by the work-mission availability (Eq. (6.32)). In the following some important results are given for the case in which the involved point process is a homogeneous Poisson process (HPP) with parameter h and the generated random variables are independent from V (t) and have the same exponential distribution with parameter p. From Eq. (6.33), with To= t , it follows that
(At )n n-l = I - e - ( ~)t f ~ ) x .
t > o given.
no.
pr{&=o)=e-?
(A7.219)
n =l
Mean and variance of
5, follow as (Eqs. (A7.219), (A6.38), (A6.45), (A6.41))
Furthermore, for t+m the distribution of 5, approaches a normal distribution with mean and variance as per Eq. (A7.220), see also Eq. (7.22). Moments of 5, can also be obtained using the moment generating function [A7.3, A7.4 (1962)l or directly by considering Eq. (A7.218), yielding to (Example A7.14) E [St] = E [v(t)] E[ki] and ~ a r [ ~ ~ ] = E [ v ( t ) l ~ a r [ ~ ~ l + ~ a r(A7.221) [v(t)]~~[~~l.
500
A7 Basic Stochastic-Processes Theory
Of interest in some practical applications can also be the distribution of the time at which the process 5, ( t > 0) crosses a give (fixed) barrier C. For the case given by Eq. (A7.119), i.e. in particular for ki> 0, the events { z c > t ] and
(A7.222)
{tt5C}
are equivalent. Form Eq. (A7.219) it follows then
Cumulative processes are regenerative only if the involved point process is regenerative, in particular thus for the HPP investigated above. However, because of possible generalizations (NHPP, arbitrary point processes), they have been considered in this Appendix devoted to nonregenerative stochastic processes.
A7.8.5 General Point Processes A point process is an ordered sequence of points on the time axis, giving for example the failure occurrence of a repairable System. Poisson and renewal processes are simple examples of point processes. Assuming that simultaneous events can not
Example A7.14 Prove Eq. (A7.221). Solution Considering 5 >O, continuous with finite mean & variance (i = 1,2, ... ), and independent of V (t), for given V(t)=n Eq. (A7.218) yields for the mean and variance of tt (Appendix A6.8) E[ct I6(t)=nl = nElcil
and
Varlct
I B ( t ) = n l = nVar[cil.
(A7.224)
From Eq. (A7.224) it follows then
: ] [Ct]; from which, considering For Var& 1, it holds that (Eq. (A6.45)) ~ a r [ < ~ ] = ~ [ -c E2 (as well as Eq. (A6.69) for row 2 and =(Ci+ ... + cv(t))2 and Eq. (A7.225) for Eq. (A6.45) for row 3 below)
A7.8 Nonregenerative Stochastic Processes
501
occur (with probability one) and assigning to the point process a count function v ( t ) giving the number of events occurred in (O,t],investigation of point processes can . arbitrary be performed on the basis of the involved count function ~ ( t )However, point processes can lead to analytical difficulties, and results are known only for particular situations (low occurrence rate, stationary, regular, etc.). In reliability applications, general point processes can appear for example when investigating failure occurrence of repairable Systems by neglecting repair times. In the following only some basic properties of general point processes will be discussed, see e.g. [A7.10,A7.11, A7.12, A7.301 for greater details. Let ~ ( tbe ) a count function giving the number of events occurred in (O,t], assume v(O)=0 and that simultaneous occurrences are not possible. The underlying point process is stationary if v ( t ) has stationary increments (Eq. (A7.5)) and without aftereffect if v ( t ) has independent increments (Eq. (A7.2)). The sum of independent stationary point processes is a stationary point process. The same holds for processes without aftereffect. However, only the homogeneous Poisson process ( H P P ) is stationary und without aftereffect (memoryless). For a general point process, a mean value function
giving the mean (expectation) of the number of points (events) in (O,t] can be defined. Z ( t ) is a nondecreasing, continuous function with Z ( 0 )= 0 , often assumed increasing, unbounded and absolutely continuous. If
exists, z(t) is the intensity of the point process. Equations (A7.228)&(A7.227) yield
and no distinction is made between arrival rate and intensity. Equation (A7.229) gives the unconditional probability for one event (failure) in (t, t +6t]. ~ ( t ) corresponds thus to m ( t ) (Eq. (A7.193)) and h(t) (Eq. (A7.24)), but differs basically from the failure rate h(t) (Eq. (A6.25)) which gives the conditional probability of failure in (t,t +6t]given that the item was new at t = 0 and no failure has occurred in (O,t]. This distinction is important also for the case of a homogeneous Poisson process (Appendix A7.2.5), for which h(x)=A holds for all interarrival times (with x starting by 0 at each renewal point) and h ( t ) = A holds for the whole process. Misuses are known, in particular when dealing with reliability data analysis (see e.g. [6.3, A7.301 and comments on pp. 356 & 358, Appendix A7.8.2, and Sections 1.2.3, 7.6, 7.7). Thus, as a first rule to avoid confusion, for repairable items, it is mandatory to use for interarrival times the variable x starting by 0 at each failure (event), instead oft.
502
A7 Basic Stochastic-Processes Theory
Some limits theorems on point processes are known, in particular on the convergence to a HPP, See e.g. [A7.10, A7.11, A7.121. In reliability applications, z(t) is called failure intensi9 [A1.4], ROCOF (rate of occurrence of failures) in [6.3] . z(t) applies in particular to repairable Systems when repair (restoration) times are neglected. In this case, vs ( t ) is the Count function giving the number of system failures occurred in (O,t], with ~ ( 0=)0, and
is the systemfailure intensity.
Ag Basic Mathematical Statistics
Mathematical statistics deals basically with situations which can be described as follows: Given a population of statistically (stochastically) identical und independent elements with unknown statistical properties, measurements regarding these properties are made on a (random) sample of this population and on the basis of the collected data, conclusions are made for the remaining elements of the population. Examples are the estimation of an unknown probability (e.g. defective probability), the parameter estimation for the distribution function of an item's failure-free time T, or a decision whether the mean of T is greater than a given value. Mathematical statistics thus goes from observations (realizations) of a given (random) event in a series of independent trials to search for a suitableprobabilistic model for the event considered (inductive approach). Methods used are based on probability theory and results obtained can only be formulated in a probabilistic language. Minimization of the risk for a false conclusion is an important objective in mathematical statistics. This Appendix introduces the basic concepts of mathematical statistics necessary for the quality and reliability tests given in Chapter 7. It is a compendium of mathematical statistics, consistent from a mathematical point of view but still with reliability engineering applications in mind. Emphasis is on empirical methods, parameter estimation, and testing of hypotheses. To simplify the notation, the terms random and statistical will be omitted (in general) and the term mean is used as a synonym for expected value. Estimated values are marked with " . Selected examples illustrate practical aspects.
A8.1
Empirical Methods
Empirical methods allow a quick and easy evaluationl estimation of the distribution function and of the mean, variance, and other moments characterizing a random variable. These estimates are based on the empirical distribution function and have a great intuitive appeal. An advantage of the empirical distribution function, when plotted on an appropriate probability charts (probability plot papers), is to give a simple visual rough check as to whether the assumed model seems correct.
504
A8 Basic Mathematical Statistics
A8.1.1 Empirical Distribution Function A sample of size n of a random variable T with the distribution function F(t) is a -+ random vector T = ( z l ,..., T , ) whose components zi are assumed independent and identically distributed random variables with F(t) = Pr{zi < t } , i = 1, ..., n. For instance, T I ,..., T , are the failure-free times (failure-free operating time) of n items randomly selected from a lot of statistically identical items with a distribution function F ( t ) for the failure-free time T. The obsewed failure-free times, i.e. the + realization of the random vector z = ( z l ,..., T,), is a set t l , ..., t , of statistically independent real values (> 0 in the case of failure-free times). Distinction between random variables z l , ..., T , and their observations t l , ..., t , is important from a mathematical point of view. +) When the sample elements (obsewations) are ordered by increasing magnitude, an order sample t ( l ) ,..., t(,) is obtained. In life tests, observations t l , ..., t , constitute often themselves an order sample. An advantage of an order sample of n observations on independent, identically distributed random variables with density f(t) is the simple form of the joint density f(t(l),...,t(,))= n ! IIif(t(i)). With the purpose of saving test duration and cost, life tests can be terminated (stopped) at the occurrence of the kth ordered observation (kth failure) or at a given (fixed) time T„,. If the test is stopped at the kth failure, a type II censoring occurs (from the left if the time origin of all observations is not known). A type I censoring occurs if the test is stopped at T„,. A third possibility is to stop the test at a given (fixed) number k of observations (failures) or at Te„, whenever the first occurs. The corresponding test plans are termed (n,F,k), (n,F, T„,),and (n, F,(k,T„,)), respectively, where F stands for "without replacement". In many applications, failed items can be replaced (for instance in the case of a repairable item or system), in these cases F is changed with r in the test plans. For a set of ordered observations t ( l ) ..., , t(,), the right continuous function for t < t ( l )
for t ( , )
(A8.1)
for t 2 t ( , ) is the empirical distributionfunction (EDF)of the random variable T, See Fig. A8.1 for a graphical representation. fi,(t) expresses the relative frequency of the event ( T 5 t } in n independent trial repetitions, and provides a well defined estimate of the
+)
The investigation of statistical methods and the discussion of their properties can only be based However, in applying the methods for a numencal on the (random) sample T I ,..., T,. evaluation (statistical decision), the observations tl, ...,tn have to be used. For this reason, the sarne equation (or procedure) can be applied to or ti according to the situation.
A8.1 Empirical Methods
L."
Figure A8.1 Example of an empirical distribution function (t,, ..., t, t ( l ), ..., t ( , is assumed here)
-
distribution function F(t) = Pr{z 2 t}. The symbol is hereafter used to denote an estimate of an unknown quantity. As stated in the footnote on p. 504, when investigating the properties of the empirical distribution function F,(t) it is necessary in Eq. (A8.1) to replace the observations t(l),..., t(,) by the sample elements T ( I ) , . .., T(,) . For given F(t) and any fixed value of t, the number of observations I t, i.e. n I?,(t), is binomially-distributed (Eq. (A6.120)) with Parameter p = F(t), mean E [n fin (t)] = n F(t) ,
(A8.2)
and variance ~ a[nrfi, (t)] = n F(t) (1 - F(t)).
(A8. 3)
Moreover, application of the strong law of large numbers (Eq. (A6.146)) shows that for any given (fixed) value o f t , I?,(t) converges to F(t) with probability one for n -+ m. This convergence is uniform in t and holds for the whole distribution function F(t). Proof of this important result is given in the Glivenko-Cantelli theorem [A8.4, A8.14, A8.161, which states that the largest absolute deviation between I?,(t) and F(t) over all t, i.e.
converges with probability one toward 0 Pr{ lim D, = 0 } = 1. H-
506
A8 Basic Mathematical Statistics
In life tests, observations t l , ..., t , constitute often themselves an order sample. This is useful for statistical evaluation of data. However, if the test is stopped at the occurrence of the kth failure or at S„, and k or T„„ are small, the homogeneity of the sample can be questionable and the shape of F(t) could change for t > t k or t > T„, (e.g. because of wearout, See the remark on p. 320).
A8.1.2 Empirical Moments and Quantiles The moments of a random variable T are completely determined by the distribution function F(t) = Pr{z I t ) . The empirical distribution e,(t) introduced in Appendix A8.1.1 can be used to estimate the unknown moments of T . The values t(l),..., t(,) having been fixed, 6 J t ) can be regarded as the distribution function of a discrete random variable with probability pk = 11n at the points t ( k ) ,k = 1, ..., n. Using Eq. (A6.35), the corresponding mean is the empirical mean (empirical expectation) of T and is given by
Taking into account the footnote on p. 504,
E[%]is a random variable with mean
and variance
Equation (A8.7) shows that E[T] is an unbiased estimate of E [ z ] , see Eq. (A8.18). Furthermore, from the strong law of large numbers (Eq. (A6.147)) it follows that for n + W , @ T ] converges with probability one toward E [ z ] Pr{ lim ;( n-tm
1
C r i )= E [ r ]} = I . i=l
The exact distribution function of E[z] is known in a closed simple form only for some particular cases (normal, exponential, or Gamma distribution). However, the central limit theorem (Eq. (A6.148)) shows that for large values of n the distribution of Erz] can always be approximated by a normal distribution with mean E[%]and variance Var[z]1n . Based on F,(t), Eqs. (A6.43) and (A8.6) provide an estimate of the vxiiance as
A8.1 Empirical Methods
The expectation of this estimate yields Var[c] (n - 1) l n . For this reason, the empirical variance of T is usually defined as
for which it follows that E[V&[T]] = Var[z] The higher-order moments (Eqs. (A6.41) and (A6.50)) can be estimated with
The empirical quantile Fq is defined as the q quantile (Appendix A6.6.3) of the empirical distribution function ; , ( t )
fq = inf { t :F,(t)
2 q).
(A8.13)
A8.1.3 Further Applications of the Empirical Distribution Function Comparison of the empirical distribution function fi,(t) with a given distribution function F(t) is the basis for several non-parametric statistical methods. These include goodness-of-fit tests, confidence bands for distribution functions, and graphical methods using probability charts (probability plot papers). A quantity often used in this context is the largest absolute deviation D, between ;,(t) and F(t), defined by Eq. (A8.4). If the distribution function F ( t ) of the random variable z is continuous, then the random variable F(T) is uniformly distributed in (0,l). It follows that D, has a distribution independent of F(t). A.N. Kolmogorov showed [A8.20] that for F(t) continuous and X > 0, m
iim &{&D, n+-
5x
I F ( t ) ) = 1 + 2 E ( - l ) k e -2k2x2 k=l
The series converges rapidly, so that for x > 1/ & ,
508
A8 Basic Mathematical Statistics
The distribution function of D,, has been tabulated for small values o A8.261, See Table A9.5 and Table A8.1. From the above it follows that: For a given continuous distribution function F ( t ) , the band F(t) + y l - , overlaps the empirical distribution function 6 J t ) with probability 1-an where an + a as n + M, with y l - , defined by Pr{Dn I yl-,
I F(t)} = 1 -a
(A8.15)
und given in Table A9.5 or Table A8.1.
From Table A8.1 one recognizes that the convergence an -+ a is good (for practical purposes) for n > 50. If F ( t ) is not continuous, it can be shown that with yl-, from Eq. (A8.15), the band F(t) + yl - overlaps 6,(t) with a probability 1- a n , w h e r e a i + a ' s a as n + m . The role of F ( t ) and 6,(t) can be reversed, yielding:
,
.
The random band Fn(t)lt y l - , overlaps the true (unknown) distribution function F ( t ) with probability 1 - U„ where an -+ a as n + W.
This last consideration is an aspect of mathematical statistics, while the former one (in relation to Eq. (A8.15)) was a problem of probability theory. One has thus the possibility to estimate an unknown continuous distribution function F ( t ) on the basis of the empirical distribution function 6,(t), see e.g. Figs. 7.12 and 7.14. Example A8.1 How large is the confidence band around
6,(t)
for n = 30 and for n = 100 if a = 0.2 ?
Solution From Table A8.1, y0,8 = 0.19 for n = 30 and y0.8 = 0.107 for n = 100. This leads to the band F,(t)10.19 for n=30 and Fn(t)f0.107for n=100.
A8.1 Empirical Methods
509
To simplify investigations, it is often useful to draw 6,(t) on aprobability chart (probability plot paper). The method is as follows: The empirical distribution function e , ( t ) is drawn in a system of coordinates in which a postulated type of continuous distribution function is represented by a straight liize; if the underlying distribution F ( t ) belongs to this type of distribution function, then for a sufficiently Zarge value of n the points ( t ( i ) F,(t($) , will approximate to a straight line ( a systematic deviation from a straight line, particularly in the domain 0.1 < e,(t) < 0.9, leads to rejection of the type of distribution function assumed). This can also be used as a simple rough visual check as to whether an assumed model ( F ( t ) )seems correct. In many cases, estimates for unknown parameters of the underlying distribution function F ( t ) can be obtained from the estimated straight line for e,(t). Probability charts for the Weibull (including exponential), lognormal and normal distribution functions are given in Appendix A9.8, some applications are in Section 7.5. The following is a derivation of the Weibull probability chart. The function ~ ( t=)1- e-@t) P can be transformed to loglo(l/(I - F(t)))= (At) loglo(e)and finally to
In the system of coordinates loglo(t) and loglo loglo(ll ( 1 - F ( t ) ) , the Weibull distribution function given by ~ ( t ) = l - e - ( " ) ' appears as a straight line. Fig. A8.2 shows this for ß = 1.5 and h = 1 / 800 h . As illustrated by Fig. A8.2, the parameters ß and 3L can be obtained graphically ß is the dope of the straight line, it appears on the scale loglologlo(l / ( 1 - F(t)) if t is changed by one decade (e.g. from 102 to 103 in Fig. A8.2), for loglo loglo(l/ ( 1 - F(t))= loglologlO(e),i.e. on the dashed line in Fig. A8.2, one has loglo(ht ) = 0 and thus h = 1 / t .
The Weibull probability chart also applies to the exponential distribution (P = 1). For a three parameter Weibull distribution ( F ( t )= 1 - e-(h(t-'v))ß,t V) one can operate with the time axis t '= t - W ,giving a straight line as before, or consider the concave curve obtained when using t (see Fig. A8.2 for an example). Conversely, from a concave curve describing a Weibull distribution (e.g. in the case of an empirical data analysis) it is possible to find W using the relationship W = ( t l t 2- t 2M )/ ( t l + t 2 - 2t, ) existing between two arbitrary points tl , t2 and t , obtained from the mean of F ( t l ) and F ( t 2 ) on the scale loglologlo(l/ ( 1 - F ( t ) ) ) , see Example A6.14 for a derivation and Fig. A8.2 for an application with tl=400h and t 2 =1000h, yielding t,= 600h and ~ = 2 0 0 h .
A8 Basic Mathematical Statistics
Figure A8.2 Weibull probability chart: The distribution function F(t) = 1- e-(ht)P appears as a straight line (in the example h = 11 800 h and ß = 1.5); for a three Parameter distribution F @ ) =1 - e - ( h ( t - v ) ) P , t 2 W , one can use t t = t - W or operate with a concave curve and determine (as necessary) W ,h, and ß graphically (dashed curve for h = 1/800 h , ß = 1.5, and = 200h as an example)
A8.2 Parameter Estimation
A8.2
Parameter Estimation
In many applications it can be assumed that the type of distribution function F ( t ) of the underlying random variable 2: is known. This means that F ( t ) = F(t, 01,..., 0,) is known in its functional form, the real-valued parameters 01,..., 9, having to be estimated. The unknown Parameters of F ( t ) must be estimated on the basis of the observations t l , ..., t,. A distinction is made betweenpoint and intewal estimation.
A8.2.1 Point Estimation Consider first the case where the given distribution function F ( t ) only depends on a parameter 9, assumed hereafter as an unknown constant + ) . A point estimate for 9 is a function (statistic)
(A8.17)
6, = ~ ( t...~, t,),
of the observations t l , ..., t , of the random variable T (not of the unknown parameter 9 itself). The estimate 6, is unbiased, if
consistent, if
6,
converges to 8 in probability, i.e. if for any E > 0
strongly consistent, if
6,
converges to 0 with probability one, i.e.
efficient, if
E[(&,- fN2I is minimum over all possible point estimates for 9, sufficient (sufficient statistic for B), if 6, delivers the complete information about 8 (available in the observations t l , ..., t,), i.e. if the conditional distribution of for given 6, does not depend on 9.
+)
Bayesian estimation theory, based on the Bayes theorem (Eqs. (A6.18), (A6.58-A6.59)) and which considers 9 as a random variable and assigns to it an a priori distribution function, will not be considered in this book (as a function of the random sample, 0, is a random variable, while 0 is an unknown constant). However, a Bayesian statistics can be useful if knowledge on the a prion distnbution function is well founded, for these cases one may refer e.g. to [A8.23, A8.241.
512
A8 Basic Mathematical Statistics
For an unbiased estimate, Eq. (A8.21) becomes
An unbiased estimate is thus efficient if ~ a r [ 6 , ]is minimum over all possible point estimates for 8 and consistent if ~ar[6,]+ 0 for n -+ =J. This last Statement is a consequence of Chebyschev's inequality (Eq. (A6.49)). Efficiency can be checked using the Cramkr - Rao inequality and sufficiency using the factorization criterion of the likelihood function, see e.g. [A8.1, A8.231. Other useful properties of estimates are asymptotic unbiasedness and asymptotic efficiency. Several methods are known for estimating 8. To these belong the methods of moments, quantiles, least Squares, and maximum likelihood. The maximum likelihood method [A8.1, A8.15, A8.231 is commonly used in engineering applications. It provides point estimates which, under relatively general conditions, are consistent, asymptotically unbiased, asymptotically efficient, and asymptotically normal-distributed. It can be shown that if an efficient estimate exists, then the likelihood equation (Eqs. (A8.25) or (A8.26)) has this estimate as a unique solution. Furthermore, an estimate 6, is suficient if and only if the likelihood function (Eqs. (A8.23) or (A8.24)) can be written in two factors, one depending on t l , ..., t , only, the other on 8 and 6, = u(tl,..., t,), see Examples A8.2 to A8.4. The maximum likelihood rnethod was developed by R.A. Fisher [A8.15 (1921)l and is based on the following idea: Maximize, with respect to the unknown Parameter 8 , the probabili~(Pr) that in a sample of size n, the (statistically independent) values t l , ..., t , will be obsewed (i.e. maximize the probability of observing that record); this by maximizing the likelihoodfunction (L Pr), defined as
-
in the discrete case, and as n
L(tl, ..., t„8) = n f ( t i , O ) , i=l
with f(ti, 0) as density function,
(A8.24)
in the continuous case. Since the logarithmic function is monotonically increasing, the use of ln(L) instead of L leads to the same result. If L(tl, ..., t„ 8 ) is derivable and the maximum likelihood estimate 6, exists, then it will satisfy the equation
513
A8.2 Parameter Estimation
The maximum likelihood method can be generalized to the case of a distribution function with a finite number of unknown parameters 81, ..., 0,. Instead of Eq. (A8.26), the following system of r algebraic equations must be solved
The existence and uniqueness of a maximum likelihood estimate is satisfied in most practical applications. To simplify the notation, in the following the index n will be omitted for the estimated parameters.
Example A8.2 Let t l , ..., t, be statistically independent observations of an exponentially distnbuted failure-free time T. Give the maximum likelihood estimate for the unknown Parameter h of the exponential distribution. Solution With f(t, h ) = h e-"
, Eq. (A8.24) yields L(tl, ..., tn, h )= hne-h(tlt
"'
,'),
+
from which
This case corresponds to a sampling plan with n elements without replacement, terminated at the occurrence of the nth failure. h depends only on the sum t l + ... +t„ not on the individual values.of ti; tl + ... + tll is a sufficient statistic and ?L is a sufficient estimate ( L = 1 hne-nh'h). However, h = nl(tl + ... + t,) is a biased estimate, unbiased is ( n - l)I(tl+... +t,),5 as weil as E[TI=(tl+...+ t , ) / n given by Eq. (A8.6).
h=
Example A8.3 Assuming that an event A has occurred exactly k times in n Bernoulli trials, give the maximum likelihood estimate for the unknownprobabilityp for event A to occur (binomial distribution).
Solution Using Eq. (A6.120),the likelihood function (Eq. (A8.23))becomes
L = pk = This leads to
(nk)
pk
(I - p)n-k
or
InL=ln(~)+klnp+(n-k)ln(~-p).
j is unbiased. It depends only on k , i.e. on the number of the event occurrences in n independent trials; k is a suficient statistic and j is a suflcient estimate ( L=Q. [ p e ( l - p ) ( l - e ) ] n ) .
5 14
A8 Basic Mathematical Statistics
Example A8.4 Let kl, ..., kn be independent observations of a random variable 5 distnbuted according to the Poisson distribution defined by Eq. (A6.125). Give the maximum likelihood estimate for the unknown Parameter m of the Poisson distribution.
Solution The likelihood function becomes k l + ... + k ,
L=
m
-nm
k*! ... kn!
or
...+ k,)lnm-mn-In(kl!
lnL=(k,+
... k,!)
and thus
fi is unbiased. It depends only on the sum kl + ... + kn, not on the individual ki ; kl + ... + k, is a suficient statistic and f i is a suficient estimate ( L = (1/ kl ! .. kn !) .(mn e-n M)).
.
Example A8.5 Let tl, ..., tn be statistically independent observations of a Weibull distributed failure-free time T. Give the max. likelihood estimate for the unknown Parameters h and P.
Solution With f(t, L, ß) = ß L ( L t)'-'e
B
- "
it follows from Eq. (A8.24) that
yielding
The solution for ß is unique and can be found, using Newton's approximation method (the value obtained from the empincal distribution function can give a good initial value, see Fig. 7.12).
Due to cost and time limitations, the situation often arises in reliability applications in which the items under test are run in parallel and the test is stopped before all items have failed. If there are n items, and at the end of the test k have failed (at the individual failure times (times to failure) tl < t2 < ... < tk) and n - k are still of the items still working at the end of working, then the operating times Tl, ..., the test should also be accounted for in the evaluation. Considering a Weibull distribution as in Example A8.5, and assuming that the operating times Tl, ..., Tnek have been observed in addition to the failure-free times tl , ..., tk, then L ( I ~..., , tk, L, 0)
- ( ß a ß fe - k
(tl
+ti)n k
ß ß +
..,
tf-l
n-k
e-
(?JjP,
A8.2 Parameter Estimation
yielding
The calculation method used for Eq. (A8.32) applies for any distribution function, yielding
where i sums over all observed times to failure, j sums over all failure-free times (operating times without failure), and 8 can be a vector. However, = tk, i.e. the test following two cases must be distinguished: 1) Tl = ... = is stopped at the (random) occurrence of the kth failure (Type II censoring), and 2) Tl = ... = = T„„ is the fixed (given) test duration (Type I censoring). The two situations are basically different and this has to be considered in data analysis, see e.g. the discussion below with Eqs. (A8.34) and (A8.35). For the exponential distribution (P = I), Eq. (A8.31) reduces to Eq. (A8.28) and Eq. (A8.32) to
If the test is stopped at the occurrence of the k t h failure, Tl = ... tk (in general) and the quantity T, = tl + ...+ tk + (n - k)t, is the random cumulative operating time over all items during the test. This situation corresponds to a sampling plan with n elements without replacement (renewal), censored at the occurrence of the kth failure (Type II censoring). Because of the memoryless property of the Poisson process (Eqs. (7.26) and (7.27)), T, can be calculated as T, = n tl + (n - l)(t2 - tl ) + ... + (n - k + l)(tk - tk-l ). It can be shown that the estimate = k / T, is biased. An unbiased estimate is given by
If the test is stopped at the fixed time T„„, then T, = tl + ...+ tk + (n - k)Ttest. In this case, T„, is given (fixed) but k as well as tl,...,tk are random. This situation corresponds to a sampling plan with n elements without replacement, censored at a fixed (given) test time T„„ (Type I censoring). Also for this case, k / T, is biased. Important for practical applications, also because yielding unbiased results, is the case with replacement, see Appendix A8.2.2.2 and Sections 7.2.3.
516
A8 Basic Mathematical Statistics
A8.2.2 Interval Estimation As shown in Appendix A8.2.1, a point estimation has the advantage of providing an estimate quickly. However, it does not give any indication as to the deviation of the estimate from the true parameter. More information can be obtained from an interval estimation. With an intewal estimation, a (random) interval [G1, G,] is sought such that it overlaps (covers) the true value of the unknown parameter 8 with a given probability y. [Gl, G,] is the confidence intenial, and 6, are the lower and upper confidence limits, and y is the confidence level. y has the following interpretation:
In an increasing number of independent samples of size n (used to obtain confidence intewals), the relative frequency of the cases in which the confidence intewals [G1, G,] overlap (cover) the unknown parameter 0 converges to the confidence level y = 1 - ßl - ß2 (0 < Pi .:1 - ß2 < 1). ß, and ß, are the error probabilities related to the interval estimation. If y can not be reached exactly, the true overlap probability should be near to, but not less than, y. The confidence interval can also be one-sided, i.e. (0, G,] or [Gl, W ) for 8 r 0. Figure A8.3 shows some examples of confidence intervals. The concept of confidence intervals was introduced independently by J. Neyman and R. A. Fisher around 1930. In the following, some important cases for quality control and reliability tests are considered.
A8.2.2.1
Estimation of an Unknown Probability p
Consider a sequence of Bernoulli trials (Appendix A6.10.7) where a given event A can occur with constant probability p at each trial. The binomial distribution
,
P1
B2
0 81
C@
two-sided
8
one-sided
6 I6,
one-sided
6 2 el
eu
ßi
0 eu
,
B2
~0
0 ei
Figure A8.3 Examples of confidence intervals for 0 2 0
517
A8.2 Parameter Estimation
gives the probability that the event A will occur exactly k times in n independent trials. From the expression for pk, it follows that Pr{kl 5 observations of A in n trials < k2 I p ) =
k2
(Y)
pi(l -
i=k,
However, in mathematical statistics, the Parameter p is unknown. A confidence intewal for p is sought, based on the observed number of occurrences of the event A in n Bernoulli trials. A solution to this problem has been presented by Clopper and Pearson [A8.6]. For given y = 1- ß1 - ß2 ( 0 iß, < 1- P, < 1) the following holds: I f in n Bemoulli trials the event A has occurred k tirnes, there is a probability nearly equal to (but not smaller than) y = 1- ß1 - ß2 that the confidence interval [F, ,F,] overlaps the true (unknown)probability p, with P1 & F , given by
c(l)p;(l - P,)"-'
=ß2,
for 0 < k < n ,
i=k
for k = 0 take
Pl=O
and $ , = l - 6 ,
with
y = 1-P„
with
y =1-P,.
(A8.39)
und for k = n take
jl =
and
Pu = 1,
Considering that k is a random variable, P1 and P, are random variables. According to the footnote on p. 504, it would be more correct (from a mathematical point of view) to compute from Eqs. (A8.37) and (A8.38) the quantities pkl and pku, and then to set PI = pkl and P, = pku. For simplicity, this has been omitted here. Assuming p as a random variable, ß1 and ß2 would be the probabilities forp and smaller than p l , respectively (Fig. A8.3). to be greater than The proof of Eqs. (A8.37) is based on the monotonic property of the function
For given (fixed) n, B, (k, p) decreases in p for fixed k and increases in k for fixed p (Fig. A8.4). Thus, for any p > 3, it follows that B,(k,p)< B,(k,P,)=ß1. For p > F „ the probability that the (random) number of observations in n trials will
A8 Basic Mathematical Statistics
Figure A8.4 Binomial distribution as a function of p for n fixed and two values of k
take one of the values O,1, ..., k is thus < ßl (for p > p ' in Fig. A8.4, the Statement would also be true for a K > k). This holds in particular for a number of observations equal to k and proves Eq. (A8.38). Proof of Eq. (A8.37) is similar. To determine pl and j, as in Eqs. (A8.37) and (A8.38), a Table of the Fisher distribution (Appendix A9.4) or of the Beta function can be used. However, for Pi = ß2 = (1 - Y ) / 2 and n sufficiently large, one of the following approximate solutions can be used in practical applications: 1. For large values on n ( m i n ( n p ,n(1- P ) ) > 5 ) , a good estimate for jl and j, can be found u s h g the integral Laplace theorem. Rearranging Eq. (A6.149) = k and ( k I n instead of ( k I n - p ) yields and considering i=l
The right-hand side of Eq. (A8.41) is equal to the confidence level y , i.e.
Thus, for a given y , the value of b can be obtained from a table of the normal distribution (Table A9.1). b is the 1- (1- y ) 12 = ( I + y ) I 2 quantile of the standard normal distribution @ ( t ) , i.e., b = t (l+y),2 giving e.g. b = 1.64 for y = 0.9. On the left-hand side of Eq. (A8.41), the expression
is the equation of the confidence ellipse. For given values of k , n, and b,
A8.2 Parameter Estimation
confidence lirnits
Pl and f i p n be determined as roots of Eq. (A8.42)
see Figs. A8.5 and 7.1 for some Examples.
2. For small values of n, confidence limits can be determined graphically from the envelopes of Eqs. (A8.37) and (A8.38) for ß1 = ß2 = (1-Y) 1 2 , see Fig. 7.1 for y = 0.8 and y = 0.9. For n > 50, the curves of Fig. 7.1 practically agree with the confidence ellipses given by Eq. (A8.43). One-sided confidence intewals can also be obtained from the above values for jl and F,. Figure A8.3 shows that
O l p < & , withy=l-ß,
and
j l l p l l , withy=l-ß2.
(A8.44)
Example A8.6 Using confidence ellipses, give the confidence interval [jl, j u ] for an unknown probability p for the case n = 50, k = 5, and y = 0.9. Solution Setting n = 50, k = 5 , and b = 1.64 in Eq. (A8.43) yields the confidence interval [0.05, 0.191, see also Fig. 8.5 or Fig. 7.1 for a graphical solution. Corresponding one-sided confidence intervals would be p 1 0.19 or p 2 0.05 with y = 0.95.
Figure A8.5 Confidence limits (ellipses) for an unknown probability p with a confidence level y = 0.9 and for n = 10,25,50, 100 (according to Eq. (A8.43))
520
A8 Basic Mathematical Statistics
The role of kln and p in Eq. (A8.42) can be reversed, and Eq. (A8.42) can be used to solve a problem of probability theory, i.e. to compute for a given probability y , y = 1- ßl - ß2 with ßl = ß2 , the limits kl and k2 of the number of observations k in n independent trials for given (fixed) values of p and n (e.g. the number k of defective items in a sample of size n)
As in Eq. (A8.43), the quantity b in Eq. (A8.45) is the (1+ y ) / 2 quantile of the normal distribution (e.g. b = 1.64 for y = 0.9 from Table A9.1). For a graphical solution, Fig. A8.5 can be used, taking the ordinatep as known and by reading kl In and k2 In from the abscissa. An exact solution follows from Eq. (A8.36).
A8.2.2.2
Estimation of the Parameter hfor an Exponential Distribution: Fixed Test Duration, with Replacement
Consider an item having a constant failure rate h and assume that at each failure it will be immediately replaced by a new, statistically equivalent item, in a negligible replacement time (Appendix A7.2). Because of the memoryless property (constant failure rate), the number of failures in (0, T1 is Poisson distributed and given by Pr{k failures in ( o , T ] h ] = ( h ~ f e - 'T 1k ! (Eq (A7.41)). The maximum likelihood point estimate for h follows from Eq. (A8.30), with n = 1 and m = A T , as
I
Similarly, estimation of the confidence interval for the failure rate h can be reduced to the estimation of the confidence intewal for the Parameter m = h T of a Poisson distribution. Considering Eqs. (A8.37) and (A8.38) and the similarity between the binomial and the Poisson distribution, the confidence limits and h, can be determined for given ßl, ßL, and y = 1- ßl - ß2 ( 0 < Pi < i - ß2 < 1) from
and
for k = 0 takes
On the basis of the known relationship to the chi-square ( X 2 ) distribution
521
A8.2 Parameter Estimation
(Eqs. (A6.102), (A6.103), Appendix A9.2), the values hl and h, from Eqs. (A8.47) and (A8.48) follow from the quantiles of the chi-square distribution, yielding 0
for
(A8.50)
k>O,
ßl = ß2 = (1 - y ) / 2 is frequently used in practical applications. Fig. 7.6 gives the results obtained from Eqs. (A8.50) and (A8.51) for ß1 = ß2 = ( 1 - y ) / 2 . One-sided confidence intewals are given as in the previous section by *
0 I h I h„ with y
and
= 1 - ßl
hl I ?L <
W,
with y = 1 - ß 2 .
(A8.52)
The situation considered by Eqs. (A8.47) to (A8.51) corresponds also to that of a sampling plan with n elements with replacement, each of them with failure rate h'= h l n , terminated at a fixed test time T„, = T. This situation is statistically different from that presented by Eq. (A8.34) and in Section A8.2.2.3.
A8.2.2.3
Estimation of the Parameter k for an Exponential Distribution: Fixed Number n of Failures, no Replacement
Let zl, ..., T , be independent random variables distributed according to a common i = 1, ..., n. From Eq. (A7.39), distribution function F ( t ) = P ~ { 5T t~) = 1-e-",
Setting a = n ( 1 -
E ~ )h/
and b = n ( l + q)/h it follows that
Considering now T I , ..., T, as a random sample of z with t l , ..., t , as observations, Eq. (A8.54) can be used to compute confidence limits iland i , for the parameter h. For given ßl, ß 2 , and y = 1 - ßl - ß2 (0 iß, < 1 - ß, < I), this leads to
h1 = ( I - E * ) ~
and
h,
=( l + ~ ~ ) h ,
(A8.55)
522
A8 Basic Mathematical Statistics
with n
*
h= and
tl
+ ... +tn
given by
Using the definition of the chi-square distribution (Appendix A9.2), it follows that x ~ andthus ~ , ~ ~ 1+ E1 = ( X z n , l - ß i ) / 2 n and ~ - E ~ = ( )/Zn x L n , p,
.
Al
=
2(tl
h,=
and
+ ... + t n )
+ ... + t n )
2(tl
E2 = 1 or = 00 lead to one-sided confidence intervals [0, L,] or [ L l , W). Figure A8.6 gives the graphical relationship between n, y , and E for the case = = E. The case considered by Eqs. (A8.53) to (A8.58) corresponds to the situation described in Example A8.2 (sampling plan with n elements without replacement, terrninated at the nth f a h r e ) , and differs statistically from that in Section A8.2.2.2.
Y 1.0 0.8 0.6 0.4
0.2
0.1 0.08 0.06 0.04
0.02 /
0.01 0.01
0.02
0.05
0.1
0.2
b 0.5
E
Figure A8.6 Probability 'y $at the interval (1 f c ) i overlaps the true value of h for the case of a fixednumbernoffailures ( h = n l ( t l + ...+ t n ) , P r { ~ < t ] = l - e - ' ~ , *forExample A8.7)
523
A8.2 Parameter Estimation
Example AS.7 For the case considered by Eqs. (A8.53) to (A8.58), give for n = 50 and y = 0.9 the-two-sided confidence interval for the parameter h of an exponential distnbution as a function of h . Solution From Figure A8.6, E = 0.24 yielding the confidence interval [0.76h, 1.24h ] .
A8.2.2.4 Avaiiability Estimation (Erlangian Failure-Free and Repair Times) Considerations of Section A8.2.2.3 can be extended to estimate the availability of a repairable item (described by the alternating renewal process of Fig. 6.2) for the case of Erlangian distributed failure-free andlor repair times (Appendix A6.10.3), and in particular for the case of constant failure and repair rates (exponentially distributed failure-free and repair times). Consider a repairable item in continuous operation, new at t = 0 (Fig. 6.2), and assume constant failure and repair rates ( h ( x )= h , p(x) = p). For this case, point and average unavailability converge rapidly ( 1-PAso(t) & 1 - A A s o ( t ) in Table 6.3) to the asymptotic and steady-state value given by
h l ( h + p ) is a probabilistic value and has his statistical Counterpart in DT/(UT+DT), where DT is the down (repair) time and UT = t -DT the up (operating) time observed in ( 0 ,t ]. To simplify considerations, it will be assumed in the following t >> MTTR = 1I p (Table 6.3) and that at the time point t a repair is terminated and k failure-free and repair times have occurred ( k=1,2, ...) . Furthermore, a«p, i.e.
-
PA =1-PA = PA, = h / p
(A8.59)
will be assumed here, yielding the counterpart DT I UT (relative error of magnitude PA). Considering that at the time point t a repair is terminated, it holds that
where ti & ti are the observed values of failure-free and repair times zi & T;, respectively. According to Eqs. (A6.102) - (A6.104), the quantity 2 h (zl +. ..+zk) has a ~2 distribution with V = 2 k degrees of freedom. The same holds for the repair times 2 y (7; + ... +&J. From this, it follows (Appendix A9.4) that the quantity
is distributed according to a Fisher distribution (F) with v l = v 2 = 2 k degrees of freedom (E,is an unknown parameter, regarded here as a random variable)
524
A8 Basic Mathematical Statistics
Having observed for a repairable item described by Fig. 6.2 with constant failure rate h ( x ) = h and repair rate p(x) = p >> h , an operating time UT = tl +. .. + tk and a repair time DT = t;+. .. + ti , the mmimum likelihood estimate for G, = h I p is A
E , = ( k i P ) = D T I U T = ( t i + ...+t i ) l ( t l + ...+ tk),
(A8.62)
an unbiased point estimate being (1 - 1 I k) DT I U T , k r 1 (Example A8.10). With the same considerations as for Eq. (A8.54), Eq. (A8.61) yields ( k = i,2, ...)
sau
and thus to the confidence limits E,, = (1 - E 2 ) S a and = (1 + E ~ ) = ~with , 2,as in Eq. (A8.62) and EI, related to the confidence level y = 1- ß1 - ß2 by " xk-l (2k -I)! - j,dx=ß, (k -1)12 (1 + X )
and
(2k - I)!
dx
=P2. (A8.64)
(k
From the definition of the Fisher distribution (Appendix A9.4), it follows that E I = F ~ ~ , ~ ~ , J-- 1P and , E2= 1 - F 2 k , 2 k , ~ z andthus, ; using F v , , v 2 , ~ 2 = l l F v , , v , , ~ - ~ 2 ,
where F2k,2k,,-ß2 & F2k,2k,l-ß, are the 1 - ß2 & 1 - ßl quantiles of the Fisher ( F ) distribution (Appendix A9.4, [A9.3- A9.61). A graphical visualization of the confidence interval , ,] is given in Fig. 7.5. One-sided confidence intervuls are
[G
,. -
O
withy=l-P,
and
Pal
<%
withy=l-P,.
(A8.66)
Corresponding values for the availability can be obtained using PA = 1- X. If failure free andlor repair times are Erlangian distributed (Eq. (A6.102)) with ß h & ßp, F2k,2k,l-ß2and F 2k,2k,~-ß1 have to be replaced by F 2kßP.zkßh.l-ß2 and (for unchanged MTTF & MlTR, See Exarnple A8.11). Ga= DT/ UT F2k P h , 2 kPp remains valid. Results based only on the distribution of DT (Eq. 7.22) are not free of parameters (Section 7.2.2.3). Example A8.S For the estimation of an availability PA, UT= 1750h, DT= 35h and k = 5 failures and repairs have been observed. Give for const. failure & repair rates the 90% lower limit of PA (Fig.7.5, y = 0.8). Solution From Eqs. (A8.65) & (A8.66) and Table A9.4a follows = 2%. 2.32 and thus -PA > 95.3%. Supplementary result: Erlangian distributed repair times with PP = 3 yields E , = 2% .1.82.
zu
A8.3 Testing Statistical Hypotheses
A8.3
Testing Statistical Hypotheses
When testing a statistical hypothesis, the objective is to solve the following problem: From one's own experience, the nature of the problem, or simply as a basic hypothesis, a specific null hypothesis Ho is formulated for the statistical properties of the obsewed random variable; sought is a rule which allows rejection or acceptance of Ho on the basis of the statistically independent obsewations made from a sample of the random variable under consideration . If R is the unknown reliability of an item,following null hypotheses Ho are possible: la) H o : R = 4 Ib) H,: R > & lc) H o : R < & . To test whether the failure-free time of an item is distributed according to an with exponential distribution Fo(t)= 1 - e - X with unknown h, or Fo(t)= 1 - e known k o , the following null hypotheses Ho can be for instance formulated: 2a) 2b) 2c) 2d) 2e)
Ho : Ho : Ho : Ho : Ho :
the distribution function is Fo(t) the distribution function is different from F o ( t ) h = hO,provided the distribution is exponential h < ho, provided the distribution is exponential the distribution function is 1 - e - h t , Parameter h unknown.
It is usual to subdivide hypotheses into parametric (la, lb, lc, 2c, 2d) and nonparametric ones (2a, 2b, and 2e). For each of these types, a distinction is also made between simple hypotheses (la, 2a, 2c) and composite hypotheses ( l b , lc, 2b, 2d, 2e). When testing a hypothesis, two kinds of errors can occur (Table A8.2): type I error, i.e. the error of rejecting a true hypothesis Ho, the probability of this error is denoted by a type II error, i.e. the error of accepting a false hypothesis Ho, the probability of this error is denoted by ß (to compute ß, an alternative hypothesis H1 is necessary, ß is then the probability of accepting Ho assuming H 1 is true). If the sample space is divided into two complementary sets, A for acceptance and 3 for rejection, the type I and type I1 errors are given by
a = Prisample in 3
I H o true},
ß = Pr{sample in A
I H o false (Hi true)}.
Both kinds of error are possible and cannot be minimized simultaneously. Often
a
526
A8 Basic Mathematical Statistics
Table A8.2 Possible errors when testing a hypothesis Ho is rejected Ho is true
false
Ho is accepted
+ type I error ( C X )
correct
1 correct
I HO is false ( H 1is true)
I false + type I1 error (0) I
is selected and a test is sought so that, for a given H 1 , ß will be minimized. It can be shown that such a test always exists if H o and H 1 are simple hypotheses [A8.22]. For given alternative hypothesis H1, ß can often be calculated and the quantity 1 - ß = Pr{sarnple in 3 H1 true] is referred as the power of the test. The following sections consider some important procedures for quality control and reliability tests, see Chapter 7 for applications. Such procedures are basically obtained by investigating the distribution of a suitable quantity observed in the sample.
I
A8.3.1 Testing an Unknown Probability p Let A be an event which can occur at every independent trial with the constant, unknown probability p . A rule (test plan) is sought which allows testing of the hypothesis
Ho: P < Po
0
1
Po
)P
(A8.69)
against the alternative hypothesis Hl
HI: ~
>
m( ~ 1 2 ~ 0 1
o
1
(A8.70)
The type I error should be nearly equal to (but not greater than) a for p = po. The type II error should be nearly equal to (but not greater than) ß for p = pl. Such a situation often occurs in practical applications, in particular in: quality control, where p refers to the defective probability or fraction of defective items, reliability tests for a given fixed mission, where it is usual to set p = 1- R (R =reliability). In both cases, a is the producer's risk and ß the consumer's risk. The two most frequently used procedures for testing hypotheses defined by (A8.69) and (A8.70), with pi >pO, are the simple two-sided sampling plan and the sequential test (one-sided sampling plans are considered in Appendix A8.3.1.3).
A8.3 Testing Statistical Hypotheses
A8.3.1.1
Simple Two-sided Sampling Plan
The rule for the simple two-sided sampling plan (simple two-sided test) is: 1. For given po, pl >po, a, and ß ( 0 < a < 1 - ß < I), compute the smallest integers C and n which satisfy
and
2. Perform n independent trials (Bernoulli trials), determine the number k in which the event A (component defective for example) has occurred, and *reject Ho: p < p o , accept Ho: p < po ,
if k > c if k 5 C.
As in the case of Eqs. (A8.37) and (A8.38), the proof of the above rule is based on the inonotonic property of Bn(c,p ) = see also Fig A8.4. (I For known n, C ,and p, B,(c,p) gives the pobab?lity of having up to C defectives in a sample of size n. Thus, assuming H o true, it follows that the probability of rejecting H o (i.e. the probability of having more than C defectives in a sample of size n) is smaller than a
k( Y)
Pr{rejection of H o
(n. ) p i ( l "P'I
/ Ho true} =
1
i=c+l
-
Y
< a,
Similarly, if H1 is true ( p > p l ) , it follows that the probability of accepting Ho is smaller than ß Pr{acceptanceof H o
1 X i true) =
5
("pi(i-p)"-i
i=O
I
The assumptions made with Eqs. (A8.71) and (A8.72) are thus satisfied. As shown by the above inequalities, the type I error and the type I1 error are in this case < a for p < po and < ß for p> pl, respectively. Figure A8.7 shows the results for po = 1%, pl = 2%, and a = ß 220%. The curve of Fig. A8.7 is known as the operating characteristic (OC). If po and pl are small (up to a few %) or close to 1, the Poisson approximation (Eq. (A6.129))
is generally used.
A8 Basic Mathematical Statistics
(f)P i
Pr {Acceptance1 p ] =
( l - ~ ~ i
i=O
4
Figure AS.7 Operating characteristic (probability of acceptance) as a function of p for fixed n and C ( p o = l % , p1 = 2 % , u=ß=0.185, n = 4 6 2 , c = 6 )
A8.3.1.2
Sequential Test
Assume that in a two-sided sampling plan with n = 50 and C = 2 , a 3rd defect, i.e. k = 3, occurs at the 12th trial. Since k > C , the hypothesis H o will be rejected as per procedure (A8.73), independent of how often the event A will occur during the remaining 38 trials. This example brings up the question of whether a plan can be established for testing H o in which no unnecessary trials (the remaining 38 in the above example) have to be performed. To solve this problem, A. Wald proposed the sequential test [A8.32]. For this test, one element after another is taken from the lot and tested. Depending upon the actual frequency of the observed event, the decision is made to either reject H o accept H o perform a further trial. The testing procedure can be described as follows (Fig. A8.8):
Zn a system of Cartesian coordinates, the rzumber n of trials is recorded on the abscissa und the number k of trials in which the event A occurred on the ordinate; the test is stopped with acceptance or rejection as soon as the resulting staircase cuwe k = f(n) crosses the acceptance or rejection line given in the Cartesian coordinatesfor specified values of po, pi, a,und ß. The acceptance and rejection lines can be determined from: Acceptance line : Rejection line : with
k = an - bl, k = a n + b2,
A8.3 Testing Statistical Hypotheses
k
Figure A8.8 Sequential test for po = 1%, pl = 2%, and a = ß = 20%
a=
W-PO)/~~-PI))
m ]-PO In- + lnP,
]-P,
in((1 - a )/ß
i= P, InPo
+ In--1- Po ]-P,
1?= ln((l-ß)'a I? ' - Po . In-+InPo
(A8.76)
1-f+
Figure A8.8 shows acceptance and rejection lines for po= 1%, pl= 2%, a = ß =20%. Practical remarks related to sequential tests are given in Sections 7.1.2.2 and 7.2.3.2.
AS.3.1.3
Simple One-sided Sampling Plan
In many practical applications only p o and a or pl and ß are specified, i.e. one Want to test H o : P < po against H 1 : P> po with type I error a o r H o : p< pl against H 1 : p > p l with type I1 error P. In these cases, only Eq. (A8.71) or Eq. (A8.72) can be used and the test plan is a pair (C, n) for each selected value of C = 0, I,. .. and calculated value of n. Such plans are termed one-sided sampling plans. Setting pl = po in the relationship (A8.70) or in other words, testing
Ho: PCP0 against
(A8.77)
H11 P > P o
with type I error a, i.e. using one (c,n) pair (for C = 0,1, ...) from Eq. (A8.71) and the test procedure (A8.73), the type I1 error can become very large and reach the value 1- a for p = po. Depending upon the value selected for C = 0,1,. .. and that calculated for n (the smallest integer n which satisfies Eq. (A8.71)), different plans (pairs of (C, n)) are possible. Each of these plans yields different type I1 errors. Figure A8.9 shows this for some values of C (the type I1 error is the ordinate of the
A8 Basic Mathematical Statistics
Figure A8.9 Operating characteristics for po = 1%, a = 0.1 and C = 0 ( n = 10), C = 1 ( n = 53), c = 2 (n=110), c = 3 ( n = 1 7 4 ) a n d c = w
operating characteristic for p > po). In practical applications, it is common usage to define
where AQL stands for Acceptable Quality Level. The above considerations show that with the choice of only po and a(instead of po, p,, a, and ß) the producer can realize an advantage, particularly if small values of c are used. On the other hand, setting po = p, in the relationship (A8.69), or testing Ho: P < P1
(A8.80)
with type I1 error P, i.e. using one ( C ,n) pair (for C = 0,1, ...) from Eq. (A8.72) and the test procedure (A8.73), the type I error can become very large and reach the value 1 - ß for p = P,. Depending upon the value selected for C = 0,1, ... and that calculated for n (the largest integer n which satisfies Eq.(A8.72)), different plans (pairs of ( C , n)) are possible. Considerations here are similar to those of the previous case, where only po and a were selected. For small values of C the consumer can realize an advantage. In practical applications, it is common usage to define p, = LTPD,
(A8. 82)
where LTPD stands for Lot Tolerance Percent Defective. Further remarks on one-sided sampling plans are in Section 7.1.3.
531
A8.3 Testing Statistical Hypotheses
AS.3.1.4 Availability Demonstration (Erlangian Faiiure-Free and Repair Times) Considerations of Section A8.2.2.4 on availability estimation can be extended to demonstrate the availability of a repairable item (described by the alternating renewal process of Fig. 6.2) for the case of Erlangian distributed failure-free and/or repair times (Appendix A6.10.3), and in particular for the case of constant failure and repair rates (exponentially distributed failure-free and repair times). Consider a repairable item in continuous operation, new at t = 0 (Fig. 6.2), and assume constant failure and repair rates ( h ( x ) = h , y ( x ) = P). For this case, point and average unavailability converge rapidly ( 1 - PA„( t ) & 1 - A A s o ( t ) inTable6.3) to the asymptotic & steady-state value given by
h / ( h + y ) is a probabilistic value of the asymptotic & steady-state unavailability and has his statistical Counterpart in DT I (UT+ DT), where DT is the down (repair) time and UT the up (operating) time observed in (O,t]. From Eq. (A8.83) it follows that
As in Appendix A8.2.2.4, it will be assumed that at the time point t a repair is terminated, and exactly n failure free and n repair times have occurred. However, will be specified (Eqs. (A.8.88)- (A8.89)) and for a demonstration test PA or DTl UT observed. Similar as for Eq. (A8.60), the quantity
is distributed according to a Fisher distribution (F-distribution) with v1=v2=2n degrees of freedom (Appendix A9.4). From this (with DT 1 UT as a random variable),
PA
UT
dy.
(A8.85)
Setting
6=
x.PA~~~,
Eq. (A8.85) yields
, the sum of n repair times Considering DT I UT = (T; + ... + T;) /(q+ ... + T ~ )i.e. divided by the sum of the corresponding n failure-free times, a rule for testing
532
A 8 Basic Mathematical Statistics
against the alternative hypothesis
&Xl
r PAo) (A8.89) can be established (as in Appendix A8.3.1) for given type I error (producer risk) nearly equal to (but not greater than) a for E = E, and type II error (consumer risk) nearly equal to (but not greater than) ß for E = E , H1:
DT
Pr(->i3 UT
(PA,
I E = E oS}a
DT
and P r { - 5 8 UT
I PA = P A , } Iß.(A8.90)
From Eqs. (A8.87) & (A8.90) it follows that (considering the definition of the Fisher AF ~ 2n,2n,1-a and 6 .PA1I PA1 = F 2 n , 2 n , ß . distribution, Appendix A9.4), 6 .P A ~ I P = Eliminating F , using F v „ v „ =1/ F v „ v , , - B , and considering - the conditions (A8.90), the rule for testing H o : PA = PAo against H , : PA = PA, follows as (see also [A8.28, A2.6 (IEC 61070)l):
q, a, and ß (0 < a < 1 - ß < I), find the smallest integer n
1. For given %, (1,2, ...) which satisfy
where F 2,,, 2n, 1 - C L and F zn, zn, 1 - p are the 1- a & 1 - ß quantiles of the Fdistribution (Appendix A9.4, [A9.2-A9.6]), and compute the lirniting value
8 =F2n,zn,l-a
/PAo= F 2 n , ~ n , l -(~~ - P ~ I ) I P ~ I .
(A8.92)
2. Observe n failure free times tl + ... + t , and the corresponding repair times t ; + ... + t;, and
accept H~ :
PA < PAo ,
Corresponding values for the availability can be obtained using PA = 1- H. If failure free and/or repair times are Erlangian distributed (Eq. (A6.102)) with ß h &Pp, F 2n,2n,l-a and F 2 n . z n , i - ß have to be replaced by F 2 n ß „ ~ n ß h , l - a and F2nßh,2nß„l - ß (for unchanged MTTF & MTTR, see Example A8.11). Results based only on the distribution of DT (Eq. 7.22) are not free of Parameters (Section 7.2.2.3). Exarnple A8.9 an availability PA, customer and producer agree the following For the demonstrationof Parameters: PAo = 1%, PAl = 6%, a = ß = 10%. Give for the case of constant failure and repair ) h and ~ ( x =) p >> h ) the number n of failures and repairs that have to be observed rates ( & ( X = and the acceptance limit 6 = (ti + ... + t ; ) 1 ( t l + ... + t , ) . Solution Eq. (A8.91) & Table A9.4a yields n= 5 ( ( F i0,i0,0,9)2=2.322< 6 . 9 9 / 1 . 9 4 < 2.59'= ( F 8,8,09)2). 6 =F PAo I PAO= 2.32.1199 =0.0234 follows from Eq. (A8.92), See also Tab. 7.2. Suppl. result: Erlangian distr. repair times with ßp=3 yields n=3, 6= 0.0288 (2.85.2.13 < 6.32).
„-„.,,,
A8.3 Testing Statistical Hypotheses
Example A8.10 Give an unbiased estimate for PA, = h l p . Solution From Eq. (A8.61) it follows that xhlp UT DT
Pr(-<X]=---
(2k-I)! ((k-I)!?
yk
-'
-d Y (1+J'lzk
The density of UTIDT for X 7 observed UTIDT is the maximum likelihood function for the estimation of Alp. From this, A l p = DTI UT (Eq. A8.25). Considenng now A l p as a random variable with distribution function as per Eq. (A8.61)for given UTI D T , it follows that (TableA9.4)
h i p = DT I UT is thus biased, unbiased is (1 - 1 I k ) DTI UT
Example A8.11 Give the degrees of freedom of the F*-distnbution for the case of Erlangian distributed failure-free and repair times with Parameters h , ßh and P*,ßP, respectively (with h f = hßh and pf= pßp because of the unchanged MTTF and MTTR). Solution Let T I + ... +T, be the exponentially distributed failure-free times with mean MTTF= 1I h . If the actual failure-free times are Erlangian distributed with parameters h*, ßh and mean MTTF= ßh lh*= 1 I h (Appendix A6.10.3, Table A6.1), the quantity corresponding to the surn of n Erlangian distnbuted failure-free times, has a distnbution with V =2 n ß h degrees of freedom (Eq. (A6.104)). Similar is for the repair times Ti. Thus, the quantity PA DT 2(Ti1 +T;, +...+ T;~,+...+T:, -!-T;, +... + ~ ; ~ , ) / 2 n ß , _._=-. PA UT L* 2 (T„ + T„ +... + Tlph +... +T,] + T , ~+... +T,&) 12nßh obtained from Eq. (A6.84) by considering h =L*/ ßh and p=pf 1 PP, has a F-distribution with vl = 2 n ß , and V, = 2nßh degrees of freedom (AppendixA9.4).
AS.3.2 Goodness-of-fitTests for Completely Specified F0(t) Goodness-of-fittests have the purpose to verify agreement of observed data with a postulated (completely specified or only partially known) model. A typical example is as follows: Given tl, ..., t , as n (stochastically) independent observations of a random variable T, a rule is sought to test the null hypothesis Ho :
the distribution function of T is Fo(t),
against the alternative hypothesis
(Ag.94)
534
A8 Basic Mathematical Statistics
Hl :
the distribution function of T is not Fo(t).
(A8.95)
F,(t) can be completely defined (as in this section) or depend on some unknown parameters which must be estimated from the observed data (as in the next section). In general, less can be said about the risk of accepting a false hypothesis H, (to compute the type 11 error P , a specific alternative hypothesis H, must be assumed). For some distribution functions used in reliability theory, particular procedures have been developed, often with different alternative hypotheses H, and investigation of the corresponding test power, see e.g. [A8.l, A8.9, A8.231. Among the distribution-free procedures, the Kolmogorov-Smirnov, Cramkr -von Mises, and chi-square (X') tests are frequently used in practical applications to solve the goodness-of-fit problem given by Eqs. (A8.94) & (A8.95). These tests are based upon comparison of the empirical distribution function (EDF) G,(t), defined by Eq. (A8.1), with a postulated distribution function F o ( t ) .
1. The Kolmogorov-Smirnov test uses the (supremum) statistic
introduced in Appendix A8.1. 1. A. N. Kolmogorov showed [A8.20] that if F,(t) is continuous, the distribution of D, under the hypothesis H, is independent of F,(t). For a given type I error a , the hypothesis H, must be rejected for D, > Yl-W
(A8.97)
where yl-, is defined by Pr{Dn > yi-,
I
H, is true} = a .
(A8.98)
,-,
are given in Tables A8.1 and A9.5. Figure A8.10 illustrates Values for y the Kolmogorov-Srnirnov test with hypothesis Ho not rejected. Because of its graphical visualization, in particular when probability charts are used (Appendix A8.1.3, Section 7.5), the Kolmogorov-Smirnov test is often used in reliability data analysis.
2. The Cramdr- von Mises test uses the (quadrate) statistic
-W
As in the case of the D, statistic, for F o ( t )continuous the distribution of W; s independent of F,(t) and tabulated (see for instance [A9.5]). The Cramer von Mises statistic belongs to the so-called quadrate statistics defined by
A8.3 Testing Statistical Hypotheses
Figure A8.10 Kolmogorov-Smimov test ( n = 20, a = 20%)
where ~ ( tis)a suitable weight function. ~ ( t= )1 yields the W: statistic and ~ ( = t )[F,(t) ( 1 - F,(t))] yields the Anderson- Darling statistic A:. Using the transformation z ( i ) = F,(t(i,), calculation of W: and in particular of A: becomes easy, see e.g. [A8.10]. This transformation can also be used for the Kolmogorov-Srnirnov test, although here no change occurs in D,.
-'
3.The chi-square ( X 2 ) goodness-of-fit test starts from a selected partition of the set of possible values of T and uses ( a l ,a 2 ] , ( a 2 ,a g ] ,..., ( a k , the statistic
is the number of observations (realizations of T) in ( a i ,ai+l]and
is the expected number of observations in ( a i ,ai+l](obviously kl +... + kk = n and pl +... +pk = 1 ) . Under the hypothesis H o , K. Pearson EA8.271 has shown that the asymptotic distribution of X: for n + is a X 2 distribution with k - 1 degrees of freedom. Thus for given type I error a ,
1
(A8.104)
lim Prix: > x~-l,l-a H o true } = (Y. n-f-
holds, and the hypothesis H o must be rejected if
Xk-1,l-a is the ( 1- a ) quantile of the
X2
distribution with k - 1 degrees of
536
A8 Basic Mathematical Statistics
freedom (Table A9.3). The classes ( a l , a 2 ] , ( a 2 , a 3 ] , ..., ( a k , ak+,] are to be chosen b e f o r e the test is performed, in such a way that all pi are approximately equal. Convergence is generally good, even for relatively small values of n ( n p i 2 5). Thus, b y selecting the classes ( a l , a 2 ] , ( a 2 , a 3 ] , ..., ( a k , ak+l] o n e should take care that all n p i (Eq. (A8.103) are almost equal u n d 1 5.
Example A8.12 shows an application of the chi-square test. When in a goodness-of-fit test, the deviation between 6,(t) and F o ( t ) seems abnormally small, a verification against superconform (superuniform if the transformation qi)= F o ( t ( i l )is used) can become necessary. Tabulated values for the lower limit L I - , for D, are e.g. in [A8.1] (for instance, a = 0.1-+ Z1-, = 0.57 I&).
Example A8.12 Accelerated life testing of a wet Al electrolytic capacitor leads to the following 13 ordered observations of lifetime: 59, 71, 153, 235, 347, 589, 837, 913, 1185, 1273, 1399, 1713, and 2567 h . Using the chi-square test and the 4 classes (0, 2001, (200, 6001, (600, 12001, (1200, M), verify at the level a = 0.1 (i.e. with first kind error a = 0.1) whether or not the failure-free time T of the capacitors is distributed according to the Weibull distribution Fo(t)=Pr{z < t ) = l - e - ( 1 0 - ~ t ) "(hypothesis ~ H o : Fo(t)=l-e- (loJ z)'.~), Solution The given classes yield number of observations of kl = 3, k2 = 3, k , = 3, and k4 = 4. The numbers of expected observations in each classes are, according to Eq. (A8.103), n p -1.754, 1np2 =3.684, np3 =3.817, and np4 =3.745. From Eq. (A8.101) it follows that X13 =1.204 2 -3 1.2 and from Table A9.2, X 3, 0.9 = 6.251. Ho : F. (t ) = 1 - e-(1° ') can be accepted since X,2 < x 2~ - -cx ~ , (in agreement with Example 7.15).
A8.3.3 Goodness-of-fit Tests for a Distribution F,,(t) with Unknown Parameters The Kolmogorov-Smirnov test and the tests based on the quadrate statistics can be used with some modification when the underlying distribution function F , ( t ) is not completely known (unknown parameters). The distribution of the involved statistic Dn, W;, A: must be calculated (often using Monte Carlo simulation) for each type of distribution and can depend on the true values of the parameters [A8.1]. For instance, in the case of an exponential distribution F O ( t , h )= 1- e-nt with parameter ?L estimated as per Eq. (A8.28) h = n l ( t l + ... + t,), the values of Y , - , for the Kolmogorov-Smirnov test have to be modified from those given in TableA8.1,e.g.formyl-,=1.36/& for a = 0 . 0 5 and yl-,=1.22/& for a = 0 . 1 to [A8.1]
A8.3 Testing Statistical Hypotheses
537
Also a modification of D, in DA= (D,, - 0.2 / n ) ( l+ 0.26 / & + 0.5 / n ) is recommended rA8.11. A heuristic procedure is to use half of the sample (randomly selected) to estimate the parameters and continue with the whole sample and the basic procedure given in Appendix A8.3.2 [A8.11 (p. 59), A8.311. The chi-square ( X * ) test offers a more general approach. Let Fo(t,B1,...,B,) be the assumed distribution function, known up to the parameters e l , ..., 8,. If the unknown parameters 01, ..., 8, are estimated according to the maximum likelihood method on the basis of the observed frequencies ki using the multinomial distribution (Eq. (A6.124)), i.e. from the following system of r algebraic equations (Example A8.13)
P1
and kl
+ ... + pk
=I,
+ ... + kk = n ,
a2pi a ~ i and exist (i = i, ..., k ; j, m = 1. .., r < k - I),
ae,
ae, aem
api
the matrix with elements - is of rank r,
ae
then the statistic
calculated with Pi = F , ( U ~ +i ~l , ..., 8,) - ~ ~ (GI,a...,~G,),, has under H , asymptotically for n -+ a x2 distribution with k - 1 - r degrees of freedom [A8.15 (1924)], see Example 7.18 for a practical application. Thus, for a given type I error a,
holds, and the hypothesis H, must be rejected if
538
A8 Basic Mathematical Statistics
2
is the (1 - a ) quantile of the X 2 distribution with k - 1 - r degrees of freedom. Calculation of the Parameters 01, ..., 8 , directly from the observations t l , ..., tn can lead to wrong decisions.
Example A8.13 Prove Eq. (A8.107). Solution The observed frequencies kl, ..., kk in the classes (al, a 2 ] , ( a Z ,a j ] , ..., ( a k ,ak+l] resuit from n trials, where each observation falls into one of the classes (ai, ai+l] with probability pi = F,,(a,+,, 8 „ ..., e r )- F, ( a i ,8„ ..., e r ) ,i = 1, ..., k . The multinomial distribution applies. Taking into account Eq. (A6.124), k, kk n! Pr{in n trials Al occurs k, times, ... , Ak occurs kk times) = m ... pk k l ! ... k k !
with
k l + ...+ k k = n a n d
m t . . . + p k = 1,
the likelihood function (Eq. (A8.23))becomes
with pi =pi(O1, ..., € I r ) ,
pl
+ ... +pk =1,
and k l + ... +kk = n
Equation (A8.107) follows then from
a
lnL -= 0 for 8 . = B j and j = 1, ..., r ,
aej
I
which complete the proof. A practical application with r = 1 is given in Example 7.18.
A9 Tables and Charts A9.1 Standard Normal Distribution
Parameters: E[z] = 0 , Properties:
Table A9.1
Examples:
@(0) =
Var[z] = 1,
0.5,
t
)=
Modal value = E[z] 1- ( t ) ,
@(t,
)=
a => tl-,
Standard normal distribution @(t) for t = 0.00 - 2.99
Pr{r < 2.33) = 0.9901;
p r { ~< -1) = 1 - PI{T < 1 ) = 1
- 0.8413
= 0.1587
=-t,
540
A9.2
A9 Tables and Charts
x2- Distribution (Chi - Square Distribution) t
Definition :
dx,
F(t) = PI-{%;I t ) =
t > o , F(o)=o, 1 , 2 ,... (degrees of freedom)
V =
Parameters: Relationsships:
E[X;I
= V,
Modal value = V - 2 (V > 2)
~ a r [ ~ ;=] 2 V,
Normal distribution:
=
1 'T
- m)',
t1,...,
independent
i=l
normal distrib. with E [ t i ] = rn and var[ti1=02 Y-1
(t/2Ii
-e
Poisson distribution: i=O
- t/2
i!
=
F ) ,
Incomplete Gamma function (Appendix A9.6):
(f):,
= 2, 4,
=
...
F(t)T(;)
Table A9.2 0.05,0.1,0.2,0.4,0.6,0.8,0.9,0.95,0.975 quantilesof the x2- distribution ) = q ; t V s q= ( x + F ? 1 2 for V > 100) ( t v . 4 = x v , 4 for which
L+, i!
1-F(26) forv = 18 = 0.1)
A9 Tables and Charts
A9.3
t - Distribution (Student distribution)
Parameters:
E[tv] = 0 ,
Properties:
F(0) = 0.5,
Relationsships:
V
V a r [ t v ] = - (V > 2 ) , V-2
F(-t) = 1 - F(t)
Normal distribution und
5 is
Modal value = 0
X 2- distribution : tv= I&
normal distributed with E[5] = 0 and Var[c] = 1;
distributed with V degrees of freedorn, Cauchy distribution:
F(t) with
5 and
X;
is
independent
V =1
Table A9.3 0.7,0.8,0.9,0.95,0.975,0.99,0.995,0.999 quantiles of the t - distnbution
Examples: F(t16,0,9)= O.9-f t16,0,9=t16,0.9= 1.3368;
F(t16,0,1)=0.1-+1
542
A9 Tables and Charts
A9.4 F - Distribution (Fisher distribution) Definition:
F(t) = Pr{F,
1' 2
5 t} =
V1 r V(l + V72 ) 2 "1 r(?)r(+)
Relationships:
E[FV1,V2]=
~ 2 - 2(v2 >2), "2
(vlx
dx, +~ ~ ) ( ~ 1 + ~ 2 ) ~ ~
vl ,v2 =1, 2, ... (degrees of freedom)
t > 0, F(0) = 0 ,
Parameters:
x(V1-2)/2
-
2 2 v 2 (V1 +V2 - 2)
,1 = v 1 ( v 2 - 2 ) 2 ( ~ 2 - 4 ) 0i2>4),
Var[F,
2
X - distribution: Fvl, v 2 = 7with X; & X; as in Appendix A9.2 9
Xvz 1V2
1
2
p ( n - k) = 1-F( ( l - p ) ( k + l ) ) '
BinomiaI distribution: 1=0
Table A9.4a 0.90 quantiles of the F - distribution (tvl,vz,0.9'Fv,,v,,o.9 forwhich F(tv1,v2, 0.9)=0.9)
withVl=2(k+l) andv2=2(n-k)
A9 Tables and Charts
Table A9.4b 0.95 quantiles of the
F - distribution
Table for the Kolmogorov - Smirnov Test Dn=
SUp
-m<
t<m
I
1
Fn(t) - Fo(t) ,
F,
( t )= empirical distribution function (Eq. (A8.l)) F, (t ) = postulated continuous distribution function
Table A9.5 1 - a quantiles of the distrib. funct. of D, ( P ~ { D ,< Y,-,
I H~ true) = 1 - a )
1.220
1.630
J;;
J;;
544
A9 Tables and Charts
A9.6 Gamma function Definition : 0
Re(z) > 0 (Euler's integral), solution of T(z + 1) = z T(z) with r(1)= 1
Special values:
n ! = 1.2.3 ...: n = r ( n + 1)
Factorial:
= @nn+1/2e-n+B/12n
,
0 <8 < 1
(Stirling's formula)
1
B ( z , W ) = Inz-' (1 - X)"-'
Beta function:
Relationships:
u z ) w
dx = --T(z + W )
0
Psi function:
~ () =z d
(In U z ) )
dz
Incomplete Gamma function:
X
Table A9.6
2
- distribution ( F @ )as in Appendix A9.2):
(f.i)= ~
( tT(;))
Gamma function for 1.00 5 t 5 1.99 ( t real), for other values use T(z + 1) = z T(z)
t
0
1
2
3
4
5
6
7
8
9
1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90
1.0000 .9513 .9182 ,8975 ,8873 ,8862 ,8935 ,9086 ,9314 .9618
,9943 .9474 .9156 ,8960 2868 ,8866 .8947 ,9106 ,9341 ,9652
,9888 .9436 ,9131 3946 ,8863 ,8870 ,8959 .9126 ,9368 .9688
.9835 ,9399 ,9107 ,8934 3860 ,8876 3972 ,9147 .9397 .9724
,9784 .9364 .9085 3922 ,8858 .8882 ,8986 ,9168 ,9426 ,9761
.9735 ,9330 ,9064 .8911 ,8857 ,8889 ,9001 .9191 .9456 .9799
,9687 .9298 ,9044 ,8902 ,8856 ,8896 ,9017 .9214 .9487 .9837
.9641 .9267 ,9025 3893 ,8856 3905 ,9033 ,9238 .9518 ,9877
.9597 ,9237 .9007 ,8885 ,8857 ,8914 ,9050 .9262 ,9551 .9917
.9554 .9209 ,8990 3878 ,8859 ,8924 .9068 .9288 .9584 .9958
Examples: r(1.25) = 0.9064; r(0.25) = r(1.25) / 0.25
= 3.6256;
r(2.25) = 1.25. r(1.25) = 1.133
A9 Tables and Charts
A9.7 Laplace Transform
I W
Definition:
F( s ) =
e - S t F( t ) dt
o
Inverse irimsf. :
Moment gener ating function:
F(t ) defined for t Z 0, piecewise continuos I ~ ( t ) l< ~e~~ (0 < A , B < W )
.
c+i1 F ( t ) = - & s ) es d s 2n 1 C-i-
exists in the halfplan Re(s) = C > B, i = f i
Considering f(t) as density of
T
I
> 0, it follows (under weak conditions) that
m(-sfzk k ~ [ x ]= E I ; thus, k=O k! k=O k. exept for the sign, the kth coefficient of the MacLaurin expansion of f (s), or f(s)= ~ome-s'f(t)d~ =E[r-"]=
for arbitrary T, the characteristic function E [e
itz
I
a
]=
e
f(x)dx applies
-Ce
Table A9.7a Properties of the Laplace Transform Transform Domain
Time Domain
Linearity Scale Change Shift
Differentiation
Integration
Convolution (F, *
q)
Initial Val. Theorem
iim s@(s) S+
Final Val. ~heorem*
rn
lim s P(s) sL0
'~xistenceof the limit is assumed; **U@) is the unit step function (Table A9.7b)
546
A9 Tables and Charts
Table A9.7b Important Laplace Transforms rransform Domain
Time Domain
m
F(t)
(understood as u(t)F(t), with u(t) as unit siep)
Impulse 6 (t)
(for a > 0 , F(t - a ) =>
e-sa )
Unit step u ( t ) ( ~ ( t =) 0 for t < 0, ~ ( t=) 1 fort 2 0) -AI -(s+A)a (e.g. he 11- u ( t - U ) ] => ( i -e
a b-a -t + - ( ~ - e - ~ ~ ) b b2
1-eWpt
for 0 2 x < A
mncated
for x 2 h ,
distribution function
547
A9 Tables and Charts
A9.8 Probability Charts A distribution function appears as a straight line when plotted on a probability chart belonging to its family. The use of probability charts (probability plot papers). simplifies the analysis and interpretation of data, in particular of life times or failure-free times vailure-free operating time). In the following the charts for lognormal, Weibull, and normal distnbutions are given.
A9.8.1
Lognormal Probability Chart
The distribution function (Eq. (A6.110), Table A6.1) ( ~ ny+in h12
~ ( t =) -
2 ~ 2 dy =
0
In(h t )
0
1 J e-x2'2h, -
4G
t
>o. F(o)=o; A,G > O
-M
appears as a straight line on the chart of Fig. A9.1 ( h in h-' ). For F(t) = 0.5, ht0,5= 1 and thus h = 1 I t0,5;moreover, for F(t) = 0.99, l ~ ~ /( t0,5) t ~l o, =~2,33 ~ and thus o = ln(to.991 to,s)12.33 (this can be used for a graphical estimation of ?L and &) .
Pigure A9.1 Lognormal probability chart
A9 Tables and Charts
A9.8.2
Weibull Probability Chart
The distribution function F ( t ) = I - e - @ ' ) ' , t > 0, F(0) = 0, h, ß > 0 (Eq. (A6.89), Table A6.1) appears as a straight line on the chart of Fig. A9.2 ( h i n h-I), see Appendix A8.1.3. On the dashed line one has h = l / t ;moreover, ß appears on the scale loglo l o g l o ( & - ) ) when t is varied by one decade (Figs. A8.2,7.12,7.13).
-
&
m
2 2m
% 2 ö z g o g g O 2b vI
Figure A9.2 Weibull probability chart
ö b
o
o
~
o
g
o
2
A9 Tables and Charts
A9.8.3
Normal Probability Chart
The distribution function (Eq. (A6.105),Table A6.1)
appears as a straight line on the chart of Fig. A9.3. For F(t) = 0.5, t0.5- m = 0 and thus m = t0,5; moreover, for F ( t ) = 0.99, (t0,99 - t0,5) I G = 2.33 and thus o = (t,„ - to,5)12.33. For a statistical evaluation of data, it is often useful to-estimate m and o as per Eqs. (A8.6),(A8.8),(A6.108)and to operate with @ .
(e) 0
Figure A9.3 Normal probability chart (standard normal distribution)
A l 0 Basic Technological ComponentisProperties Table A1O.l gives some basic technological properties of electronic components to Support reliability evaluations. Table A1O.l Basic technological properties of electronic components Component
Technology, Characteristics
I
Sensitive to
4xed resistors Carbon film A layer of carbon film deposited at high Load, temperature, temperature on ceramic rods; +5% usual; overvoltage, medium TC; relatively low dnft (-1 to freq. (> 50MHz), +4%); failure modes: opens, drift, rarely moisture shorts; elevated noise; 1 G? to 22 MG?; low h (0.2 to 0.4 FIT)
t
,
Application Low power (51W) moderate temperature ( 4 5 ° C ) and frequency ( 550MHz)
Metal film
Evaporated NiCr film deposited on aluminum oxide ceramic; +5% usual; low TC; low drift (+l%); failure modes: drift, opens, rarely shorts; low noise; 10 Q to 2.4MQ: low h (0.2 FIT)
Low power Load, temperature, ( 5 0.5 W), high accuracy and current peaks, stability, high freq. ESD, moisture ( 5 500 MHz)
Wirewound
Usually NiCr wire wound on glass fiber substrate (sometimes ceramic); precision (+0.1%) or power (+5%); low TC; failure modes: opens, rarely shorts between adjacent windings; low noise; O.lG? to 250 k!2 ; medium h (2 to 4 FIT)
Load, temperature, overvoltage, mechanical stress (wire < 25 pm ), moisture
Thermistors PTC: Ceramic materials ( BaTi03 or (PTC, NTC) SrTiOg with metal salts) sintered at high temperatures, showing strong increase of resistance (103 to 104) within 50°C; medium h (4 to 10 FIT, large values for Current and voltage disk and rod packages) load, moisture NTC: Rods pressed from metal oxides and sintered at hi h temperature, large neg. TC . rate as for PTC (TC - 1/ T5 ).. failure Jariable resist.
-
Cermet Pot., Metallic glazing (often ruthenium oxide) Cermet Trim deposited as a thick film on ceramic rods and fired at about 800°C; usually 210%; poor linearity (5%); medium TC; failure modes: opens, localized wearout, drift; relatively high noise (increases with age); 20 C2 - 2 MC2 ; low to medium h (5-20 FIT) Wirewound CuNi / NiCr wire wound on ceramic rings X cylinders (spindle-operated potentiom.); Pot. normally I10%; good linearity (1%); precision or power; low, nonlinear TC; low drift; failure modes: opens, localized wearout, relatively low noise; 10 C2 to 50 kC2: medium to large h (10 to 100 HT)
Load, current, fritting voltage ( < 1.5V), temperature, vibration, noise, dust, moisture, frequency (wire)
High power, high stability, low frequency (520 kHz)
PTC: Temperature Sensor, overload protection, etc. NTC: Compensation, control, regulation, stabilization
Should only be 2mployed when there is a need for adjustment during ~peration,fixed resistors have to be preferred for xilibration during testing; load xipability proportional to the part of the resistor med
A l 0 Basic Technological Component's Properties Table A1O.l (cont.) Component :apacitors Plastic (KS, KP, KT, KC)
Technology, Characteristics
Sensitive to
Application
Wound capacitors with plastic film (K) of polystyrene (S), polypropylene (P), polyethylene-terephthalate (T) or polycarbonate (C) as dielectric and Al foil; very low loss factor (S, P, C); failure modes: opens, shorts, drift; pF to p R low h (1 to 3FIT)
Voltage stress, pulse stress (T, C), temperature (S, P), moisture* (S, P), cleaning agents (S)
Tight capacitance tolerances, high stability (S, P), low loss (S, P), welldefined temperature coefficient
Metallized plastic (MKP, MKT, MKC, MKU)
Wound capacitors with metallized film (MK) of polypropylene (P), polyethyleneterephthalate (T), polycarbonate (C) or cellulose acetate (U); self-healing; low loss factor; failure modes: opens, shorts; nF to p E low h (1 to 2 FIT)
Metallized Paper (MP, MKV)
Wound capacitors with metallized paper (MP) and in addition polypropylene film as dielectric (MKV); self-healing; low loss factor; failure modes: shorts, opens, drift; 0.1p F to mF, low h (1 to 3 FIT)
Ceramic
Often manufactured as multilayer capacitors with metallized ceramic layers by sintering at high temperature with controlled firing process (class 1: E, < 200, class 2: E, > 200); very low loss factor (class 1); temperature compensation (class 1); high resonance frequency: failure modes: shorts, drift, opens; pF to pF; low h (0.5 to 2 FIT)
Tantalum (dry)
Aluminum (wet)
Voltage stress, frequency (T, C, U), temperature (P), moisture* (P>U)
High capacitance values, low loss, relatively low frequencies (< 20kHz for T, U)
Voltage stress and temperature (MP), moisture
Coupling, smoothing, blocking (MP), oscillator circuits, commutation, attenuation (MKV)
Voltage stress, temperature (even during soldering) moisture*, aging at high temperature (class 2)
Class 1: high stability, low loss, low aging; class 2: coupling, smoothing, buffering, etc.
Incorrect polarity, voltage stress, AC resistance (ZO)of the el. circuit (new types less sensitive), temperature, frequency (>lkHz). moisture* Incorrect pola& Wound capacitors with oxidized Al foil (if polarized), (anode and dielectric) and conducting electrolyte (cathode); also available with voltage stress, temperature, two formed foils (nonpolarized); large, frequency and temperature dependent loss cleaning agent factor; failure modes: drift, shorts, opens; (halogen), Storage time, frequency pF to 200 mF ; medium to large h (5 to 10 FIT); limited useful life (> l r n z ) , moisture* (function of temperature and ripple)
Manufactured from a porous, oxidized cylinder (sintered tantalum powder) as anode, with manganese dioxide as electrolyte and a meta1 case as cathode; polarized; medium frequency-dependent loss factor; failure modes: shorts, opens, drift; 0.1 pF to mF; low to medium h (2 to 5 FIT,20 to 40 FIT for bead)
Relatively high capacitance per unit volume, high requirements with respect to reliability, ZOt lS1N
Very high capacitance per unit volume, uncritical applications with respect to stability, relatively low ambient temperature (0 to 55OC)
552
Al0 Basic Technological Component's Properties
Table A1O.l (cont.) Component Diodes (Si) General purpose
Zener
Transistors Bipolar
FET
Controlled rectifiers (Thyristors, triacs, etc.)
I
Technology, Characteristics
I
Sensitive to
PN junction produced from high purity Si Forward current, by diffusion; diode function based on the reverse voltage, recombination of minority carriers in the temperature, depletion regions; failure modes: shorts, transients, opens; low h (1 to 3FIT, 0J=400C, moisture* 10 FIT for rectifiers with 0 T = 100°C) I
/
Application Signal diodes (analog, switch), rectifier, fast switching diodes (Schottky,
I
Heavily doped PN junction (charge carrier generation in strong electric field and rapid increase of the reverse current at low Load, temperature, reverse voltages); failure modes: shorts, moisture* opens, dnft; low to medium h (2 to 4 FIT for voltage regulators ( 8J = 40°C), 20 to 50 FIT for voltage ref. ( 0 7 = 100°C)) PNP or NPN junctions manufactured using planar technology (diffusion or ion implantation); failure modes: shorts, Opens, thermal fatigue for power transistors; low to medium h (2 to 6 FIT for OJ = 40°C, 20 to 60 FIT for power transistors and 8 = 100°C) Voltage controlled semiconductor resistance, with control via diode (JFET) or isolated layer (MOSFET);transist. function based on majority carrier transport; N or P channel; depletion or enhancement type (MOSFET); failure modes: shorts, opens, M,medium h (3 to 10 FIT for 8 J = 40°C, 30 to 60 FIT for power transistors and 0 T = 100°C) NPNP junctions with lightly doped inner
zones (P, N), which can be triggered by a control pulse (thyristor), or a special antiparallel circuit consisting of two thynstors with a single finng circuit (triac); failure modes: drift, shorts, opens; large h (20 to 100 FIT for 0 = 100°C)
Electrical/optical or opticallelectrical conOptosemiconductors verter made with photosensitive semiconductor components; transmitter (LED, (LED, IRED, photo-sensitive IRED, laser diode etc.), receiver (photoievices, opto- resistor, photo-transistor, solar cells etc.), rouplers, etc.) opto-coupler, displays; failure modes: opens, dnft, short . medi m o large h (2 to 100 FIT, 20.,/no.ofpirel: for LCD); limited useful life
Load, temperature, breakdown voltage (VBCEO, VBEBO), moisture*
Level control, voltage reference (allow for +5% drift)
Swjtch, amplifier, power stage (allow for +20% dnft, +500% for ICBO)
Switch (MOS) and Load, temperature, amplifier (JFET) breakdown voltage, for high-resistance ESD, radiation, circuits (allow for moisture* 220% dlift)
Temperature, reverse voltage, nse rate of voltage and current, commutation effects, moisture*
Controlled rectifier, overvoltage and overcurrent protection (allow for 220% drift)
Temperature, cnrrent, ESD, moisture*, mechanical stress
Displays, Sensors, galvanic separation, noise rejection (allow for 230% drift)
553
A l 0 Basic Technological Component's Properties
Table A1O.l (cont.) Component
Technology, Charactenstics
Digital ICs Bipolar
Sensitive to
Application
Monolithic ICs with bipolar transistors (TTL, ECL, L), important AS TTL (6mW, 2ns, 1.3V) and ALS TTL (ImW, 3ns, 1.8V); VCC = 4.5-5.5V; Zout < 150 B for both states; low to medium h (2 to 6 FIT for SSI/MSI, 20 to 100 FIT for LSUVLSI)
Supply voltage, noise (> l V ) , temperature (OSeV), ESD, rise and fall times, breakdown BE diode, moisture*
Fast logic (LS TTI ECL ) with uncntical power consump., rel. higl cap. loading, 8 j < 175°C (< 200°C for SOI:
MOS
Monolithic ICs with MOS transistors, mainly N channel depletion type (formerly also P channel); often TTL compatible and therefore VDD = 4.5 - 5.5 V ( 100 pW , 10ns ); very high Zi, ; medium Zout (1 to 10kQ); medium to high h (50 to 200 FIT)
ESD, noise (> 2 V ), temperature ( 0.4eV), rise and fall times, radiation, moisture*
Memones and microprocessors high source impedance, low capacitive loading
CMOS
Monolithic ICs with complementary enhancement-type MOS transistors; often TTL compatible and therefore VDD= 4.5 -5.5V ; power consumption f ( 10 pW at10kHz,VDD=5.5V,CL=15pF); Cast CMOS (HCMOS, HCT) for 2 to 6 V with 6 ns at 5 Vand 20 pW at 10 kHz : large static noise immunity (0.4 VDD); very high Zi, ; medium Zmt (0.5 to 5 kB); low to medinm h (2 to 6FIT for SSI/MSI, 10 to 100 FIT for LSINLSI)
-
Monolithic ICs with bipolar and CMOS devices; trend to less than 2 V supplies; rombine the advantages of both bipolar and CMOS technologies
BiCMOS
Analog ICs Operational amplifiers, comparators, voltage regulators, etc.
Monolithic ICs with bipolar and /or FET transistors for processing analog signals (operational amplifiers, special amplifiers, iomparators, voltage regulators, etc.); up to about 200 transistors; often in meta1 packages; medium to high h J 0 to 50 FiT)
Hybrid ICs Thick film, thin film
I
Combination of chip components (ICs, transistors, diodes, capacitors) on a thick tilm 5 - 20 pm or thin film 0.2 - 0.4 pm Substrate with deposited resistors and :onnections; substrate area up to 10cm2 ; medium to high h (usually detennined by the chip components)
ESD, latch-up, temperature (0.4eV),riseand fall times, noise (> 0.4 VDD), moisture*
similar to CMOS
Temperature (0.6eV ), input voltage, load CU,.rent,moisture*
Manufacturing quality, temperature, mechanical stress, moisture*
Low power consumption, high noise immunity, not extremely higifrequency, high source impedance, low cap. load, 8 j < 175OC, f0r memones: 1125°C similar to CMOS but also for very high frequencies
Signal processing, voltage reg., low tc medium Power cOnSUmp for dnft), 8 j < 175°C (< 125°C for low power)
Compact and reliable devices e.g. for avionics or automotive (ailow for +20% drift)
I!SD = electrostatic discharge; TC = temperature coefficient; h in 1 0 - ~h-I for standard ind. envir. (BA = 40°C, GB), indicat&e values; foi failure modes See also Table 3.4; * nonhermetic packages
A l l Problems for Horne-Work In addition to the 120 solved examples in this book, the following are some selected problems for home-work, ordered for Chapters 2, 4, 6, 7 and Appendices A6, A7, A8 ( * denotes time-consuming). Problem 2.1 Draw the reliability block diagram corresponding to the fault tree given by Fig. 6.39b (p. 271). Problem 2.2 Compare the mean time to failure M T and the reliability function Rs (t ) of the following two reliability block diagrams for the case nonrepairable and constant failure rate for elements E, ,..., E4 (Hint: For a graphical comparison of Rs (t ), Fig. 2.7 can be modified for the 1-out-of-2 redundancy).
& y E4
+E&
1-out-of-2 active
2-out-of-3 active
( E = E = E = E =E)
( E = E = E =E)
1
2
3
4
1
2
3
Problem 2.3 Compare the mean time to failure M T for cases 7 and 8 of Table 2.1 (p. 31) for E, and constant failure rates hl = ... = h5 = h .
= ... = E5 =
E
Problem 2.4 Compute the reliability function Rs(t) for case 4 of Table 2.1 (p. 3 1) for n = 3, k = 2, EI + E2 + E3. Problem 2.5* Demonstrate the result given by Eq. (2.62), p. 63, and apply this to the active and standby redundancy. Problem 2.6* Compute the reliability function Rs(t) for the (Hint: Use Eq. (2.29) as for Example 2.15).
n circuit with bidirectional connections given below
Problem 2.7* Give a realization for the circuit to detect the occurrence of the second failure in a majority redundancy 2-out-of-3 (Example 2.5, p. 47), allowing an expansion of a 2-out-of-3 to a 1-out-of-3 redundancy (Hint: Isolate the first failure and detect the occurrence of the second failure using e.g. 6 two-input AND, 3 two-input EXOR, 1 three-input OR, and adding a delay 6 for an output pulse of width 6).
A l l Problems for Home-Work Problem 4.1 Compute the M T R s for case 5 of Table 2.1 (p. 31) for hl = 1 0 - ~h-l, h2= 1 0 - ~h-', h3 = 10-2 h-', h, = W3h-', h, = 1 0 - ~h-', h6= 10" h-l, h, = I O - ~h-', and p = ... = = 0.5 h" . Compare the obtained M V R s with the mean repair (restoration) duration at system level MDTs (Hint: use results of Table 6.10 to compute MTTFso and P A S , and assume (as an approximation) MUTs = MTTFso in Eq. (6.291)). Problem 4.2 Give the number of spare parts necessary to cover with a probability y 2 0.9 an operating time of 50,000h for the system given by case 6 of Table 2.1 (p. 31) for hl = & = 4 = 10-~h-', h, = 10"h-' (Hint: Assume equal allocation of y between E, and the 2-out-of-3 active redundancy). Problem 4.3 Same as for Problem 4.2 by assuming that spare parts are repairable with pi=pz=p3 =pv= 0.5 h-l (Hint: consider only the case with Rs (t ) and assume equal allocation of y between E, and the 2-out-of-3 active redundancy). Problem 4.4 Give the number of spare parts necessary to cover with a probability y 2 0.9 an operating time of 50,000h for a 1-out-of-2 standby redundancy with constant failure rate h = 1 0 - ~h-' for the operating element (?L= 0 for the reserve element). Compare the results with those obtained for an active 1-out-of-2 redundancy with failure rate h = 1 0 - ~h-l for the active and the reserve element. Problem 4.5 Give the number of spare parts necessary to cover with a probability y 2 0.9 an operating time of 50, OOOh for an item with Erlangian distributed failure-free times with h = 1 0 - ~h-' and n = 3 (Hint: Consider Appendix A6.10.3). Problem 4.6* Develop the expression allowing the computation of the number of spare parts necessary to cover with a probability 2 y an operating time T for an item with failure-free times distributed according to a Gamma distribution (Hint: Consider Appendix A6.10.3, and Table A9.7b). Problem 4.7* A series-system consists of operationally independent elements 4 ,..., E, with constant failure rates Al ,...,?L, . Let c, be the cost for a repair of element Ei. Give the mean (expected value) of the repair cost for the whole system during a total operating time T (Hint: Use results of Section 2.2.6.1 and Appendix A7.2.5). Problem 4.8* A system has a constant failure rate A and a constant repair rate p. Compute the mean (expected value) of the repair cost during a total operating time T. given the fixed cost co for each repair. Assuming that down time for repair has a cost cd per hour, compute the mean value of the total cost for repair and down time during a total operating time T, (Hint: Consider Appendices A7.2.5 and A7.8.4).
A l 1 Problems for Home-Work
Problem 6.1 Compare the mean time to failure MITFso and the asymptotic & steady-state point and average availability PAS =AAS for the two reliability block diagrams of Problem 2.2, by assuming constant failure rate h and constant repair rate p for each element and only one repair Crew (Hint: Use the results of Table 6.10). Problem 6.2 Give the asymptotic & steady-state point and average availability PAs = AAS for the bridge giveu by Fig. 2.10, p. 53, by assuming identical and independent elements with constant failure rate h and constant repair rate p (each element has its own repair crew). Problem 6.3 Give the mean time to failure M7TFs0 and the asymptotic & steady-state point and average availability PAs = AAS for the reliability block diagram given by case 5 of Table 2.1 (p. 31) by assuming constant failure rates hl , ... ,h7 and constant repair rates pl , ... ,P,: (i) For independent elements (Table 6.9, p.225); (ii) Using results for macro-stmctures (Table 6.10, p.227); (iii) Using a Markov model with only one repair crew, repair priority on elements E6 and E7, and no further failure at system down. Compare the results by means of numerical examples (Hint: For (iii), consider Point 2 of Section 6.8.8). Problem 6.4 Develop the expressions for mean and variance of the down time in (0, t ] for a repairable item with constant failure rate h and constant repair rate p, starting up at t = 0, i.e. prove Eq. (A7.220), p. 499. Problem 6.5* Show that both diagrams of transition rates of Fig. 6.37 (p. 264) are equivalent for the computation of M T T 4 o. It is the case for the availability? Problem 6.6* Give the asymptotic & steady-state point and average availability PAs = AAS for the circuit with bidirectional connections given by Problem 2.6, by assuming identical and independent elements with constant failure rate h and constant repair rate p (each element has its own repair crew).
n
Problem 6.7* For the 1-out-of-2 warm redundancy of Fig. 6.8a (p. 191) show that 2 MTTFSi = Po M q o + 4 M q l differs from MUTs (Hint: Consider Appendix A7.5.4.1 or Point 9 in Section 6.8.8). Problem 6.8* For the 1-out-of-2 warm redundancy given by Fig. 6.8a (p. 191) compute for states Z o , q , Z 2 : (i) The states probabilities PO, P I , P2 of the embedded Markov chain; (ii) The steady-state probabilities Po, 4 , P2 ; (iii) The mean stay (sojourn) times q, q , T2; (iv) The mean recurrence times Tao, q,, T22. Prove that T22 = MUTS + T2 holds (with MUTs from Eq. (6.287)) (Hint: Consider Appendices A7.5.3.3, A7.5.4.1, and A7.6). Problem 6.9* Prove the results given by Eqs. (6.206) and (6.209), pp. 238 and 239.
A l 1 Problems for Home-Work Problem 7.1 For an incoming inspection one has to demonstrate a defective probability p = 0.01. Customer and producer agree AQL = 0.01 with producer risk a = 0.1. Give the sample size n for a number of acceptable defectives c = 0, 1,2, 5, 10, 14. Compute the consumer risk ß for the corresponding values of c (Hint: Use the Poisson approximation (Eq. (A6.129)) and Fig. 7.3). Problem 7.2 For the demonstration of an MTBF = I / h = 4'000h one agrees with the producer the following rule: MTB6 = 4'000h, MTB3 = 2'000h, a =ß =0.2 . Give the cumulative test time T and the number c of allowed failures. How large would the acceptance probability be for a true MTBF of 5'000h and of 1'500h, respectively? (Hint: Use Table 7.3 and Fig. 7.3). Problem 7.3 During an accelerated reliability test at an operating temperature BJ =125"C, 3 failures have occurred within the cumulative test time of 100'000h (failed devices have been replaced). Assuming an activation energy E, = 0.5eV, give for a constant failure rate h the maximum likelihood point estimate and the confidence limits at the confidence levels y= 0.8 and y = 0.2 for BJ = 35'C. How large is the upper confidence limit at the confidence levels y = 0.9 and y = 0.6? (Hint: Use Eq. (7.56), Fig. 7.6, and Table A9.2). Problem 7.4 For the demonstration of an M7TR one agrees with the producer the following rule: MTTRo = 1h, MTTRl = 1.5 h, a =ß =0.2. Assuming a lognormal distnbution for the repair times with o2= 0.2, give the number of repair and the allowed cumulative repair time. Draw the operating characteristic as a function of the true MTTR (Hint: Use results of Section 7.3.2). Problem 7.5* For the demonstration of an MTBF = 1 / h = 10'000h one agrees with the producer the following rule: MTBF = 10'000h, acceptance risk 20%. Give the cumulative test time T for a number of allowed failures c = 0, 1, 2, 6 by assuming that the acceptance risk is: (i) The producer nsk a (AQL case); (ii) The consumer risk ß (LTPD case) (Hint: Use Fig. 7.3). Problem 7.6* For a reliability test of a nonrepairable item, the following 20 failure-free times have been observed (ordered by increasing magnitude): 300, 580, 700, 900, 1'300, 1'500, 1'800, 2'000, 2'200, 3'000, 3'300, 3'800, 4'200, 4'600, ä4'800, 5'000, 6'400, 8'000, 9'100,9'800h. Assuming a Weibull distribution, plot the values on a Weibull probability chart (p. 548) and determine graphically the Parameters h and ß. Compute the maximum likelihood estimates for h and ß and draw the corresponding straight line. Draw the random band obtained using the Kolmogorov theorem (p. 508) for a = 0.2. It is possible to affirm, or one can just believe, that the observed distribution function belongs to the Weibull family? (Hint: Use results in Appendix A8.1 and Section 7.5.1).
*
Problem 7.7* For a repairable electromechanical System, the following amval times t of successive failures have been observed during T = 3'000h: 450, 800, 1'400, 1'700, 1'950, 2'150, 2'450, 2'600, 2'850, 2'950h. Test the hypothesis H,,: the underlying point process is a HPP, against H1: the underlying process is a NHPP with increasing density. Fit a possible M (t ) (Hint: Use results of Sections 7.6.3 gr7.7).
A l 1 Problems for Home-Work Problem A6.1 Devices are delivered from source A with probability p and from source B with probability I - p. Devices from source A have constant failure rate LA, those from source B have early failures and their failure-free time is distributed according to a Gamma distribution (Eq. (A6.97), p. 422) with parameters hB and ß < 1. The devices are mixed. Give the resulting distnbution of the failure-free time and the M7TF for a device randomly selected. Problem A6.2 Show that only the exponential distribution (Eq. (A6.81), p. 419), in the continuous case, and the geometnc distribution (Eq. (A6.131), p. 431), in the discrete case, possess the memoryless property (Hint: Use Eq. (A6.27) and considerations in Appendices A6.5 and A7.2). Problem A6.3 Show that the failure-free time of a series-system with operationally independent elements E, ,..., E, each with Weibull distributed failure-free times with parameters h i and ß is distributed according to a Weibull distnbution with parameters hs and ß, give hs (Hint: Consider Appendix A6.10.2). Problem A6.4 Prove cases (i), (iii), and (v)given in Example A6. 17 (p. 426). Problem A ~ S * Show that the sum of independent random variables having a common exponential distribution are Erlangian distributed. Same for Gamma distnbuted random variables, giving a Gamma distribution. Same for normal distributed random variables, giving a normal distribution (Hint: Use results of Appendix A6.10 and Table A9.7b). Problem ~ 6 . 6 * Show that the mean and the variance of a lognormally distnbuted random variable Eqs. (A6. 112) and (A6.113), p. 426, respectively (Hint: Use the substitutions X = and y = X - o I for the mean and similarly for the variance).
&
Problem A7.1 Prove that for a homogeneous Poisson process with parameter h, the probability to have k events (failures) in (0, T] is Poisson distributed with parameter AT, i.e. prove Eq. (A7.41), p. 450. Problem A7.2 Determine graphically from Fig. A7.2 (p. 446) the mean time to failure of the item considered in Example V (Hint: Use Eq. A7.30). Compare this result with that obtained for Case V with h, = 0, i.e. as if no early failures where present. Same for case IV, and compare the result with that obtained for Case IV with -t W , i.e. as if the wearout penod would never occur. Problem A7.3 Investigate for t + m the mean of the fonvard recurrence time z R ( t ) for a renewal process, i.e. prove Eq. (A7.33), p. 448. Show that for a homogeneous Poisson process it holds that the mean of ( t )is independent o f t and equal the mean of the successive interarrival times (1 I L). Explain the waiting time paradox (p. 448).
A l 1 Problems for Home-Work Problem A7.4 Prove that for a nonhomogeneous Poisson process with intensity m(t ) = dM(t ) 1dt , the probability to have k events (failures) in the interval (0, T] is Poisson distributed with parameter M(T) - M(0). Problem A7.5 Investigate the cumulative damage caused by Poisson distributed shocks with intensity h, each of which causes a damage 5 > 0 exponentially distnbuted with parameter q > 0, independent of the shock and of the present damage (Hint: Consider Appendix A7.8.4). Problem ~ 7 . 6 * Investigate the renewal densities hird(t)and hdu (t ) (Eqs. (A7.52) & (A7.53), p. 454) for the case of constant failure and repair (restoration) rates h and p. Show that they converge exponentially for t + W with a time constant 1 I (h + p) -; 1 I p toward their final value h p 1 (h + p) = h (Hint: Use Table A9.7b). Problem ~ 7 . 7 * Let O 0 (measured from the origin t = T = 0 ). Show that the quantities = M (T < ... are the occurrence times = M (T;) < in a homogeneous Poisson process with intensity one, i.e with M(t)= t (Hint: Consider the remarks to Eq. (A7.200)).
0
2)
Problem ~ 7 . 8 * In the interval (0, T], the failure times (arrivai times) T;< ... 0, show that (for given T and v(T)= n), the quantities 0 < M (T;) 1 M(T) < ... < M ( T : ) 1 M(T) I 1 have the same distribution as if they where the order statistics of n independent identically distributed random variables uniformly distributed on (0,l) (Hint: Consider the remarks to Eq. (A7.206)).
Problem A8.1 Prove that the empirical variance given by Eq. (A8.10), p. 507, is unbiased (i.e. prove Eq. (A8.11)). Problem A8.2 Give the maximnm likelihood point estimate for the Parameters A and ß of a Gamma distnbution (Eq. (A6.97), p. 422) and for m and o of a normal distribution (Eq. (A6.105), p. 424). Problem A8.3 Give the procedure (Eqs. (A8.91) - (A8.93), p. 532) for the demonstration of an availability PA for the case of constant failure rate and Erlangian distributed repair times with parameter PP. Problem ~ 8 . 4 * Investigate mean and variance of the point estimate h = k I T given by Eq. (7.28), p. 296. Problem A M * Investigate mean and variance of the poin; estimate h = (k - 1)1 (tl + ... + t k + (n - k) t k ) given by Eq. (A8.35), p. 5 15. Apply this result to h = n 1 (tl + ... + t, ) given by Eq. (A8.28), p. 5 13.
Acronyms
ACM AFCIQ ANS1 AQAP ASQC BWB CECC CENELEC CNET DGQ DIN DOD EOQC EOSESD ESA ESREF ETH EXACT GIDEP GPO GRD IEC (CEI) IECEE IECQ IEEE IES IPC IRPS ISO MIL-STD NASA NTIS RAMS RIAC Rel. Lab. RL SAQ SEV SNV SOLE VDWDE
: Association for Computing Machinery, New York, NY 10036 : Association Francaise pour le Controle Industrie1 de la Qualitk, F-92080 Paris
: : : : : : : : : : :
: :
: : : : :
: : : :
: :
: : : : : : : : : : : : : : :
Amencan National Standards Institute, New York, NY 10036 Allied Quality Assurance Publications (NATO-Countries) American Society for Quality Control, Milwaukee, W1 53203 Bundesamt für Wehrtechnik und Beschaffung, D-56000 Koblenz Cenelec Electronic Components Committee, B-1050 Bruxelles European Committee for Electrotechnical Standardization, B-1050 Bruxelles Centre National d'Etudes des Telecommunications, F-22301 Lannion Deutsche Gesellschaft fur Qualität, D-60549 Frankfurt a. M. Deutsches Institut für Normung, D-14129 Berlin 30 Departement of Defense, Washington, D.C. 20301 European Organization for Quality Control, B-1000 Brussel Electrical OverstressElectrostatic Discharge European Space Agency, NL-2200 AG Noordwijk European Symp. on Rel. of Electron. Devices, Failure Physics and Analysis Swiss Federal Institute of Technology, CH-8092 Zürich Int. Exchange of Authentic. Electronic Comp. Perf. Test Data, London, NW4 4AP Government-Industry Data Exchange Program, Corona, CA 91720 Govemment Printing Office, Washington, D.C. 20402 Gruppe Rüstung, CH-3000 Bem 25 International Electrotechnical Commission, CH-1211 Genkve 20, P.O.Boxl3 1 IEC System for Conformity Testing and Certif. of Electrical Equip., CH-121 lGent5ve20 IEC Quality Assessment System for Electronic Components, CH-1211 Genkve 20 Institute of Electrical and Electronics Engineers, Piscataway, NJ 08855-0459 Institute of Environmental Sciences, Mount Prospect, IL 60056 Institute for Interconnecting and Packaging EI. Circuits, Lincolnwood, IL 60646 International Reliability Physics Symposium (IEEE), USA International Organisation for Standardization, CH-1211 Genkve 20, P.O.Box56 Military (USA) Standard, Standardiz. Doc. Order Desk, Philadelphia, PA191 11-5094 National Aeronautics and Space Administration, Washington, D.C. 20546 National Technical Information Service, Springfield, VA 22161-2171 Reliability, Availability, Maintainability, Safety; also Rel. & Maint. Symposium, IEEE Reliability Information Analysis Center, Utica, NY 13502-1348 (formerly RAC) Reliability Laboratory at the ETH (since 1999 at EMPA S173, CH-8600 Dübendorf) Rome Laboratory, Griffiss M B , NY 13441-4505 Schweizerische Arbeitsgemeinschaft für Qualitätsfördernng, CH-4600 Olten Schweizerischer Elektrotechnischer Verein, CH-8320 Fehraltorf Schweizerische Normen-Vereinigung, CH-8008 Zürich Society of Logistic Engineers, Huntsville, AL 35806 Verein Deutscher 1ng.Nerband Deut. Elektrotechniker, D-60549 Frankfurt a. M.
Index
(less relevant places (not bold) are omitted by some terms)
A pnon / a posteriori probability 401,511 Absolutely continuous 403 Absorbing state 471-72 Accelerated test 35,81,86,98,99,102,307-12, 352,426,535 (see also 312-34) Acceleration factor 37,99,308-11 Acceptable Quality Level (AQL) 86,284-86,530 Acceptance line 283-84,301-02,528-29 Acceptance test -+ Demonstration Accessibility 8,118,151-52 Accident prevention 9,362 Accumulated -+ Cumulative Acquisition cost 11,13,14,357 Activation energy 37,97,99,103,308-09, Active redundancy 43,44,61-64,195,206,210, 211,225,227,361 Addition theorem 397,399 Adjustment 118,152 Age replacement 134,234 Aging 6,405 (see also Wearout and Bad-as-old) Alarm circuit 47 Allocation (reliability) 67 Alternating renewal process 168,452-56 Alternative hypothesis 280,291,298,305,525-26 Alternative investigation methods 267-76 AMSAA model 330 Anderson -Darling statistic 534 Antistatic container 148 AOQ l AOQL 281-82 Aperture in shielded enclosure 144 Approximate expressions 59,61,131-34,1798O,l88,192,195,198-2OO,206,211,227, 219-30,236,238,240,243,245,266,430 Approximation of a reliability function 192 Approximation of a repair funct. 114-15,198-200 AQL + Acceptable Quality Level Arbitrary failure and repair rates 164,168,186, 200,23 1-32 Arbitra~yinitial conditions (one item) 176-78 Arbitrary repair rate 177,185,188,200,206, 215,241,489,490 Arithmetic random variable 402,428-31 Arrhenius model 37,97,102,307-09 Arrival rate 494,501
Arrival time 321,331,442,494,496 As-bad-as-old 40,497 As-good-as-new 5,6,8,9,40,164,171,232, 234,236,242,249,251,253,254,365,294, 319,353,356,358,359,404,497 Assessed reliability 3 Asymptotic behavior 178-80,447-50,455,474-76, 486-88,491-92 (see also Stationary and Steady-state) Asynchronous logic 149 Automatic test equipment (ATE) 88 Availability demonstration 291-92,293,531-32 estimation 289-90,293,523-24 Average availability (AA) 9,171-72,176,177, 476,487 (see also Intrinsic, Operational, Overall, Point, Technical availability) Average Outgoing QuaIity (AOQ) 281-82 Axioms of probability theory 394 Backdriving 340 Backward recurrence time 447 Bad-as-old (BAO) 40,405,497 Bathtub curve 6-7,422 Bayes theorem 401,414 Bayesian estimate / statistics 414,511 Bernoulli distnbution -t Binomial distribution Bernoulli trials 427-28,431,433,513 Bernoulli variable 427 Bidirectional connection 31,53,554 Binary decision diagram (BDD) 271 Binomial distribution 408-09,427-29,517-20, 527-28 Birth and death process 131,207,211,479-83 BIST -+ Built-in self-test BIT + Built-in test BITE + Built-in test equipment Black model 97 Bonding 95,100 Boolean function method 58-61 Bottom-up 72, 156,158,355 Boundary-scan 149 Bounds 59,179-80 Branching process 499
Index Breakdown 96-97,102,106,140,144,145 Bridge structure 31,5344 Built-in self-test (BIST) 150 Built-in test 66,116-1 18,149-51 Built-in test equipment (BITE) 116 Burn-in 6,339,342,352 capability 13,66,72,154,248,352,376,479 Capacitors 140,141,143,146,521 Captured 181 CASE 157 Cataleptic failure 4,6 Causes for defects 66,155-57,329,341 Causes for failures 3-4 Cause-to-effects-analysis 15,66,72-80,153, 15748,329,356 Cause-to-effects-chart 76,356 CDM -+ Charged device model Censoring 295,296,298,323,324,331,504,s 1 5 Central limit theorem 126,434-37,449 Centralized logistic Support 125-29,130 Ceramic capacitor 140,143,146,147,521 Change 379 Chapman-Kolmogorov equations 462 Charactenstic function 539,545 Characterization 90-92,108 Charge spreading 103 Charged device model 94 Chebyshev inequality 411,293,433,434 Check list 79,120,372-75,376-82,383-87 Chi-square (x2) distribution 408-09,423,540 Chi-square (x2) test 3 16-18,535-38 Classical probability 395 Clock 66,144,146,150,151 Clustering of states 222 CMOS tenninals 145 Coating 142 Coefficient of variation 128,411 Coffin-Manson 109,311 Coherent System 57,61 Cold redundancy -+ Standby redundancy Common-cause 72,260-64 Common-mode currents 144 Cornmon mode failures 42,66,72,260,361 Comparative studies 15,25,26,31,44,48-49,78, 103,116,119,130,133,164,194,220-21,234, 261,446,550-54 Complement (complementary events) 392 Complex structure 52,231 Complex ystern 64-66 Composite shmoo-plots 91 Compound failure rate 3 10 Compound process + Cumulative process
Computer-aided reliability prediction 272-76 Concurrent engineenng 1,11,16,17,19,21, 353,357,360,376 Conditional density Idistribution function 404,412-14,485,491,495,497 Conditional expected value 405,414 Conditional failure rate 405,497 Conditional probability 396-97,444,460,494,501 Confidence ellipse 278-79,518-19 Confidence interval 279,290,297,516-24 Confidence level 516 Confidence limits 516 availability 289-90,523-24 failure rate h 296-97,520-23 f a h r e rate hs at system level 298 Parameters lognormal distribution 305 unknown probability 278-80,5 16-520 Configuration accounting 379 Configuration auditing 374,378-79 Configuration control 158,374,379 Configuration management 16,21,152,157, 158,335,353,378-81 Conformal coating -+ Coating Congruential relation 274 Connector 140,145,146,148,152 Consecutive k-out-of-n system 45 Consistent estimates 5 11-12 Constant acceleration test 339 Constant failure rate 6-7,35,40,172,165,177, 179-80,294-303,405,419-20,450-51,460-83 Constant repair rate 171,181,177,182-84,18996,207-1 1,213-30,238-40,243-71,46043 Consumer risk 86,281,284,291,299,302, 526,532 Contamination 85,93,98 Continuity test 88 Continuous random variable 403-04,408,412-14 ControIlability 149 Convergence almost sure T, Conv. with prob. one Convergence in probability 433 Convergence quickness 127,179-80,279,290, 297,303,313,394,506,507-08,519,522,530 Convergence with probability one 434 Convolution 416-17,545 Cooling 84,140-42,146 Corrective actions 16,21,22,72,73,77,80,10405,153,336,389-90 Corrective maintenance 8,113,118,120,154,353 Correlation 78,415,425,440-41 Corrosion 83,85,98-99,102,103,142,311 Cost I cost equation 12,14,136-38,235,242-43, 342-49,357,364,369-70,372,376,428,476 Cost effectiveness 13,353
Cost optimization 11,13,16,353,357 Count function 442,451,493,499 Covariance matrix 4 15 Coverage -+ Incomplete coverage, Test Cover. Cracks 85,93,102,104,106,108,109,111 Cramer - von Mises test 322,534 Critical design review -+ Design review Critical operating states 264 Criticality 72-73,78,153,158,161 Criticality grid 1criticality matrix 72-73 Cumulated states 259,477 Cumulative damage 499,559 Cumulative operating time 294-303,309,515 Cumulative process 237-38,498-500 Customer requirements 365-68,369-7 1 Cut Sets -+ Minimal cut sets Cut sets theorem 509,512 Cutting of states 222,230,273 Cycle 275,455-57,491
Density 403,408,413 Dependability 9,11,13,19,354,366,367,479 Derating 33,82,84,86,139-40,354 Design FMEAIEMECA 72,78 Design guidelines 25-27,66,77,80,84,374,377 maintainability 149-52 reliability 139-48 software quality 152-61 Design reviews 21,27,77,79,107,120,153,159, 354,374,378,381,383-87 Design d e s + Rules Destructive analysis 104 Device under test (DUT) 88 Diagnosis -+ Fault isolation Diagram of state transition 187,201,215,244,489,490 transition probabilities 62,183,191,196, 208,214,229,465-68,471,472,479,481 transition rates 231,239,240,245,246, 247,250,252,256,261,263,264,269 Damage 85,93,94,95,100,104,106,107,109, Differente between + Distinction between 110,311,312,329,336,337,340 Different elements 194-96,225,227 Damp test -T) Humidity test Differential equations (method of) 190,469-72 Data collection 21,22,23,360-61,388-90 Directed connection 31,55 Data retention 89,97-98 Discrete random variable 402,408-09 DC Parameter 88,92 Discrimination ratio 281,300 De Moivre-Laplace theorem 434,518 Dislocation climbing 109,341 Death process 61-63 Distinction between arrival times and interarrival times 494-95 Debug test 159 time and failure censoring 295,515,520-22 Debugging 153,158 h(t ) and f(t) 404 Decentralized logistic Support 129-30,134 h(t) arid zS(t) 7,501,356 Decoupling capacitor 66,143,146 z s ( t ) , m(t) and h(t) 7,356,444-45,501 Defect 354 examples 3-4,21,23,66,67,72,78,113, 117, P;(&) and QO(6t) 465 Piand 1: 475,487,488 152-61,302-04,341,343,344-49,362 localization 337 t;,tz ,... a n d t l , t 2 , ... 319,331,494 T:, 22, ... and zl,z2, ... 319,331,494 prevention 66,78,155-59 Distributed system I structure 52 (see also Dynamic defect) Distribution function 401-02,408-09,412,419-32 Defect tolerant 152-53,155 Documentation 6,15,118,154-56,375,378, Defective prob. 12,86,277-86,337,341,343 Deferred cost 12,14,342,342,344-47 379,380,381 Dominant failure mechanism 37-38,310 Definition of probability 394-95 Deformation mechanisms I energy 109,3 41 Dormant state 33,36,140 Degradation 4,7,66,92,96,101,112,248,264 Double one-sided sampling plan 285-86 Degree of freedom 423,540-42 Down state 265-66,452-53,469 Down time 123,124,136,173-74,235,476,499 Demonstration (see also MDT) availability 291-92,293,531-32 Drift 52,67,71,76,83,100,113,142,146,550-54 defective (or unknown) probability p 283, Drying material 142 280-86,287-88,526-30 Duane model 330-32 const. failure rate h or MTBF=l I h 301, Duration -+ Frequency lduration 298-303,370-71 Duration (sojourn, stay) -+ Stay time M7TR 305-07,371 Duty cycle 38,67,273,370 Dendrites 95,100
.
.
.
.. .
Index Dwell time 98,108,109,339,341 Dynamic burn-in 101,109,339 Dynamic defect 3-4,152,354,362,363,410 Dynamic fault tree 270-71 Dynamic Parameter 88,145 Dynamic stress 69,144 Early failures 6-7,35,315-16,323,326,328, 329,337,342,352,354,355,406,445-46 Early failure period 6-7,315,323,328 Ecological IEcologically acceptable 10,369,370 EDF + Empirical distribution function EDX spectrometry 104 Effect + Failure effect Effectiveness -+ Cost effectiveness Efficient estimates 51 1,512 Electrical overstress 148 Electrical test assemblies 340-41 components 88-92 Electromagnetic compatibility (EMC) 82,84, 108,139,143-44 Electromigration 6,95,97,103,311 Electron beam induced current (EBIC) 104 Electron beam tester 91,104 Electrostatic Discharge (ESD) 89,94,102,104, 106-07,108,139,144,148,335 Elementary event 392 Elementary renewal theorem 447 Elements of a quality assurance system 21 Embedded Markov chain 274,464,475,483, 486,487,488,491 Embedded renewal process 169,203,452,453, 456,484,491 Embedded semi-Markov process 197,215,440, 485,488,489-91 Embedded software 153,157 EMC -+ Electrornagnetic compatibility Emission + EMC Emission microscopy (EMMI) 104 Empincal distribution function 3 12-17,504-10 Empirical evaluation of data 314-17,421, 503-10,547-49 Empirical failure rate 5 Empirical mean I variance 4,303,304,506-07 Empincal reliability function 4-5 Empty set 392 Environmental conditions lstress 10,28,33,36,82.83 stress screening -+ ESS Environmental and special tests assemblies 108-09 components 92-100
.
.
.
Equivalence between asymptotic, steady-state, stationary 180-81,450,476,487 Equivalent event 392 Erlang distribution 186,423 Error I mistake 3,6,9,76,78,95,153,156-57, 329,354,355,356,362,386 E m r correcting code 153 ESD -+ Electrostatic discharge ESS 6,341,349,352,35445,362 Estimate 511,503-24 Estimation availability 289-90,293,523-24 defective probability p 279,278-80, 287-88,513,516-20 failure rate h or MTBF = I 1h (T fmed) 297,295-98,513,515,520-21 failure rate h (k fixed) 295,521-22 M i T R 303-05 Nonhomog. Poisson process 33 l-32,497 pointlinterval (basic theory) 511-24 Euler integral 544 Event field 391-94 Exchangeability 118,151-52 Expanding 2-out-of-3 to I-out-of-3 red. 47,544 Expected percentage of performance 513 Expected percentage of time in a state 206,513 Expected value (mean) 4,406,415,416, 506 Exponential distribution 408-09 Extreme value distributions 421 Extrinsic 3-4,86,355,389 Eyring model 99,102,311 Faii-safe I, 9,66,72,157 Failure 1,3-4,6-7,22,23,61-62,64-65,78,355 Failure analysis 87,89,95,102-07,111 Failure cause 3-4,72-73,78,102-05,355-56,389 Failure effect 4,72-80,87,101,355-56,389,363 Failure-free operating time -+ Failure-free time Failure-free time 3-6,39-40,404,420 failure frequency -+ System failure frequency Failure hypothesis 69-70 failure intensity 5,7,355,501-02 Failure isolation -+ Fault isolation Failure mechanism 4,33-38,92,96-100,102, 103,307-12,337,339,406 Failure mode 3,27,42,51,101,356,362,389 examples 3O,5l, 64-66,550-54 distribution 100,550-54 investigations 64-66,72-77,236-47,255-58 Failure mode anaiysis + FMEA / FMECA Failure propagation + Secondary failures Failure rate 4-7,33-38,355,404-05,409,419-20 Failure rate analysis 26,28-67
Index Failure rate confidence limits at components level 296-98 at system level 298 Failure rate estimation 296-98,513,520-22 Failure rate demonstration 298-303 Failure rate models IHDBKs 35-38,99,310-12 Failure rate of mixed distributions 41,404-06 Failure recognition 101,116-18,149-51,236-46 Failures with constant failure rate h 6-7,35 False alarm 66,232,241,246 Fatigue 88,98,311,421 (see also Wearout) Fault 4,72,356 Fault coverage -t Incomplete coverage Fault isolation 116-17 Fault model 90,91,236-64 Fault modes and effects analysis -t FMEA Fault recognition 112,115,116-18,119,149 Fault tolerant system 47,64-65,66,101,153, 157,162,165,231,233,248-60,264,476,478 Fault tree /Fault tree analysis (FTA) 66,76,78, 270-71,356 Feasibility I feasibility check 10, 19,77,121, 154,354,378,381,383,384 Field of events 391-94 Fine leak test 339-40 Finite element analysis 69 First delivery 350 First-in / first-out 164,232,273 Fishbone diagram -+ Ishikawa diagram Fisher distribution 290,291,523,532,429,542-43 FIT (Failures in time) 36 Fitness for use 11,360 fixed length test -+ Simple two-sided test Flow of system failures 161,294,330,497 FMEAFMECA 27,42,66,69,72-75,78,117, 237,248,264,355,377 Force of mortality 7,356 Forward recurrence time 175,178,180,446-47, 448,45 1,454 (see also Rest waiting time) Frequency / duration 231,148,255,259-60,266, 475,476-78,487 FTA -t Fault Tree Analysis Functional block diagram 29,68,256,271 Function of a random variable 405,410,426 Functional test 88
.
Gamma distribution 408-09,422-23 Gamma function 544 Gate review 378,381 Gaussian distribution + Normal distribution General reliability data 3 19-28 Generation of nonhomog. Poisson processes 497 Generator for stochastic processes 275-76
Geometric distribution 408-09,431 Geometric probability 395,408-09,431 Glassivation + Passivation Glitches 66, 146 Glivenko-Cantelli theorem 505 Gold-plated pins 94,147 Gold wires 100 Good-as-new -+ As-good-as-new Goodness-of-fit tests 312-18,322,533-38 Graceful degradation 66,248 Grain boundary sliding 109,341 Grigelionis theorem 498 Gross leak 339-40 Ground 143-45,146,147,152 Guard rings 144,146 Guidelines -t Design guidelines HALT 312 HAST 89,98-99,312 Hazard rate 5 HBM -t Human body model HPP -+ Poisson process (homogeneous PP) Hermetic enclosure 142,148 Hermetic package 85,102,104,142,337,339 Hidden defect 14,117 Hidden failures 8,66,79,107,113,116,117,120, 149,150,233,241-46,243,359 High temperature Storage 89,98,337 Higher-order moments 41 0,411,507 Highly accelerated tests 3 12 Historical development 16,17,85 Homogeneous -t Time-homogeneous Homogeneous Poisson process + Poisson proc. Hot carriers 96,102,103 Hot redundancy + Active redundancy Human aspects Ifactors 2,3,9,27,73,76,77, 152,153,157-58,352,361,363,373,385 Human body model (HBM) 94 Human errors 10,119,157 Human reliability + Risk management Humidity tests 89,98-100 (See also HAST) Hypergeometric distribution 408-09,432 Idempotency 61,392 Imperfect switching -t Switching In-circuit test 340 Inclusion / Exclusion 400 Incoming inspection 90,101,145,336,340, 343,344-49 Incomplete coverage 241-46,267 Independent elements 52 Independent events 397,398 Independent increments 439-40
Index Independent random variable 394,413,415, 416,416-18,419,422,423,434,465 Indicator 56,57,58,61 Indices 167 Indirect plug connectors 152 Inductive Icapacitive coupling 91,143,146 Industrial applications (environment) 37,38,140 Influence of prev. maintenance 134-36,233-36 Influence of repair time distribution 114-15, 133-34,198-200 Information feedback 22,360-61,390 Infrared thermography (IRT) 104 Inherent + Intnnsic Initial conditions 63,176,178,180,190,191, 208,449-50,454-55,462,469,471,485 Initial distribution 449,459,460-61,463, 475-76,485,486-87,491-92 Input/output dnver 146 Inserted components 84,108,110 Integral equations (method of) 166,185,193-94, 211-12,216-17,473-74 Integral Laplace theorem 434,5 18 Integrated circuits (ICs) 34-37,84-85,89, 90-100,142,149,336,337-40 Intensity 7,296,321,451,493,497,498,501,502 Interaction 66,253,156 Interarrival time 5,294,319,323,328,442,494 Interchangeability 8 Interface 78,82,96,97,103,118,139,146,154,157 Intermetallic compound Ilayer 100,103,109 Internal redundancy + Active redundancy Internal visual inspection 89,93,104 Intersection of events 392 Interval estimation 278-80,289-90,293,296-98, 305,516-24 Interval estimate at system level 298 Interval reliability 166-67,172,177,181,188, 193,195,198,211,265,454 Intrinsic 3-4,9,86,139,355,389 Inverse function 405 Ion migration 103 Irreducible Markov chain 459-60,475-76, 486-87,491 IRT -;r Infrared thermography Ishikawa diagrarn 76-77,78,356 ISO 9000: 2000 family 11,366-67 Item 2,357 Jelinski-~oranda 160 Joint availability 174-175,177 Joint density I distribution 412-13,494-95 Junction temperature 33,34,35,37,79,84,85, 140-42,145,309
k-out-of-n: G -+ k-out-of-n redundancy k-out-of-n redundancy 31,44,61-64,130, 206-12,211,225,227,271,479,489-90 Kepner-Tregoe 76,78 Key item method 52-55,60,68-69 Key renewal theorem 178,179,448,455,457,491 Khintchine theorem 45 1,498 Kirkendall voids 100 Kolmogorov backward / forward eqs. 462 Kolmogorov-Smirnovtest 312-17,322,332,497, 534,536-37,543 Korolyuk theorem 493 kth momentkentral moment 410-1 1 Laplace test 324 Laplace transform 545-46 Last repairable unit + Line replaceable unit Last replaceable unit + Line replaceable unit Latch-up 89,96,145,148 Latent damage -+ Darnage Law of large numbers 433-34 Leak test + Seal test Liability + Product liability Life cycle cost (LCC) 11,13,16,112,353,357, 364,369,370,377 Life-cycle phases 19 (hardware), 154 (software) Lifetime 357 Like new + As-good-as-new Likelihood function + Max. likelihood function Limit theorems of probability theory 432-37 Line repairable unit -+ Line replaceable unit Line replaceable unit (LRU) 115,116,118, 120, 125,149 Liquid crystals 104 List of preferred parts (LPP) + Qualified part list Load capability 33 Load sharing 43,45,52,61-64,163,164,190, 194,207,458,488 Logarithmic Poisson model 333 Logistic support 8,115,119,125,129,235,357 Lognormal distribution 113-15,303-07,408-09, 425-26,547 Long-term stability 86 Lot tolerance percent defective 284-85,530 Lowest replaceable unit + Line replaceable unit LRU + Line replaceable unit LTPD + Lot tolerance percent defective Macro-structures 165,222,227,264 Maintainability 1,2,8, 9,12,13,21,112-15, 357,366,367,368
Index Maintainability analysis 72,115-24,149-52, 373,375 Maintainability estimation/demonstr. 303-07,371 Maintainability program -t Maintenance concept Maintenance 8,113 Maintenance concept 8,112,115-20,373,375 Maintenance levels 119-20 Maintenance strategy 35,134-36,233-36 Majority 31,47,66,215 Manufacturing processes 106-11,147-48,33550,378 Manufacturing quality 16,20,86,335-50 Margin voltage 98 Marginal density I distribution function 413 Marking 306 Markov chain 244,268,274,458-60,461,463, 464,475,483,485,486,487,488 Markov models 61-64,166-67,170-71,189-93, 195,211,220-21,225,227,226-30,238-40, 260-63,264-67,440,460,466-68,471,479 Markov process 166-67,440,460-83,487 Markov renewal property 465 (see also Memoryless property Markov renewal processes 483 Match IMatching 144,146 Mathematical statistics 503-38 Maximum likelihood function lmethod 278,89, 296,304,305,313,319,322,331,512-15,536 Mean (expected value) 406-07,410,415,416 Mean down time ( M D n 124,259,266,478 Mean (for rel. applications) -+ MDT, MTBF, MTBUR, MTTF, MTTPM, MTTR, MUT Mean logistic delay 235 Mean operating time between failures (MTBF) 6,39-40,358 (see also 294-303,369-71for estimation & demonstration of MTBF=l I h ) Mean time to failure (M77'fl 6,39,40,63,16667,195,211,220-21,225,227,358,474,486 Mean time to preventive maintenance ( M Z P M ) 113,121,125,358 Mean time to repair (MTTR) 8-9,113,121-24, 359 (303-07 for estimation & demonstration) Mean time to restoration -+ Mean time to repair Mean up time 6,265,477 Mean value function 321,324,328,333,493,496 Mechanical reliability 67-7 1 Mechanism -+ Failure mechanism Median 412 Meshed structure 52 Memories 90-91,93,97-98,146 Memoryless property 7,40,63,136,172,192,234, 2!?5,298,405,420,43l, 440,451,464,465,478 Meta1 migration 103 (see also Electromigration)
Metallographic investigation 104-05,108-09 Method of differential eqs. 167,190-81,469-72 Method of integral eqs. 166,193-94,473-74,486 Metrics (software quality) 153 Microcracks + Cracks Microsection 104,105,108,110 Minimal cut sets 59,60,76 Minimal operating state -t Critical Oper. states Minimal path Sets 58,60,76 Mission availability 173 Mission profile 3,15,28,38,68,79,231,357,370 Mistake -+ Error Mixed distribution function 403 Mixture of distributions 7,41,316,406 Modal value 412 Mode -+ Failure mode Models for failure rates 35-38 (see also Mixture) Models for faults -+ Fault model Modification 379 Moisture 98-99,142 Module I Modular Il8,120,149,150-59 Moment generating function 545 Monotony 57 Monte Carlo simulation l65,23 1,233,272, 273-76,426,435,436 (see also Generation and Generator) Motivation and training 24,119,375 MDT -+ Mean down time MTBF -+ Mean operating time between failures MTBUR 8,358 MTTF + Mean time to failure MZTPM -+ Mean time to prev. maintenance MTTR + Mean time to repair / restoration MUT -+ Mean up time Multidimensional random var. 412- 16,438-41 Multifunction system -+ Phased-rnission system Multilayer 143,148 Multimodal 412 Multinornial distribution 318,429,537,538 Multiple failure mechanism 64-65,310,312, 319,341,406 Multiple failure mode 52,64-65,66,246-47, 255-58 Multiple faults I consequences 76 Multiple one-sided sampling plans 285-86 Multiplication theorem 398-99 Mutually exclusive events 57,171,174,237, 392,393,394,397-98,400,446 MUX 150,151
-+
Nitride passivation Passivation Nonconformity 354,359 Nondestructive analysis 102-05
Index Nonhomogeneous Poisson process 161,321-34, 451,493-97 Nonregenerative state 201,210,440,490 Nonregenerative stochastic process 164,186, 200,212,488,492-502 Nonrepairable item (up to system failure) 5,7, 39-57,61-71,236-37,240,243,245,254,260, 270,272 Normal distribution 113,126-127,408-09, 424-25,434-35,449,496,539,549 Number of states 56,219 N-version programming (NVP) 47
Parameter estimation 278-80,289-90,293, 294-98,303-05,331-32,511-24 Pareto 76,78 Part Count method 5 1 Part Stress method 33-38,50-51 (see also 69-71) Partitioning 115,118,157,158 Partitioning cumulative operating time 294, 295,301,371 Passivation IPassivation test 89,93,104,106 Path Set -+ Minimal path sets Pattern sensitivity 91,93 PCBs + Populated printed circuit boards Pearson 517,535 OBIC105 Percentage point 412 Object oriented programming 157 Performability 259 Observability 149 Performance -+ Capability Obsolescence 8,118,138,145,357 Performance effectiveness + Reward Performance test 108 Occurrence time + Arrival time Petri nets 267-69 One-item structure 39-41,168-82 One-out-of-2 redundancy ( 1-out-of-2 redundancy) Phased-mission Systems 28,30,38,231,248-55 102-07 (see Failure mech.) 42-43,189-206,225,227,236-45,247,260-64, ~h~sics-of-failures Pitch 84,109,147,341 466,470-72,488-92 Plastic packages -i Packaging One-sided confidence intervai 280,290,297, Point availability 9,170,178,181,166-67,190, 316,319,321,322,324,516,519 One-sided sampling plan (forp) 284436,529-30 289-93,352,454 One-sided tests to demonstrate kor MTBF=1 I k Point estimate 278,289,296,303,332,511-15 Point estimate at system level 298 302-03 Point process (general) 500-02 Only one repair Crew + models of Chapter 6 except pp. 210,224-25 Poisson approximation 430 Poisson distribution 283,294,408-09,429-30 Operating characteristic lcurve 281-82,284-85, Poisson integral 544 300,306-07,527-28,530 Poisson process Operating conditions 2,3,7,26,28,33,35,79, homogeneous (HPP) 7,294,295-96,320, 84,90,96,99,102,354,365 323-27,356,445,448,450-51,515, Operation monitoring 116 493-97 for m(t)=h Operational availability 235 nonhomogeneous (NHPP) 161,3Sl-34,451, Operational profile 28 493-97 Optical beam induced current (OBIC) 104 Populated printed circuit board (PCB) 84,85,90, Optimal derating 33,140 94,107-11,116,144,146-48,152,336,340-41 Optimal preventive maintenance 234-36,242-43 Power devices 1supply 96,98,99,108, 143, Optimization l2-l5,67,l2O,l36,l38,342-49, 145,146,147,150,152 353,364 Power Law process 230 Optocoupler 146 ppm 337,424 Order observations1sample lstatistics 3 12,313, 321,323,324-25,332,495,496,5O4,5O6,535 Precision measurement unit (PMU) 88 Predicted maintainability 121-25 Organizational structure (company) 20 Predicted reliability 3,25-27,28-71,172-276, Overall availability 9,235 Overstress 33,103,148,336 372 Preferred list -+ Qualified part list (QPL) Oxide breakdown 96-97,102,103,106,3 11 Preheating 147 Preliminary design reviews + Design reviews Packaging 84-85,89,100,142 Pressure cooker -+ HAST Parallel model 43-45,61-64,195,206,211,225, Preventive action 16,22,72,77,112,139-52, 227,236-43,247,466-39,470-42,489,490 155-58,341,371-82 Parallel redundancy + Active redundancy
Index Preventive maintenance 8,112-13,233-36,24143,359 Printed circuit board -+ Populated printed C. b. Probability 393-96 Probability chart 314-15,317,421,509-10,547-49 Probability density + Density Probability plot paper -t Probability chart Problems for Home-Work 554-59 Procedure for analysis of complex systems 264-66 analysis of mechanical systems 69 electrical test of complex ICs 88-90 demonstration of availability (PA=AA) 291-93 MTTR 305-07 probability p 280-86,287-88,526-30 li. or MTBF= 1Ili. 298-303 estimation of availability (PA=AA) 289-90 MTTR 303-05 probability p 278-80,287-88,513,516-20 h or MTBF= 1Ih 296-98 (see in particular 279,290,297) ESD test 94 FMEAIFMECA 72-75 frequency I duration 265-66,476-79 0 graphical estimation of F(t) 507-10 (see also 312-17,533-34,547-49) Goodness-of-fit tests Anderson-Darling 534 Cramer - von Mises 322,534-35 Kolmogorov-Smimov 3 l2-17,322, 333-34,534,536-37 (see also 504-10) X* test 316-18,535-38 mechanical system's analysis 67-68,69 modeling complex rep. systems 264-66 qualification test assemblies 107-11 complex ICs 89,87-107 first delivery 349-50 reliability allocation 67 reliability prediction 3,25-27,28-7 l,l72-276,372-73 (see 67-71 for mechanical reliability) reliability test accelerated tests 307-12 technical aspects 101,109,337-40 statistical aspects 277-334,503-38 (see in particular 283,297,301) screening of assemblies 333-41 (see also 107-11) components 366-40 (see also 92-100) sequential test 283-84,300-01,528-29
simple one-sided test plan 284-86, 302-03,529-30 simple two-sided test plan 280-83, 298-301,527-28 software developmentltest 1561158 test and screening strategy 342-49 transition probabilities (determination of) 185,187,193,244,464,489-91 Process FMEAIFMECA 72,78 Process reliability 3 Process with independent increments 333-34, 439-40,451,493-97 Process with stationary increments 441,45 1,494 Producer risk86,28 1,284,291,299,302,526,532 Product assurance 16,359,367,368 Product liability 9-10,15,354,359,360,379 Production process 6,21,87,98,106-07,108, 335-36,342-44,354-55,360,3 65,3 68,3 78 Programl erase cycles 97-98,338 Project management 17-24,152-61,369-82 Prototype 18,19,87,107,312,329,343,374, 375,377,380,381,384,386,387 Pseudo redundancy 42,361 Pseudorandom number 274 hll-uplpull-down resistor 145,147,150 Purple plague 100,103 s
Quad redundancy 65,66,lOl Quadrate statistics 534 Qualification tests 21,343,374,378,380,381 assemblies 107-11 components 89,87-107 Qualified part list (QPL) 87,145,372,378,385 Quality 11,360 Quality & reliability assurance progr. 17,371-82 Quality & reliability requirements 365-68,369-71 Quality and reliability standards 365-68 Quality assurance 11,13,16,17-24,152-61, 360,372-75,376-82 Quality assurance system 21,366 Quality attributes for software 157 Quality control 13,16,21,158,277-86, 336,360 Quality cost optimization + CostJcost equations Quality data reporting system 22,360-61,388-90 Quality growth (software) 159-61 Quality handbook 21 Quality management 16,20,21,24,360,361,366 (see also Quality assurance and TQM) Quality of manufacturing 16,21,86,335-36 Quality metric for software 153 Quality tests 21,361,376,380 Quantile 412,540-43 Quick test 116
Index Random duration (phased-mission systems) 274 Random sample + Sample Random variable 401-03 Random vector 412-15,438-41 Rare event 10,272,273,275 Reachability tree 268-69 Reconfiguration 66,118,157,231,248-60 time censored (phased-mission system) 248-55 failure censored 255-58 with reward and frequencylduration 259-60 Recrystallization 109 Recurrence time 174,175,178,446-51,494 Recycling 10,19,357 Redesign 8,329 Reduction (diagram of transition rates) 264(P.2) Redundancy 42-45,47,51,61-64,65,66,68,18992,195,211,220-21,225,227,236-46,260-64,
361 in software 47,153,157 Reflow soldering 147 Refuse to start 239 Regeneration 1renewal point 201,440,442,
Repair priority 214,227,229,232,239,240,247,
256,264,466,468 Repair rate 115,170-71,177,214,466,468 Repair strategy -+ Maintenance strategy Repair time 8,113-14,121-24,303-07,359 Repairable spare parts -t Spare parts Repairable Systems 5,162-276 Repairable versus nonrepairable 40 Repairability + Corrective maintenance Replaceability 152 Replacement policy 236 Requalification 87 Required function 28,362 Requirements + Quality and rel. requirements Reserve contacts 152 Reserve/reserve state 43,62,163,190,201 Rest waiting time 221,494 Restart anew 171,440,456 Restoration 8,112,353 Restoration frequency -t System repair freq. Results (tables lgraphs) 31,44,48-49,111,127, 166-67,177,181,188,195,206,211,220-21,
453,456,473,484,489,490
225,227,230,234,258,279,283,290,292,297, 301,302,309,315,408-09,45 1,468,s 10,522 484,489 Reuse 10,116,119,130 Reward 231,255,259-60,266,476,478-79 Regenerative process 456-57 Rejection line 283-84,301-02,528-229 Rework 108,148,341 Rise time 143,144 Relation between 4 and Pi 475 Risk 9-11,15,67,72,145,148,273,278,347, (see also Distinction between) Relative frequency 278-789,393,394-96,513,516 363,369,373,384 (see a, ß & ß, ,ß2, y for statistical nsk) Relaxation 109 Reliability 2,13,27,66,69,72,231,361,367,372 ROCW 542 Rules for Reliability allocation 67 Reliability analysis 13,25-27,66,67-71,80, convergence PA(t ) + PA 179-80,195 139-48,162-67,372-73,377-78 data analysis 320 derating 33,140 Reliability block diagram (RBD) 28-32,68-69,362 FMEA/FMECA 72 (see 231-76if the RBD doesn't exist) imperfect switching 238,240,247 Reliability function 2-3,166-67,169,176,361, incomplete coverage 245 404,471-72,473-74,486 jnnction temperature 37,141,145 Reliability growth 329-34,362 (see also 159-61) Regeneration state 200,215-16,440,456,464,
Reliability prediction + Procedure for Reliability tests + Procedure for Remote control ldiagnostic 117-18,120 Renewal density 443-44 Renewal density theorem 179,448 Renewal equation 444 Renewal function 443 Renewal point + Regeneration point Renewal process 164,441-51 embedded 203,452-52,456,484,487,491 Repair 8,113,163-64,353,359 Repair frequency -t System repair frequency
.
partition of cumulative operating time
294,295,301,371 power-up 1 power-down 145,147 quality and reliability assurance 19 senes /parallel structures 46,219 (see also Design guidelines ) Run-in 341,352 Safety 9-10,13,15,66,72,78,362-63,366,379 S a f e t ~anal~sis15,66,72-78,373,377,378 Safety factor 69 Same element in rel. bleck diagram 30,32,55,
60.69
Index Same stress 45,71 FMEAFMECA 72-73 Sample 504 interaction 156 Sample space 391-92 life-cycle phases 154 Sampling tests 277,280-88,34449,527-30 metncs 153 Scan path 150-51 quality assurance 21,152-61,153,362 Scanning electron microscope (SEM) 104 quality attributes 153,155 Schmitt-trigger 92,143 quality metrics 153 Scrambling table 91 quality growth 159-61,329-334 Screening (see also ESS) specifications 154,156,157,159 assemblies 340-41 standards 143,152,153,158,159 components 337-340 (see also 92-100) testing I validation 158-59 Screening strategy -+ Test and screening strategy time lspace domain 153,157 Seal test 339-40 Sojoum time i, Stay time Secondary failure 4,66,73 Solder joint 84-85,108-11,147,340-41 Selection cnteria for electronic comp. 550-53 Solder-stop pads 146,147 Semidestructive analysis 104 Solderability test 94 Semi-Markov process 164,166-67,440,483-88 Soldering temperature profile 147,148 Semi-Markov proc. embedded -+ Semi-reg. proc. Spare parts provisioning 125-34 Semi-Markov transition probability 166,185, Special diodes 145 187,197,244,463-64,48445,489,490 Special manufacturing processes 378 Semi-regenerative process 162,163,164,197, Specifications 3,154,156,157,159,365,372, 215,233,264,273,274-75,438,440,488-92 376,379,381,386 Sequential test 283-84,300-02,528-29 Standard deviation 41 1 Senes model 41-42,64,71,182-88,320,406,421 Standard industrial environment 36 Series - parallel structure 45-49,213-30,468 Standard normal distribution 424-25,539 (see 48-49 and 220-21 for comparisons) Standardization 117,120,149,152,155,365,386 Series - parallel system -+Seties- paral. structure Standards 365-68 Serviceability i, Preventive maintenance Standby redundancy 43,62,195,206, Services reliability 3 211,237,361,418 (see also Active &Warm) Set operations 392 State probability 63,190,461,475-76,486-87 Shewhart cycles 76 State space 438-41 Shielded enclosure 144 State space extension 492 Shmoo plot 9l,93 State space method 56-57 Short-term test 312 State space reduction 264 Silicon nitride glassivation -t Passivation State transition diagram -+ Diagram of Simple one-sided test 284-86,302-03, 529-30 Static fault tree 270 Simple structure 28,39-51,168-236 Stationary (or in steady-state) Simple two-sided test 280-83, 298-301,527-28 alternating renewal process 180-81,454-55 Simulation -+ Monte Carlo distribution 459,475,486,488 Single-point failure 42,66,79 increments (time-hoinogeneous)441 Single-point ground 143 initial distribution 459,475,486,488 Six-o approach 424 Markov chain 459 Sleeping state -+ Dormant state Markov process 166-67,474-76,488 SMD I SMT 84,109-11,146-47,341 one-item structure 180-81 Sneak analyses 76,79,377 process 440-41 Soft error 97 regenerative process 457 Software renewal process 449-51 attributes + quality attributes semi-Markovprocess 166-67,486-8s defects 67,117,149,152-53,155-61,329 Statistical decision 504 defect prevention 155-58,160 Statistical error -+ Statistical risk design reviews 154,157,158,159 Statistical hypothesis 525-26 development procedure 153-56 Statistical maintainability tests 303-307 documentation 154,155,156 Statistical quality control 16, 277-86
Index Statistical reliability tests 277-334,503-38 Statistical risk 503 (see also a, P, ßl, ßz,Y) Statistically independent 397,504,512,525 (see also Stochastically independent) Statistics + Mathematical statistics Status test 116,119 Stay time (sojourn time) 163,166-67,249,264, 274,275,458,463-64,474,479,483,486,488 Steady-state + Stationary Steady-state property of Markov processes 477,480,488 Step-stress tests 3 12 Strategy maintenance 35,134-36,233-36 test&screening 342-44,347-49,361,373,380 Stirlings' formula 3 19,544 Stochastic demand 174 Stochastic matrix 458,460 Stochastic process 438-41,441-502 Stochastically independent 397,399,413 Storage temperature 148 Stress factor 33,139-40,145 Stress-strength method 69-71,76 Strict liability 15,360 Strong law of large numbers 434,505 Structure function -+ System function Stuck-at-state 238,247 90 Stuck-at-zero 1 at-one 90 Student distribution 541 Successful path method 55-56 Sufficient statistic 295,324-27,511-12,513,514 Sum of Homogen. Poisson proc. 296,451,498 Nonhomogen. Poisson proc. 45 1,496,498 Point processes 501 Random variables 416-18,443 Renewal processes 497-98 Superconform 535 Superimposed processes -+ Sum of Superposition + Sum of Supplementary states 186-88,492 Supplementary variables 186,492 Suppressor diodes 143,144 Surface mount devices I techn. -+ SMD 1SMT Survival function -+ Reliability function Susceptibility + EMC Sustainable development 10,357,385 Switch 47,48-49,213-19,220-21,236-40,255-58 Switching + Switch System 2,31,166-67,264-66,363 System's confidence limits 298 System design review 381,383-87 System effectiveness + cost effectiveness
.
System failure frequency 265-66,477-78 System function 58 System mean time to failure (M7TFS )+ MTTF System reconfiguration -+ Reconfiguration System repair kequency 266,478 System restoration frequency + System rep. freq. System specifications + Specifications Systems engineering 11,16,357,363 Systems with complex structure 31,52-67,69, 231-33,236-76 Systems with hardware and software 161 System without redundancy + Senes model Systematic failure 1,3,6,109,115,329,331,342, 352,354,355,362,363 Tasks / task assignment l7-20,372-75 Technical Availability 235 Technical safety + Safety Technical system -+ System Tecbnological characterization 96-98 Technological properties Ilimits 10,38,84-85, 92,96-100,107-111,550-54 Test and screening procedures + Screening Test and screening strategy 342-44,347-49,361, 373,380 (see also Screening) Test coverage 90,91,117,231.233,241-46 Test Pattern 90-93 Test plan 281,283,291,292,299,301,306,527, 528,529-30,532 Test point 147,150 Test time partitioning + Partitioning Test vector 88 Testability 117,147,149-51,155,157,158 Testing unknown availability 291-92,293,531-32 unknown distr. function 312-18,533-38 unknown MTTR 305-07 0 unknown probability 28046,287,29192,526-30,531-32 unknown h or MTBF=l I h 298-303, statistical hypotheses (basic theory) 525-38 Tchebycheff 4 Chebyshev Theorem of cut sets + Cut sets theorem Thermal cycles 83,95,98,100,108,109,110, 337-39,341 Thermal design concept / management 141 Thermal resistance 141-42 Thermal Stress 145 Three Parameter Weibull disturb. 421,509-10 Time censoring -+ Censoring Time-dep. dielectric breakdown 96-97,103 Time-homogeneous Markov process 164,16667,440,460-83
.
Index Time-homogeneousprocess 440 Time schedule (diagram) 169,175,201,202, 212,242,442,453,489,490 Time to market 10,19,369 Timing diagram 146 Top-down 76,78,156,157,356 Top event 76,78,270 Tort liability + Product liability Total additivity 394 Total down time 124,174,235,499 Total expectation 415,418 Total operating time -+ Total up time Total probability 170,400,447,459,473 Total up time 173,174,235,478,499 Totally independent elements 52,61,210,219,225 TQM (Total Quality Management) 16,17,18, 19,20,21,353,354,363,3 65,3 66,369,3 72 Traceability 379,380 Training 24,119,375 Transformation of random variables 274,405,426 Transition diagram + Diagram of Transition probability 166-67,458-59,460-65, 469-71,473,483-86 (see also Diagram of) Transition rate 461-65 (see also Diagram of) Trend test 323-328 True reliability 26 Truncated distnbution/ random variable 71, 250,273,275,406 Truth table 88,92 Two-sided test const. failure rate h or MTBF=I l h 298-301 unknown probability p 280-86,526-30 (see in particular 283,301) Type I I I1 error (alß)281-84,2891-92,298-303, 305-07,312-18,323-27,525526,527,530-37
.
Unavailability 61,219,223,230 Unbiased 511 Unconditional expected value 415 density 404 probability 396 Uniform distribution 427 Uniformiy distributed random numbers 274 random variables 324 Union of events 392 Unused logic inputs 145 Up state 265-66,452-53,469 UPS 223 Useful life 8,14,35,39,81,85,118,141,169,364 (comp. with limited useful life 142, 145,146) User documentation 15,117,118-19,375,379
Value Analysis 364 Value Engineering 364 Variable resistor 100,140,146,550 Variance 410-11,415,416,506 Vibrations 82,83,108,109,341 Viscoplastic deformation 109 Voter 47,215 Wafer 97,106,148 Waiting redundancy + Warm redundancy Waiting time paradox 448 Waiting time -t Stay time Warm redundancy 43,61-64,189-93,195,206, 211,361 (see also Active & Standby) Washing liquid 148 Weaknesses analysis 3,6,26-28,69,72-80,96, 139,329,380 Wearout / wearout failures 3,6-7,8,35,98,233 31 l,315,32O,323,328,329,355,4O6,421, 445-46 Wearout period 6,315,323,328 Weibull distribution 126-28,314-15,408-09, 420-21,509-10,548 Weibull prob. chart 314-15,421,509-10,548 Weibull process 330 Weighted sum 7,12,14,41,315-16,343-49, 403,406 (see also Cost & Mixture) Without aftereffect 320,334,451,494,497,501 Work-mission availability 173-74,499 Worst case analysis 76,384 X-ray inspection 102 Zener diodes 140,144,145 Zero defects 86 Zero hypothesis 525-27 1-out-of-2 -+ one-out-of-two 6-0 approach 424 85/85 test + Humidity test a particles 103 a, ß 525-26 ßy 02, Y 516-17 X Chi-square o (6t) (Landau notation) 461 i= 539,545 circuit 554 t t (realizations of z ) 4-5,503-38 " 2'"' tl;, t2, ... (arbitrary points on the time axis, e.g. arrival times, realizations of z T ...) 494 7,n, T„, 418
-+
T, 2,