Advanced Reliability Modeling: Proceedings of the 2004 Asian International Workshop (AIWARM 2004), Hiroshima, Japan, 26 - 27 August 2004

Advanced Reliability Modeling This page intentionally left blank Proceeding sof the 2004Asina International Wroksho...

Author: Tadashi Dohi | Wong Young Yun

21 downloads 1233 Views 29MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Advanced Reliability Modeling

This page intentionally left blank

Proceeding sof the 2004Asina International Wrokshop (AIWARM 2004)

Advanced Reliability Modeling 26 - 27 August 2004

Hiroshima, Japan

edited by

Tadashi Dohi H iroshirna University, Japan

Won Young Yun Pusan National University, Korea

N E W JERSEY

-

LONDON

-

r pWorld Scientific SINGAPORE

-

BEIJING.

SHANGHAI

-

H O N G KONG

- TAIPEI

* CHENNAI

Published by

World Scientific Publishing Co. Re. Ltd. 5 Toh Tuck Link, Singapore 596224 USA ofice: 27 Warren Street, Suite 401402, Hackensack, NJ 07601

UK ofice: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-PublicationData A catalogue record for this book is available from the British Library.

ADVANCED RELIABILITY MODELING Proceedings of the 2004 Asian International Workshop (AIWARM 2004)

Copyright 0 2004 by World Scientific Publishing Co. Re. Ltd. All rights reserved. This book, or parts thereoJ may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, M A 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-238-871-0

Printed in Singapore by World Scientific Printers ( S ) Pte Ltd

Preface

Computers control many artificial systems in use today. Even in a highly information oriented society, system failure causes of significant problems. To prevent accidents due to system failure, which are caused by uncertain events, the systems must be evaluated sufficiently from various points of view such as reliability, performance, safety and so on. Of these, system reliability is considered the most important factor, and the reliability modeling approach plays a central role in protecting our safe social life. The Asian International Workshop on Advanced Reliability Modeling (AIWARM) is a symposium for the dissemination of state-of-the-art research and practice in reliability engineering and related issues in the Asian area. The aim of the workshop is to bring together researchers, scientists and practitioners from Asian countries to discuss the state of research and practice when dealing with reliability issues at the system design (modeling) level, and to jointly formulate an agenda for future research in this emerging area. The theme of AIWARM 2004, held in Hiroshima, Japan, August 26-27, 2004, is the Advancement of Reliability Modeling in the Asian Area. This book contains 78 rigorously refereed articles presented at AIWARM 2004. These articles cover the key topics in reliability, maintainability and safety engineering and provide an in-depth representation of theory and practice in these areas. The contributions are arranged in alphabetical order based on the surname of the first author. We believe the articles of this book will introduce readers to significant and up-to-date theory and practice in reliability modeling. This book should also be of interest and importance to the practitioners such as system designers and engineers, as well as to researchers such as applied mathematicians, statisticians, and graduate students interested in reliability, maintainability and safety engineering.

V

vi

AIWARM 2004 is sponsored by Hiroshima Shudo University, Japan, the Chugoku-Shikoku Branch of The Operations Research Society of Japan, the Electric Technology Research Foundation of Chugoku, Japan, and the Trouble Analysis and Reliability Research Center (PNU), Korea. This workshop is held in cooperation with the IEEE Hiroshima Chapter, Japan, the IEEE Reliability Society Japan Chapter, Japan, the IEICE Technical Group on Reliability, Japan, The Operations Research Society of Japan, The Korean Reliability Society, Korea, and the Structural Safety and Reliability Project Research Center of Hiroshima University, Japan. We, the editors, would like to express our sincere appreciation to the Program Committee members and to the Local Organizing Committee members as well as to all the contributors to this book. Especially, we are indebted to Honorary General Chair, Professor Naoto Kaio, Hiroshima Shudo University, Japan, Program Co-Chairs, Professor Mitsuhiro Kimura, Hosei University, Japan, and Professor Min Xie, National University of Singapore, Singapore. Our special thanks are due to Professor Hiroyuki Okamura and Dr. Koichiro Rinsaka, Hiroshima University, Japan, for their continual support. Finally, we would like to thank Chelsea Chin, World Scientific Publishing Co., Singapore, for her warm and patient help.

Tadashi Dohi, Hiroshima University, Japan W o n Young Yun, Pusan National University, Korea Editors 63 General Co-Chairs of A I W A R M 2004

Contents

Preface T. Dohi and W. Y. Yun

V

Genetic Search for Redundancy Optimization in Complex Systems M. Agarwal and R. Gupta

1

Upper and Lower Bounds for 3-dimensional &wit hin-consecutive- (r1,ra,r3)-out-of- (nl,n2 ,n3):F System T. Akiba and H. Yamamoto

9

How Can We Estimate Software Reliability with a Continuous-state Software Reliability Model? T. Ando and T. Dohi

17

A Study on Reliable Multicast Applying Convolutional Codes over Finite Field M. Arai, S. Fukumoto and K. Iwasaki

25

Reliability Design of Industrial Plants using Petri Nets M. Bertolini, M . Bevilacqua and G. Mason

33

Optimal Burn-in Procedures in a Generalized Environment J . H. Cha and J. M i

41

Performing the Soft-error Rate (SER) on a TDBI Chamber V . Chang and W. T. K. Chien

49

vii

viii

Enhancement of Reliability and Economy of a Thermal Power Generating System Through Prediction of Plant Efficiency Parameters A . Chatterjee, S. Chatterjee and I. Mukhopadhyay Optimal Burn-in Time for General Repairable Products Sold Under Warranties Y. H. Chien and S. H. Sheu Determining Optimal Warranty Periods from the Seller’s Perspective and Optimal Out-of-warranty Replacement Age from the Buyer’s Perspective Y. H. Chien, S. H. Sheu and J. A . Chen Warranty and Imperfect Repairs S. Chukova and Y. Hayakawa Acceptance Sampling Plans Based on Failure-censored Step-stress Accelerated Tests for Weibull Distributions S. W. Chung, Y. S. Seo and W. Y. Yun Availability for a Repairable System with Finite Repairs L. Cui and J. Li A New Approach for the Fuzzy Reliability Analysis in Case of Discrete Fuzzy Variable Y. Dong, Z. Ni and C. Wang Fuzzy Reliability Analysis of Complex Mechanical System Y . Dong, Z. Ni and C. Wang

57

65

73

81

89

97

101

109

Optimal Release Problem Based on the Number of Debuggings with Software Safety Model T. Fujiyoshi, K. Tokuno and S. Yamada

117

Operating Environment Based Maintenance and Spare Parts Planning: A Case Study B. Ghodrati and U. Kumar

125

ix

Discrete-time Spare Ordering Policy with Lead Time and Discounting B. C. Giri, T. Dohi and N . Kaio

133

SNEM: A New Approach t o Evaluate Terminal Pair Reliability of Communication Networks N . K. Goyal, R. B. Misra and S. K . Chaturvedi

141

Robust Design for Quality-reliability via Fuzzy Probability H. Guo

149

Interval-valued Fuzzy Set Modelling of System Reliability

157

R. Guo Fuzzy Set-valued Statistical Inferences on a System Operating Data R. Guo and E. Love

165

A Software Reliability Allocation Model Based on Cost-controlling C. Huang, R. Z. X u and L . P. Zhang

173

Reliability of a Server System with Access Restriction M. Imaizumi, M. Kimura and K . Yasui

181

Continuous-state Software Reliability Growth Modeling with Testing-effort and Its Goodness-of-fit S. Inoue and S. Yamada

189

Analysis of Discrete-time Software Cost Model Based on NPV Approach K. Iwamoto, T. Dohi and N . Kaio

197

Reducing Degradation Testing Time with Tightened Critical Value J. S. Jang, S. J. Jang, B. H. Park and H. K . Lam

205

An Optimal Policy for Partially Observable Markov Decision Processes with Non-independent Monitors L. Jan, T. Mashita and K. Suzuki

213

X

Mathematical Estimation Models for Hardware and Software Fault Tolerant System P. Jirakittayakorn, N. Wattanapongsakorn and D. Coit Analysis of Warranty Claim Data: A Literature Review M. R. Karim and K. Suzuki

22 1

229

Simulated Annealing Algorithm for Redundancy Optimization with Multiple Component Choices H. G. Kim, C. 0. Bae and S. Y. Park

237

Estimation of Failure Intensity and Maintenance Effects with Explanatory Variables J. W. Kim, W. Y. Yun and S. C. Han

245

Economic Impacts of Guard Banding on Designing Inspection Procedures Y. J . Kim and M. S. Cha

253

The System Reliability Optimization Problems by using an Improved Surrogate Constraint Method S. Kimura, R. J. W. James, J. Ohnishi and Y . Nakagawa

261

Efficient Computation of Marginal Reliability Importance in a Network System with k-terminal Reliability T. Koide, S. Shinmori and H. Ishii

269

Reliability and Risk Evaluation of Large Systems K. Kolowrocki

277

An Optimal Policy to Minimize Expected Tardiness Cost Due to Waiting Time in the Queue J . Koyanagi and H. Kawai

285

Reliability of a k-out-of-n System with Repair by a Service Station Attending a Queue with Postponed Work A. Krishnamoorthy, V. C. Narayanan and T. G. Deepak

293

xi

Reliability Evaluation of a Flow Network with Multiple-capacity Link-states S. M. Lee, C. H. Lee and D. H. Park

301

A Random Shock Model for a Continuously Deteriorating System K . E. Lam, J . S. Baek and E. Y. Lee

309

Improvement in Bias and MSE of Weibull Parameter Estimators from Right-censored Large Samples by Using Two Kinds of Quantities C. Liu and S. Abe

317

Software System Reliability Design Considering Hybrid Fault Tolerant Software Architectures D. Methanavyn and N . Wattanapongsakorn

325

Software Reliability Prediction using Neural Networks with Linear Activation Function R. B . Misra and P. V. Sasatte

333

Block Burn-in with Minimal Repair M. H. Nu, S. Lee and Y. N . Son

341

Five Further Studies for Reliability Models T. Nakagawa

347

Note on an Inspection Density T. Nakagawa and N . Kaio

363

An Improved Intrusion-detection Model by Profiling Correlated Access Data H. Okamura, T. Fukuda and T. Doha

371

Dependence of Computer Virus Prevalance on Network Structure - Stochastic Modeling Approach H. Okamura, H. Kobayashi and T. Dohi

379

xii

Optimal Inspection Policies with an Equality Constraint Based on t h e Variational Calculus T. Ozaki, T. Dohi and N . Kaio

387

Optimal Imperfect Preventive Maintenance Policies for a Shock Model C. H. Qaan, K. Ito and T. Nalcagawa

395

Determination of Optimal Warranty Period in a Software Development Project K. Rinsaka and T. Doha

403

Optimal Inspection-warranty Policy for Weight-quality Based on Stackelberg Game -Fraction Defective and Warranty Cost H. Sandoh and T. Koide An Automatic Defect Detection for C++ Programs S. Sarala and S. Valli

411

419

Approximation Method for Probability Distribution Functions by Coxian Distribution Y. Sasalci, H. Imai, I. Ishii and M. Tsunoyama

427

Tumor Treatment Efficacy by Fractionated Irradiation with Genetic Radiotherapy T. Satow and H. Kawai

435

Computation Technology for Safety and Risk Assessment of Gas Pipeline Systems V. Seleznev and V. Aleshin

443

Joint Determination of the Imperfect Maintenance and Imperfect Production t o Lot-Sizing Problem S. H. Sheu, J. A . Chen and Y.H. Chien

45 1

Optimum Policies with Imperfect Maintenance S. H. Sheu, Y.B. Lin and G. L. Liao

459

xiii

Optimal Schedule for Periodic Imperfect Preventive Maintenance S. W. Shin, D. K. Kim and J. H. Lam

467

Reliability Analysis of Warm Standby Redundant Structures with Monitoring System S. W. Shin, J. H. Lim and D. H. Park

475

User Reception Analysis in Human Reliability Analysis K. W. M. Siu

483

Evaluation of Partial Safety Factors for Establishing Acceptable Flaws for Brittle Piping A . Srividya, R. Rastogi and M. J. Sakhardande

49 1

Automatic Pattern Classification Reliability of the Digitized Mammographic Breast Density T. Sumimoto, S. Goto and Y. Azuma

499

X-ray Image Analysis of Defects at BGA for Manufacturing System Reliability T. Sumimoto, T. Maruyama, Y . Azuma, S. Goto, M. Mondou, N . Furukawa and S. Okada

507

Analysis of Marginal Count Failure Data with Discarding Information Based on LFP Model K. Suzuki and L. Wang

515

On a Markovian Deteriorating System with Uncertain Repair and Replacement N . Tamura

523

Software Reliability Modeling for Integration Testing in Distributed Development Environment Y. Tamura, S. Yamada and M. Kimura

531

Performance Evaluation for Multi-task Processing System with Software Availability Model K. Tokuno and S. Yamada

539

xiv

Quality Engineering Analysis for Human Factors Affecting Software Reliability in the Design Review Process with Classification of Detected Faults K . Tomitaka, S. Yamada and R. Matsuda

547

Construction of Possibility Distributions for Reliability Analysis Based on Possibility Theory X. Tong, H. Z. Huang and M. J. Zuo

555

A Sequential Design for Binary Lifetime Testing on Weibull Distribution with Unknown Scale Parameter W. Yamamoto, K . Suzuki and H. Yasuda

563

The Generally Weighted Moving Average Control Chart for Detecting Small Shifts in the Process Median L. Yang and S. H. Sheu

569

Safety-integrity Level Model for Safety-related Systems in Dynamic Demand State I. Yoshimura, Y. Sato and K . Suyama

577

Warranty Strategy Accounts for Products with Bathtub Failure Rate S. L. Yu and S. H. Sheu

585

Calculating Exact Top Event Probability of a Fault Tree T. Yuge, K . Tagami and S. Yanagi

593

A Periodic Maintenance of Connected-(r,s)-out-of-(m,n):F System with Failure Dependence W. Y. Yun, C. H. Jeong, G. R. K i m and H. Yamamoto

60 1

Estimating Parameters of Failure Model for Repairable Systems with Different Maintenance Effects W. Y. Yun, K . K . Lee, S. H. Cho and K . H. N a m

609

xv

Reliability and Modeling of Systems Integrated with Firmware and Hardware T. Zhang, M. Xie, L. C. Tang and S. H. Ng

617

Author Index

625

This page intentionally left blank

GENETIC SEARCH FOR REDUNDANCY OPTIMIZATION IN COMPLEX SYSTEMS

MANJU AGARWAL AND RASHIKA GUPTA Department of Operational Research, University of Delhi Delhi-110007, INDIA E-mail: manju-agarwalQyahoo.com Genetic Algorithms (GA’s) have been recently used in combinatorial optimization approaches to reliable design, mainly for series-parallel systems. This paper presents a GA for parallel redundancy optimization problem in complex systems. For highly constrained problems, infeasible solutions may take a relatively big portion of the population and in such cases feasible solutions may be difficult to find. For handling constraints penalty strategies are very effective as a certain amount of infeasible solutions are kept in each generation, so that, genetic search is enforced towards an optimal solution from sides of, both, feasible and infeasible regions. In this paper an adaptive penalty strategy is proposed, which makes use of feedback obtained during the search along with a dynamic distance metric and helps the algorithm to search efficiently for final, optimal or nearly optimal solution. Some numerical examples illustrate the effectiveness of the proposed algorithm.

1. Introduction

Redundancy allocation problem is a nonlinear integer-programming problem and has been thoroughly studied and discussed in the literature with both enumerativebased methods and heuristic-based methods. This type of problem is of combinatorial nature and NP-hard. In recent works major focus is on the development of heuristic and metaheuristic algorithms for redundancy allocation problems for system reliability improvement. Genetic Algorithms (GA’s), one of the metaheuristic techniques, seek to imitate the biological phenomenon of evolutionary production through parent-children relationship and can be understood as the intelligent exploitation of a random search. This technique was initially developed by Holland [6]. The references [4, 51 provide a good description of GA’s. GA’s have been applied in combinatorial optimization techniques and designed to solve a variety of reliability optimization problems. While the papers [l,8, 12, 14, 161 have applied GA mainly to series-parallel and parallel-series systems, [2, 3, 7, 101 have applied GA to find a reliable network design. This paper focuses on solving highly constrained redundancy optimization problems in complex systems with genetic search using an adaptive penalty function approach. The effectiveness of the adaptive penalty approach developed in this re1

2

search is demonstrated on complex system structures from literature with linear as well as nonlinear constraints. 2. Notations n m

k xi x xf xu 1% u,,

Number of subsystems Number of constraints Number of constraints violated Redundancy at subsystem i

(x1,x2,...x,) Best feasible solution yet obtained Best infeasible solution yet obtained Lower limit of subsystem i Upper limit of subsystem i

Resource j consumed Resource j available S j ( X ) - bj Subsystem i reliability System reliability a t x System reliability at x f System reliability a t xu Penalized system reliability a t x

3. Statement of the Problem

The problem of finding optimal redundancy levels ( X I , x2, . . .x,) for maximizing system reliability subject to constraints can be described as follows:

Maximize : R s ( x ) subject t o : gj(x) 5 b j , j = 1 , 2 , . . . m.

xi 2. 1, integers , i = 1,2;..n.

(1) It is assumed that, the system and all its subsystems are s-coherent, all component states are mutually s-independent, and system reliability R s ( x ) is known in terms of Ri(xi). 4. Adaptive Penalty Function

GA’s perform a multiple directional search by maintaining a population of potential solutions. The central problem of applying GA’s to the constrained optimization is how t o handle constraints. For handling constraints, penalty function method has found great application in the field of GA’s [4,151 since it keeps a certain amount of infeasible solutions in each generation so that genetic search is enforced towards an optimal solution from sides of, both, feasible and infeasible regions. Penalty functions can be classified as: - Static, Dynamic and Adaptive. The reference [13] has given a good comparison of six penalty function strategies applied to continuous optimization problems in GA’s. However, there are no general guidelines on designing penalty functions since constructing an efficient penalty function is quite problem-dependent . The adaptive penalty function used to solve the redundancy allocation problem is presented in (2) below

3 The function Rp(z) learns to adapt itself based on the severity of the constraints and the system reliability of a particular problem instance. The metric

(k- calculates the ratio of the number of constraints violated to the total number \

,

evaluationsum

of constraints. The distance metric defined as j=1

of the ratios of magnitude of constraint violation to the resource available incorporating the dynamic aspect, and increases the severity of the penalty for a given distance as the search progresses, where X is a positive constant, taken as 0.03, 1- R s ( x ) takes and g is the generation number. The adaptive term exp

(R,(z")

-

Rfbf)

)

care of the maximum possible improvement in the current solution with respect to the difference between the system reliabilities of the best infeasible and feasible solutions yet obtained. The sensitivity of the adaptive penalty is studied for different values of K as 0.5, 1.0, 2.0, 3.0, and 4.0. When &(xu) 5 R f ( z f )then penalty is not imposed on the infeasible solutions. Moreover, the impact of the adaptive penalty is such that the infeasible solutions giving less reliability are penalized more and so the search is made on the promising infeasible solutions. 5. Genetic Algorithm

The major steps involved in GA are: 1) Chromosome representation; 2) Generation of initial population ; 3) Evaluation of fitness of chromosomes; 4) Selection of parent chromosomes for crossover; 5 ) Generation of offsprings by the crossover operation; 6) Mutation operation on chromosome; 7 ) Selection of best chromosome from the population for the next generation according to the fitness values. Steps 4) to 7 ) are repeated until termination criteria is met. A. Chromosome Representation We use integer value encoding which is more efficient for combinatorial problems and each possible solution or encoding is a vector representation x = ( ~ 1 ~ x. .2.,zn). B. Initial Population A fixed number of chromosomes are randomly generated to form an initial population. C. Fitness The penalized objective function value is taken to be the fitness value of the chromosome. D. Genetic Operators i Crossover: - The crossover rate, p,, is taken as 0.4 for all the system structures computed. The population is arranged from the fittest chromosome to the least fit chromosome and a random number p is generated from interval [0, 11 for each of the population member. A chromosome is selected as a parent only if p 5 p,. One-cut-point crossover method is used to mate the parents and cut position is

4

generated randomly from the interval [l,n ] . ii Mutation: - The mutation rate, p,, is taken as 0.3. Random number p is generated for each gene from interval [0, 11. If p 5 p , then the gene is randomly flipped to another gene from the corresponding set of alternatives.

E. Selection All the population members and the children produced after the genetic operations are arranged in descending order of fitness and the p o p ulation size chromosomes are selected for the next generation. 6. Test Problems and Results

The test problems studied are 5 unit (Bridge Structure), 7, 10 and 15 unit complex structures with linear constraints (Kim and Yum [9]) and a 4-stage series system with non-linear constraints (Kuo et al. [ l l ] , pp. 69-70). The objective is to maximize the reliability of the system subject to various constraints. In total 9 sets of problems with different combinations of constraints are studied. While for test problems of size 4, 5, 7 and 10, population size is taken to be 15 and number of generations 1000, for 15-unit structure population size and number of generations are taken as 50 and 1500, respectively. Ten GA trials for each system are then made for different values of K = 0.5, 1.0, 2.0, 3.0, 4.0. To carry the sensitivity analysis of the adaptive penalty function with respect to 6,we compute for each value the best feasible solution and average reliability obtained in 10 GA trials. Also, the standard deviation and percent coefficient of variation of the average reliabilities obtained for each generation in 10 GA trials are computed. The value of K giving the average reliability with least coefficient of variation is taken to be the best value of n for that particular problem. 6.1. 5, 7, 10 and 15 Unit Structures with Linear Constraints for Different Combinations of Problem Parameters

The problem is as defined in (1) with linear constraints gj(z) = b j , j = 1,2;..m and is varied for n = 5, m = l (Bridge Structure). n = 7, m= 1, 5, bj = ‘small’, ‘large’. n = 10, m= 1, 5. n = 15, m= 1.

Figure 1.

7 Unit Structure

Cy=lcji(zi) 5

5

Figure 2.

10 Unit Structure

Figure 3.

15 Unit Structure

This results in 8 sets of test problems. For each set 10 GA runs are performed for each of the 5 values of K and variation in the average reliability is studied. Data for the structures is generated randomIy as described below [9]: Component cost cji = random uniform deviate from ( 0 , loo), Component rel. in subsystem i, ~i = random uniform deviate from (0.6, 0.85), b j = wj Cy=lcji , where wj = random uniform deviate from (1.5, 2.5) for ‘small’ b j and from (2.5, 3.5) for ‘large’ b j . To have an idea of the computations, Table 1 contains results obtained for 7 x 5 ‘large’ b j , 10 x 5 and 15 x 1 structures. It tabulates the best feasible solution and average reliability obtained in 10 GA trials. Also the standard deviation and percent coefficient of variation of the average reliabilities for each generation in 10 GA trials are given. The ideal value of K for each test problem is highlighted.

table1 ComparisonTablefor710and15UnitStructures4

15x1

0.5 1.o 2.0 3.0

0.912058 0.914283 0.891940 0.909330

0.854986901 0.828742878 0.862362945 0.873000625

0.028445537 0.047884535 0.019145562 0.025273055

3.32701442 5.77797243 2.22015387 2.89496413

4.0

0.913974

0.877274465

0.009701171

1.10583079

6

I Figure 4. Comparison of Average Reliabilities for each Generation in 10 Trials Corresponding to Different Values of K. for 10x5 Structure

For 10 x 5 structure Figure 4 graphically refiects the variation in average reliability obtained in 10 GA trials per generation and hence the effect of the penalty function on the solution quality and speed of evolution convergence. During computations it is observed that in comparison to K = 4.0 when the convergence is more from the infeasible region side resulting in less variation and high average reliability, K = 0.5 and 1.0 impose more penalty causing larger variation in average reliability which improves only gradually from the feasible region side. 6.2. ,$-Stage Series System with Non-Linear Constraints This system is the one presented in [ 111, and is a special case for allocation of discrete component reliabilities and redundancy model. Table 2 shows the variation in the average reliability for different values of K . Table 2.

Comparison Table for 4-Stage Series System

6.3. Comparison of G A with Heuristic

Further t o test how good is our GA, all the test problems are solved by Kim and Yum 191 heuristic algorithm (KYA), which perhaps seems to be the best heuristic

7 proposed in the literature. In KYA the search is made not only in the feasible region but also into the bounded infeasible region making an excursion, which eventually returns to feasible region with possibly improved solution. Table 3 shows the optimal/ nearly optimal solutions obtained by KYA and proposed GA. By KYA each test problem is solved for 10 randomly generated initial solutions and the best is selected to be the optimal/ nearly optimal solution. It can be noticed that for two systems, 7x1 ‘small’ bj and 7x5 ‘small’ b j , GA gives better solution than KYA.

table3.ComparisonofoptionalSolutionsObtsained

4x3 Non-linear constraints

GA KYA GA

(2,3,3,4,3,3,1,1,1,2,1,1,1,1,1) (3,3,5,3) (3,3,5,3)

0.914283 0.944447 0.944447

792

7 . Conclusion The results of GA have been very encouraging. It is a promising optimization method for solving redundancy allocation problems for complex systems, although computing time is more as compared to the other heuristics proposed in the literature. The search towards an optimal solution is enforced form sides of, both, feasible and infeasible regions and, is much superior t o the strategy of allowing only feasible solutions. The infeasibility of the solutions is handled by a dynamic-adaptive distance based penalty function, which helps the search to proceed efficiently for final optimal /nearly optimal solution. The effectiveness of the adaptive penalty function is studied and shown graphically on the solution quality as well as the speed of evolution convergence for several highly constrained problems. The investigations show that this approach can be powerful and robust for problems with large search space, even of size and difficult-to-satisfy constraints.

8

References 1. D. W. Coit, A. E. Smith and D. M. Tate, “Adaptive Penalty Methods for Genetic Optimization of Constrained Combinatorial Problems.” INFORMS Journal of Computing, vol. 8, pp. 173-182, (1996). 2. D. L. Deeter and A. E. Smith, “Heuristic Optimization of Network Design Considering All Terminal Reliability”, I n N . J. Mcafee (editor) Proceedings Annual Reliability and Maintainability Symposium, Philadelphia, PA, 13-16 Jan 1997, pp. 194-199, (1997). 3. B. Dengiz, F. Altiparmak and A.E. Smith, “Local Search Genetic Algorithm for Optimal Design-of Reliable Networks”, IEEE Trans. on Evolutionary Computation, vol. 1, pp. 179-188, (1997). 4. M. Gen and R. Cheng, Genetic Algorithms and Engineering Design, John Wiley and Sons, Inc., New York, (1997). 5. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Reading MA: Addison-Wesley, (1989). 6. J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, (1975). 7. Y. C. Hsieh, T. C. Chen and D. L. Bricker, “Genetic Algorithms for Reliability Design Problems”, Technical Report, Dept. of Industrial Engineering, University of Iowa, (1997). 8. K. Ida, M. Gen and T. Yokta, “System Relaibility Optimization of Series-Parallel Systems Using a Genetic Algorithm”, IEEE Runs. on Reliability, vol. 45, pp. 254-260, ( 1996). 9. J. H. Kim and B. J. Yum, “A Heuristic Method for Solving Redundancy Optimization Problems in Complex Systems”, IEEE Tmns. on Reliability, vol. 42, no.4, pp. 572-578, (1993). 10. ‘A. Kumar, R. M. Pathak and Y. P. Gupta, “Genetic-Algorithm-Based Reliability Optimization for Computer Network Expansion”, IEEE Trans. on Reliability, vol. 44, pp. 63-72, (1995). 11. W. Kuo, V. R. Prasad, F. A. Tillman and Ching-Lai Hwang, Optimal Reliability Design - Fundamentals and Applications, Cambridge University Press, (2001). 12. S. R. V. Majety and J. Rajgopal, “Dynamic Penalty Function for Evolutionary Algorithms with an Application t o Reliability Allocation”, Technical Report, Dept. of Industrial Engineering, University of Pittsburgh, Pittsburgh, PA, (1997). 13. Z. Michalewicz, “Genetic Algorithms, Numerical Optimization, and Constraints”, Proceedings of Sixth International Conference on Genetic Algorithms, pp. 151-158, (1995). 14. L. Painton and J. Campbell, “Genetic Algorithms in Optimization of System Reliability”,IEEE Tmns. on Reliability, vol. 44, pp. 172-178, (1995). 15. A. E. Smith and D. W. Coit, Handbook of Evolutionary Computation, Section C 5.2, “Penalty Functions”, Joint Publication of Oxford University Press and Institute of Physics Publishing, (1995). 16. T. Yokta, M. Gen and K. Ida, “System Reliability of Optimization Problems with Several Failure Modes by Genetic Algorithm”, Japanese Journal of Fuzzy Theory and Systems, ~01.7,no.1, pp. 117-135, (1995).

UPPER AND LOWER BOUNDS FOR 3-DIMENSIONAL R-WITHINI Z ~n3):F , SYSTEM CONSECUTIVE-(rl, r2, Q)-OUT-OF-(IZ~, TOMOAKI AKIBA' Deparhent of Information Management Engineering, Yamagata College of Industty & Technology, 2-2-1 Matsuei, Yamagata, Yamagata, 990-2473, Japan HISASHI YAMAMOTO Deparhent of Production Information System, Tokyo Metropolitan Institute of Technology, 6-6 Asahigaoka, Hino, Tokyo, 191-0065, Japan As a 2dimensional k-within-consecutive-r-out-of-n:Fsystem, for example, there are connected-(r, s)-out-of-(m, n):F lattice system and 2dimensional k-within-consecutive-(r, s)-out-of-(in, n):F system. For these systems, the calculation method for reliability and, upper and lower bounds, have been studied by many researchers. Furthermore, several reports have been proposed for the reliability of more multi-dimensional systems. In this study, we consider 3dimensional k-within-consecutive-r-out-of-n:F system, called the 3dimensional k-within-consecutive-(rl, r2, r3)-out-of-(nI, n2, n3):F system. This system consists of nlxn2xn3 components, which are arranged like a ( n , , n2, n3) rectangular solid. This system fails if and only if there is an (r,,r?, r3) rectangular solid in which k or more components fails. In this system, although an enumeration method could be used for evaluating the exact system reliability of very small-sized systems, that method needs much computing time when applied to larger systems. Therefore, developing upper and lower bounds is useful for evaluating the reliability of large systems, in a reasonable time.

1

Introduction

The consecutive-k-out-of-n:F systems have been extensively studied since the early 1980s. This type of system can be regarded as a one-dimensional reliability model and can be extended to two- or three- or d-dimensional versions ( d z 2). The purpose of this paper is to review the studies of multi-dimensional consecutive-k-out-of-n:F systems. There are a few papers about 3-dimensional systems, whose reliability is equal to the probability that a radiologist might not detect their presence of a disease (Salvia and Lasher[ I]), as described in section 1. We have not yet obtained the efficient algorithm for system reliabilities for the complexity of these systems. For the first time, Psillakis and Makri[7] analyzed 3-dimensional linear consecutive-k-out-of-r-from-n:F systems by a simulation method. Boushaba and Ghoraq41 proposed upper and lower bounds and a limit theorem for this system, based on Koutras, Papadopoulos and Papastavridis[S]. Godbole, Potter and Sklar[2] proposed upper bound for the reliability of a d-

Corresponding author, Presenter 9

10

Y

a

: component works

#l : component fails

Figure 1: Example of failure of 3-dimensiona.l k-withincomecutive(rl, r2, q)-out-of-(n,, n2, n3):F system and component axis

dimensional linear consecutive-k-out-of-n:F system. For a 3-dimensional kwithin-consecutive-(rl, r2, r3)-out-of-(n1, 122, n3):F system (denoted as k/(rl, r2, r3) /(nl, n2, n3):F system throughout this paper), which is a three-dimensional version of 2-dimensional rectangular k-within-consecutive-(r, s)-out-of-(m, n):F system. In this study, we consider 3-dimensional k-within-consecutive-r-out-of-n:F system, called the k/(rl, r2, ~3)/(nl,122, n3):F system. In this system, although an enumeration method could be used for evaluating the exact system reliability of very small-sized systems, that method needs much computing time when applied to larger systems. Therefore, developing upper and lower bounds is useful for evaluating the reliability of large systems, in a reasonable time. 2

M(r1, rz, r3) /@I, nz, n3):F system

2.1. Definition of the system

k/(rl, r2, r3)/(n1, n2, n3):F system consists of nlxn2xn3 components, which are arranged like a (nl, n2, n3) rectangular solid. This system fails if and only if there is a (rl, r2, r3) rectangular solid in which k or more components fails as shown in Figure 1. In this study, we denote, by component (h, i , j ) , the component located on h-th point in the nl axis, i-th point in the n2 axis and j-th point in the n3 axis, with reliabilityphiiand failure probability qhij = 1 - p h i i , for h = 1, 2, ..., nl, i = 1, 2 , ..., 112 and j = 1 , 2,..., n3, as shown in Figure 1.

11

Salvia and Lasherr 13 etc. gave the following examples to illustrate where such multi-dimensional models may be used, the presence of a disease is diagnosed by reading an X-ray. Let p be the probability that an individual cell (or other small portion of the X-ray) is healthy. Unless diseased cells are aggregated into a sufficiently large pattern (say a k x k square), the radiologist might not detect their presence. In medical diagnostics, it may be more appropriate to consider a three-dimensional grid in order to calculate the detection probability of patterns in a three-dimensional space. The other example, k/(rl,r2, y3)/(nl, n2, n3):F system can be applied to the mathematical model of a three-dimensional flash memory cell failure model. 2.2.

Theorem In this section, we propose upper and lower bounds for the reliability of a

kl(rl, r2, r3)/(nl, n2, n3):F system in this section. For this, we introduce some notations. First, we define some sets of components in the system for h = 1 , 2,. . ., nl, i = 1,2 ,..., n 2 , j = 1 , 2,..., n3, CA(h, i , j ) : set of all components in a solid with components ( 1 , 1 , l ) , (nl, 1 , l ) , (1, n2, 11, ( n l , n2, I), (1, 1 , j ) , ( n ~1,j), , ( n ~i-l,j), , ( A , i , j h ( 1 , i , j h (1, n2,j-1) and (nl,n2,j-1) as its apices, CSl(h, i , j ) : set of all components in a ( r l , r2, r3) rectangular solid as a part of CA(h, i , j ) with components ( h , i , j ) as its upper right deep apex, CS2(h, i , j ) : set of all components in a (71, r2) matrix as a part of CSl(h, i, j ) with components (h-rl+l, i-r2+1,j), (h, i-r2+1,j), (h-rI+l, i , j ) and ( h , i, j ) as its apices, CS3(h,i , j ) : set of all components in a ( r , ,r3) matrix as a part of CSl(h, i , j ) with components (h-rl+l, i,j-r3+l), (h, i,j-r3+1), (h-rI+l, i , j ) and (h, i, j ) as its apices, CS4(h,i , j ) : set of all components in a (r2, r3) matrix as a part of CSl(h, i , j ) with components (h, i-r2+1,j-r3+l), (h, i,j-r3+l), (h, i-r2+1,j) and (h, i, j ) as its apices, CC(h,i , j ) : set of all components in a solid as a part of CA(h, i, j ) with components (h-2rl+2, i, j ) , (h-rl, i, j ) , (h-r~,i-r2, j ) , (h+l, i-r-2, j ) , (h+l, i-1, j ) , (h+l, i, j - l ) , (h+rI-l, i, j - l ) , (h+rl-l, i, j-r3+1), (h-2rl+2, i, j-r3+l), (h-2r1+2, i-r2+2, j-r3+1), (h+rl-l, i-r2+2, j-r3+1), (h-2r1+2, i-r2+2,j) and (h+r~-l,i-r2+2, j ) as its apices, CGl(h, i , j ) : set of all components in a (rl-1, r2-1, r3-l) rectangular solid as a part of CSI(h,i , j ) with components (h-1, i-1,j-1) as its upper right deep apex, CG2(h, i , j ) : set of all components in a ( ~ 2 - 1 , r3-1) matrix as a part of CSl(h, i , j ) with components (h, i-1, j - l ) , (h, i-r2+1, j - l ) , (h, i-r2+1, j-r3+l) and (h, i-l,j-r3+l) as its apices,

12

CG3(h, i , j ) : set of all components in a ( ~ ~ - 1r3-l) , matrix as a part of CSl(h, i, j ) with components (A-1, i, j-l), (h-rl+l, i, j-l), (h-rl+l, i, j-r3+1) and (A-1, i,j-r3+l) as its apices, CGd(h, i , j ) : set of all components in a (rl-l, r2-1) matrix as a part of CSl(h, i , j ) with components (h-1, i-1, j ) , (h-rl+l, i-1, j ) , (h-r,+l, i-r2+l, j ) and (A-1, i-rZ+l,j) as its apices, CGs(h, i , j ) : set of r2-1 components as a part of CSl(h, i , j ) with components (h, i-1 ,j ) , (h, i-2,j),. ..,(A, i-r2+1 ,j), CG6(h, i , j ) : set of r3-1 components as a part of CSl(h,i, j ) with components (h, i, j - 1), (h, i, j-2), ..., (h, i, j-r3+ 1), CG7(h,i,j) : set of rl-1 components as a part of CSl(h, i, j ) with components (A-1, i , j ) , (h-2, i , j ),..., (h-rI+l, i,j). For the simple expression of theorems and equations, the virtual components should be stated: component (h, i, j ) with component reliability 1 , for ( h , i, j ) E { ( h , ~ ) ~ l ~ ~ ~ ~ ~ ~ lFurthemore, s ~ s n we ~ ~ denote l ~ some events, which occur on the above sets. For h = 1,2,...,nl, i = 1,2,...,n2 and j = 1,2)...)n3, s, : event that “k or more components fail in CS,(h, i, j),, and “at least one it

component fails in CS2(h,i,j),, and “at least one component fails in C&(h, i, j),, and “at least one component fails in CS4(h,i,j),,. For h = r l , rl+l,..., nl, i = r2, r2+1,..., n2 an d j = r3, ?-+I,.. ., n3, Glu : event that all components function in CC(h, i , j ) , Ghi, : the whole event for h = r,, i = r2 , j = r3; the event that less than k components fail in C G , ( h , i , j ) n CG,(h,i,j)UCG,(h,i,j)UCG,(h,i,j) for h z q , i = q , j = q , the event that less than k components fail in CG, ( h, i, j) n CG,(h,i,j)UCG,(h,i,j)UCG,(h,i,j) for h = q , i * q , j = r 3 , the event that less than k components fail in C G , ( h , i , j ) n CG,(h,i,j)UCG,(h,i,j)UCG,(h,i,j) for h = q , i = % , j * q , the event that less than k components fail in CG,(h,i,j)fl CG,(h,i,j)UCG,(h,i,j)UCG,(h,i,j) for h = q , i z r , , j z q , the event that “less than k components fail in C G , ( h , i , j ) n CG,(h,i,j)U CG,(h,i,j)U CG5(h,i,j),,and “less than k components fail in CG, (h,i,j) fl CG,(h,i,j) U CG,(h,i,j) U CG,(h,i,j),, for h = 5 , i z r2 , j z 5 , the event that “less than k components fail in C G , ( h , i , j ) n CG,(h,i,j)U CG,(h,i,j)U CG,(h,i,j),,and “less than k components fail in CG,(h,i,j)nCG,(h,i,j)UCG,(h,i,j)UCG,(h,i,j),,for h # q , i=r,, j z q , the event that “less than k components fail in C G , ( h , i , j ) n CG,(h,i,j) UCG,(h,i,j)UCG,(h,i,j),, and “less than k components fail in CG, (h,i,j ) r l CG3(h,i,j)U CG,(h,i,j) U CG,(h,i,j),, for h + q , i * r2 , j = q ,

13

the event that “less than k components fail in CG,(h,i,j ) n CG,(h,i, j ) U CG4(h,i,j ) U CG5(h,i,j),,and “less than k components fail in CG,(h,i,j)nCG,(h,i,j)UCG,(h,i,j)UCG,(h,i,j),, and “less than k components fail in CG, (h,i,j)nCG3(h,i,j)U CG4(h,i,j)U C G 7 ( h , i , j ) , , for h * q , i # r 2 , j * r 3 , E,,] : event that “k or more components fail in CSl(h, i, j),, and “event G h , ] occurs,,. By using the above notations, our proposed upper and lower bounds for the reliability of a k/(rl, r2, r3)/(nl,122, n3):F system for non-i.i.d. case are given in Theorem 1.

Theorem 1: Upper bound UB and lower bound LB for the reliability of a k/(rl, r2, r3)/(n,,n2, n3):F system are given as follows. ~~

h-l

r-l

j-l

Theorem 1 can be proven with the similar manner to Yamamoto and Akiba[3]. In the i.i.d. cases, Corollary 1 gives the upper and lower bounds with no description of s h u , Ghij and ‘ h i j . For integers 11,12, we define 0

(3)

otherwise.

Corollary 1: Letp be a component reliability. (1) Lower bound LB, is given as

where

u(v - l)(w - 1) +(

t

)+((

( 2 ) Upper bound UBpis given as

u - l)v(w - 1)

t

)+((

u - l)(v - 1)w t

)

14

k ( r 3 - 1) r(r2 - I>r3 (r[- 1)r2r3 #G(h,i,j ) =. rlr2r3- rI ' 1 r2r3

- '2

rl r2r3

- '3

rlr,r3- 1

and

(h=rl,i=r2,j#r3), ( h = r, ,i * r,, j = r3), (h#rl,i=r,,j=r3), ( h = r,,i * r,, j + r3), (h*rl,i=r2,j*r3), (h#rl,i#r2,j=r3),

(8)

otherwise,

is given as followsFor

And N E ( i ; h , i , j )is given as follows. N , (t;h,i,j ) = N , (t;h,i,j ) + N , ( t - 1;h,i,j ) .

(11)

Excepttherangeof obtainedsimilarly

In addition, the upper and lower bounds for the reliability of a system can be calculated by using the reliability of a small 3-dimensional system(for example, k / ( r l , r2, r3)/(r1,122, n3):F system) in the same idea as Akiba, Yamamoto and Saitou[6].

15

Table 1 : Upper and lower bounds for the reliability ofW(rl, r2, r3)/(n,,n2, n3):F system nl

n 2

n ,

rI

r ?

r l

k

p

i

UB

Upper and Lowr bounds LB difference

10

10

10

2

2

2

2

0.99000i

0.435143

0.359604

0.075539

10

10

10

2

2

2

2

0.99500;

0.791924

0.772025

0.019899

10 10

10 10

10

2 2

2 2

2 2

2 3

0.99900; 0.95000;

0.989817

0.989604

0.328108

0.030590

0.00021 3 0.2975 18

10

10

10

10

2

2

2

3

0.97000;

0.667639

0.446764

0.220875

10

10

10

2

2

2

3

0.99000;

0.974983

0.968562

0.00642 1

50

50

50

2

2

2

2

0.99900;

0.219270

0.2 1 1320

0.007950

50

50

50

2

2

2

2

0.99950;

0.680828

0.677669

0.003 159

50

50

50

2

2

2

2

0.99990;

0.984578

0.984541

0.000037

50

50

50

2

2

2

3

0.99500;

0.571939

0.527560

0.044379

50

50

50

2

2

2

3

0.99700;

0.879611

0.870148

0.009463

50

50

50

2

2

2

3

0.99900;

0.994963

0.994826

0.000137

100 100 100

2

2

2

2

0.99980;

0.602586

0.601092

0.001494

100 100 100 100 100 100 100 100 100

2 2 2

2 2 2

2 2 2

2 2 3

0.99990; 0.99999; 0.99500;

0.880757 0.998728

0.880483 0.998728

0.000274 0.000000

0.010209

0.005 199

0.005010

100 100 100

2

2

2

3

0.99700f

0.348706

0.318570

0.030 136

100 100 100 100 100 100

2

2

2

3

0.99900i

0.959341

0.958235

0.001 106

2

2

2

3

0.999103

0.970102

0.969367

0.000735

2.3. Evaluation

In Table 1, we shows the results of numerical experiments in i i d . cases. We calculated the upper and lower bounds for the reliabilities of a k/(rI,r2, ?-3)/(n1, 122, n3):F system. We calculated upper and lower bounds for the following systems with the identical component reliability. For the system sizes, each of nl, n2 and 123 takes the values of 10, 50 and 100. As the sizes of the rectangular solid which leads to system failure each of YI, r2 and ~3 takes the value of 2. As the numbers of failure components which leads to system failure, k = 2, 3 . From Table 1, we found the following within the range of our experiments, the difference between lower bound LB and upper bound UB becomes small when component reliabilities are close to one.

3

Conclusion

In this study, we propose the upper and lower bounds for reliabilities of a 3dimensional k-within-consecutive-(r,, r2, r3)-out-of-(n n2, n3):F system. As results, we found the following within the range of our experiments, the

16

difference between ow proposed lower bound LB and upper bound UB becomes small when a system is large and component reliabilities are close to one.

References A. A. Salvia and W. C. Lasher, 2-dimensional consecutive-k-out-of-n:F models, IEEE Transactions on Reliability, 39, 382-385 (1990). A. P. Godbole, L. K. Potter and J. K. Sklar, Improved upper bounds for the reliability of d-dimensional consecutive-k-out-of-n:F systems, Naval Research Logistics, 45, 2 19-230 (1998). H. Yamamoto and T. Akiba, Evaluating methods for the reliability of a large 2dimensional rectangular k-within-consecutive-(r, s)-out-of-(m, n):F system(submitted) M. Boushaba and N. Ghoraf, A 3-dimensional consecutive-k-out-n: F models, International Journal of Reliability, Quality and Safety Engineering, 9 , 193-198 (2002). M. V. Koutras, G. K. Papadopoulos and S. G. Papastavridis, Reliability of 2dimensional consecutive-k-out-of-n:F systems, IEEE Transactions on Reliability, 42, 658-661 (1993). T. Akiba, H. Yamamoto and W. Saitou, Upper and lower bounds for reliability of 2dimensional-k-within-consecutive-(r, s)-out-of-(m, n):F system, Reliability Engineering Association of Japan, 22, 99- 106 (2000). (in Japanese) Z. M. Psillakis and F. S. Makri, A simulation approach of a d-dimensional consecutive-k-out-of-r-from-n:F system, Proceedings of the Third TASTED International Conference of Reliability, Quality Control and Risk Assessment, 14-19 (1994).

HOW CAN WE ESTIMATE SOFTWARE RELIABILITY WITH A CONTINUOUS-STATE SOFTWARE RELIABILITY MODEL ?*

T. A N D 0 AND T. DOH1 Department of Information Engineering, Hiroshima University 1-4-1 Kagamiyama, Higashi-Hiroshima 739-8527, Japan E-mail: [email protected]

During t h e last three decades the stochastic counting (discrete-state) process models like non-homogeneous Poisson processes have described the software reliability growth phenomenon observed in t h e testing phase, and gained t h e popularity to explain the software debugging process. O n the other hand, the continuous-state process models based on the Brownian motion processes have a n advantage in terms of the goodness-offit test based on the information criteria, like AIC and BIC. T h e most critical point for the continuous-state process models is t h a t t h e software reliability can not b e well defined in their modeling framework. T h e purpose of this paper is t o answer the titled question, t h a t is, we propose two methods t o define quantitatively the software reliability and the MTBSF (mean time between software faults) for a continuous-state software reliability model.

1. Introduction Since reliable computer systems strongly depend on the reliability of both hardware and software systems, the reliability assessment of software systems is quite important. To assess the software reliability in the testing phase before release, the stochastic models called the software reliability models (SRMs) have been developed during the last three decades. Among them, the stochastic counting (discrete-state) process models like non-homogeneous Poisson processes (NHPPs) can describe the software reliability growth phenomenon observed in the testing phase, and gain the popularity to explain the software debugging process. Goel and Okumoto and Yamada et al. propose the seminal N H P P models with the exponential and the S-shaped mean value curves. The main reason that a huge number of N H P P models have been developed in the literature is due to the simple structure of NHPPs, that is, the N H P P is one of the most simple marked point processes with timedependent mean value. As a remarkable property, it is known that the N H P P is a specific stochastic counting process with the same variance as the mean value function. *This work is supported by the grant 15651076 (2003-2005) of Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Exploratory Research.

17

18 The equivalence between mean and variance for the data generated from a stochastic counting process can be statistically tested. Ando et al. apply some hypothesis testing methods of NHPPs in Cox and Lewis [a] to the real software fault-detection time data, and empirically conclude that the NHPP is not always suitable to represent the software debugging phenomenon. Also, they develop the Markov diffusion processes with mean reversion and use them instead of the typical NHPP models 3 , '. The continuous-state SRM is first introduced by Yamada et al. '. More precisely, they model the dynamic behavior of the number of remaining faults in the software debugging phase by a multi-plicative linear stochastic differential equation, say, the geometric Brownian motion. This model is extended by and Yamada et al. by introducing the time-dependence strucKimura et al. ture in the infinitesimal mean. As shown by Ando et ak. and Yamada et al. ', the continuous-state SRMs have an advantage in terms of the goodness-of-fit test based on the information criteria, like AIC (Akaike information criterion) and BIC (Bayesian information criterion). The most critical point for the continuous-state process models is that the software reliability can not be well defined in their modeling framework. In fact, all the literature mentioned above fail to discuss the software reliability as a probability that the software will operate as required for a specified time in a specified environment. For continuous-state SRMs, strictly speaking, the software reliability can not be defined consistently, because the state representing the number of detected faults can take real (non-integer) values and can decrease with positive probability. Neverthless, the continuous-state SRMs have attractive characteristics in terms of the goodness-of-fit test. This point would resemble the applicability of any time series analysis technique to predict the number of remaining faults in the software. To this end, the continuous-state SRMs poorly explain the software debugging processes, but can provide the better prediction performance than the existing NHPP models. In.this paper, we propose two methods to define quantitatively the software reliability and the MTBSF (mean time between software faults) for a continuousstate software reliability model. Applying some results on the first passage problem for the geometric Brownian motion process; we define the software reliability and the MTBFS in the reasonable way. 2. Non-homogeneous Poisson Process Models Let X ( t ) be a stochastic process and denote the number of software faults detected up to time t. The stochastic process X ( t ) is said the NHPP if the probability mass function is given by

where A(t) = E [ X ( t ) ]is the mean value function and E denotes the mathematical expectation operator. Goel and Okumoto assume that the expected number of software faults detected per unit time is proportional to the expected number of

19 remaining faults. ‘That is, letting X(t> be the intensity function of the NHPP, it is assumed that X(t) = dh(t) - b{a - A(t)),

dt

where a (> 0) and b (> 0) are the expected initial number of software faults and the fault detection rate per unit time, respectively. Solving the above ordinary differential equation with the initial condition A(0) = 0 yields

~ ( t=)a ( 1 - exp{-bt)).

(3)

Yamada et al. take account of the delay effect between the fault detection phase and the fault isolation phase, and develop the delayed S-shaped SRM with the mean value function:

~ ( t=)a ( 1 - (1+ bt)exp{-bt)).

(4)

In this way, by making the deterministic debugging scenario on the mean value function, we can develop many kinds of NHPP-based SRMs. As the authors point out in the literature [I], the hypothesis that the software debugging process can be described by an NHPP is rather questionable, because the NHPP is a simple but a specific stochastic counting process with the same variance as the mean value function. 3. Continuous-State Software Reliability Model

Yamada et al. propose a continuous SRM from the different point of view. Given the constant (not mean) initial fault contents a0 (> 0), they assume that the number of remaining software faults at time t , S ( t ) = a0 - X ( t ) , is described by

d S ( t ) - -bS(t), dt

--

S(0) = ao.

(5)

In actual cases, the progress of the testing procedures is influenced by various uncertain factors. If the fault detection rate b is irregularly influenced by random factors such as the testing effort expenditures, the skill of test personnel and the testing tools, it can be also regarded as a stochastic process depending on time. Hence, it is assumed that Eq.(5) is extended as the stochastic differential equation:

+

{(t))S(t) dt with a time-dependent noise ( ( t )that exhibits an irregular fluctuation. Yamada et al. make its solution a Markov process and regard the noise factor as { ( t )= r y ( t ) , where u (> 0) is a scale parameter (constant) representing a magnitude of the irregular fluctuation and { r ( t )t, 2 0 } is a standardized Gaussian white noise. Substituting { ( t )= q ( t ) into Eq.(6), we obtain the following stochastic differd S-( t ) - - ( b

ential equation of It6 type:

dS(t)= - ( b

U2

-

-)S(t)dt 2

+ uS(t)dB(t),

(7)

20 where { B ( t ) t, 2 0} is a one-dimentional Wiener (standard Brownian motion) process. The Wiener process is a Gaussian process with the following properties: (i) Pr{B(O) = 0) = 1, (ii) E[B(t)]= 0; (iii) E[B(t)B(t’)] = min(t,t’). By the use of the well-known ItB’s formula (see [ 5 ] ) ,it is straightforward to solve the stochastic differential equation in Eq.(7) with the initial condition S ( 0 ) = a0 as follows.

S ( t ) = uoexp{-bt

-

uB(t)}.

(8)

This stochastic process is called the geometric Brownian motion process or the lognormal process, and is often used to describe the stock price dynamics in financial engineering ’. From the Gaussian property for B ( t ) , the transition probability distribution function of the process S ( t ) is given by the lognormal distribution:

where

a(.) denotes the standard

normal distribution function:

From Eq.(9) and the property of Brownian motion process, the number of software faults experienced by time t is given by X ( t ) = a0

-

S ( t ) = uo(1- exp{-bt

+ aB(t)}),

X(0) = 0

(11)

with

provided that b > a2/2. Then, the mean and the variance of X ( t ) are easily derived by E[X(t)] = a0 (1 - exp{

-

(b

-

“2) t } )

+

respectively. 4. Software Reliability Measures The software reliability is defined as the probability that the software will operate as required ( i e . , without fail) for a specified time in a specified environment. For non-decreasing stochastic counting processes like NHPPs, the software reliability can be defined by R ( t ) = Pr{X(t) = 0 I X ( 0 ) = 0). On the other hand, strictly speaking, it is impossible for continuous-state models to define the software reliability consistently. In this section, we approximately give two definitions of the software reliability for continuous-state SRMs as follows.

21 (1) Definition Z: R l ( t ) = Pr{X(t) < 1 I X ( 0 ) = O } = Pr{S(t) > a0 - 1 I S ( 0 ) =

ao), (2) Definition 2: & ( t ) = Pr{supo,,5tX(T) X(0) = 01,

< 1 I X ( 0 ) = 0)

=

Pr{Tl < t

I

where

Tj = inf{t 2 0 : X ( t ) 2 j

1 X ( 0 ) = 0)

(15)

is the first passage time of the stochastic processes X ( t ) to the arbitrary level j (< ao). Since Definition 1 indicates that the reliability function is approximated by the survivor function, it is seen immediately that

For Definition 2; on the other hand, we have to solve the first passage problem on the geometric Brownian motion process. From the strong Markov property of the geometric Brownian motion process, we can derive the following renewal-type equation:

where f(.) and ator, i.e.

'*' are the density function of Ti and the Stieltjes convolution operA * B(t)=

Jo'

A(t - z ) d B ( Z )

for two continuous functions A ( t ) and B ( t ) with positive support, respectively. By taking the Laplace transform in the both sides of Eq.(17), we have

Next, we consider the MTBSF (mean time between software faults). In order to avoid any trouble on the definition of MTBSF, Yamada et al. 6 , apply unusual measures called instantaneous MTBSF and cumulative MTBSF. Let A X @ )be the increment of cumulative number of software faults detected during the time interval [t,t At]. Since the fault-detection time per fault is approximately given by A t / A X ( t ) ,taking the limitation yields At dt 1 lim At+o AX(t) dX(t) dX(t)/dt'

+

~

~

which is called the instantaneous TBSF and means the instantaneous time interval when a software fault is detected at time t . Then the MTBSF is defined as

[

MTBSF, ( t ) = E d X ( i ) / d t ]'

22 Unfortunately, since it is difficult to take the expectation of Eq.(22), the MTBSF is further approximated by MTBSFi(t)

1 M

= {u(b-

g)exp{-(b-

g)t}}-'.

(22)

E[dX( t )/dtI

Similar to the instantaneous MTBSF, if t / X ( t )is regarded as the fault-detection time by time t , then the cumulative MTBSF is defined by

From the same reason of analytical treatment as the instantaneous MTBSF; we obtain the approximate formula: MTBSF,(t)

t

-- t{ao(l -exp{-(bE[X(t)l

;)t})}-'.

(24)

If the underlying stochastic process X(t) is a renewal process which includes the homogeneous Poisson process, the above two definitions are appropriate to represent the mean values of fault-detection time interval and its cumulative value. The main reason to apply such unusual measures is that the probability distribution of fault-detection time interval for the NHPP with bounded mean value function is improper, i.e. there exists a mass part of the distribution at infinity and the corresponding mean value diverges. Fortunately, it is easy for the continuous-state SRM in Eq.(ll) to obtain both measures based on Definition 2. More precisely, the mean time between j-th software faults and ( j 1)-th one is defined without approximation by

+

MTBSF,(j) = E[T,] =

( j = 1 , 2 , . . . , uo - 1)

(25)

from Eq.(19). Also the arithmetric mean of MTBSF to the j-th detected faults is regarded as a counterpart of the cumulative MTBSF, and is given by MTBSF,(j) = MTBSFi ( k ) / j .

xi=,

5. Numerical Illustrations

We compare the software reliability measures using the real software fault data observed in actual testing process. Two dada sets; DS-#l (109) and DS-#2 (317) are used, where the number in the brackets denotes the number of software faultdetection time data measured as CPU time. For instance, in DS-#l, we estimate the model parameters using 109 fault-detection time data and estimate MTBSFi(t), MTBSF,(t) at CPU time t = 49171 (80% point of total data), and MTBSFi(j), MTBSF,(j) for j (= 110, 111, 123, 136,137)-th detected fault. Figure 1 illustrates the befavior of software reliability as a function of testing time, where NHPP (Exp) and NHPP (S-shape) denote the exponential NHPP and the delayed S-shaped NHPP SRMs, respectively. For continuous-state SRM, we apply the usual maximum

23

DS#I

DS-#2

Figure 1. Behavior of estimated software reliability iunctions.

likelihood method and estimate a0 = 139.27 (667.56), b = 3.10 x (1.62 x lop5) and 0 = 1.84 x (6.08 x lop4) with DS-#l (DS-#2). On the other hand, we for the exponential have a = 118.32 (330.79) and b = 5.17 x l o p 5 (7.97 x NHPP and a = 110.56 (318.89) and b = 1.27 x 10V4 (1.82 x for the delayed S-shaped NHPP with DS-#I (DS-#a). From these figures, the software reliability is underestimated for both definitions when the continuous-state SRM is assumed. For DS-#l (DS-#2), we calculate the AIC and the BIC as AIC = 446.77 (1269.12) and BIC = 454.76 (1280.37) for the continuous-state SRM. In the cases of the exponential NHPP and the delayed Sshaped NHPP SRMs, we have AIC = 1503.79 (3489.30), B I C = 1509.18 (3496.82) and AIC = 1590.28 (3551.74), BIC = 1595.67 (3559.26) with DS-#I (DS-#2), respectively. This result implies that the reliability function based on the continuousstate SRM is more reliable, because both information criteria are considered as approximate forms of a distance between the real and the supposed probability distributions. As is expected intuitively, Definition 2 is more optimistic than Definition 1, and is plausible from its physical meaning. Figure 2 plots MTBSFs, where 'real' means the actual fault-detection time interval data, and the cumulative MTBSF is calculated as the moving average From the figures, it is seen that the instantaneous MTBFS of MTBSF,(j). (MTBSF,(t) in Eq.(22) in the initial phase overestimates but approaches to the real MTBFS (MTBSF,(j) in Eq.(25)) as the testing time goes on. On the other hand, the cumulative MTBSF (MTBSF,(j)) behaves similar to the approximate one (MTBSF,(t) in Eq.(24)) in the wide range. Table 1 presents the comparison results between MTBSF,(t) (MTBSF,(t)) and MTBSF,(j) (MTBSF,(j)). Comparing MTBSF,(t) with MTBSF,(j) for arbitrary j , the former underestimates (overestimates) the MTBSF for smaller (larger) j . Also, if we apply the cumulative MTBSF in Eq.(24), it tends to be underestimate the actual cumulative MTBSF.

24 MTBSF

I

m S F

,wo - MTBSFi(t) ' MTBSFc(t)

- MTBSF4)

0

DS-#I

50

IW

IS0

Figure 2.

2M)

250

DS#2

Faults

3W

No Faults

Behavior of estimated MTBSFs

Table 1. Comparison of MTBSFs. DS

- #1

DS

(109)

MTBSFi(t = 49171)

3356.69

MTBSF,(t = 49171) MTBSF,(j = 10) MTBSF,(j = 11)

691.04

MTBSF,(j = 23) MTBSF,(j = 136) MTBSFi(j = 137) MTBSF,(j = 137)

1119.59 1159.89 2042.05 11692.50 18535.60 3523.37

-

#2 (317)

MTBSF,(t = 39862) MTBSF,(t = 39862) MTBSF,(j = 318) MTBSF,(j = 319)

534.88

MTBSF,(j = 357) MTBSF,(j = 396)

201.03

MTBSF,(j = 397) MTBF,(j = 397)

229.37

197.95 177.38 177.89 228.52 200.88

References 1. T. Ando, T. Dohi and H. Okamura, Continuous software reliability models: their validation, effectiveness and limitation, under submission. 2. D. R. Cox and P. A. Lewis, Statistical Analysis of Series of Events, Methuen, London, 1966. 3. A . Goel and K. Okumoto, Time-dependent error-detection rate model for software reliability and other performance measures, IEEE Trans. Reliab., R-28, pp. 206-211, 1979. 4. M. Kimura and S. Yamada, Software reliability management technique and their tool, In Handbook of Reliability Engineering (H. Pham, Ed.), Chapter 15, pp. 265-284, Springer-Verlag, London, 2003. 5. B. Oksendal, Stochastic Differential Equations, Springer-Verlag, Berlin, 1992. 6. S. Yamada, M. Ohba and S. Osaki, S-shaped reliability growth modeling for software error detection, IEEE Trans. Reliab., R-32, pp. 475-478, 1983 7. S. Yamada, M. Kimura, H. Tanaka, and S. Osaki, Software reliability measurement and assessment with stochastic differential equations, IEICE Trans. Fundamentals (A), E77-A, pp. 109-116, 1994. 8. S. Yamada, A. Nishigaki, and M. Kimura, A stochastic differential equation model for software reliability assessment and its goodness-of-fit, Int 'I J . Reliab. Applic., 4, pp. 1-11, 2003.

A STUDY ON RELIABLE MULTICAST APPLYING CONVOLUTIONAL CODES OVER FINITE FIELD

M. ARAI, S. F U K U M O T O , AND K. IWASAKI Graduate School of Engineering, Tokyo Metropolitan University 1-2 Minami-osawa, Hachioji, Tokyo 192-0397, Japan E-mail: { arai, fukumoto, iwasaki} @eei.metro-u.ac.jp We apply (n,k,m) convolutional codes in which elements in generator matrices are the ones over a finite field to Hybrid ARQ combining with retransmission mechanisms, and evaluate the number of transmitted packets and the number of transmissions. We assume star topology and independent model for a transmission model, and compare two retransmission strategies: (1) one which has been applied with Reed-Solomon code, and (2) one which choose transmitted packets considering the constraint length m. We use m ) and (6,3, m ) convolutional codes to observe the efcomputer simulation for (14,7, fects of four parameters, that is the number of receivers, constraint length m, packet loss probability, and redundancy at initial transmission P F , to the number of transmissions and transmitted packets. Simulation results showed that the number of transmitted packets can be reduced by the retransmission mechanism which consider m. Also, a P F which minimize number of transmitted packets exists for given packet loss probability, number of receivers, and constraint length.

1. Introduction

Packet loss recovery is one of the important techniques to improve reliability of the communications over the Internet [l-31. While automatic repeat request (ARQ) scheme have widely applied to the Internet, many studies have also been done for forward error correction (FEC) and Hybrid ARQ, that combines the concepts of ARQ and FEC [4,5]. In the Hybrid ARQ, a receiver tries to recover lost information packets using received redundant packets at first. When some packets are not recovered, the receiver requests a sender to retransmit information or redundant packets. Reliable multicast is expected as one of the most attractive area to apply Hybrid ARQ [5-71. By applying Hybrid ARQ t o reliable multicast, it is possible t o reduce the numbers of transmitted packets, supressing state explosion according to increased number of receivers [7]. Many researches have focused on the application of ( n ,k) Reed-Solomon (RS) codes as FEC. The scheme in which logical hierarchy is introduced t o the receivers has also been investigated, that aims to reduce the total number of transmitted packets by local recovery and retransmission [8,9]. We have proposed a Hybrid ARQ scheme that combines ( n ,n - 1,m) convolutional codes with retransmission of redundant packets [lo]. In ref. [lo],we applied 25

26 the convolutional codes whose generator matrices consist of elements 0 and 1, resulting in large number of retransmitted packets. On the other hand, we have also proposed an FEC scheme that applies ( n ,k , m ) convolutional codes in which elements of generator matrices are the ones over finite fields [ll]. This scheme can easily arrange such generator matrices that can recover k lost packets by randomlychosen k redundant packets, as same as the RS codes. It also shows higher recoverability of packets than RS codes under the same values of n and k , when packet loss probability is low. Therefore, by applying it to reliable multicast communications combining with appropriate retransmission schemes, the number of transmission and transmitted packebs is expected to be further reduced. In this paper we apply ( n ,k , m ) convolutional codes over a finite field to reliable multicast, and evaluate the number of transmission and transmitted packets. We assume star topology with independent packet loss model, and investigate two retransmission strategies, We use computer simulations to estimate the number of transmissions and packets for (14,7,m ) and ( 6 , 3 ,m ) convolutional codes, under the given parameters such as constraint length m, packet loss probability, the number of receivers, and proactivity factor that indicates the redundancy on the initial transmission. 2. Packet loss recovery using ( n ,k , m ) convolutional codes and transmission model

2.1. Packet loss recovery using ( n ,Ic, m) convolutional codes Here we briefly explain packet loss recovery using (n,k , m) convolutional codes. Encoding and decoding strategies are described in detail in refs. [10,11]. The sequence of information packets that a sender generates is divided into groups each of which consists of k packets. Let u, = [UQ u , , ~. . . u,,k] denote the i-th group, where u , , ~is an information packet. For each group u,, a code group W, = [w,,~ U,J . . . u,,,] that consists of n packets, that is, k information and ( n - k ) redundant packets, is generated as:

where g ( D ) is the k by k matrix, and each element in g ( D ) is a polynomial of delay operator D for the degree of at most m. For example, a polynomial gp'(D) can be expressed as:

where m is a parameter called constraint length. Generation of the code group is performed in parallel for each q bits in the packets. That is, regarding coefficients and each q bits in the packets as elements over the finite field GF(2'?),every q bits in a redundant packet is generated from Eq.(l). In this paper q is set to 8.

gki

27 The sender transmits a part of or all of the packets in the generated code groups. We assume that the received packets at the receivers contain no bit-errors, and that the receivers can locate the positions of the lost packets. Then, Eq. (1) also holds at the receivers, while lost packets are regarded as unknown values. Eq. (1) contains ( n - k ) equations. Thus, when L code groups v1, . . . , v~ are transmitted continuously, L . ( n- k ) equations holds. Receivers tries to recover lost packets by solving these simultaneous equations. 2.2. nansmission model Figure 1 shows the assumed transmission model. We apply star topology [5-61. There are R receivers, and communication links between the sender and receivers is independent from each other. We also assume that each packet transmitted from the sender might be lost independently on the link by fixed packet loss probability P [5,7-91. 3. receive and recover

1. generate nL packets L

2 send LkPF packets redundant packets

A

2 \

U 5

\ \

\

4 notice elR

A

3. receive and recover

Figure 1. Transmission model.

The sender is going to transmit k L information packets. While the sender generates n L packets by using convolutional codes, k L information packets and a part of L . ( n - k ) redundant packets are sent at the initial transmission. We introduce a parameter called proactivity factor, that is noted as P F , to determine which redundant packets are sent. P F is the redundancy at the initial transmission, that is, the ratio of the number of initially transmitted packets to the number of information packets [S]. Receiver T (1 5 T 5 R ) tries to recover lost information packets by using received information and redundant packets. If one or more packets cannot be recovered, the receiver notifies the number of unrecovered information packets in i-th code group,

28 ei,r (1 5 i 5 L ) , to the sender, requesting retransmission. According to the applied retransmission strategies described in the following section, the sender determines which packets are retransmitted, and it transmits the same set of packets to every receivers. Request and retransmission are repeated until all receivers receive or recover all information packets. 3. Retransmission strategy We deal with two retransmission strategies. Retransmission strategy 1 is similar to the one that have been applied to RS-code-based Hybrid ARQ [5,6]. Retransmission strategy 2 chooses transmitted packets in consideration of the constraint length m that is the parameter for convolutional codes.

3.1. Retransmission strategy 1 Algorithm for selecting transmitted packet using Retransmission strategy is as follows, where c, is the number of redundant packets already transmitted in the code group v, and ez,r is the number of information packets that the receiver T cannot receive or recover. (1) set i to 1. (2) calculate E, as the maximum of ez,r among 1 5 T 5 R. (3) if (c, E,) 5 ( n - k ) , transmit E, redundant packets in v, and increase c, by E,. (4) if ( c , E,) > ( n - k ) , transmit k . PF packets u , , ~.,. . , u,,k p . ~and set c, to k . ( P F - l ) , similarly to the initial transmission. ( 5 ) increase i by 1, and repeat above 2 to 4 until i > L.

+

+

3.2. Retransmission strategy 2 The selection algorithm for this strategy is as follows, where ci is a counter that memorizes the number of redundant packets which are going to be sent in v,. (1) set T = 1, i = 1, and c> = 0 (1 5 j 5 L ) . (2) if ez,r > 0, decrease ei,T by ci, ldots, c’,+,. (3) if e,,r i 0, check whether j exists that satisfies 0 5 j 5 m and ( c ~ +c:+~) +~ 5 ( n - k). if j exists, one redundant packet that is not sent yet is marked as the one going to be sent, increase c ; + ~by one, decrease ez,r by one, and repeats 3. (4) if ez,r > 0 and above j is not exists, transmit k.PF packets w i , ~ ., . . , u i , k . p ~ , set ci to k c o t ( P F - l), and reset c: to c,. ( 5 ) repeat above 2 to 4 for all code groups, that is 1 5 i 5 L. (6) reset i to 1, and repeat above 2 to 5 for all receivers, that is 1 5 r 5 R. ( 7 ) send the packets that is going to be sent, and update ci according to the sent packets.

29 Figure 2 shows an example of retransmitted packets using Retransmission strategies 1 and 2 with (6, 3, 1) convolutional code, R = 3, P F = 413, and L = 4. A rectangle with cross at receivers means a packet which is not recovered. At the sender, packets drawn with solid line are sent at the initial transmission, and the ones with dotted line are not sent. Packets with bold line is chosen to be sent by Retransmission strategy 1, and the ones as the ones painted gray is for Retransmission strategy 2. receiver 1

sender

Figure 2.

Example of packets transmitted at a retransmission

For the three unrecovered packets in the center code group, that is, v 3 , of receiver 1, Retransmission strategy 2 sends three redundant packets in 2r3 and v4. Thus, the number of retransmitted packets becomes 5 , reduced in comparison with Retransmission strategy 1 that needs to send 7 packets. 4. Evaluation of number of transmissions and packets We used computer simulation to evaluate two retransmission strategies. Evaluation measure was average number of transmissions and forward bandwidth required for all receivers to receive or recover all information packets, which are obtained from 1000 trials. The forward bandwidth was calculated as the ratio of total number of transmitted information and redundant packets divided by k L , the number of information packets. Figures 3 shows the measurement results for the number of transmissions and forward bandwidth under the condition that p = 0.05, L = 20, and R = 1000. Applied codes were (14, 7, 0), (14, 7 , 3), and (14, 7, 6) convolutional codes. The x-axis is proactivity factor P F , which was set to 1, 817, 917, . . . , or 2. (14, 7 , 0) convolutional code is equivalent to (14, 7) RS code, and results were the same for Retransmission strategies 1 and 2 with (14, 7 , 0) convolutional code. For the Retransmisssion strategy 1, the constraint length m has little effect for

30

+

m = 3, strategy 2 m = 6, strategy 2

1w PF

(a) number of transmissions

12

14

16

-w 18

PF

(b) forward bandwidth

Figure 3. Effects of P F to (a) the number of transmission and (b) forward bandwidth for (14, 7, 0), (14, 7, 3), and (14, 7, 6) convolutional codes, where R = 1000 and p = 0.05.

the number of transmissions when m is greater than 0. On the other hand, for the Retransmission strategy 2, the number of transmission for lower P F showed significant increase as m increases. (14, 7, 0) convolutional code showed almost the same value for two strategies. In comparison with Retransmission strategy 1 under the same m of (14,7, m) convolutional codes, Retransmission strategy 2 shows lower forward bandwidth. The figure also indicates the existence of optimal P F value that minimizes forward bandwidth. However, the optimal P F for Retransmission strategy 1 is not always the same as for Retransmission strategy 2. When (14, 7, 6) convolutional code is applied with Retransmission strategy 1, forward bandwidth was minimized to 1.33 at P F = 9/7. A lost information packet can be recovered by redundant packets in the succeeding m code groups. Then, when m and P F are small, transmitted redundant packet sometimes may not contribute for recovery. This is considered as the reason of existence of minimum and local minimum for forward bandwidth. Figure 4 shows the measurement results for the number of transmissions and forward bandwidth with (14, 7, 0), (14, 7, 3), and (14, 7, 6) convolutional codes under p = 0.05 and P F = 8 / 7 . The x-axis is the number of receivers R. The number of transmissions at Retransmission strategy 2 rapidly increases as m and R increases. Retransmission strategy 2 intends to reduce transmitted packets as much as possible, without considering packet loss probability. Therefore, lost redundant packets at retransmissions sometimes make lost information packets unrecoverable, resulting in repeated retransmissions. On the other hand, larger m improves forward bandwidth under the same value of R. Figure 5 shows the measurement results of forward bandwidth when Retransmission strategy 2 is applied with (14, 7, 6) convolutional code under R = 1000 and packet loss probability in the range of 0.01 to 0.1. For a given P F , forward

31 m=O m

l.M

1.m

-

1

(a) number of transmissions

. . . . ~~

m =6,strategy 1 ....

m = 6, strategy 1

R

~

= 3, strategy 1

rn = 3, strategy 2

~~

+

1w

10

d 1Mo

R

(b) forward bandwidth

Figure 4. Effects of R t o the number of transmissions and forward bandwidth for (14, 7, 0), (14, 7, l ) , (14, 7, 3 ) , and (14, 7, 6) convolutional codes, where P F = 8/7 and p = 0.05.

bandwidth became lower as the packet loss probability decreases. For a given packet loss probability, optimal P F value that minimizes forward bandwidth existed, while optimal value differs depending on the loss probability. Figure 6 shows the measurement results for the number of transmissions and forward bandwidth with (6, 3, O), (6, 3, 3), and (6, 3, 6) convolutional codes under p = 0.05 and R = 1000. Similarly to the results with (14,7, m ) convolutional codes, (6,3, m ) convolutional codes had the optimal P F , and Retransmission strategy 2 showed superior forward bandwidth than strategy 1 when m was greater than 0. (6,3, m ) convolutional code generally showed higher forward bandwidth than (14,7, m ) convolutional codes under the same values of m and P F . 5. Conclusions In this paper we applied ( n ,k , m ) convolutional codes over a finite field to reliable multicast, and evaluated the number of transmission and transmitted packets. We considered two retransmission strategies. We used computer simulations for (14,7, m ) and ( 6 , 3 ,m) convolutional codes, under the given parameters constraint length, packet loss probability, the number of receivers, and proactivity factor. Simulation results showed the existence of PF that minimizes the number of transmitted packets under given parameters. Retransmission strategy 2 showed smaller number of transmitted packets than Retransmission strategy 2, while the number of transmission increased.

References 1. C. Perkins, 0. Hodson, and V. Hardman, "A survey of packet-loss recovery techniques for streaming audio," IEEE Network Magazine , Sep./Oct. 1998.

32 2.m

p=O.lO 1.m

-

p = 0.08

I

~

------

p =0.06

~

~

p = 0.04

5

140 -----...-....2,’.’,’

..... .

,_

.,,

..----’

m = 6 . strategy 1

m = 3, strategy 2

I

1W’ 1

12

14

16

18

2

PF

Figure 5. Effects of PF and packet loss probability to forward bandwidth for Retransmission strategy 2 with (14, 7, 6) convolutional codes, where R = 1000.

m = 6, strategy

1w

1

12

14

16

2 - -*18

2

PF

Figure 6 Effects of PF to forward bandwidth for Retransmission strategy 2 with ( 6 , 3, 0), (6, 3, l), (6, 3, 3), and (6,3,6) convolutional codes, where R = 1000 andp = 0.05

2. H. Liu, H. Ma, M. El Zarki, and S. Gupta., ”Error control schemes for networks: An overview”, ACM Mobile Networks & Applications, Vol. 2, No. 2, pp. 167-182. 1997. 3. M. Yajnik, S. Moon, J. Kurose, and D. Towsley, ”Measurement and Modeling of the Temporal Dependence in Packet Loss,” Proc. of IEEE INFOCOM ’99, pp. 94-99, Nov. 1996. 4. L. Rizzo, ”Effective Erasure codes for reliable computer communication protocols,” Computer Communication Review, Vol. 27, No. 2, pp. 167-182, Oct. 1997. 5. J. Nonnenmacher, E. Biersak, and D. Towsley, ”Parity-Based Loss Recovery for Reliable Multicast Transmission,” IEEE/ACM Trans. Networking, Vol. 6, No. 4, pp. 349-361, Aug. 1998. 6. D. Rubenstein, J. Kurose, and D. Towsley, ”Real-Time Reliable Multicast Using Proactive Forward Error Correction,” Proc. of IEEE NOSSDAV’98, Jul. 1998. 7. C. Metz, ”Reliable Multicast: When Many Must Absolutely Positively Receive It,” IEEE Internet Computing, Vol. 2, No. 4, pp. 9-13, JuLAug. 1998. 8. R. Kermode, ”Scoped Hybrid Automatic Repeat Request with Forward Error Correction (SHARQFEC),” Proc. of ACM SIGCOMM’98, pp. 278-289, Oct. 1998. 9. J. Nonnenmacher, M. S. Lacher, M. Jung, E. Biersack, and G. Carle, ” H o w Bad is Reliable Multicast without Local Recovery?,” Proc. of IEEE INFOCOM’98, pp.972979, Apr. 1998. 10. A. Yamaguchi, M. Arai, S. Fukumoto, and K. Iwasaki, ”Fault-Tolerance Design for Multicast Using Convolutional-Code-Based FEC and Its Analytical Evaluation,” IEICE Trans. Info. & Sys., Vol. E85-D, No. 5 , pp. 864-873, May 2002. 11. M. Arai, S. Fukumoto, and K. Iwasaki, ” A Study on Extention of Coefficients for (n, k, m) Convolutional-Code-Based FEC,” 2nd Euro-Japanese Workshop on Stochastic Risk Modeling for Finance, Insurance, Production and Reliability, pp. 32-41, Sep. 2000.

RELIABILITY DESIGN OF INDUSTRIAL PLANTS USING PETRI NETS MASSIMO BERTOLINI Dipartimento di Ingegneria Industriale, Universita degli Studi di Parma. Viale delle Scienze, 181/A - 43100 Parma (ITALY) MAURIZIO BEVILACQUA Dipartimento di Ingegneria delle Costruzioni Meccaniche, Nucleari, Aeronautiche e di Mefallurgia, Universita degli Studi di Bologna, sede di Forli, Via Fontanelle 40 - 47100 Forli (ITALY) GIANLUIGI MASON Dipartimento di Ingegneria Industriale, Universita degli Studi di Parma, Viale delle Scienze, 181/A - 43100 Parma (ITALY)

This paper describes a reliability analysis tool for design or revamping of industrial plants. The methodology is based on Failure Modes, Effects and Criticality Analysis (FMECA) and stochasticevents simulation analysis through Petri nets. The input data for the analysis are collected by applying a FMECA technique to the industrial plants both in the design and in the revamping stage, obtaining useful information of events probability, occurrence and effects. The following step, i.e. the simulation of the system operation using Stochastic Petri Nets (SPN), makes it possible to calculate some important reliability parameters of the industrial plant, evaluating their change depending on maintenance policies on plant items. In particular, the effects of preventive maintenance on system reliability have been analysed using Petri Nets, allowing the costsibenefits analysis. The proposed methodology has been applied to the Integrated Gasification and Combined Cycle (IGCC) plant of API oil refinery in Falconara Marittima (Ancona, Italy), providing results that are consistent with the experimental reliability data, thus proving to be an effective decision-support tool for Reliability Engineering.

1

Introduction

Reliability analysis is an essential step in design, revamping and management of any industrial plant. Several techniques and tool can be used for this aim, such as the Fault Tree Analysis (FTA), Failure Modes, Effects and Criticality Analysis (FMECA) and HAZard and Operability study (HAZOP). These techniques are developed in order to collect reliability information on the system, such as system availability, mean time between failures (MTBF) and mean time to repair (MTTR). This kind of information, useful during the plant normal operation, is really essential during its revamping or redesign, in order to save time and money, since the knowledge of the critical elements of the system at the beginning of the development process assures easier and cheaper changes in plant redesign. The concept of Design For Reliability (DFR) is especially important in case of complex and expensive products; in such a situation the know how of maintenance procedures in similar machines or plants provides a great advantage to

33

34 the maintenance staff. This practical knowledge is generally fixed in the FMECA, an effective methodology to assess the critical elements of a machine or a plant. FMECA technique, initially introduced as a tool for failure analysis during product development process [7], was afterwards used to develop a Total Productive Maintenance (TPM) plan, according to the rules of the Reliability Centered Maintenance (RCM) [4]. Other interesting contributes for design reliability analysis can be found in [ 2 ] , [6], [8], where the FMECA is associated with the Quality Function Deployment (QFD). The authors propose a sequential utilization of FMECA and QFD in order to improve customer satisfaction through the development of a product that can satisfy customer requirements of quality and robustness, according to the Total Quality Management (TQM) philosophy. This paper presents a reliability analysis methodology for the design or revamping of industrial mechanical plants. The methodology uses Failure Modes, Effects and Criticality Analysis (FMECA) and stochastic-events simulative analysis. In the first step of the procedure, the input data for the analysis are collected through the application of FMECA technique to the industrial plants both in the design and in the revamping stage, obtaining useful information on events probability, occurrence and effects. In the second step, the behaviour of the system is simulated using Stochastic Petri Nets. The analysis is focused on the effective critical points of a plant or a machine rather than on customer requirements. Moreover the effects of preventive maintenance on system reliability can be analysed using Petri Nets, making possible to perform a costshenefits analysis by evaluating the specific maintenance policy effects on reliability parameters. The paper is organized in the following way: an overview of FMECA methodology and Petri Nets is firstly presented, with a latter analysis methodology description. Finally a case study is described, with the application of the tool to the Integrated Gasification and Combined Cycle (IGCC) plant in the API oil refinery in Falconara Marittima (Ancona, Italy).

2

Overview of FMECA methodology

The Failure Modes Effects and Criticality Analysis (FMECA) method [7] is probably the most famous technique for defining procedures to assess products/processes identification of potential failures. FMECA is characterized by a bottom-up approach. It breaks down any system (product and/or production process) into its fundamental parts to find all the potential failure modes and their effects. The analysis of the failure modes of a given production process provides important information on: 1. the subsystems and parts of the system in a hierarchical arrangement (functional analysis of the production plant); 2. any “failure” or generic “malfunctioning”, with a list and a description of all the potential failure modes for the process/product being analysed; 3. the probability, severity and detectability of each failure mode;

35 the Criticality Analysis (CA), which ranks all the failure modes in order of importance. Where the risks are higher, it becomes necessary to propose corrective actions, checking the effectiveness of any of these actions and making sure that the criticality analysis is accordingly revised.

4.

3

Petri Nets

Petri Nets (PN), developed by Carl Petri in his Ph.D. thesis, are a useful tool for analysing and modelling the dynamic behaviour of complex systems with concurrent discrete events. PN were first used by electronic and informatics engineers, for example to model microprocessor architecture [ 11. A PN model is graphically represented by a directed bipartite graph, consisting of two kinds of nodes, called places and transitions, drawn as circles (places) and boxes (transitions). Places can contain tokens, drawn as black dots, while transitions are labelled with their temporal delay D (stochastic or deterministic). Places and transitions are linked by weighted arcs. From a modelling point of view, places represent conditions and transitions represent events. A transition is characterised by a certain number of input places, describing the conditions to be verified for the firing of the transition, and a certain number of output place, representing the effects of the firing. Various PN applications in several industrial engineering fields are described in scientific literature; in particular Schneeweiss [ 5 ] describes several Petri Nets application for Reliability Modelling. For a detailed description of PN building and modelling tools the reader can refer to [11 PI. 4

The FMECA-Petri Nets approach to reliability design

The FMECA-Petri Nets methodology here described is developed to predict the reliability parameters of a complex system and to simulate the preventive maintenance policy effects on system av '

Figure 1 PN representlng an and

FMECA technique is firstly used to collect data on system failure modes and on failure criticality; from a system availability point of view a failure is considered critical

36 if it causes a system breakdown. The followed data are collected for each system component: 1. Mean Time Between Failure (MTBF), for corrective and preventive maintenance conditions; 2. Mean Time To Repair (MTTR); 3. Preventive maintenance implementation parameter (maintenance time interval and maintenance action); 4. System status during maintenance (oddown). The collected data are used in the Petri nets simulation software to evaluate system reliability parameters. For the development of the PN every critical element previously identified is considered as an item subject to failures that can be repaired (as good as new) depending on its MTBF and MTTR (mean values of stochastic timed transitions with negative exponential distribution). Those transitions are linked to two places, representing the state of down and up of the whole system. Figure 1 shows the case of a machine with five different failure modes. Each firing of a failure event removes a token from the on condition place and adds a token to the off condition place. The PN behaviour is symmetric for the repair event. The on-time and down-time of the system can then be easily evaluated adding a time counter represented by a test arc, a deterministic timed transition t with unit delay and a place that collects tokens representing time units, sent by the transition t, as shown in Figure 2.

Another place is finally introduced in the net, to count the number of occurred failures. As can be seen from Figure 3, each failure is linked to a place to which a token is sent when a failure occurs.

Figure 3. Failure counter

37 Once the Petri Net design is completed, it is possible to obtain the desired reliability parameters, identified by: 1. TU = up time of the machine; 2 . TD = down time of the machine; 3. TU + TD = T = total time of simulation; 4. N = total number of occurred failures. The following parameters will be obtained

TU MTBFp=1. MTBFp = MTBF of the machine N . TD MTTRp=2 . MTTRp = MTTR of the machine N . 3.

Ap = Availability

Figure 4. PN simulating preventive maintenance.

When considering the case of preventive maintenance simulation, the principle of net building is the same as before, with the addition of preventive actions. Petri Nets are modified adding new transitions to represent those activities, as shown in Figure 4. MTBF values without preventive maintenance are used instead of the previous MTBF values, while MTTR values doesn’t change. The behaviour of the repaired elements is assumed to be as good as new. The described models have been applied to the feed pumps of Gasification Unit of Integrated Gasification and Combined Cycle (IGCC) plant in API oil refinery in Falconara Marittima (Ancona, Italy).

1.1. The Integrated Gasification and Combined Cycle Plant case study API oil refinery of Falconara Marittima has recently introduced an innovative IGCC plant that allows the recovery of high sulphurous residual deriving from refinery cycle, in order to produce syngas to be used as fuel to obtain electric power and steam. The plant is divided in two sections: 1. SMPP (Syngas Manufacturing Process Plant): section where the residual is gasified and syngas is obtained (Texaco technology).

38 2.

CCPP (Combined Cycle Power Plant): co-generation section where electric power is produced in a combined cycle power plant. The introduction of this innovative plant provides several advantages, such as, lower pollution, because the previous thermo electrical power plant was shut down and higher profits, because several kinds of crude oils can know be processed, and the electrical power produced is sold to national electrical net (286 MW) and the generated steam is used to serve refinery utilities. The object of our analysis is one of the most important parts of the plants, both from an economic and a technical point of view: the feed pumps of the Gasification Unit in the IGCC section.

Figure 5. Charge pump scheme ~

The group of feed pumps consists of three alternative volumetric pumps, in a 2-outof-3 layout; this layout was chosen because of the economic and technical importance of this system. These pumps work in very severe conditions, because of the high density (982 kg/m3) and viscosity (148 cSt) of the fluid pumped at high pressure (80 bar) and temperature (271°C). Each pump is a complex machine made of several auxiliary systems, as represented in Figure 5.

Table 1 . Example of breaking down according to FMEA methodology

39 Every pump was broken down into 90 elements, according to FMEA practice. Among those 90 elements, 30 critical ones were found, and the input data were collected as previously described. Table 1 and Table 2 report some data collected. Table 2. Example of data collecting

The three nets representing each of the pumps have then been linked together in order to study the availability of the whole 2-out-of-3 system, as shown in Figure 6. The Petri Net model to evaluate the preventive maintenance policy effect was built according to the same rules previously introduced. The three nets representing the single pumps have then been linked together as described before.

1.2. Final results

I

I

Figure 6. Linking the three nets.

The simulation analysis has been carried out according to the general procedure describedin Law and Kelton [3]. The final results presented in Table 3 have been obtained by simulating the system operation for a five year period. The Petri nets simulation output data are characterized by a confidence level of 95%. It should be noted that the results are very close to the limited and uncompleted experimental data available collected in the last two years. The MTBF values are high: this is due to the very severe operating conditions of the pumps. On the other hand MTTR values are very low because of the extreme importance of the machines from a technical and economical point of view: repair operations are always made on time with great accuracy. The derived availability values are very high as logic consequence of the fact that MTTRs are very lower than MTBFs.

40

table1.Finalresulsts MTBF MTTR Pump availability System availability

5

Results of PN for reliability analysis 756,87 [hour] 1 1,04 [hour]

Results of PN for preventive maintenance analysis 778,82 [hour] 10,48 [hour]

Experimental data 785,57 [hour] 10,08 [hour]

98,50 %

98,60%

98,70 %

99,80 %

99,90%

99,90 %

Conclusion

The use of Petri Nets for Reliability Modelling proved to be a very useful tool, providing several interesting information and thus completing the output of a Project FMECA. System reliability can be evaluated during the design stage or after modifying some parts of the system; preventive maintenance operations can be simulated in order to support maintenance planning, making possible to evaluate how changes in maintenance scheduling can modify system reliability and availability. The methodology proposed here can be applied to a wide range of industrial plants, helping the designer to gain useful information from existing reliability know-how and to consolidate them in a well defined design process scheme. References

1. 2. 3. 4. 5.

6. 7.

8.

Ajmone Marsan M., Balbo G., Conte G., Donatelli S., Franceschinis G., Modelling with Generalized Stochastic Petri Nets, John Wiley & Sons, Torino, Italy, (1994). Ginn D.M., Jones D.V., Rahnehat H., Zairi M., The “QFD/FMEA interface”, European Journal of Innovation Management, 1 (I), 7-20, (1998). Law Averill M., Kelton W. David, Simulation Modeling & Analysis. McCraw-Hill, New York, USA, (1991). Rausand M., Reliability Centered Maintenance. Reliability Engineering and Systems Safety, 60, 121-132, (1998). Schneeweiss W. G., Petri Nets for Reliability Modeling, LiLoLe-Vevlag GmbH (Publ. Co. Ltd.), Hagen, Germany, (1999). Tan C.M., Customer-focused build-in reliability: a case study, International Journal of Quality & Reliability Management, 20 (3), 378-397, (2003). US Military Standard, MIL-STD- 1629A, Procedures for performing a failure mode, effect and criticality analysis. Department of Defense, USA, (1 980). Yang K., Kapur K.C., Customer Driven Reliability: Integration of QFD and Robust Design, Proceedings Annual Reliability and Maintainability Symposium, (1997).

OPTIMAL BURN-IN PROCEDURES IN A GENERALIZED ENVIRONMENT

JI HWAN CHA Division of Mathematical Sciences Pukyong National University Busan, 608-737, KOREA E-mail: [email protected] JIE MI Department of Statistics Florida International University Miami, FL 33199, USA E-mail: [email protected] Burn-in procedure is a manufacturing technique that is intended to eliminate early failures. In the literature, assuming that the failure rate function of the products has a bathtub shape the properties on optimal burn-in have been investigated. In this paper burn-in problem is studied under a more general assumption on the failure rate function of the products. An upper bound for the optimal burn-in time is presented under the assumption of eventually IFR. Furthermore, it is also shown that a nontrivial lower bound for the optimal burn-in time can be derived if the underlying lifetime distribution has a large initial failure rate.

1. Introduction

0 0

0 0 0

0 0

0

ACRONYMS AND ABBREVIATIONS Cdf cumulative distribution function DIB bathtub shape FR failure rate (function) initially decreasing and eventually increasing failure rate (function) IDEI FR pdf probability density function r.v. random variable sstatistical(1y) Sf survivor function NOTATION

0 0

X h

lifetime of a component, X 2 0; a r.v. burn-in time 41

42

lifetime of a component survived burn-in time b, Xb 2 0; a r.v. Xb pdf, Cdf, Sf of X f(t), F(t),F ( t ) 0 r(t) F R of X the first and second infancy points 0 t,, t,, the first and second wear-out points 0 t*, t** the first and second change point, respectively, when the FR is DIB 0 tl, t2 $ r(u)du;cumulative F R 0 A(t) mean residual life function of a component with burn-in time b 0 p(b) 0 T given mission time

0 0

Burn-in is a method used to eliminate the initial failures in field use. To burn-in a component or system means to subject it to a period of use prior t o the time when it is to be actually used. Due t o the high F R in the early stages of component life, burn-in procedure has been widely accepted as a method of screening out failures before systems are actually used in field operations. An introduction to this important area of reliability can be found in Ref. 6 and Ref. 7. It is widely believed that many products, particularly electronic products or devices such as silicon integrated circuits, exhibit DIB FRs. Hence many researches on burn-in have been done under the assumption of DIB FR. See, for example, Refs. 3-5, 9-12 and 14-16. Recently, there have been many researches on the shape of FRs of mixture distributions. For instance, in Refs. 1, 2 and 8, the shape of FRs of mixture distributions which is not of the typical DIB are investigated. Ref. 13 considered optimal burn-in under the assumption of eventually IFR. In this paper, we consider optimal burn-in under an initially decreasing or/and eventually increasing FR, which includes DIB F R as a special case. We derive a sharper upper bound for optimal burn-in than that obtained in Ref. 13 under the assumption of eventually IFR, and a lower bound assuming that the F R is initially decreasing.

Definition 1.1. A FR r ( x ) is eventually increasing if there exists 0 5 xo < 00 such that r ( x ) strictly increases in z > 2 0 . For an eventually increasing FR r ( x ) the first and second wear-out points t* and t**are defined by t* = inf{t 2 0 : r ( x ) is nondecreasing in z 2 t} t** = inf{t 2 0 : r ( x ) strictly increases in z 2 t}. Obviously 0 5 t* 5 t** 5 zo < 00 if r ( z ) is eventually increasing. In particular, if r ( z ) has a bathtub shape with change points t1 <_ t 2 < 00, then t l = t* and t 2 = t**.

Definition 1.2. A F R r ( x ) is initially decreasing if there exists 0 < 20 5 00 such that r ( x ) strictly decreases in x E [O,zo]. For an initially decreasing F R r ( z ) the first and second infancy points t, and t,, are defined by

t, = sup{t 2 0 : r(x) strictly decreases in z 5 t} t,, = sup{t 2 0 : r ( x ) is nonincreasing in x 5 t}.

43

Obviously 0 < 20 5 t , 5 t,, 5 co if r ( z ) is initially decreasing. If r ( z ) is both initially decreasing and eventually increasing(IDE1) with t* 5 t,,, then 0 < t, 5 t* 5 t,, 5 t** < 00. In particular, if r ( z ) has a bathtub shape with change points 0 < tl 5 t2 < m, then tl = t, = t* and t 2 = t,, = t**.Note that there also exists IDEI FR with t,, 5 t*. 2. Mean Residual Life

In this section we consider optimal burn-in time maximizing mean residual life function. Observe that the mean residual life function of a component with burn-in time b is given by :

= =

Jd:

1

[“

r(u)du}dt

expi-

exp{-[A(b+ t ) - A(b)]}dt

= exp{A(b))

J’

03

exp{-A(t))dt,

b

where A ( t ) F $ r(u)du.

Theorem 2.1. Suppose that the FR r ( t ) is eventually increasing. L e t B1 F

{b : g 1 ( z ) e

IW

( r ( z )- r(t))exp{-A(t)}dt

< 0, for all

z

> b}.

T h e n the set B1 i s not e m p t y and bl = inf B1 is a n upper bound for optimal b u m - i n t i m e b*, that is, b* 5 bl < co, where b* satisfies ,u(b*) = maxbzo p ( b ) . Proof. Since the FR is eventually increasing, it is true that, for each z r ( z ) - r ( t )5 0 for all

> t*,

t > z,

and, for each z > t”,there exists t’ E [z, m) such that r ( z ) - r ( t ) < O for all

t > t’.

These imply that s,”(r(z) - r ( t ) ) exp{ - A ( t ) } d t < 0 for all z > t*,and thus t* E Hence the set B1 is not empty. Observe that ‘f b > bl p’(b) = r ( b ) exp{A(b)}

IM exp{-A(t)}dt

-

1

< exp{A(b)} J’ r(t)exp{-A(t)}dt

-

1

M

= 0.

b

b

B1.

44

This means that p(b) is strictly decreasing in b > bl. Therefore we conclude that b' 5 b l . QED Note that if r(0) > l/(J,"exp{-h(t)}dt) = 1/E[X], then p ' ( 0 ) > 0. Hence a sufficient condition for a positive burn-in(i.e., b* > 0) is r(0) > l / E [ X ] . Corollary 2.1. Suppose that the FR r(t) is eventually increasing. Then the optimal burn-in time b* 5 t*. Proof. It is true that t* E B1. Hence b' 5 b l 5 t* holds. QED The above result of Corollary 2.1 has been also given in Theorem 1 of Ref. 13. The following result gives a lower bound for optimal burn-in when the FR is initially decreasing. Theorem 2.2. Suppose that (i) the FR r(t) is both initially decreasing and eventually increasing(IDEI); (ii) r* = supt2t**r(t) < r(0); (iii) r ( t ) is continuous on (O,t,,]. Let

Bz

= {b : g1(z) =

LW

(r(x) - r ( t ) ) exp{-A(t)}dt

> 0, f o r all

z

< b}.

Then the set B2 is not empty and optimal burn-in time satisfies b2 5 b* 5 b l , where bz SUP B2. Proof. Define set A = {t : r(t) = r*,O 5 t 5 t**}. Note that r(0) > r* 2 r**= r(t**) and r(t) is continuous on (O,t,,], so the set A is not empty and we can further define to = sup{t : r ( t ) = r*,0 5 t 5 t,,}. Then, for each z < to,

> 2, [to, m) such that, for each z < to, r ( z ) - r(t) 2 0 , V t

and there exists t" E

r ( x ) - r ( t ) > 0 , V t > t". These imply that L m ( r ( x ) - r(t))exp{-h(t)}dt

> 0,V

z

< to,

and thus t o BZ. Hence the set Bz is not empty. Observe that 'd b < b2, p'(b) = r(b) exp{h(b)}

> exp{h(b)} = 0.

J'

W

exp{-R(t)}dt

-

1

r(t) exp{-h(t)}dt

-

1

b

b

This means that p(b) is strictly increasing in b < bz, and therefore we have b* 2

b2.

45

Corollary 2.2. Suppose that the same conditions in Theorem 2.2 are true. T h e n optimal burn-in time satisfies t o 5 b* 5 t*, where t o = sup{t : r ( t ) = r*,O 5 t 5

t**}. Proof. It follows from Corollary 2.1 that b" 5 t'. From the proof of Theorem 2.2 it holds that t o E Bz and thus t o 5 t z 5 b*. The desired result thus follows. QED 3.

The Probability of Performing Given Mission

In field operation a component is usually used t o accomplish a task or mission. Let T be a given mission time. Then the probability that a component, which has survived burn-in time b, accomplishes the mission is given by :

+ +

p ( b ) = P(X6 > T ) = P ( X > b T ~ X > b) = exp{-[h(b T ) - A(b)]}. Theorem 3.1. Suppose that the F R r ( t ) is eventually increasing. Let B3

= { b : g z ( x ) = r ( z )- r ( z + T ) 5 0 , f o r all x > b a n d , f o r s o m e b' s u c h that b < b'

I m,

r ( x )- r ( x

+ T ) < 0 for all b < x < b'}.

T h e n the set B3 is not empty and b3 = inf B3 i s a n upper bound for optimal burn-in t i m e b', that is, b* 5 b3 < 00, where b* satisfies p ( b * ) = maxb2o p ( b ) . Proof. Let the set

t = max{t*,t** - 7 ) . Then it can be shown that t E B3, which implies that B3

is not empty. Observe that

p'(b) = ( r ( b ) - r ( b

+

7 ) )e x p { - [ h ( b

+

T)-

h ( b ) ] }I 0 ,

for all b > b3 and there exists bk > b3 such that the above inequality strictly holds for b3 p ( b ) , for all b > by. Therefore we can conclude that b* 5 b3. QED

If r ( 0 ) > b* > 0.

r(T)

then p'(0)

> 0. Hence r ( 0 ) > r ( T ) is a sufficient condition for

Corollary 3.1. Suppose that the F R r ( t ) is eventually increasing. T h e n optimal burn-in t i m e satisfies b* 5 t, where t = max{t*, t**- T } .

Proof. It is true that t E B3. Hence b* I b3 5 t holds. QED The above result of Corollary 3.1 has been also given in Theorem 2 of Ref. 13.

46

Theorem 3.2. Suppose that (a) the FR r ( t ) is both initially decreasing and eventually increasing(IDEI); (ii) r* = supt2t**r ( t ) < r ( 0 ) ; (iii) r ( t ) is continuous on (O,t,,]. Let B4

= {b : g 2 ( 2 ) = r ( z )- r ( z + r ) 2 0 ,

for all z 0 f o r all b"

+

< z < b}

Then the set B4 is not empty and optimal burn-in time satisfies b4 5 b" 5 b3, where b4 = supB4. In particular, if t* < t,, and t,, - t* > r , then optimal burn-in tame b* can be any value in [t*,t,, - r ] . Pro0f. By the same arguments stated in the proof of Theorem 2.2, we can define t o = sup{t : r ( t ) = r*,O 5 t 5 t,,}. Then t" = min{to,t,) E B4, hence the set B4 is not empty. Observe that p'(b) = ( r ( b )- r ( b

+

7)) exp{-[A(b

+r )

-

A(b)]j 2 0 ,

for all b < b4 and there exists bk < b4 such that the above inequality strictly holds for bk p(b) for all b < b4. Therefore we can conclude that b* 2 b4. Now suppose that t* < t,, and t,, - t* > r . Then it is true that p'(b) > 0 for all 0 5 b < t*, p'(b) = Ofor all t* 5 b 5 t,, - r , and p'(b) < 0 for all b > t,, - r. These imply the desired result. QED Corollary 3.2. Suppose that the same conditions in Theorem 3.2 are true. Then optimal burn-in time satisfies t" 5 b* 5 t, where t" E min{to,t,} and t o E sup{t :

r ( t ) = r*,O 5 t 5 t,,}. Proof. From the proof of Theorem 3.2, it is true that t" E

B4.

Hence t" 5 b* 5 t holds.

QED 4. Illustrative Example

In this section a numerical example is considered for illustration. Example 4.1. Suppose that the FR of the component is given by :

i

-(t (1/4)(t

r ( t )=

-

+ 2, if o 5 t 5 2, 3)2 + 3/4, if 2 5 t 5 4, +

-(1/4)(t - 5)2 5/4, if 4 5 t 5 6, (1/4)(t - 7)2 3/4, if 6 5 t 5 9,

+

4 - (9/4)exp{-(t

-

g)}, if 9 5 t.

The graph for the FR is presented in the Figure 1. Then the FR is eventually increasing with two wear-out points t* = t** = 7.0. We consider optimal burn-in time b* which maximizes p(b).

47

2

4

6

8

1

0

1

2

Figure 1. Failure Rate Function

L

5

6

X

7

Figure 2. Graph for gi(x)

To find the upper bound obtained in Theorem 2.1, the graph of gl(z) for 0 5

x 5 7.0 is obtained and is presented in the following Figure 2. From the above graph of gI(x), the upper bound bl is given by 2.32, which is much sharper than t* = 7.0. Therefore to find optimal burn-in time, it is sufficient to consider only b 6 [0,2.32]. However note that g l ( b ) > (<)O if and only if p’(b) > (<)O and, in our case, the graph of g 1 ( z ) has not much fluctuations and is of simple shape. Hence in this case we can see that the optimal burn-in time b* can be 0 or 2.32. The comparison yields that p(0) < p(2.32), and the optimal burn-in time b* = 2.32. The graph of p ( b ) for 0 5 b 5 2.40 is given in the following Figure 3.

48 P@) 1.1 -

0.9 -

0.7 -

Figure 3. Graph for p@)

References 1. H. W. Block, Y . Li and T. H. Savits, “Initial and final behaviour of failure rate functions for mixtures and systems”, Journal of Applied Probability, vol. 40, pp. 721-740 (2003). 2. H. W. Block, Y . Li and T. H. Savits, “Preservation of properties under mixture”, Probability i n the Engineering and Informational Sciences, vol. 17, pp. 205-212 (2003). 3. J. H. Cha, “On a better burn-in procedure”, Journal of Applied Probability, vol. 37, pp. 1099-1103 (2000). 4. J. H. Cha, “Burn-in procedures for a generalized model”, Journal of Applied Probability, vol. 38, pp. 542-553 (2001). 5. C. A. Clarotti and F. Spizzichino, “Bayes burn-in decision procedures”, Probability i n the Engineering and Informational Sciences, vol. 4, pp. 437-445 (1991). 6. F. Jensen and N. E. Petersen, Burn-in: John Wiley, New York, 1982. 7. W. Kuo and Y . Kuo, “Facing the headaches of early failures: A state-of-the-art review of burn-in decisions”, Proc. IEEE., vol. 71, pp. 1257-1266 (1983). 8. G. Klutke, P. C. Kiessler and M. A. Wortman, “A critical look at the bathtub curve”, IEEE Transactions on Reliability, vol. 52 pp. 125-129 (2003). 9. J. Mi, Optimal bum-in: Doctoral Thesis, Dept. Statistics, Univ. Pittsburgh (1991). 10. J. Mi, “Burn-in and maintenance policies”, Advances in Applied Probability, vol. 26, pp. 207-221 (1994). 11. J. Mi, “Minimizing some cost functions related to both burn-in and field use”, Operations Research, vol. 44,pp. 497-500 (1996). 12. J. Mi, “Warranty policies and burn-in”, Naval Research Logistics, vol. 44, pp. 199-209 (1997). 13. J. Mi, “Optimal burn-in time and eventually I F R , Journal of the Chinese Institute of Industrial Engineers, vol. 20, pp. 533-542 (2003). 14. D. G. Nguyen and D. N. P. Murthy, “Optimal burn-in time to minimize cost for products sold under warranty”, IIE Transactions, vol. 14, pp. 167-174 (1982). 15. K. S. Park, “Effect of burn-in on mean residual life”, IEEE Transactions on Reliability, vol. 34, pp. 522-523 (1985). 16. G. H. Weiss and M. Dishon, “Some economic problems related to burn-in programs”, IEEE Transactions on Reliability, vol. 20, pp. 190-195 (1971).

PERFORMING THE SOFT-ERROR RATE (SER) ON A TDBI CHAMBER VENSON CHANG, WEI-TING KARY CHIEN Reliability Engineering, Semiconductor Manufacturing International Corporation, Shanghai, 18 Zhang'iang Road, PuDong New Area 201203,China

Abstract

As the gate oxide gets thinner and the cell density is increased due to continuous scaling and the rapid technology advancement, the soft-error rate (SER) attracts many researchers' attentions. In the IC industries, engineers perform the accelerated (ASER) and the system soft-error rate (SSER) tests to evaluate SER performance. ASER and SSER tests are performed on memory testers and they are time and cost consuming especially for SSER, which is a test with large sample size and long test time. In this paper, we successfully use a Test During Bum-In (TDBI) chamber for SER tests with good correlation with the memory tester to contribute a cost reduction solution. We also successfully verify the relationships of either the technology [ I ] or Vcc and the Failure-In-Time (FIT) [2] level by real cases. Also, we report some significant findings like the Bum-In (BI) effect and the test pattern issue for different circuit designs.

1

Introduction

A soft error (SE) is defined as a random error induced by an event corrupting the data stored in a device and the device itself is not permanently damaged. An SE is caused by particle strikes; this makes SE a more important issue nowadays because of the increasing applications of integrated circuit (IC) in space where a lot more cosmic rays and particle strikes are expected. The soft-error rate (SER) is measured by FIT (Failure In Time). The common industrial standard for SER, 1,000-FIT, can't be maintained at 0.1 urn technology era. This explains why SER attracts the attentions of many researchers in recent years. Nowadays, SER test becomes more important and was included as a product qualification item. However, it is an expensive test and the TDBI (Test During Bum-In) chamber provides a cost-effect solution for SER tests.

1.1. The SER Test Standards Industrial standards on SER tests are defined in MIL-STD-883E Methodl032.1 and JESD-89. There are two types of SER test: System SER (SSER) and Accelerated SER (ASER). Both are defined in JESD-89 and are to evaluate the strength of cells to resist particle strikes and the radioactive particle generated by the molding compound used in assembly. The SSER test requires thousands of samples under normal using conditions for a very long time so that the overall device-hour can be at least a million. At the room temperature, the SSER test samples are in a bum-in chamber, which continuously sends read write test patterns and records the fail bit count (FBC). Unlike the time-consuming 49

50 and expensive SSER tests, ASER only requires several samples to be run in less than a week using a radiation source, which is a substitute to molding- compound and acts as an acceleration generator. Sincq no specific equipment is defined in MIL-STD-883E and in JESD-89 for ASER tests, previously, a very expensive memory tester (e.g., Mosaid-4205, which will be denoted as MS4205 hereafter) is used for ASER tests. This is practically infeasible because the memory tester is a necessary tool for electrical failure analysis (EFA) and such tester is always heavily loaded. This difficulty is resolved when advanced bum-in chambers can be used as described in this paper. 1.2. The Test During Burn-In (TDBI) Chambers

Bum-in is an operation to weed out defective parts by stresses (which are usually voltage and temperature) higher than normal using conditions. And the most advanced test-during burn-in (TDBI) chambers, whose waveforms are greatly improved and can perform speed-insensitive tests during bum-in. In this paper, a TDBI chamber is used for ASER tests. We report the correlation between the TDBI system and MS4205, where MS4205 is treated as a standard to match.

1.3. The Structure of This Paper It is the first time the followings on ASER tests are clarified. 1 . What is the correlation be-tween MS4205 and the TDBI chamber? 2. How to achieve statistically acceptable correlation between MS4205 and the TDBI chamber? What are the important factors in correlation? 3. Since there are many sockets in a TDBI burn-in board (BIB), can we treat all sockets as identical? 4. Will the Vcc significantly affect the ASER FIT level? Alpha particles will induce cell leakage current (cell current leakage) and the leakage time is different at different Vcc. This leads to different ASER FIT’S. 5 . Why are all former ASER test cycles so high (e.g., 10,000 read-write cycles)? Can the ASER test be performed in a shorter time? How long is needed for a single ASER test stage? 6. Will we observe significant difference on ASER FIT when testing early and later lots? What’s the relationship between ASER FIT and technologies? What will be the ASER FIT for various test vehicles? After a series of experiments, we successfully answer all above questions with sound evidence and we identify the most critical factor affecting MS4205-TDBI correlation to be the fuse resistance of each socket on BIB. We also obtain many extra findings to facilitate practical SER test operations. Different from the studies focused on Neutron SER [3,4], we emphasize on semiconductor alpha particle ASER & SSER and obtain significant findings on SER trends as well as on BI efficiency enhancement. Our findings successfully elucidate how SER be affected by factors like Vcc, test pattern, source strength, flux time, and the fuse resistance on BIB. One important conclusion is on BIB fuse resistance, which directly relate to SER data credibility and BI efficiency; such observations are never addressed before.

51

After introducing SE fundamentals in Section 1, experiments and test set-ups are in Section 2. Section 3 contains main results. Discussions and Conclusions are summarized in Section 4. Possible future extensions are briefly described in the last Section.

2.

Experiment and Test Setups

The “given” resources, which are uncontrollable, are: 1. Test standards: All settings comply with what defined in MIL-STD-883 and JESD-89. Tester requirements are in JESD-89. 2. MS4205 208MHz memory tester: It is used as the standard tester (i.e., the golden unit to match). MS4205 is a powerful tool for EFA and can test SRAM, DRAM, DDR, and FLASH by changing the probe card (for wafer-level tests) or the DUT (device under test) board (for package-level tests). 3. The TDBI chamber (denoted as TDBI): It is originally used for bum-in, EFR (early failure rate), HTOL & LTOL (high & low temperature operating life) tests. To increase flexibility and to reduce the loading of MS4205, we plan to use it for ASER tests; this is the theme of this paper. 4. Test chip: It is a 4Mb (256K x16) high-speed (sub-lOns) SRAM by 0.18um technology. The package type is TSOP-1144. Following JESD-89 guidelines, at least 6 decapped samples are needed for all ASER evaluations. We use 6 samples, which are from a very early lot and do not have significant decay (in terms of FBC) after being verified at MS4205 for 4 months. SRAM’s made by other technologies are also tested for the SER trend analysis. 5 . Radiation source: JESD-89 suggests Am-241 for ASER tests because of its long halflife (which is 432.7k0.5 years), whereas MIL-STD-883E Method 1032.1 defines the radiation source should have strength between 0.01 and 5 uCi. We use Am-241 in the studies with two strengths: 0.189uCi and 2.381uCi. 6. The test pattern: Based on JESD-89 and MIL-STD-883E, we use “March-Row’’ and “March-Column” with background data Logical all “O”, Logical all “1 ” and “Checkerboard”. From the data sheet of the chosen 4Mb SRAM, we set Vih/Vil= 3.0V/0.2V, Voh = 1.6V, Vol = 1.4V, Vcc = 3.3V, no address scramble, and with data scrambler. The test cycle time is 1.O psec. The strobe width is 30ns and which is found independent to the results. Both test programs for MS4205 and TDBI are identical.

3.

Experiment Results

We take one unit to perform ASER on both MS4205 and on TDBI. The FIT level of MS4205 and on socket A l / G6/ M10 on the TDBI is 68 and 92/ 103/ 3278, respectively. The results show amazingly large deviations. The FIT levels on the 3 TDBI sockets indicate the sockets cannot be treated identical and there must exist some key issues contributing to such large discrepancy. According to BIB designs, we suspect it is due to the fuse resistance because the TDBI we use is right after calibration and a board checker verifies the BIB under test with all parameters in control. After measurements, indeed, the fuse resistances ofthese three sockets are 0.7Q (Al), 4.3Q (G6), and 11.2Q (M10). Repeating the ASER tests on selected sockets with different fuse resistances, we confirm

52 the surmise that the fuse resistance plays a critical role on ASER tests as in Figure 1, which clearly shows the ASER FIT dramatically increases if the fuse resistance is higher than 5.5 0. High fuse resistances lead to false alarms.

r

.--.-*

i

%T

1

L

3

. .

4,3

I

"

.*

-

1.: 4 x ^

-

~~"~

>*"

+

Figure 1 The ASER-FIT at different fuse resistances

To completely remove test noise, we replace all fuses by connecting wires (whose resistances are negligible) on 5 sockets (i.e., A l , A10, G6, M1, and M10) for a new ASER test using six samples. The t-test in Figure 2 shows the FIT levels are significantly different at 95% confidence because the samples at M1 socket have a significantly low FIT. That is, the sockets are still not identical although the conformance is much better than tk irevious result in Figure 1. 120

,

I

110 100

t:

90

ri 0:

k2

80 70

60 50 i

40 dl

: 1 . 8

ci Socket.IDc.*

41

mo

Each P a n Student's t 0.05

:

Figure 2. The ASER-FIT at 5 DUTs (fuses are mounted at untested DUT's)

Before a new ASER test, we remove all the fuses at un-tested DUT's by shorting the tested DUT. That is, all DUT's are open except the following 5 sockets: A l , AIO, G6, M 1, and M 10. Using 6 samples, the results of the ASER tests in Figure 3 show there is no significant difference in terms of ASER-FIT on these five sockets. Adopting the arrangements by shorting the tested DUT's and opening the untested DUT's on the BIB, we perform ASER tests on both MS4205 and TDBI. The p-value of the t-test in Figure 4(a) is 0.9991, which indicates these two systems are correlated at 95% confidence level. Good correlation is also achieved on following the same procedures using a 0.2 1um LPSRAM as in Figure 4(b). Hence, we can use the much more cost-effective TDBI chamber to replace MS4205 for ASER and SSER tests. We prepare ASER samples from a more recent lot, whose yield is around 90%. ASER tests do not show significant difference on ASER-FIT between these new and the six

53 samples (which are from a much earlier lot). In other words, it is acceptable that ASER test is only conducted at early development phase and we do not need to include it in the routine I

Each, Pair Student's.t+'

~

Figure 3. Good conformance among several BIB sockets ..... ................ ..............................................

.................................................................

98

__

96 -

Y"

TDBI

MS4205 TESTER.

Each Pair Student's t 0 05

(4

(b)

Figure 4. The ASER FIT comparison between MS4205 and TDBI, (a) 0.18um SRAM, (b) 0.21um LPSRAM

Using the TDBI chamber, we further test the SRAM's made by 0.13um & 0.15um technology and the ASER FIT per Mega-bit is in Figure 5 , which indicates the ASER FIT increases as technology advances. This result complies with the trend specified in Ref. [I, 5 , 6, 71. Moreover, from the regression models using gate oxide thickness and the operating voltages, we estimate the ASER for 90nm is around 65 FIT/Mb. Some other practical findings are summarized in Section 4. FIT/Mb

0.21um

0.18um

0 . 1 5 ~ 1 ~ 0 . 1 3 ~ 1 1 0.09um Technology

Figure 5. The trend of ASER FIT and technologies

54 4.

Discussions and Conclusions

4.1 The Impact of Vcc As specified in JESD-89, we perform ASER tests using 85%, 90%, 95%, loo%, 105%, 1 lo%, and 115% Vcc and find a weak relation between Vcc and ASER FIT. As expected, the lower the Vcc, the higher the ASER FIT becomes (Figure 6, with 95% confidence bands). The higher Vcc level we set, the longer leakage time and the lower failure probability as we strobe the data in the read cycle. The similar result is obtained as reported in Ref. r21. 115 110

t: k.

6 m

105

< 100 95

vc C"(VO1t). Figure 6 The lower the Vcc, the higher the ASER FIT

4.2 The Largest Allowable Fuse Resistance (Rmaw) f o r BI The fuse resistance plays an important role not only on ASER tests as presented in Section I11 but for BI. The simple calculation below helps determine the maximal allowable fuse resistance (Rmax).From a simplified circuits of a DUT in Figure 7, we set 70mA from product data sheet and Vddmin = 3.0V from many Vcc=3.3V and EFA experiences. Hence, Rmax can be obtained by ( 3 . 3-3. 0) /O. 07 = 4. 3 R . This rough estimation helps choose suitable BIB & DUT for ASER tests. From Figure 7, 4. 3 R should not lead to severe false alarms. Due to different designs and applications, the Rmax varies for other products. For example, Rmax is 6R for a 0.20um 64Mb SDRAM and can be as large as 11 R for a 0.21um LPSRAM. However, from BI standpoint, we should replace each fuse by a wire so the voltage drop on the fuse will not reduce the stress voltage on the sample. In our cases, considering the voltage-accelerated factor, the BI effect will be reduced by at least 20% for large fuse resistances. BIB.Fuse-Calculation.r

-A ICC'

VCCd

Figure 7. Derive the largest allowable fuse resistance from data sheet

55 4.3 The Background Data at Read7 Write As defined in JESD-89, the ASER test background data should include at least all “I”, ”checkerboard”, and all “0”. For our 0.1 8um 4Mb SRAM, the ASER FIT of the checkerboard data is between those of all “1” and of all “0” (actually, all “0” data has the highest ASER FIT). This phenomenon is not general and should depend on technology, layout, and circuit designs. For the 0.21um LPSRAM, the lowest ASERFIT is from the checkerboard background data. Therefore, we suggest using these three background data in SER tests for a more complete and conservative estimation. 4.4 The Required Timefor Each ASER Test Previously, the test duration for each point data is at least 2-hr. Inserting intermediate reading points and considering the steadiness of data (Figure 8), we set the shortest time to reach a steady ASER FIT level for 2.381uCi and 0.189uCi source to be 30-min. and D-min., respectively. This is much shorter than the previous test duration (= 2-hr). I/ -. ,

13: ::2:

...... . .,. .

.......,..

..

~

.-

6 100

A . # +

:!:

i

f

2

,

:o[

k@ i iO!o

, , ’

<

K

15 30 45 60 75 90 105 120 Time (min) ~~

(a)

~

~~

~~

(b)

Figure 8. The time to reach a steady ASER FIT for both sources: left = 2.381uCi, right = 0.189uCi

The Effects of The Radiation Sources From Table I, the ASER FIT’s from both 2.381uCi and 0.189uCi radiation source are comparable using the 6 samples. The ASER FIT’s by the two radiation sources for each sample are all within 10%. This is expected and once again verifies the TDBI is capable of providing stable ASER FIT estimations as long as the pre-check on fuse resistance is correctly performed. 4.5

Table I. This table shows the relationship between different emission rate source and FIT level.

4.6 The Low-Cost SSER Test Using TDBI As derived earlier, we control all fuse resistances to be less than 4.3R for the SSER test defined in JESD-89. Most of the SSER test details are the same as those for ASER tests

56 except that the SSER tests need to specify the.confidence interval. As long as we keep all fuse resistance to be lower than the threshold as in Section 4.2, we can use the TDBI for SSER tests. The SSER test requires a large sample size and a long test time to reach millions of device-hours to obtain the more accuracy data. We use 2,000 samples in the SSER test and the test time is now over 3,000 hour with no failure. The failure rate is 153 FIT at 60% confidence level. Without applying the TDBI chamber, cost-effective SSER tests are not possible; this is one of the major contributions of our studies.

5. Future Work Aside from the many practically valuable findings, we list three promising extensive tasks to further maximize the contributions of this study. I . Resolving BIB and DUT concerns for ASER tests pave the way for TDBI correlation to the mass-production memory tester like Advantest T5581 to make the best use of the TDBI chambers. This facilitates EFR curve formulation, burn-in time reduction, and Ea (the activation energy) & beta (the voltage acceleration multiplier) estimation. 2. Aside from ASER tests, HTOL and EFR tests require a large sample size and are not affordable to manually replace the fuses with new ones. We thus need to clearly understand the degradation of the fuses to ensure good TDBI-T5581 correlation. 3. Apply the proposed scenario to test the qualification vehicles of more advanced technologies (e.g., 90nm and below) to formulate the ASER FIT trend, to propose a more suitable ASER criterion, and to enhance ASER strength if necessary.

References

1. Robert Baumann, “Soft Error Rate Overview and Technology Trends,” IRPS, 2002, Dallas, Texas, pp. 4-10. 2. P.E. Dodd, M.R. Shaneyfelt, J.R. Schwank, and G.L. Hash “Neutron-Induced Soft Errors, Latchup, and Comparison of SER Test Methods for SRAh4 Technologies ” IEEE, 2002. 3. Paul E. Dodd and Fred W. Sexton, “Mitigation of Single-Event Effects in MissionCritical Systems,” IWS, 2002, Dallas, Texas, pp. 4-2 I. 4. Allan Johnston, Jet Propulsion laboratory, California Institute of Technology Pasadena, California,” Mitigation Methods for Soft Error and Related Radiation Effects in Spacecraft,” JPL, 2002. 5 . Neil Lohen, T.S. Sriram-Norm Leland, David Moyer, Steve Butler, and Robert Flatley,” Soft Error Considerations for Deep-sub micron CMOS circuit Application” IEEE, 1999. 6. Premkishore Shiuakumar, Stephen W. Keckler, Doug Burger, Michael Kistler, Corenzo Alvisi,” Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic,” in Proceeding of the 2002 International Conference on Dependable System and Networks. 7. Robert Baumann, “The impact of Technology scaling on Soft Error Rate Performance and Limits to the Efficacy of Error Correction” IEEE, 2002.

ENHANCEMENT OF RELIABILITY AND ECONOMY OF A THERMAL POWER GENERATING SYSTEM THROUGH PREDICTION OF PLANT EFFICIENCY PARAMETERS ADITYA CHATTERJEE Department of Statistics, The University of Burdwan, Burdwan 713 104, West Bengal, India SUDIPTA CHATTERJEE CESC Limited, Taratala Kolkata 700 034, West Bengal, INDlA INDRANIL MUKHOPADHYAY Department of Human Genetics, University of Pittsburgh Pittsburgh, PA 15261 USA The plant ‘Heat Rate’ (HR) is a measure of overall efficiency of a thermal power generating system. It is dependent on a large number of factors, some of which are non-measurable, while data relating to others are seldom available and recorded. However, coal quality (expressed in terms of ‘effective heat value’ (EHV) as kcal / kg) transpires to be one of the important factors that influences HR values and data on EHV are available in any thermal power generating system. In the present work a prediction interval of the HR values have been proposed on the basis of only EHV, keeping in mind that coal quality is one of the important (but not the only) factors that have pronounced effect on the combustion process and hence on HR. The underlying theory borrows the idea of providing simultaneous confidence interval to the coefficients of an AR (p) process. The theory has been substantiated with the help of a real life data from a power utility (after suitable base and scale transformation of the data to maintain the confidentiality of the classified document). Scopes for formulating strategies to enhance the economy of a thermal power generating system have also been explored.

1. Introduction Thermal power plant economics is a major issue not only from the service provider’s point of view but also from the point of view of the consumer because it has direct bearing on the cost of electricity. The total cost of a power generating station may be classified into two broad categories (Billinton [ 3 ] ,Sullivan [9]). The first is the fixed cost that includes interest on capital, taxes, insurance, depreciation etc. whereas the second is operating cost which includes operation and maintenance (0& M) costs comprising costs of fuel, repair & maintenance costs; employee overheads and administrative expenses among others. The operating cost can be lowered by running the station at a high load factor and by increasing the efficiency of the plant (Endrenyi [5]).It is therefore essential to measure and monitor the plant efficiency. The plant Heat Rate (HR) is a measure of the overall efficiency of the thermal power plant and is expressed as kcal / kwh. In other words, Heat Rate (HR) is defined as energy required in kilocalories to generate one unit of electricity (kwh). It may be noted in this regard that plant efficiency has an inverse

57

58 relationship with HR value in the sense that lower the HR value, higher is the plant efficiency. In a thermal power station, coal is by far the most important raw material Proximate analysis (Carlson [4], Jain and Balasubrahmanyam [6]) of coal is used to determine the components of coal viz. total moisture (involving surface moisture and inherent moisture), volatile matter (hydrocarbons and other gases that are driven off on heating), ash, fixed carbon etc. The gross calorific value (GCV) or higher heating value (HHV) of coal is defined as the amount of energy released (in kcals) when one kilogram of the coal undergoes complete combustion under normal atmospheric conditions. The effective heat value (EHV) of the coal, which is actually the moisture-compensated GCV, can be calculated empirically or otherwise from the values of GCV. In the present problem, we try to model the effect of coal quality on the plant heat rate. It must be clearly understood that the coal quality (expressed in terms of the GCV or EHV) is only one of the several factors that affect the plant heat rate. But coal quality is certainly an important factor which has a pronounced effect on the combustion process and hence on the boiler efficiency. Thus it transpires that prediction of HR is an important problem for given EHV values and has paramount impact on planning and managing the power plant in terms of plant efficiency. This is particularly relevant in the context when individual coal sources (mines, collieries) declare the EHVs of their coal and a great amount of saving may be accomplished by correctly predicting the HR value for a given EHV. However, the problem as it stands, cannot be modeled through a standard linear model, more specifically a regression equation, as the HR values are dependent on several other plant parameters apart from the EHV. This difficulty may be overcome by considering the series of HR and EHV’ s for a given plant as a time series (the data are usually collected for each day) and then extending the idea of providing confidence interval for the autoregressive parameters as has been done by Basu and Mukhopadhyay [l], [ 2 ] .The paper has been structured in six sections. The second section details the description of the problem encountered in reality that specifies the objective of the study. The third section refers to the data structure and the necessity to introduce the autoregressive parameters in a seemingly simple regression problem. Section four is concerned with the outline of the background theory as well as subsequent modifications necessary to give a meaningful solution of the power plant problem. In section five, we have discussed the necessary computational details and relevant analysis of the data. The paper ends with some concluding remarks as well as other challenges in power plant management in a broader perspective in the light of the present problem.

2. Why This Problem In thermal power generation, it is well known that plant efficiency in terms of heat rate depends upon generator efficiency, turbine efficiency and boiler efficiency (Skrotzki and Vopat [8]. Generator efficiency is usually very high (98% or above) and hence turbine and boiler are the two major equipments, which have marked effect on HR. There are numerous factors, which affect the boiler and turbine efficiencies. Some of the major causes that affect these efficiencies are loss of heat, increase in throttle loss of turbine,

59 any leakage in the power cycle, deviation from the design input / output parameters, improper combustion to quote a few. A root-cause analysis (cf. op. cit.) of the above factors would lead to reasons like system / grid disturbance, variation of load as per grid requirement, constraints like ash evacuation problem, constraints imposed by regulatory bodies, water quality, coal quality, ageing effect of the boiler, turbine and auxiliary equipment, inefficacy of heaters, leakages through seals, inspection doors and so on. It is clear that most of these factors are not in the absolute control of the service provider. At best these can be mitigated but cannot be altogether eliminated. Since our subject of interest is the relationship between coal quality and HR, we take a more detailed look at how coal quality affects HR. We have tried to determine under normal operating conditions of the plant (including its load variations due to system requirements or very common situations like coal mill outages which force the plant to curtail loads (Billinton [3]) whether the EHV of coal and the plant Heat Rate (HR) have any relationship or not. So, symbolically, the problem can be stated as:

Y = f ( EHV, X2, X,, X,, X,, ...............X") + E

(1)

where, Y stands for HR, which is observable at a given point of time, EHV (one of the inputs) is known but all other input parameters are unknown. f (.) is an appropriate functional relationship and E is the innovation associated with Y. The rationale behind this is that the HR is easily calculated and recorded and so is the EHV, but the other parameters which are quite large in numbers are very difficult to measure and sometimes even to identify. As such, as noted earlier, this is a very typical multiple regression problem where all but one explanatory variables are unknown. Hence, prediction through a standard regression equation of HR on EHV will not be helpful to get a meaningful solution to the problem, as it ignores the information contained in 'E' through other variables either unknown or difficult to measure. So there is a need for an appropriate relationship because it may help to achieve the following: 1. For a given EHV, one can get some idea on the value of predicted HR. 2. For that HR obtained thereby, the coal figure can be worked out (coal figure is the amount of coal required to generate one unit of electricity, expressed in kg / kwh) 3. For a given coal figure, the total coal requirement (in metric tones (mT))for a yearly or half-yearly target generation (in million units) may be predicted. Thus, the total coal requirement in rnTfor that target generation may be predicted (for a given EHV) for that plant. 4. This would not only facilitate budgeting the coal cost over the planning horizon but would also help the utility to select the coal-source for which the total coal cost is the cheapest (coal-sources declare their coal GCV) 5. Since the coal cost is the most important cost element, improvement in plant operating cost as well as its reflection on the consumer is expected.

Prediction by means of an appropriate regression model is very well known in regression analysis. There are many software available to do this. However, in the present

60 context the estimated residuals will contain information regarding the latent input parameters affecting the HR values. Moreover, heat rate of a plant for a given day is strongly dependent on the HR values of previous days. As a result, the HR values and hence the estimated residuals constitute a time series data that has to be given adequate attention. It is known that for the same EHV, the HR may vary from one power generating system to another or from one span of time to another for the same system. Thus it is meaningful to predict HR on the basis of the current data available in a particular power generating system for a given time frame. Moreover, providing a single estimate of the HR value for a given EHV is difficult to interpret because of the complex and hitherto unknown effect of some latent variables, some of which are even difficult to identify. The estimated residuals in the standard regression in this case actually contain the effect of these factors that cannot be treated as only statistical fluctuations or random variations. As such we invoke the idea of a simultaneous confidence interval (Elasu and Mukhopadhyay [I], [2]) to provide prediction interval for the HR values for given EHVs in the context of a given time frame. Our subsequent discussion will be concerned with this specific problem regarding the construction of proper prediction intervals with respect to HR values.

3. The Basic Data Structure The basic data structure comprises two time series data, one each for EHV and HR observed on different dates in a given time frame for any particular power generating system. As the documents are classified in nature, we implement base and scale transformation of both the series and treat them to be the actual series. The scatter plot of log, HR against log, EHV indicates some amount of linear dependence between them and as revealed from the data, the plot shows that HR increases with increasing EHV. In other words, as coal quality improves the plant efficiency decreases, which may be misleading unless interpreted properly. This is explained from the fact that the particular generating unit had the boiler designed to burn inferior grade coal and when coal of high EHV value was used, then the boiler failed to utilize the additional heat content of this superior quality coal. In other words, the combustion was not proper and there was poor utilization of the high heat content of coal, which in turn increased the HR. Given the background of the problem, the estimated residuals will constitute an autoregressive series with proper order. Hence we will try to provide prediction interval taking into account the effects of all the factors whether known or unknown. This will be achieved in two steps. In step - 1, we propose a prediction interval for the part of HR that could have been affected by EHV only, and in step - 2 , a prediction interval for the estimated residual which accounts for the effects of the unknown factors as well as the carried over effects on the previous HR values. The step - 1 is a straightforward problem and is available in standard textbooks on regression analysis or quality assurance (Neter et. al. [7], Vardeman and Jobe [lo]). However, we have tried to be explicit by giving a brief outline of the theory. The step -2 demands special attention and requires some novel techniques as in Basu and Mukhopadhyay [I], [ 2 ] modified accordingly to suit the present problem. All this has been described in section - 4.

61

4. Theoretical Developments It is clear that heat rate (Y) depends on EHV (X). However there are several other factors, some of which are unknown, that affect Y. So, a natural formulation of the prediction problem of Y on X is as follows: y , = w ( x , ) + c , : t = I, 2,....,

The form of the true value

w (Xr

(2)

for a given value of X may be obtained by exploratory

method and the form may vary from one situation to other, This W ( x r ) can be predicted A

by the least square method. However, the estimated residuals e r have a major role to play in the prediction problem of Y, as heat rate is a function of several other unknown factors. It is to be noted that the estimated residuals are not independent and they really form a time series type data. So, to have some idea about the information contained in the A

‘error part’ in the model, an autoregressive model may be of great help. Denote e r by

zr and assume that z,follows an autoregressive model of

appropriate order. The order

of the autoregressive model (AR) may be determined by empirically using exploratory method. There are some theoretical justifications of determining the appropriate order of an AR model; however, for the sake of simplicity and to tackle the problem in a down to earth manner, we use exploratory method that includes a careful look at the residual sum of squares while fitting the AR model. Naturally the order of the AR model varies from one situation to another. It is clear that a single predicted value of HR given a value of EHV is difficult to obtain as it depends heavily on the estimated residual that itself undergoes change through a dependent process. As such, we try to develop a method of providing a prediction interval with a very high degree of prediction coefficient. It is to be noted that to fit an appropriate model to the dependence of HR only on EHV, one can take a transformation on Y as well as on X, depending on the situation. The following theorem will give prediction limits to HR given a value of EHV. The proof of the theorem as well as of two lemmas required therein is omitted as they can be derived from Neter er. al. [7] and Basu and Mukhopadhyay [ 11, [2] respectively modified accordingly to suit the present situation. However, we have used the notations therein in our subsequent discussions h

Theorem 4.1: For a given set of observations (xr .Y,>, r = 1,2, ..., ~t,let

ui

A

predicted value of V(x,) and Z i be the predicted value of

z,.Also

let y be the h

predicted value according to the model (1) as obtained by considering both Then there exists some prediction limits

Y(0 < Y < 11,

be the A

c(t) - and

-

-

ut

A

and Z t .

C ( t ) (C ( r )< c(t)> such that, for any

62

1

[-

P C ( t )I Y t I ?(t) > 1 - y

(3)

5. Findings The simple linear regression of log, HR on log, EHV based on all the 80 observations comes out to be log, HR = 1.692711 + 0. 751341 log, EHV

(4)

From this regression equation the residuals (e,) have been computed for the entire data set. The prediction intervals for log, HR have been obtained by using the standard result from Neter et. al. [7]. The prediction interval for the residuals after assuming an AR (2) model [that comes out to be a reasonable one by exploratory method] have been obtained by results derived from Basu and Mukhopadhyay [l], [2]. The autoregressive parameters involved in this AR (2) model are estimated as 0. 398534 and - 0. 255797 respectively. Noting that A,,= of M,.’ (the notion of M, is given in Basu and Mukhopadhyay (op. cit.) being 6. 378861 and y = 0. 05, where A,,,,,, (A) is the maximum eigen value of a matrix A, the prediction intervals for the log,HR values along with the actual values have been depicted in Figure - 1.

1 -LPL .........

7 75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

-

.............. .-.

- . . . . .--.

............................

._....- . ......

Figure - 1 . Prediction intervals as well as actual ‘log,HR’ values over time

From the graph as well as from the actual computation it is apparent that the intervals are not very wide and have been able to capture the erratic movements of the log, HR values and at the same time may be accepted with high degree of confidence. In the present problem we have taken the value of y to be 0.05 and in the actual computation for the given data set of 80 observations, the proportion of observed values falling within the prediction limits comes out to be 0.96154. This corroborates our claim of providing prediction interval for log, HR capturing the effects of the only existing and the other latent factors. We have tried with other datasets from various power generating systems and the findings are somewhat similar.

63 The above discussion authenticates the underlying theory as developed in section 4. However, our primary objective is the extrapolation beyond the actual data set by providing prediction intervals to log, HR values either for actual variable EHV values or for some externally supplied constant EHV value. To this end, we have considered a part of the data set (consisting of first 70 observations) as the actual data (training set) and then extrapolating in form of prediction interval beyond the time frame. The linear regression of log, HR on log, EHV based on the first 70 observations comes out to be log, HR = 1.628320 + 0.759103 log, EHV

(5)

By following the above stated procedure and noting that the values of the autoregressive parameters in the concerned AR (2) model as 0. 345774 and - 0.276019 and of Mn-’ being 6. 979588, the prediction intervals for the log, HR values have been worked out for the remaining 10 observations. The intervals along with the actual values are represented in Figure - 2.

-

0bsd.lnHR ~~. LPL

7

...................................................

7.9

7.8 7.7

....................................................................

76 1

2

3

4

5

6

7

8

9

10

Figure - 2. Prediction interval for last 10 ‘log. HR’ values based on actual (variable) EHV

Here again it is evident that the intervals are quite sharp and have been able to capture the erratic movement of the future log, HR values for given variable EHV’s (for the remaining 10 data sets) quite competently. As an answer to the second problem we have taken a fixed EHV of 4800 as the externally supplied EHV for all the remaining 10 observations. The prediction intervals then found out are plotted in Figure - 3.

I

I

-LPL

.................................................................. 7 8. 1

2

3

4

5

6

7

8

9

10

Figure - 3. Prediction interval for last 10 ‘log. HR’ values based on same EHV (= 4800)

The width of the intervals being constant in the long run is due to the consideration of AR (2) model for the given data and the constant input value of EHV that is quite natural. Incorporation of random fluctuation superficially is not logical as for a given EHV it is expected to have a more or less stable prediction interval for HR in the long run. Some seasonality in the model may be accommodated easily in case that is a major concern. For

64

all practical purposes, in power generating systems seasonality is not considered to be an important issue.

6. Concluding Remarks As already discussed in section 2, the above procedure can be used in coal cost budgeting and selection of coal supply sources, which would enhance the economy as well as reliability of the piant. As an extended application, an optimal coal mix problem may be explored with the objective of minimizing the coal cost subject to constraints like coal availability and price volatility, coal transportation and logistics bottlenecks, maximum ash content which in turn govern the stack emission parameters among other things This optimization problem would hinge around a target HR value (or a band of target HR values), depending on the boiler design. An optimal coal mix of two or three grades (in terms of EHV and ash content) may be attempted. Moreover, base load stations (generating units which operate at or near full loads) and variable load stations (generating stations with fluctuating output to cater to daily peak and lean demand periods) are expected to behave differently. Adequate data need to be collected for both categories of stations to conduct sensitivity of the above procedure.

References 1.

2. 3. 4.

5. 6. 7.

8. 9. 10.

A. K. Basu and I. Mukhopadhyay, Fixed Width Sequential Simultaneous Confidence Intervals for the Parameters in AR (p) Model. In Frontiers in Probability and Statistics, eds. S. P. Mukherjee, S. K. Basu and B. K. Sinha, Narosa, ND, 25 - 34 (1998) A. K. Basu and 1. Mukhopadhyay, On Darling - Robbins Type Confidence Sequences and Sequential Tests with power one for parameters of an autoregressive model. Statistics and Probability Letters, 45, 205 - 214 (1999) R. Billinton, Power System Reliability Evaluation, Gordon and Breach, NY (1970) K. E. Carlson, Fossil Fuel in Power Plant Engineering (Black & Veach) eds. L. F. Drbal, P. G. Boston, K. L. Westra and R. B. Erickson, CBS Publishers and Distributors, ND, 7 1 - 123 ( 1998) J. Endrenyi, Reliability Modeling in Electric Power Systems, Wiley, NY (1978) R. K. Jain, and J. Balasubrahmanyam, Modern Power Plant Engineering, Khanna Publishers, ND (1987) J. Neter, M. H. Kutner, C. J. Nachtsheim and W. Wasserman, Applied Linear Statistical Models, qLhedition, Irwin, Chicago (1996) B.G.A. Skrotzki and W. A. Vopat, Power Plant Engineering and Economy, Tata McCraw - Hill Publishing Co. Ltd., ND (1986) R. L. Sullivan, Power Systems Planning, McCraw - Hill, NY (1976) S. B. Vardeman and J. M. Jobe, Statistical Quality Assurance Methods for Engineers, John Wiley& Sons. Inc., NY (1999)

OPTIMAL BURN-IN TIME FOR GENERAL REPAIRABLE PRODUCTS SOLD UNDER WARRANTIES*

Y . H. CHIEN Department of Statistics, National Taichung Institute of Technology, 129 Sanmin Road., Taichung, Taiwan E-mail: yhchienOms1.tcol.com.tw

S. H. SHEU Department of Industrial Management, National Taiwan University of Science and Technology, 43 Keelung Rd., Section 4., Taipei 107, Taiwan E-mail: shsheuOim.ntust.edu.tw

Burn-in is used to improve product quality pre-sale. Particularly for products with an initially high failure rate sold under warranty, burn-in can be used to reduce the warranty cost. Since burn-in is usually costly and adds directly to the product manufacturing cost, optimizing the length of the procedure is a major problem. This investigation considers a general repairable product sold under warranty, and examines the optimal burn-in time for achieving a trade-off between reducing the warranty cost and increasing the manufacturing cost (since burn-in can be considered part of the manufacturing process). The expected total cost per unit sold is derived for various warranty policies (failure-free policies with and without renewing, rebate policy). The conditions required for burn-in to be beneficial are derived. Finally, a numerical example is presented.

1. Introduction Warranties for durable consumer products are common in the marketplace, particularly for complex and expensive products. Bischke and Murthy [3] defined a warranty as a contractual obligation incurred by a manufacturer in connection with the sale of a product under which the manufacturer is required to ensure the product functions properly during the warranty period. Failure-free and pro rata rebate are the two common types of warranty policies. A failure-free policy obligates the manufacturer to maintain the product free of charge during the warranty period, while a pro rata rebate policy obligates the manufacturer to refund a fraction of the 'Three warranty policies are considered in this study, including failure-free renewing, failure-free non-renewing and rebate. tMr. Chien is the corresponding aurthor, he is a Assistant Professor in the Department of Statistics at the National Taichung Institute of Technology.

65

66 purchase price if the product fails within the warranty period. Failure-free policies can be further divided into two categories, namely, renewing and non-renewing. 0

0

Renewing policy: if an item fails within the warranty time, it is replaced by a new item with a new warranty. In effect, warranty begins anew with each replacement. Non-renewing policy: replacements of a failed item do not alter the original warranty.

Providing warranties creates additional costs for the manufacturer. The warranty cost usually involves the product failure cost during the initial high failure rate period (infant mortality). Since the cost of failure during production usually is lower than during the warranty period, burn-in often is used for reducing the warranty cost. A burn-in program eliminates infant mortality, and ensures good finished product quality, with the product operating in the region of near constant failure rate following purchase. However, burn-in programs increase manufacturing costs. Therefore, manufacturers attempt to determine the duration of burn-in procedure. The optimum time for stopping burn-in according to a given set of criterion is termed the optimal burn-in time. In this investigation we consider a general repairable product sold under warranty and determines the burn-in time required before the product is put on sale. Burn-in time is optimized to minimize the expected total cost (i.e., manufacturing plus warranty costs) under various warranty policies (failure-free policies with and without renewing, and rebate policy). Conditions required for burn-in to be beneficial are also derived. It can be seen that the present models and results found herein are a generalization of those reported by Nguyen and Murthy [2].

2. Failure/Repair Characterization 2.1. Preliminaries

The product fails randomly. Let X be the failure time of a new product without burn-in, and the failure characteristics of the product are described as follows: 0 0 0 0 0

F ( t ) = P ( X 5 t ) : the failure time cumulative distribution; f ( t ) = d F ( t ) / d t : the failure time density function; F ( t ) = 1 - F ( t ) : the survival function; h(t) = f(t)/F(t): the failure or hazard rate function; N ( t ) = h(u)du:the cumulative hazard function.

The failure rate function is assumed to be uni-modal. Generally, h(t) has a bathtub-shaped failure rate as in Fig.1, with h(t) strictly decreasing for 0 5 t 5 T, and increasing for t > 7,. For an increasing failure rate function, rm = 0; while for a decreasing failure rate function, T, = co.

67

2.2. General repair model

In the general failure model, when the product fails, type I failure occurs with probability 1 - p and type I1 failure occurs with probability p , 0 5 p 5 1. Type I failure is assumed to be minor, and thus able to be corrected by minimal repair, while type I1 failure is catastrophic, and thus can be removed only by replacement.(See ['I) Let random variable Y denote the time to the first type I1 failure of the product without burn-in, then the survival function of Y is given by -

~ ( t= )p(y > t ) = e-

Jt p h ( u ) d u = e - p N ( t )

= ( F ( t ) ) p , t/ t 2 0.

(1)

Also let g ( t ) = -(dE(t)/dt) be the density function of Y , while r ( t ) = ph(t) represents the hazard function of Y. After the burn-in time T , the general repairable product has age 7 . Let subscript T denote the corresponding failure functions after burn-in time T , then we have the following relationships: G,(t) = [G(T t ) - G ( T ) ] / ~ ( T g 7)(,t ) = g(.r and T,(t) = T(T t ) . And the survival function of Y, is given by

-

+ t)/c(~)

+

+

G,(t) = P(Y, > t ) = e - S,'rs(u)du = e-P"(T+t)-N(T)l

=(j7(t))P.

(2)

For general repairable products that do not undergo burn-in, the sequence of type I1 failures followed by replacement constitutes a renewal process, and the expected number of replacements in [0, t ] ,V ( T ) ,is given by the following renewal equation:

V ( T )= G ( T )+

1'

V ( T - t)dG(t).

(3)

Similarly, for general repairable products with burn-in time T , the expected number of replacements in [0, TI is given by

3. Manufacturing Cost Model

The manufacturing cost model contains four costs: Co: the manufacturing cost per unit without burn-in; C1: the fixed setup cost of burn-in per unit;

68

Cz: the cost per unit time of burn-in per unit; C3: the minimal repair cost per type I failure during burn-in. Let V(T) be the total manufacturing cost per unit for products with burn-in time Then the expected total manufacturing cost per unit for products with burn-in time 7 is given by

7.

v(7) = E[v(7)] =

+ C1+ C2

CO

*

S,'E(u)du -

+ C3 - (a - ~ ) G ( T )

G(7) for 0 0.

(5)

(6)

Thus, for general repairable products, the manufacturing cost per unit increases with burn-in time. 4. W a r r a n t y Cost Model

This section derives the expected warranty cost per unit sold,a(T, r ) ,for products with burn-in time 7 and warranty period T under the three types of warranty: the failure-free policies with and without renewing, and the rebate policy.

4.1. Failure-free renewing policy Let the random cost w ( r ; T ) be the total warranty cost per unit sold for products with burn-in time T and warranty period T , then

where C4 denotes the extra cost incurred when a failure occurs during the warranty, regardless of the failure type. Differentiating (7) with respect to r yields

+ {[(C3+ C4W - P ) + (C4 + 4 7 ) ) P I . b(7 + T ) - wm +[cz+ (c3 (1- p ) + V ( T ) . p ) h ( T ) ] . [E(T)- G(7 + T ) ] } .

w ' ( T , T= ) (E(T T ) ) - l X

*

*

(8)

4.2. Failure-free non-renewing policy

Under this non-renewing policy, the total warranty cost per unit sold for the burnedin product with burn-in time r during the warranty period [O,T]is denoted by K ( T ,7 ) . Let Y;,Y;, Y:, . . . be i.i.d. random variables with a survival function ET. Meanwhile, set 2, = Y; Y; Y; ... Y,*, n 2 1 (20= 0, by convention) as the waiting time until the occurrence of the n-th replacement for the burnedin products during warranty. The renewal function associated with is thus P(2, 5 t ) = C:='=, G P ) ( t ) ,where GP)(t)denote the produced as VT(t)= Cz='=, n-fold convolution of the distribution G,(t) itself.

+

+

+

+

cT

69 Let I , represent the indicator function of event A . The following expression is then obtained:

Therefore, we have

~ ( T ,=TE)[ K ( T , T )=] E [ K ( T , T )Y; ; > T ]+ E[K(T,T);Y; 5 T ] 1

+ (C3 + C4)(-P - 1))GT(T)+

= {[c4I).+(

1'

a(T-y,~)dGT(y).

(10)

Thus a ( T ,T ) satisfies a renewal-type equation

1'

+ a ( T - y,~ ) d G , ( y ) . (11) for which L ( T , T )= {[C4+ v ( T ) ]+ (C3 + Cd)(l/p - l)}GT(T). The solution is a ( T ,T ) = L(T, T )

Differentiating (12) with respect to

T

yields

4.3. Rebate policy

In the rebate policy, for the general repairable product, all type I failures in the warranty period [0,T] are rectified (through minimal repair actions) by the manufacturer free of cost, and the buyer is refunded a proportion of the sales price C, when the type I1 failure occurs for the first time. The amount of rebate, R ( t ) ,is a function of the type I1 failure time t . This study assumes that R(t) is a linear function o f t , i.e., R(t) =

Iccp(l - %), for 0 5 t 5 T , for t > T,

(14)

where 0 < Ic 5 1, 0 5 Q 5 1. Two special forms of (14) are the lump sum rebate policy ( a = 0) and pro rata rebate policy ( a = 1, Ic = 1).

70 The expected warranty cost for the general repairable product sold under this policy can be given by 1 a ( T , . r )= ( T . ~ ( T ) x) -{ ~( C ~ + C ~ ) ( - - ~ ) . T . [ ~ ( T ) - ~ ( P

+

k . C p * {T * G(T)- (1 - a ) . T . E(T T ) - a .

Differentiating (15) respect t o

a ’ ( T ,7) = (T . G(.))-l

T

yields

1-

+

G(T t ) d t } .

-

(15)

x

5. Optimization M o d e l

Let C ( T , T )denote the expected total cost per unit sold, for a general repairable product with burn-in time T and warranty period T ; and let C(T)represent the corresponding cost without burn-in. Then

C ( T ,7) = V ( T ) + a ( T ,T ) .

(17) Notably, c(T)< limT+,+C ( T , T ) ,this is due to the fixed setup cost of burn-in C1 > 0. If C1 = 0, then C ( T )= lim,,,+ C(T,T). Thus for a specified warranty period T , the objectives of the manufacturer are: 0

0

To determine the optimal burn-in time T * to minimize C ( T ,T ) when burn-in is used. To compare C ( T , T * )with C(T). If C(T,T*) > C(T),then the optimal policy is to have no burn-in, while if C(T,T*) < C(T),then the optimal burn-in time is given by T * .

Differentiating (17) with respect to condition for minimum cost:

T

C’(T,T ) = .I(.) Sufficient conditions for

T*

and equating it to zero yields a necessary

+ a ’ ( T ,7) = 0.

(18)

to be optimal are (i) it should satisfies (18) and (ii)

C’I(T,T*)> 0.

(19)

Since C ( T , T )+ 00 as T -+ 00, T * is always finite. If (18) has no solution, then = 0 (i.e., no burn-in). The optimal T * can be found by solving (18) or by directly minimizing (17) using numerical methods. Theorem 5.1 and Theorem 5.2 give conditions for T * to be zero or nonzero, and these conditions help in computing T * . In these theorems .isatisfies h(?) = h(.i T ) .

T*

+

Theorem 5.1. For a failure-free policy with warranty period T ,

71 0 0

r* = o if h(0) 5 h ( T ) . A suficient condition for r* > 0 is: 1 ) h(O) > {Cz+[(Co+Ci)pfC3(1-p)+C4]h(T)}/C4forthe renewing

policy; 2) h(O) > ~ ~ ~ ~ o + c l + c 4 ~ + ( c 3 + c 4 ) ( ~ - 1 ) 1 ~ S ( T ) + C 2 [ 1 [(Co+ +V(~ c1+C4)+(C3 C4)( - 1)]C(T)-[C,(1-p) (CO+C1)p] [l+v(T)] } >0 f o r the non-renewing policy.

+

0

;

+

If r* > 0 and rm < 00, then r* < .i5 rm.

Theorem 5.2. For a rebate policy with warranty period T , 0

r* = 0 i f the following I ) or 11) hold. I ) a = 0 and h ( 0 ) 5 h ( T ) ; or 0 < a 5 1, h(0) 5 h ( T ) and h(0) 5 G ( T ) / { p .J z E ( t ) d t } . 11) D < O , where D = [ ( C 3 + C 4 ) ( 1 - p ) + k . C p . ( 1 - a ) p ]. T . ? ? ( T ) + a . k . C p . p . S z E ( t ) d t - [ ( C o + C l ) p + C 3 ( 1- p ) ] T .

+

0

+

+

A suficient condition f o r r* > 0 is h ( 0 ) > (C2 . T [(C3 C4)(l- p ) Cp . (1 - a ) p ] . T . h ( T ). G(T) a . k . C p . G ( T ) } / D> 0. If r* > 0 and r, < co, then r* < .i5 rm.

+ lc .

i) r* < .iL r, i f a = 0; ii) r* < rm if 0 < a 5 1.

Remark 5.1. The above above results yield the following observations: (i) The optimal burn-in time r* depends on 0 0 0

0

product failure characteristics length of warranty period cost parameter probability of failure type I1

(ii) The burn-in is beneficial (i.e., r* > 0) if 0

0

(iii)

The initial failure rate (i.e., h(0))is large. This confirms the intuitive result that burn-in is only useful for products with a high infant mortality rate. Failures during the warranty period are costly. This is the case where C4 is large (failure-free policy) or Cp is large (rebate policy). It can be proved that r* increases as C4 increases (failure-free policy) or C, increases (rebate policy).

is always less than T ~ this ; is to be expected since one would never burn-in beyond the end of the infant mortality period.

T*

72 6. Numerical Example

Jensen and Petersen [‘I and Nguyen and Murthy [2] presented an example of a bathtub-shape failure, where product failure time is assumed t o have a mixed Weibull distribution, i.e., -

~ ( t=)0 .e ( - ~ l @ l ) + (1- 0) . e ( - X 2 . t P 2 )

(20)

with A1 , A2 > 0 , 0 < PI < 1, P 2 > 1 and 0 I 8 5 1. This model can be interpreted intuitively as representing a mixture of two kinds of units, with proportion 6 of defective units and proportion 1 - 8 of normal units. For the present example, 8 = 0.1, A1 = 4, A2 = 0.08, PI = 0.5 and P 2 = 3. Moreover, the cost parameters are Co = 5, Cl = 0.2, C2 = 5/unit of time, C3 = 2, C4 = 10 and C , = 20. The type I1 failure probability is p = 0.2. The warranty policies considered are defined as follows: 0 0 0 0

Policy I: Failure-free renewing policy; Policy 11: Failure-free non-renewing policy; Policy 111: Rebate policy with a = 0 and k = 1; Policy IV: Rebate policy with a = 1 and k = 1.

Figure 2 shows T * versus T . As T -+ 0, then intuitively T * + 0 since less saving is obtained by burn-in; also as T + M, then T * -+ 0, since in this case burn-in worsens the product. Thus, as T increases from 0 to infinity, T * increases and then decreases, as illustrated. I

“ “ ‘ 7

I

1

iY

.-

1

.

To study the variation in the magnitude of saving in the expected total cost with changing T , the following is defined: S ( T )= [C(T)- C(T,T*)]/C(T). Figure 3 shows S ( T ) versus T . Clearly, S ( T ) has a maximum value and is negative for small or large T . This is due to the fixed burn-in cost CI > 0. If Cl = 0, then S(T) is always positive, and S ( T ) -+ O+ as T -+ 0 or T + 00. Acknowledgments The authors like to thank the referees for their valuable comments and suggestion.

References 1. 2. 3. 4.

F. Jensen and N. E. Petersen, Wzley, New York, (1982). D. G. Nguyen and D. N. P. Murthy, IIE Trans. 14, 167 (1982). W. R. Blischke and D. N. P. Murthy, Marcel Dekker, New York, (1994). J. H. Cha, J. Appl. Prob. 37,1099 (2000); 38, 542 (2001); 40, 264 (2003).

DETERMINING OPTIMAL WARRANTY PERIODS FROM THE SELLER’S PERSPECTIVE AND OPTIMAL OUT-OF-WARRANTY REPLACEMENT AGE FROM THE BUYER’S PERSPECTIVE

Y. H. CHIEN* Department of Statistics, National Taichung Institute of Technology, 129 Sanmin Road, See. 3, Taichung, Taiwan E-mail: [email protected]. tw S. H. SHEU Department of Industrial Management, National Taiwan University of Science and Technology, 43 Keelung Rd., Section 4., Taipei 107, Taiwan E-mail: shsheu @im.ntust. edu.tw

J. A. C H E N Department of Business Administration, Kao Yuan Institute of Technology, 1821, Chung-Shan Rd., Lu-Chu Hsiang, Kaohsiung, Taiwan E-mail: [email protected]. tw

This paper considers a general repairable product sold under a failure-free renewing warranty agreement. In the case of a general repairable model, there can be two types of failure: type I failure (a minor failure), which can be rectified by minimal repairs; and type I1 failure (a catastrophic failure), which can be removed only by replacement. After a minimal repair, the product is operational but the failure rate of the product remains unchanged. The aim of this paper is to determine the optimal warranty period and the optimal out-of-warranty replacement age, from the perspective of the seller (manufacturer) and the buyer (consumer), respectively, while minimizing the corresponding cost functions. Finally, a numerical example is presented.

1. Introduction

Warranties for durable consumer products are common in the marketplace. The primary role of a warranty is to offer a post sale remedy for consumers when a product fails to fulfill its intended performance during the warranty period. Bischke and Murthyg defined a warranty as a contractual obligation incurred by a manufacturer, in connection with the sale of a product, under which the manufac‘Mr. Chien is the corresponding aurthor, he is a Assistant Professor in the Department of Statistics at the National Taichung Institute of Technology.

73

74

turer is required to ensure proper functioning of the product, during the warranty period. Failure-free and pro rata rebates are two common types of warranty policies. A failure-free policy obligates the manufacturer t o maintain the product free of charge, during the warranty period, while a pro rata rebate policy obligates the manufacturer to refund a fraction of the purchase price if the product fails within the warranty period. Failure-free policies can be further divided into two categories: renewing and non-renewing. 0

0

Renewing policy: if a product fails within the warranty time, the product is replaced and a new warranty issued. In effect, the warranty begins anew with each replacement. Non-renewing policy: replacements of a failed product do not alter the original warranty.

Manufacturers offer many types of warranties t o promote their products. Thus, warranties have become an important promotional tool for manufacturers. Warranties also generally limit the manufacturer's liability for out-of-warranty product failure. The discussion of various issues related t o warranty policies can be found in Murthy3, Blischke and Murthy5, Murthy and Blischke' and Mitra and Patankar'. Although warranties are used by manufacturers as a competitive strategy t o boost their market share, profitability and image, they are by no means cheap. Warranties cost manufacturers a substantial amount of money. From a manufacturer's perspective, the cost of a warranty program must be estimated precisely and its effect on the firm's profitability must be studied. J a et a1.l' estimated the warranty costs during the life cycle of a product in order t o create a fund for warranty reserves. They considered a failure-free non-renewing warranty policy for products with age-dependent minimal repair costs, derived the s-expected warranty costs and warranty reserves, and demonstrated the feasibility of using cost information to determine warranty length. Yeh and Loll investigated preventative maintenance warranty policies for repairable products. When the length of a warranty period was pre-specified, the optimal number of preventive maintenance actions, corresponding maintenance degrees, and the maintenance schedule were jointly determined. In this paper a general repairable product, sold under warranty, is considered. We have adopted a failure-free renewing warranty policy and an out-of-warranty preventative replacement policy. In this general repairable model, when the product fails a t its age of use t , type I failure occurs with a probability of q ( t ) = 1 - p ( t ) and type I1 failure occurs with a probability of p ( t ) , 0 5 p ( t ) 5 1. Type I failure is assumed to be minor, and can thus be corrected by minimal repair, while type I1 failure is catastrophic, and can only be restored by replacement. Minimal repair means that the repaired product is returned in the same condition as it was, i.e., the failure rate of the repaired product remains the same as it was just prior t o failure. We have assumed that all failures are instantly detected and repaired.

75 From a seller's (manufacturer's) and buyer's (consumer's) perspective, our goal is to determine, respectively, the optimal warranty period and the optimal out-ofwarranty replacement age, which will minimize the corresponding cost functions. 2. Optimal warranty period from the seller's perspective In this section, the problem of determining the optimal warranty period, which minimizes the cost function, is considered from the seller's (manufacturer's) perspective. A failure-free renewing warranty policy was adopted for this investigation, in which minimal repairs or replacement takes place according to the following scheme: if the product-failure within the warranty is minor (type I failure), then the manufacturer conducts minimal repairs; if the product-failure within the warranty is catastrophic (type I1 failure), then the product is replaced and a new warranty is issued. Both minimal repairs and replacement are free of charge to the consumer, but incur costs of c1 and c2, respectively, t o the manufacturer. The cost function consists of the maintenance expense due to product-failure within the warranty period, and the amount gained due to offering the length of the warranty period so forth. Let the random variable Y denote the waiting time t o the first type I1 failure of a new product; the survival function of Y is then given by -

G ( t ) = p ( y > t ) = ,-.fo'p(U)P(U)dU.

(1)

Let h(W) be the total maintenance cost (including minimal repair and replacement) per unit sold for products with warranty period W . The parameter r j - 1 is defined as the number of replacements until the first product's surviving warranty, without type I1 failure, is obtained. Then the random variable Q clearly has a geometric distribution given by

P(rj = Ic)

=

[G(W)]"'C(W),k

2 1.

{x,

Furthermore, let i 2 l} be an i.i.d. sequence of random variables distributed according to G, in which case the random cost h ( W ) is clearly given by

x:zi

where by convention s 0 when q = 1. Since r j is also a stopping time with respect t o the a-field {a(Yl,Yz,.. . , Yn),n 2 l} ; then, by Wald's identity, the mean cost Eh(W)is given by

The cost structure which we consider here contains two parts. The first part is still the mean cost Eh(W). The second part, which is the gain part, is proportional to the length of the renewing warranty period. As mentioned in section 1, warranty can be regarded as an important competitive strategy, used by manufacturers, t o

76 boost their market share, profitability and image. Therefore, if we denote the gain as proportionality constant by K > 0, then the gain due t o the warranty is given by K . W . Thus the cost function considered is this section has the following form:

Differentiating (4) with respect t o W yields

where

+p (-w ) r ( w[cl) . G(W)

lw

+

q ( u ) r ( u ) d u cz . G(W)]

It is easy to check that cp’(W) > 0 if p ( t ) and r ( t ) are increasing, and limw,m p ( W ) = 00. Therefore, the following result is obtained. Theorem 2.1. Suppose the functions r and p are strictly increasing. T h e n for the cost function C,(W) given in (4), if [c1 . q(0) c2 .p(O)]r(O)< K , then the optimal warranty period W *(> 0 ) is unique and finite; otherwise, W * = 0 .

+

+

c2 . p ( t ) ] r ( t ) can be considered as the s-expected Remark 2.1. [cl . q ( t ) marginal maintenance cost function of the failure-free renewing policy. Therefore, Theorem2.1 shows that it is not worth providing a product warranty (i.e., W * = 0) when the marginal maintenance cost of the product’s initial use is high, especially when [ C I . q(o) cz .p(O)]r(O)2 K .

+

3. Optimal out-of-warranty replacement age from the buyer’s perspective

The cost structure considered in this section is described as follows. The consumer has purchased products sold under a failure-free renewing warranty. Within the warranty period W , the manufacturer must maintain the products, free of failure. Although the maintenance is free, the consumers will experience inconvenience or loss incurred by the product failure. That is, any failure of a product within the warranty period not only results in the seller’s cost to provide the maintenance but also in a cost to the consumer (e.g., handling cost, shortage cost, system down cost, waiting cost, etc.). Therefore, we have assumed that csl and cs2 are the costs incurred by the consumer resulting from type I and type I1 failures, respectively. We have assumed that c,1 < c , ~ .The aim of this section is to determine the optimal out-of-warranty replacement age, which minimizes the expected total cost per unit time over the life cycle, for each product purchased by consumer.

77 Using similar arguments to those in section 2, the expected total cost incurred by the consumer during the warranty period can be expressed as follows.

And, for each product purchased, the expected total time for the renewing warranty to last is

Out of warranty, all the repair and replacement costs due to product-failure is incurred by the consumer. A preventative out-of-warranty replacement policy is now considered, in which minimal repairs or replacement takes place according to the following scheme. Out of warranty, a product will be completely replaced whenever it reaches the use time T (i.e., the product with age of use W T ) at a cost cT1 (planned replacement). If the product fails at time of use y E (0, T ) , then it will either be replaced, with a probability of p ( W y) (type I1 failure) at a cost c , ~(unplanned replacement), or it will undergo minimal repairs, with a probability of q(W y) = 1 - p ( W y) (type I failure) at a cost c,. We have assumed that c,1 < c , ~ .After a complete out-of-warranty replacement (i.e., planned or unplanned), the procedure is repeated (i.e., the consumer will be assumed to have purchased a new and identical product after a complete replacement). Sheu4 considered such a preventative replacement model. Therefore, per unit purchased, the total cost incurred out-of-warranty, by the consumer, can be expressed as follows.

+

+

+

+

rYw

where YW = (Y - W ) I {Y > W } . Then ,the expected total cost incurred by the consumer out-of-warranty is

=

(C(w))-l x { (csl + c,) +(GZ

+

'

/ww +T q ( u )T (u)C(u )d u

-G(W

C ~ Z ) @(W)

+ T ) ]+ cr1

'

G(W

+T)}.

Moreover, the expected total operating time for a product out-of-warranty is

(10)

78 Hence, by Eqs. (7) and (lo), the expected total cost incurred by the consumer, from the time a product was purchased to the out-of-warranty replacement, can be expressed as

t(cs1

+ cm)

'Iw

+

. [G(W)- G(W + T ) ]+ c,i . G ( W + T ) } .

W+T

+ ( ~ 2 GZ)

q(u)r(u)Wdu -

(12)

and by Eqs. (8) and (ll),the corresponding expected total operating time per unit purchased is

Therefore, by Eqs. (12) and (13), the expected total cost per unit time over the life cycle for each product purchased is given by W+T

G ( T ; W ) = (/

-

'1

W

~ ( u ) d u ) -xl

{Cs1

0

+(C,l

+ cm)

+(c,z

+ C,Z)

+

q(u)r(u)G(u)du c s z .G ( W )

W+T '

/w

q(u)r(u)Wdu

. [E(W)- E(W

+ T ) ]+

.G(W + T ) } .

(14) In this case, differentiating C 2 ( T ; W ) with respect to T , we see that dCz(T;W)/BT = 0 if and only if

0=lW-

+

+

C,I

+

G(u)du x {(csl cm)q(W T ) r ( W T )

+ csz)p(W+ T ) r ( W + T ) c,1 . p ( W + T ) r ( W + T ) } rW+T +[(c,2 + cs2) c,11 [p(W+ T ) r ( W + T ) p ( u ) r( u ) ]G ( u ) d u } +(c,z

-

-

'

{Iw

-

theorem3.1 letthefuncions r andpabe continuous. Then,if

and either (a) r and pr are increasing with r unbounded and (cTZ+ c S z ) > cT1+ ( C s 1 + c,), or (b) [(c,2 C,Z) - c,~]pr (c,i c,)qr increases to +co,there exists

+

+

+

79 at least one finite T' which minimizes the cost function in Eq. (14). Furthermore, i f any of the functions in (a) or (b) are strictly increasing, then T* is unique. Proof. If the conditions of the theorem are satisfied, then the right-hand side of Eq.(15) is a continuous increasing function of T which is negative (by Eq. (16)) at T = 0 and tends to +co as T --+ +co. Hence there is at least one value 0 < T* < co which satisfies Eq.(15). Since C;(T;W ) has the same sign change pattern (-, 0, +), it follows that C2(T;W ) has a minimum at T'. Under the strict increasing assumption, the right-hand side of Eq. (15) is strictly increasing, therefore T* is unique.

0

+

+ +

+

+

+

Remark 3.1. {(csl c,)q(W t ) [ ( C ~ Z c , ~ )- c,l]p(W t ) } r ( W t ) can be considered as the s-expected marginal cost function of the age-replacement out-ofwarranty policy, and note that

=

{ (csl + c m ) 4 W )+ [(c,z + c S d - c , ~ l p ( W ) ) r ( W ) lim { (csi Cm)s(W t ) [(GZ csz) - c, l] p ( w t ) } r ( W t ) , T-0

+

+ +

+

+

+

which represents the marginal cost of the product at its initial out-of-warranty use. And the term

csl .

so q(u)r(u)=(u)du+ W

C,Z

+

-

. G ( W ) c1, . G ( W )

c(u)du is the cost per unit time for the product within the warranty. Therefore, Theorem3.1 indicates that Eq. (16) is the necessary condition to continue using the product outof-warranty (i.e., T* > 0). 4. A numerical example In this numerical analysis we consider the product with a Weibull distribution, one commonly used in reliability studies. The p.d.f. of the Weibull distribution with shape parameter P and scale parameter I9 is given by

with the parameters of the distribution being chosen as P = 3.3, 0 = 10122 so that the expected life, p , and the standard deviation, u,are 9080 hours and 3027 hours respectively, as in the case in Barlow and Proschan'. The following data are used for the other parameters: c1 = 100, c2 = 1000, c , ~= 5000, c , ~= 10000, cs2 = 5, c,g = 20 and c , = 1000. The type I1 failure probability function is considered as p(y) = 1 - 0.8 * e-O.lY. Using these data we first solve the optimal warranty period W' which was considered in section 2, then based on W * we solve the optimal replacement out-of-warranty age T* which was considered in section 3. The results obtained for different levels of the gain proportional constant K are summarized in Table 1. From the numerical results, we can derive the following remarks:

80 Table 1.

Optimal solution

W*

T*

0.01

2225.01

5715.21

0.025

3313.98

4532.67

0.05

4479.53

3147.05

0.075

5343.11

2002.59

0.1

6055.01

995.76

0.2

8184.61

0.00

K ~

(i) From the seller's (manufacturer's) perspective, the optimal warranty period W * intuitively increases as the gain proportionality constant K increases. (ii) From the buyer's (consumer's) perspective, the optimal replacement out-ofwarranty age T" decreases to 0 as the warranty period W" provided by the seller increases. This is to be expected since the longer the warranty period, the larger the out-of-warranty product failure rate; a t this point, it would not be worth continuing t o use the out-of-warranty product.

Acknowledgments The authors like t o thank the referees for their valuable comments and suggestion.

References 1. R. E. Barlow and F. Proschan, Mathematical Theory of Reliability, Wiley, New York (1965). 2. S . M. Ross, Applied Probability Models with Optimization Applications, Sun Francisco, Calzfornia: Holden-Day(l970). 3. D. N. P. Murthy, Eng. Optim. 15, 280 (1990). 4. S. H. Sheu, Microelectrics Reliab. 31, 1009 (1991). 5. W. R. Blischke and D. N. P. Murthy, Eur. J . Operational Res. 62, 127 (1992). 6 . D. N . P. Murthy and W. R. Blischke, Eur. J . Operational Res. 62, 261 (1992). 7. D. N. P. Murthy and W. R. Blischke, Eur. J. Operational Res. 63, 1 (1992). 8. A. Mitra and J. G. Patankar, Int. J. Production Economics 20, 111 (1993). 9. W. R. Blischke and D. N. P. Murthy, Warranty Cost Analysis, Dekker, New York (1994). 10. S . S . Ja, V. G. Kulkami, A. Mitra and J. G. Patankar, IEEE Trans. Reliab. R50, 346 (2001). 11. R. H. Yeh and H. C. Lo, Euro. J. Operational Res. 134, 59 (2001).

WARRANTY AND IMPERFECT REPAIRS

S. CHUKOVA School of Yathemntical and Computing Sciences, Vzctoria University of Wellington, PO Box 600, Wellington, New Zealand E-mail: Stefanka. [email protected] Y. HAYAKAWA School of International Liberal Studies, Waseda University, 1:7-14-4F Nisha- Waseda, Shinjuku-ku, Tokyo 169-0051, Japan E-mad: yu. [email protected] A brief introduction to concepts and problems in warranty analysis is presented. The degree of warranty repair over the warranty period have an impact on the expected warranty costs and influences consumers’ expenses over the post-warranty usage period of the product. Some techniques and approaches for modeling imperfect repairs are reviewed. A particular model is used to illustrate the impact of the degree of repair on warranty servicing costs.

1. Introduction All products are subject to failures. A failure can be due to a manufacturing defect or to the wearout of the product. Usually the repairs due to manufacturing defects are covered by warranty assigned to the product a t the time of its sale. Extended warranties, which nowadays are quite popular with the consumers, may also cover repairs caused by the wearout of the product. Warranty provides indirect information about the quality of products, and it may influence competition in the marketplace. This is why the length of warranty coverage has generally increased over the years. For example, warranties for automobiles are now 3 years/36,000 miles, or more for some models, compared to only 1 year/12,000 miles twenty years ago. Warranty repairs can affect the overall reliability of the product. The influence of the repair on the lifetime of the product is measured by the degree of the repair. Naturally, a “higher” degree of warranty repair adds to the total warranty costs. At the same time an improvement of the product during the warranty repair decreases the future warranty costs by increasing the product reliability and by reducing the number of failures within the warranty period. In this paper we focus on imperfect warranty repairs and their impact on ex81

82 pected warranty costs. Section 2 is a short introduction to warranty analysis. Section 3 briefly reviews tools and approaches for modeling imperfect repairs. In section 4 an example is given to illustrate the impact on expected warranty costs of imperfect repairs under a renewing free replacement warranty policy. 2. Product Warranty

A product warranty is an agreement offered by a producer to a consumer to repair or replace a faulty item, or to partially or fully reimburse the consumer in the event of a failure. Warranty may depend on one parameter (usually time) or more than one parameters (e.g., time and usage for automobiles). Thus, warranty could be onedimensional or multi-dimensional. Multi-dimensional warranty is usually defined by the geometric measure of the region of coverage. The form of reimbursement of the customer on failure of an item or dissatisfaction with service is one of the most important characteristics of warranty. The most common forms of reimbursement and warranty policies are reviewed in Blischke and Murthy’. Despite the fact that warranties are so commonly used, the accurate pricing of warranties in many situations can be difficult. This may seem surprising since the fulfillment of warranty claims may represent a substantial liability for large companies. For example, according to the 2002 General Motors annual report, the company had net profits of US$1.7 billion and sales of 8.4 million units. The estimated future warranty and related costs on these units was put at US$4.3 billion, substantially more than profits. Underestimating true warranty cost results in losses for a company. On the other hand, overestimating them may lead to uncompetitive product prices and unduly negative reports to stockholders. As a result the amount of product sales may decrease. The data relevant to the modeling of warranty costs in a particular industry are usually highly confidential. Much warranty analysis, therefore, takes place in internal divisions in large companies. The common warranty parameters of interest to be analyzed and evaluated are the expected total warranty cost over the warranty period, as well as for the lifecycle of the item. These quantities reflect and summarize the financial risk or burden carried by buyers, sellers and decision makers. The evaluation of the parameters (e.g., the warranty period, price, etc.) of the warranty program can be obtained by using appropriate models, from the producer’s, seller’s, buyer’s as well as decision maker’s point of view. Their values result from the application of analytical or approximation methods, often combined with an optimization problem. Due to the complexity of the models, it is almost always necessary to resort to numerical methods, since analytical solutions exist only in the simplest situations. A general treatment of warranty analysis is given in Blischke and Murthy’, Chukova, et aL6. Murthy and Djamaludin13 provides a recent extensive literature review of the field. A pictorial representation of the classification of the mathematical models in warranty analysis is given in Fig.1. A version of this classification can be found in Chukova4.

83

1

MATHEMATICAL MODELS IN WARRANTIES

NONREPAIRABLE ITEMS

REPAIRABLE ITEMS

I

COMPLEX ITEMS

Figure 1. Mathematical Models in Warranties

3. Imperfect Repairs

The evaluation of the warranty cost or any other parameter of interest in modeling warranties depends on the failure and repair processes and on the assigned preventive warranty maintenance for the items. The repairs can be classified according to the degree to which they restore the ability of the item to function (Pham and Wang"). The post-failure repairs affect repairable products in one of the following ways 0

0

(a) Improved Repair. A repair brings the product to a state better than when it was initially purchased. This is equivalent to the replacement of the faulty item by a new and improved item. (b) Complete Repair. A repair completely resets the performance of the product so that upon restart the product operates as a new one. This type

84

0

0

0

0

of repair is equivalent to a replacement of the faulty item by a new one, identical to the original. (c) Imperfect Repair. A repair contributes to some noticeable improvement of the product. It effectively sets back the clock for the repaired item. After the repairs the performance and expected lifetime of the item are as they were at a n earlier age. (d) Minimal Repair. A repair has no impact on the performance of the item. The repair brings the product from a 'down' to an 'up' state without affecting its performance. (e) Worse Repair. A repair contributes to some noticeable worsening of the product, It effectively sets forward the clock for the repaired item. After the repairs, the performance of the item is as it would have been at a later age. ( f ) Worst Repair. A repair accidentally leads to the product's destruction.

What could be the reason for imperfect, worse or worst repair? Some possible reasons are (see also Brown and Proschan3 and Nakagawa and Yasui14): incorrect assessment of the faulty item; while repairing the faulty part, damage is caused to the adjacent parts or subsystems of the item; partial repair of the faulty part; human errors such as incorrect adjustment and further damage of the item; replacement with faulty or incompatible parts, and so on. The type of the repair which takes place depends on the warranty reserves, related costs, assigned warranty maintenance, reliability and safety requirements of the product. The existence of an extended warranty or any additional agreements in the warranty contract may influence the degree of the repair to be performed on the faulty item under warranty. Mathematically the degree of repair can be modeled through different characteristics of the lifetime distribution of the item, for example, the mean total time to failure, failure rate function or cumulative distribution function (Chukova et d 5 ) .More sophisticated techniques involving stochastic processes to model the virtual age of the product and its dependence on the degree of repair are also of researcher's interest (Lam'', Lindqvist12, Pham and Wang", Wang and Pham17). Moreover with respect to the length of the repair two scenarios are possible, namely instantaneous repairs and repairs with deterministic or random duration (Chukova and Hayakawa7, Chukova and Hayakawa'). 4. Example 4.1. The Age-Correcting Repair Model Let the initial lifetime, X , of a new product sold under warranty, be a continuous random variable with probability cumulative distribution function (c.d.f.) F ( z ) ( F ( 0 )= 0 and F ( z ) < 1 for all z 2 O), probability density function (p.d.f.) f(z), failure rate function X(z), cumulative failure rate function A(z), and survival func-

85 tion F ( z ) . We model the imperfect or worse repairs using the failure rate function of the lifetime of the product. Let 6i denote the lack of perfection of the ith repair. Then

TO= 0,

+ hixi,

(1) are the values of the virtual age of the product immediately after the ith repair. If 6, = 1 the ithrepair is a minimal one, whereas if 6i > 0 and 6i < (>)1the ith repair is imperfect (worse) one. The extreme case of 6i = 0 corresponds to a complete repair. The model described in (1) is Kijima’s Model I (see Kijima”). We consider this model with the assumption that 6, = 6 # 0, and refer to 6 as an age-correcting factor. If 6 < 1 it is an age-reducing factor, and if 6 > 1 it is an age-accelerating factor. In warranty it is natural to assume that 0 < 6 < 1, which corresponds to reliability improvement of the product. With no failures in [0, u), u > 0 the product would have the original failure rate function X(z) for z E (0, u).Referring to Fig.2a and Fig.2b, the first age-reducing repair occurs at an instant u. After the repair, the product is improved and its performance is as it was earlier, when the age of the product was bu. At calendar age u,which is the time of the first repair, the virtual age of the product is 6u.From time u onwards, until the next repair, the performance of the product is modeled by modified original failure rate function A(z - (u- bu)). Assume that the next failure is at the calendar age u zi. The instantaneous repair improves the performance of the product and its virtual age is 6u 6v. Physically, between the two consecutive failures, the product experiences age accumulation, say u,but due to the age-correcting repair, its virtual age accumulation is 6v. The failure rate function of the lifetime of the product maintained with age-correcting repairs is a modification of X(z), as shown in Fig. 2. For any particular product in a homogeneous population this function will have its jumps whenever an age-correcting repair occurs. Therefore, future failures may reduce (or prolong) the increments in virtual age by the factor 6. Hence, its virtual failure rate will be compressed (or stretched) compared to the original failure rate. Following the ideas in Nelson15, for a population of products (with i.i.d. lifetimes) maintained under age-correcting repairs with identical age-correcting factors, the population failure rate is obtained by averaging the possible individual failure rates of all products. The failure rate for this population is the virtual failure rate of one randomly selected product maintained under age-correcting repairs. We denote it by X*(z), where 1c is a calendar age. It reflects the overall slow-down or acceleration of the aging process for an “average” product from the population. See Dimitrov, et al.’ for details on this model. 5 T, 5 . . . of times representing Consider the sequence 0 = TO5 TI 5 Tz 5 the virtual age of the product after the nth repair. Let {N,”, t 2 0) be the counting process corresponding to { T n } ~ = o . From Theorem A.4 (pp. 380) in Block, et aL2, it follows that {N,“, t 2 O} is a non-homogeneous Poisson process (NHPP) with a leading Ti

= T,-1

+

i = 1,2,

+

86

function

where A ( t ) = - log(] - F ( t ) )is the leading function of the NHPP associated to the process of instantaneous minimal repairs. A(x)

A(X) 0.6-

1.5

0.5

1.25

0.4

1 0.75

0.3

0.5

0.2

0.25

0.1

6u$,u+vP6

Fig.2a

8

X

u+v

Original and individual virtual failure rates under age-reducing factor 6 = .6.

Fig.2b

Original and individual virtual failure rates under ageaccelerating factor 6 = 1.2.

For 6 = 0, equation (2) is not valid because it reflects the failure rate immediately after an age-reducing repair only. Equation (2) shows that the transformation between the calendar and the virtual time scales is t" + t"/b. In other words, the virtual lifetime T , and the calendar lifetime X multiplied by 6 are equal in d distribution, i.e., T = 6 X . Therefore, when the product is at calendar age x, its virtual age measured at the calendar age scale is 62. Thus (see Dimitrov, et aLg for details) X*(x) = X(6x)

and A*(x) =

1 -

6

A(&)

for x 2 0, 6 > 0.

Denote by C, (u, 6) the cost of an age-reducing repair of factor 6 at calendar age u of the product and by a constant Cm(u)= Cm the cost of a minimal repair of the product. Let

Cr(u,6)= Co(1-

(3)

where A4 is a limiting age after which the product can not be sold and Co is the price of a new item. 4.2. Costs Analysis: Renewing Warranty

Here and onwards the time scale is the calendar age time scale. Under the agereducing repair model, we focus on the cost analysis of a renewing free replacement warranty of duration T . Then (see Dimitrov, et aLg for details) the following is true.

a7

The expected warranty cost Cw(t0,T ,6) associated with a product sold at age t o under a renewing free replacement warranty of duration T and maintained under age-reducing repairs of factor 6, satisfies the integral equation

J to

with the boundary condition Cw(to,O,6 ) = 0. Consider products with the Weibull lifetime distribution, i.e., X E Weibull(p, a ) and A(%)

=

P

(z) P

0-1

, A(%) =

(:)a,

z

2 0. Taking into account (3), equation

(4) becomes

Cw ( t o , T, 6) =

Dr

T

Fig.3a

Cw(0,T ,6) with fixed values of 6.

Fig.3b

Cw (0, T ,6) with fixed T .

F i g . 3 ~,,,S as a function of T .

Here and onwards the product's life time is assumed to be Weibull(a = 1.5, A4 = 4, CO = 100 and Cm = 15. Fig.3a illustrates the dependence of Cw(0,T ,6) on the warranty period T , under age-reducing repairs of factor 6, where 6 assumes the values 1.0, 0.85, 0.67, 0.4 and 0.2. We observe that the expected warranty cost Cw(0, T, 6) is an increasing function of T . Using numerical optimization and the dependence of C w ( 0 ,T ,6) on 6 shown in Fig.3b we observe that Cw(O,3,6) has a maximum at 6,, = ,67757. The existence of 6, was to be expected due to the renewing warranty scenario and the fact that the lifetime distribution is an IFR distribution. The length of the renewing warranty coverage is a function of the reliability of the product, namely small values of 6 lead to shorter warranty coverage. On the other hand, due to (3), the cost per repair at time u , C,(u,b), is a decreasing function of 6. F i g . 3 ~ for T E [0,3]. In other words it gives the "worst" represents the values of b,, = 2) and the values of the remaining parameters are

88 value of the age-reducing factor as a function of t h e length of t h e warranty period. It is interesting to observe that t h e range of , , ,a is very small for quite a large range of 5"-values. The illustrations a r e for a new product t o = 0. However, equation ( 5 ) allows one to s t u d y the dependence of the warranty cost on the selling age t o .

References 1. Blischke, W.R. and Murthy, D.N.P. Product Warranty Handbook. Marcel Dekker, 1996. 2. Block, H.W., Borges W., and Savits, T.H. Age-dependent minimal repair. Journal of Applied Probability, 22:370-385, 1985. 3. Brown, M. and Proschan, F. Imperfect repair. Journal of Applied Probability, 20:851859, 1983. 4. Chukova, S. On taxonomy of mathematical models in warranty analysis. In Vandev, D., editor, Proceedings of Statistical Data Analysis'96, pages 124 - 133, Sozopol, Bulgaria, 12 - 17 September 1996. 5. Chukova, S., Arnold, R., and Wang, D. Warranty analisys: An approach to modelling imperfect repairs. International Journal of Production Economics, 2004 (to appear). 6. Chukova, S., Dimitrov, B., and Rykov, V. Warranty analysis. a survey. Journal of Soviet Mathematics, 67(6):3486-3508, 1993. 7. Chukova, S. and Hayakawa, Y. Warranty cost analysis: Non renewing warranty with non-zero repair time. Applied Stochastic Models in Business and Industry, 20( 1):59-71, 2004. 8. Chukova, S. and Hayakawa, Y. Warranty cost analysis: Renewing warranty with nonzero repair time. International Journal of Reliability, Quality and Safety Engineering, 2004 (to appear). 9. Dimitrov, B., Chukova, S., and Zohel, K. Warranty costs: An age-dependent failure/repair model. Naval Research Logzstic Quarterly, 2004 (under review). 10. Kijima, M. Some results for repairable systems with general repair. Journal of Applied Probability, 26:89-102, 1989. 11. Lam, Y . Geometric process and replacement problem. ACTA Mathematicae Applicatae Sinica, 4(4):366-377, 1988. 12. B. Lindqvist. Repairable systems with general repair. In Proceedings of European Safety and Reliability Conference, pages 1-2, Munich, Germany, 13 - 17 September 1999. 13. Murthy, D.N.P. and Djamaludin, I. New product warranty: A literature review. International Journal of Production Economics, 79(2):236-260, 2002. 14. Nakagawa, T. and Yasui, K. Optimum policies for a system with imperfect maintenance. I E E E Transaction o n Reliability, R-36/5:631433, 1987. 15. Nelson, W. Graphical analysis of system repair data. Journal of Quality Technology, 20:24-35, 1988. 16. Pham, H. and Wang, H. Imperfect maintenance. European Journal of Operational Research, 94:425-438, 1996. 17. Wang, H. and Pham, H. A quasi renewal process and its applications in imperfect maintenance. International Journal of Systems Science, 27: 1055-1062, 1996.

ACCEPTANCE SAMPLING PLANS BASED ON FAILURE-CENSORED STEP-STRESS ACCELERATED TESTS FOR WEIBULL DISTRIBUTIONS* SANG WOOK CHUNG’ and YOUNG SUNG SEO Department of Industrial Engineering, Chonnam National University 300 Yongbong-dong, Buk-gu, Gwangju 500-757, Korea WON YOUNG YUN Department of Industrial Engineering, Pusan National University 30 Jangieon-dong, Geumjeong-gu, Busan 609- 735, Korea

This paper considers the design of the acceptance sampling plans based on failure-censored stepstress accelerated tests for items having Weibull lives The accelerated life tests assume that 1) a linear relationship exits between the log Weibull scale parameter and (transformed) stress, ii) the Weibull shape parameter is constant over stress, and iii) the parameters involved are estimated by the method of maximum likelihood The sample size and the lot acceptability constant are determined satisfying the producer’s and consumer’s risks, and the accelerated life test is optimized to have a minimum sample size by minimizing the asymptotic variance of the maximum likelihood estimator of the test statistic The proposed sampling plans are compared with the sampling plans using constant stress accelerated tests

1

Introduction

It is essential from the consumer’s point of view that a product perform the required function without failure for the desired period of time. The lifetime of a product is therefore one of the most important quality characteristics. Life-test sampling plans are commonly used to determine the acceptability of a product with respect to lifetime. The design of the life-test sampling plans has been considered by several authors. Fertig and Mann” discussed sampling plans for the Weibull distribution using best linear invariant estimators. Kocherlakota and Balakri~hnan’~ considered one- and two-sided sampling plans for the exponential distribution. SchneiderZ0 discussed failure-censored sampling plans for the lognormal and Weibull distributions using maximum likelihood estimators (h4LEs). Balasooriya6 considered failure-censored sampling plans for the exponential distribution under the circumstances where items are to be tested in sets of fixed size. Balasooriya and Saw,’ Balasooriya and Balakrishnan,’ and Balasooriya et al.’ considered sampling plans for the exponential, lognormal, and Weibull distributions based progressively failure-censored data, respectively. Many modern high-reliability products are designed to operate without failure for a very long time. Life testing for these products under use conditions takes a lot of time to This study was financially supported by Chonnam National University in the program, 2002.

89

90 obtain reasonable failure information. In this situation sampling plans based on such life tests are impractical. Introducing accelerated life tests (ALTs) in life-test sampling plans can be a good way to overcome such difficulty. Wallace" stressed the need for introducing ALTs to the future plans of MIL-STD-781. ALTs are used in many contexts to obtain information quickly on the lifetime distribution of products. Test items are subjected to higher than usual levels of stress to induce early failures. The test data obtained at accelerated conditions are extrapolated by means of an appropriate model to the use conditions to obtain an estimate of the lifetime distribution under the use conditions. In ALTs, the stress can be applied to test items in various ways. In a constant stress ALT the stress applied to the items is constant during the test duration, and in a step-stress ALT the stress changes either at a fixed time or on the occurrence of a fixed number of failures. Nelson (Ref. 18, Chap. 6) and Meeker and Escobar (Ref. 16, Chap. 20) provide references and planning methods for constant stress ALTs. Some work on planning step-stress ALTs can be found in Miller and Nelson,I7 Bai et u I . , ~ Bai and Kim,3 Chung and Bai," and Alhadeed and Yang.' Sampling plans based on ALTs have been explored in previous work. Yum and Kimz2developed failure-censored constant stress ALT sampling plans for the exponential distribution. HsiehI3 extended the work of Yum and Kimz2 and obtained sampling plans that minimize the total censoring number. Bai et ~ 1 considered . ~ the design of failurecensored constant stress ALT sampling plans for the lognormal and Weibull distributions. Under the constraint that the tests at high and low stress levels have equal expected test times, Bai et a1.' considered failure-censored constant stress ALT sampling plans for the Weibull distribution. Chung et al." considered the design of failure-censored step-stress ALT sampling plans for the lognormal distribution with known scale parameter. This paper considers the design of the acceptance sampling plans based on failurecensored step-stress ALTs for items having Weibull lives. The sample size and the acceptability constant are obtained which satisfy the producer's and consumer's risks. The accelerated life test is optimized to have a minimum sample size by minimizing the asymptotic variance of the test statistic. The proposed sampling plans are compared with the sampling plans using constant stress ALTs. 2

TheModel

2.1. Assumptions At any stress x , the lifetimes of the test items follow a Weibull distribution with shape parameter 7 and scale parameter B(x) = exp(w, + w,x) ; i.e., the log lifetimes follow a smallest extreme value distribution with location parameter p(x) = log[B(x)] = w, + w,x and scale parameter 0 = 1/ 7 . 2. The Weibull shape parameter 7 is constant over stress. 3. The cumulative exposure model" holds for the effect of changing stress. 4. The lifetimes of test items are statistically independent. 1.

91

2.2. Standardized Model For convenience, define the standardized stress as 5 = (x - x, ) /(xH - x, ) , where x, and x, are the prespecified design and high stresses, respectively. For the design stress x = x, , 5 = 5, = 0, for the low stress x = x, , 5 = tL(0 < 5 , ~ < 1) , and for the high stress x = x , , 5 = 5, = 1 . The location parameter p(x) of log lifetime distribution of test items at stress x can be rewritten in terms of 5 as p(5) = ya + y , 5 , where yo = w, + w , x , = p(x,) and y, = w,(x, -x,) = ,u(x,)-,u(xD). Note that p, = p(x,) = P(5, ) = y o .

2.3. Test Procedure Low-to-high (LH) mode test procedure 1 . n items are first run simultaneously at 5, . 2. When the nq, items have failed at <,, the stress applied to the surviving n(1-q,) items is changed to 5, . 3. The test is ended when the additional n(q, - 4 , ) items have failed. High-to-low (HL) mode test procedure 1 . n items are first run simultaneously at 5, . 2. When the nq, items have failed at 5, , the stress applied to the surviving n(1- qH) items is changed to 6, , 3. The test is ended when the additional n(q, - q H ) items have failed. 3

Acceptance Sampling Plans

3.1. Lot Acceptance Procedure Let L’ denote the actual one-sided lower specification limit on the lifetime of a product with the Weibull distribution. That is, items with lifetime less than L’ are considered nonconforming. Instead of using the actual lifetime Y , we work with log lifetime log Y , which leads to the smallest extreme value distribution with location parameter ,u and scale parameter CT . The lower specification limit on the log lifetime is L =log L’ . A sample of n items is selected at random from a lot and is tested according to the above test procedure. MLEs il, = 7, and 8 of the location parameter ,uD= y o at use condition and the scale parameter CT ,respectively, are obtained from the test data. In this paper we apply the well-known procedure of Liberman and Resnikoffl’ to judge whether a lot should be accepted or rejected. The value of test statistic

T=;, - k &

(1)

is compared with L , and the lot is either accepted if T 2 L or rejected if T < L , where k is the acceptability constant.

92

According to an agreement between the producer and the consumer, lots with fraction nonconforming p 5 po are presumed to be good and should be accepted with probability of at least 1- a , where a is the producer's risk. On the other hand, lots with p > pp are rejected with probability of at least 1 - /? , where p is the consumer's risk. The sample size n and the acceptability constant k are to be determined to satisfy the producer's and consumer's risks. 3.2. Asymptotic Variance of Test Statistic

c2

Let 5, and be the first and second stress levels, respectively, and let q, and q 2 be the = , failure proportion at 6, and 42 , respectively. Then, for LH mode 5, = 5 , , 4 , = qL , and q 2= q H = q, - qL , and for HL mode 5, = 5,, , t2= 5 , , 4, = q H , and q , = 9,. = q , - q H. The Fisher information matrix can be obtained by taking expectations of the negative of these second partial derivatives. Applying the approximation of Schneider?' the Fisher information matrix is obtained as

c2 cH

u,

w,,r, 6) = 7 >

D

u, + A ,

u, + A 2

(2)

symmetric

where

~ ~ = 5 ; s , + 4 ; ( ~i =co -, 192, ~,),

The asymptotic variance-covariance matrix for MLE ( f " , f , , 6) is obtained by inverting the Fisher information matrix. The asymptotic variance of the test statistic T = i , - k O is AsvarfT] = Asvafiy, ] - 2kAscov[f,, ,c?]+ k 2Asvafic?],

(3 1

and the standardized asymptotic variance is defined by V

=

2AsvarfT] . 0'

(4)

93 3.3. Determination of Sample Size and Acceptability Constant

Based on asymptotic distribution theory, the test statistic T = i,- k 8 is approximately normally distributed with mean E[T] = yu - k o and variance var[T] = (o?I n)V . Then, following the argument of Schneider," we have

and

where w, and w,+ are the quantiles of the standard normal distribution and z p a and zpp are the quantiles of the standard smallest extreme value distribution. From (5) and (6), we can see that k is determined by the two points ( p , ,1- a ) and ( p p,/3) on the OC curve, and n by the two points and V . 4

Optimum Sampling Plans

The sample size n depends heavily on V . Thus the minimum sample size n' can be determined by optimally designing the step-stress ALT so that V is minimized. V is a and cs . Then optimum design problem of function of p a , a , p p , /3 , qc , q, , failure-censored step-stress ALT sampling plans can be stated as follows: given the values of p a , a , p a , p , q, , 4, , and cs , find the value of qL minimizing V . The optimum value of q, can be found by numerical method such as Powell method.'' Table 1 presents optimum sampling plans for various values of p , , p p , q, , ,and CT when l-a=0.95 and P=O.lO. Following Schneider?' the values of p d and p p are chosen to match MIL-STD-IOSD. The values of o correspond to the cases of lifetimes with increasing failure rate for o < 1.O and decreasing failure rate for o > 1.O . For a given value of 5 , , n' decreases as qc increases. The decrease is significant for relatively large value of 4,. For a given value of qc , n' increases as 4, increases and does rapidly when 5, is larger than 0 . 5 . The increase is significant for larger value of q, . n' of LH mode is less than that of HL mode.

eL,

c,

5

Comparison with Constant Stress ALT Sampling Plans

We compare the proposed sampling plans with the constant stress ALT sampling plans considered by Bai et a1.4 in terms of sample size. Let z* be the sample proportion allocated to 4, which minimizes the asymptotic variance of the test statistic, and let c, and c, be the failure proportion at {, and respectively, in constant stress ALT. We first obtain z* and minimum sample size n' of constant stress ALT sampling plans for selected values of c, , c, , and 4, . Next, for the value of q, calculated by the equation q, = z * c L+ (1 - z')c,, , we obtain n' of step-stress ALT sampling plans.

cH,

94 Table 1 Failure-censored step-stress ALT sampling plans for p , and pp to match with MIL-STD-I05D 0 =0.8

LH mode Pa PP qc tL 0.00041 0.01840 0.3 0.3 0.5 (k =5.656) 0.7 0.7 0.3 0.5 0.7 0.00284 0 03110 0.3 0.3 0.5 (k =4.509) 0.7 0.7 0.3 0.5 0.7 0.00654 0.04260 0.3 0.3 05 (k =3.963) 07 0.7 0.3 0.5 0.7 0.01090 0.05350 0.3 0.3 0.5 (k =3.607) 0.7 0.7 0.3 0.5 0.7 0.02090 0.07420 0.3 0.3 0.5 ( k =3.130) 0.7 0.7 0.3 0.5 0.7 0.03190 0.09420 0.3 0.3 0.5 (k=2.802) 0.7 0.7 0 3 0.5 0.7

4;

n'

0.045 39 0.135 39 0.164 83 0.086 21 0.198 21 0.370 35 0.084 57 0 159 67 0.177 217 0.132 34 35 0.344 0.400 89 0.155 67 0.171 97 0.183 369 0.173 43 0.368 48 0.413 148 0.168 75 0.179 128 0.187 527 0.216 49 0.385 60 0.422 210 0.186 91 0.189 195 0.191 866 0.373 59 0.408 86 0.433 342 0.197 107 0.196 268 0.195 1217 0.398 66 0.423 112 0.441 478

d =2.0

HL mode

4 0.069 0077 0087 0 140 0 161 0 188 0080 0087 0095 0 161 0 180 0203 0088 0093 0098 0 I74 0 I91 0210 0093 0097 0 101 0 183 0 199 0216 0102 0 103 0 105 0 198 0210 0223 0 I09 0 108 0 107 0209 0219 0.228

n'

124 180 358 65 92 176 230 354 765 126 187 383 317 505 1143 178 271 578 389 637 1491 224 347 758 512 875 2150 303 485 1101

610 1079 2750 371 607 1415

LH mode 4;

0.061 0098 0 184 0 128 0200 0424 0089 0 181 0 198 0 170 0286 0449 0115 0 192 0205 0201 0398 0462 0 145 0200 0210 0229 0441 0472 0206 0211 0216 0283 0460 0485 0219 0219 0220 0342 0475 0.494

n'

39 39 43 21 21 22 57 57 82 34 34 41 67 69 122 43 43 60 73 80 162 49 50 79 81 100

248 59 61 116 86 119 338 65 71 154

HL mode

6 0.287 0.045 0.056 0.671 0.088 0.114 0.283 0.054 0.063 0665 0.104 0.129 0.281 0.060 0.068 0.661 0.114 0.137 0.278 0.065 0.071 0.658 0.122 0.143 0.275 0.073 0.076 0.653 0.134 0.152 0.271 0.079 0.079 0.648 0.144 0.159

n*

63 84 134 32 45 71 103 147 258 56 85 144 131 195 364 74 118 208 151 234 457 89 145 265 182 298 623 113 193 369 202 346 765 132 232 461

Table 2 gives n' , n' , and ratio n* In' for the selected values of c , , c , , and 6, when ( p a , l - a )=(0.01090, 0.95) and (p,,p)=(0.05350, 0.10). We can observe that n' I n' for the case with cr = 2.0 is less than that for the case with cr = 0.8, and when cr = 2.0, in many cases n' of LH mode is less than n' .

6

Conclusions

The design of the acceptance sampling plans based on failure-censored step-stress ALTs was considered for the Weibull distribution. Asymptotic variance is a dominating factor in determining the sample size required for a sampling plan to determine the acceptability

95 Table 2.Comparison of proposed sampling plans with constant stress ALT sampling plans CT

=0.8

LH mode cL

c,

0.3 0 5

CL

03 0.5 0.7 0.7 0.3 0.5 07 0.9 0.3 05 0.7 0.5 0.7 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.7 0 9 0 3

z*

0.724 0.697 0.646 0.715 0.730 0.686 0.727 0.759 0.711 0.696 0.671 0.623 0.656 0.686 0.654 0.679 0.5 0.657 0.7 0.611

4,

0.355 0.361 0.371 0.414 0.408 0.426 0.464 0.445 0.474 0.561 0.566 0.575 0.638 0 626 0.639 0.764 0.769 0.778

ti

77 114 300 63 93 259 53 81 243 63 87 206 52 71 176 52 70 159

70(0.909) 108 (0.947) 418 (1.393) 65 (1.032) 96 (1.032) 359(1.386) 62(1.170) 89 (1.099) 320 (1 317) 56(0.889) 72 (0.828) 259 (1.257) 52 (1,000) 66 (0.930) 232 (1.318) 47 (0.904) 55 (0.786) 188(1.182)

CT

=2.0

HL mode LH mode ti (n’In ’)

349(4.532) 558 (4.895) 1254(4.180) 316(5.016) 510 (5.484) 1121 (4.328) 293(5.528) 479 (5.914) 1029 (4235) 258(4.095) 403 (4462) 881 (4.277) 237 (4.558) 375 (5.282) 813 (4.619) 21 1 (4.058) 326 (4.657) 702(4.415)

69 (0.896) 72 (0.632) 134 (0.447) 65 (1.032) 68 (0.731) 119(0.459) 62(1.170) 65 (0.802) 109 (0.449) 56(0.889) 57 (0.655) 92 (0447) 52 (1,000) 53 (0.746) 85 (0483) 47 (0.904) 47 (0.671) 72 (0.453)

HL mode

138 (1.792) 211 (1.851) 396(1.320) 127(2.016) 197 (2 118) 361 (1.394) llS(2.226) 188 (2.321) 336 (1.383) 105 (1.667) 163 (1.874) 297 (1.442) 96 (1.846) 154(2.169) 279(1.585) 83 (1.596) 139 (1.986) 251 (1.579)

of a lot. The proposed sampling plans provide the minimum sample size by optimally designing a step-stress ALT so that the asymptotic variance is minimized. The effects of low stress and total failure proportion on sampling plans are also investigated. References

1. 2.

3.

4.

5. 6.

7.

A. A. Alhadeed and S-S. Yang, “Optimal simple step-stress plan for Khamis-Higgins model,” IEEE Trans. Reliab. 51 (2002), pp. 212-215. D. S. Bai, Y. R. Chun and J. G. Kim, “Failure-censored accelerated life-test sampling plans for Weibull distribution under expected test time constraint,” Reliability Engineering andsystem Safety SO (1995), pp. 61-68. D. S. Bai and M. S. Kim, “Optimum simple step-stress accelerated life tests for Weibull distribution and type I censoring,” Naval Research Logistics 40 (1 993), pp. 193-210. D. S. Bai, J. G. Kim and Y. R. Chun, “Design of failure-censored accelerated life-test sampling plans for lognormal and Weibull distributions,” Engineering Optimization 21 (1993), pp. 197-212. D. S. Bai, M. S. Kim and S. H. Lee, “Optimum simple step-stress accelerated life tests with censoring,” IEEE Trans. Reliab. 38 (1989), pp. 528-532. U. Balasooriya, “Failure-censored reliability sampling plans for the exponential distribution,” Journal of Statistical Communication and Simulation 52 (1 9 9 9 , pp. 337-349. U. Balasooriya and N. Balakrishnan, “Reliability sampling plans for lognormal distribution, based on progressively-censored samples,” IEEE Trans. Reliab. 49 (2000), pp. 199-203.

96 8.

9. 10.

11.

12. 13. 14.

15. 16. 17. 18. 19.

20. 21. 22.

U. Balasooriya and L. C. Saw, “Reliability sampling plans for the two-parameter exponential distribution under progressive censoring,” Journal of Applied Statistics 25 (1998), pp. 707-714. U. Balasooriya, L. C. Saw and V. Gadag, “Progressively censored reliability sampling plans for the Weibull distribution,” Technometrics 42 (2000), pp. 160-167. S. W. Chung and D. S. Bai, “Optimal designs of simple step-stress accelerated life tests for lognormal lifetime distributions,” International Journal of Reliability, Quality and Safety Engineering S (1998), pp. 315-336. S. W. Chung, W. S. Yang and W. Y. Yun, “Failure-censored step-stress acceptance sampling plans for lognormal distributions with known scale parameters,” Engineering Valuation and Cost Analysis 3 (2000), pp. 257-267. K. W. Fertig and N. R. Mann, “Life-test sampling plans for two-parameter Weibull populations,” Technometrics 22 (1980), pp. 165-177. H. K. Hsieh, “Accelerated life test sampling plans for exponential distributions,” Communications in Statistics-Simulation 23 (1994), pp. 27-41. S. Kockerlakota and N. Balakrishnan, “One- and two-sided sampling plans based on the exponential distribution,” Naval Research Logistics Quartely 33 (1986), pp. 5 13522. G. J. Lieberman and G. J. Resnikoff, “Sampling plans for inspection by variables,” Journal of the American Statistical Association SO (1959, pp. 457-516. W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliabilily Data (John Wiley & Sons, New York, 1998). R. Miller and W. Nelson, “Optimum simple step-stress plans for accelerated life testing,” IEEE Trans. Reliab. R-32 (1983), pp. 59-65. W. Nelson, Accelerated Testing: Statistical Models, Test Plans, and Data Analyses (John Wiley & Sons, New York, 1990). M. J. D. Powell, “An efficient method for finding the minimum of a function of several variables without calculating derivatives,” The Computer Journal 7 (1964), pp. 155-162. H. Schneider, “Failure-censored variables-sampling plans for lognormal and Weibull distributions,” Technometrics 31 (1 989), pp. 199-206. W. E. Wallace, Jr., “Present practice and future plans for MIL-STD-781,” Naval Research Logistics Quarterly 32 (1985), pp. 2 1-26. B. J. Yum and S. H. Kim, “Development of life-test sampling plans for exponential distributions based on accelerated life testing,” Communications in Statistics-Theory andMethods 19 (1990), pp. 2735-2743.

AVAILABILITY FOR A REPAIRABLE SYSTEM WITH FINITE REPAIRS

LIRONG CUI AND JINLIN LI School ojManagement & Economics, Beijing Institute o j Technology, Beqing, P. R. China (100081)

In this paper, we discuss a problem in maintenance for a single-unit repairable system with finite repairs. The instantaneous availability is shown and comparison is made with that for infinite repairs. Some numerical examples are presented to illustrate the results.

1.

Introduction

A great deal of research efforts have been devoted in the reliability and maintenance literature on the analysis of repairable reliability systems (see references [ 11-[6]). However, it seems that they all are based on the assumption of infinite number of repairs., i.e., for a repairable system, we can do infinite repairs no matter what repairs to be taken. In the real life, sometimes people only have finite resources or a repairable system performance allows only finite repairs. What is the instantaneous availability for this situation? In this paper, we discuss the problem for a single-unit repairable system. First we make the following assumptions for a single-unit repairable system, 1. at most n repair actions can be taken and each repair action can correct one failedcomponent into new one; 2. the lifetime X of the system has the distribution F(x) and the repair time Y has the distribution Cb); 3. X and Yare independent to each other; 4. the system is new at time GO; 5. at any time, the system has two possible states- up and down. The paper is organized as follows: In Section 2, the recursive equation for the instantaneous availability is established by probability arguments, and it is solved by using Laplace transformation technique. Some numerical examples are presented in Section 3. Finally, discussion and conclusions are given in Section 4.

2.

Availability Formula

We introduce a stochastic process { X ( t ) , t 2 0) , where X ( t )=

1, at time t the system is in up state; 0, at time t the system is in down state.

Let A,, ( t )= P ( X ( t )= 1 I at time t the system is new,and repairs can be done at most n times) , in the following we shall omit “repairs can be done at most n times” in order to compact the formulas without confusion to readers. We have,

97

98 A,, ( t )= P { X ( t )= 1 1 at time t the system is new} = P { X , > t, X ( t ) = I I at time t = 0 the system is new)

+ P { X , I t < XI + Yl ,X ( t )= I 1 at time t = 0 the system is new} + P { X , + Y, I t, X ( t ) = 1 1 at time t = 0 the system is new) = P { X , I t } + O + P { X , +Y, I t , X ( t ) = l I a t t i m e t =Othesystemis new}, where P { X , + Y, I t , X ( t ) = 1 I at time t = 0 the system is new} = p { X ( t )= 1 1 at time t = 0 the system is new,X1 + 6 = u}dP{X,+ 6 I u ) 0 t

+ Y, 5 u }

= J p { X ( t )= I 1 at time t = u the system is new}dP{X1 0 f

o

= IpjX(t - u ) = I I at time t = the system is new,at most n - I repairs}dP{X,+ Y, I u} 0

= p-l-,(t - U ) d P { X , i-Yl Iu } 0

= A,-I ( t ) * Q
where Q(u) = P { X , + Y, I u } . Thus we have finally, A, ( t ) = 1- F ( t ) + A,-I ( t ) * Q(t) . Taking the Laplace Transformation on both sides of the above equation, we get

rn

m

m

where A;(s) = le-sfA,(t)dt,

I

$(s) = e-”‘dF(t), 0

0 m

&(s) = Ie-$‘dG(t). 0

m

1- h(s)

I

We know that A; (s) = JCsf A, (t)dt = e-”‘ [I - F(t)]dt= ___ , then we can get, S

0

0

In general, we have, S

j=l

By Tauber Theorem, we have lim A, ( t ) = lim sA; (s) = 0 t--)m

s-to

99

1

* l-P(s)x = A (s) =

When n -+ m , we can get &(s)

, which is the

~

l-F(s)G(s)

s known result ([2],p267).

3.

A Numerical Example

To illustrate our results, two examples are given in the following. First if F ( t ) = 1 - exp(-At) , G ( y )= 1- exp(-py) , n = 7 , then using Markov chain approach, we need 2 x 7 + 1 = 1 1 states. However, using the method mentioned above, we get

7

C{

1

= -[1+

s+A

j=l

1’1.

aP

(S+4(S+P)

After taking inverse Laplace Transformation, we get, if A = p ,

(N4+-(W2+-( N o+-(W8 (W2 +-(W6+-(W4 -

A , ( t ) = [-

+

14!

12!

8!

lo!

At the same time, we have A:(s) = -[

6!

4!

1

1

&

s + A 1A , ( t ) = - 1+ - e 1 2 2

(s +

m +P )

-2h

.

Second, if G ( y )= 1 - (1 + pt) exp(-py)

+

lle-a.

2

I=

’+’ ,and

s2

+(a+ p ) s

then G(s) = p2 and n =2 then (s + P ) 2 proceeding similarly as above we can obtain the following results when A = p , 1

A&S)=-[l+

s+R

AP2 (s+R)(s+/#

+

a2p4 (s+R)2(s+p)4

1 2 A, ( t )= - + -e

3

3

3

Js;It).

cos(-

2

100

. - -_ ---_ 6

Fig.l A7(t)and A , ( t ) ( / Z = p = 0 . 6 )

4.

8

10

12

14

16

18

20

Fig.2 A2(t)and A , ( t ) ( A = p = O . 6 )

Discussion and Conclusions

There is no doubt that the method used in the paper can be applied for other repairable systems, such as series and parallel systems. Although sometimes the Markov chain approach can be used for finite repair assumptions, it needs more states to describe the model. Furthermore the repairable system must be Markovian system. In general, the number of states of Markovian chain is too big, and therefore, the calculation is not so easy. In fact, we can prove that the instantaneous availability for finite repairs is nearly the same as that for infinite repairs at the initial time interval. The time interval is longer when the number of repairs is more. This characteristic can be observed from the two examples shown above and explained by intuitive sense. As mentioned in Section 2, the steady-state availability for finite repairs is zero. Perhaps, for some complex repairable systems, a set of recursive equations needs to be established for getting the instantaneous availability of the systems.

Acknowledgments This research is supported by the National Science Foundation Committee under project 7037 1048. Authors also thank a referee for hidher useful comments and suggestions.

References 1. R. E. Barlow and F. Proschan, Mathematical Theory of Reliability, New York: Wiley, 1965. 2. Jinhua Cao and Kan Chen, Introduction to Reliability Mathematics, Beijing, Academic Publisher, 1986 (in Chinese). 3. L. R. Cui and M. Xie, Availability analysis of periodically inspected systems with random walk model. Journal OfAppliedProbability, 38 (4), 860-871 2001. 4. G.H. Sandler, System Reliability Engineering: Prentice Hall, 1963. 5. J. Sarkar and S. Sarkar, Availability of a periodically inspected system under perfect repair. Journal of Statistical Planning and Inference, 91, (2000), p77-90. 6. D. J. Sherwin, Steady-State Series Availability, IEEE Transactions on Reliability, V01.49, N0.2, (2000) ~131-132.

A NEW APPROACH FOR THE FUZZY RELIABILITY ANALYSIS IN CASE OF DISCRETE FUZZY VARIABLE* YU GE DONG, ZHENG NI, CHUNXIAN WANG School ofMechanical & Automotive Engineering, Hefei University of Technology '193 Tunxi Rd., Hefei City, Anhui Province, 230009, P. R. China hf uh.m E-mail: d~yzlzdiy:i2muil. For the first time, this paper puts forwards an approach for fuzzy reliability analysis by means of the transformation from a discrete fuzzy strength to a discrete random strength when strength is a discrete fuzzy variable and stress is a discrete random variable, and compares the result obtained from the approach in the paper with the result obtained from the approach before Therefore, the paper gives a new idea to analyze fuzzy reliability, and the approach in the paper holds true in the reliability analysis when stress is a discrete fuzzy variable and strength is a discrete random variable, or both stress and strength are discrete fuzzy variables

1. Introduction

There is a lot of uncertain information in the process of machine design. Random variables and fuzzy variables can be used to describe the uncertain information ['I. If the uncertain information is based on a mass of data, the random variables subjected to some given probability distributions can be used to express it, On the other hand, if the uncertain information is obtained from the experts' experience, the fuzzy variables can be used to express it. The different kinds of uncertain information will induce the random variables and the fuzzy variables appearing at the same time in the process of machine design. When we consider fuzzy information in design, it is difficult to analyze the reliability of machine parts because we usually must deal with the random information and the fuzzy information simultaneously. We can use the probability theory to deal with the random information and the fuzzy theory to deal with the fuzzy information, but so far there is no compatibility between these two methods in the fuzzy reliability analysis. According to the concept of the cut-set of fuzzy mathematics, the problem of fuzzy reliability could be solved by means of the transformation from the fuzzy set to the general set, namely, from the problem of fuzzy reliability to the problem of general reliability, and it was proved that general reliability was a special example of fuzzy reliability[2-i21.Especially, the above approach is applicable to the failure that happens gradually or the fuzziness that is difficult to judge the state of failure'*' l o ' ''I. When one of generalized stress (It is simply called stress later in the paper) and generalized strength

* This work is supported by the National Nature Science Foundation of China (50375042),and also supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. 101

102 (It is simply called strength later in the paper) is a random variable, the other is a fuzzy variable, the calculating process is fussy and the calculating quantity is large, but an ideal result can also be obtained [3-7' l o ' 12]. T o solve the problem that the calculating quantity is too large mentioned above when analyzing fuzzy reliability o f machine parts, the paper discusses a new approach by means of the transformation from a fuzzy variable to a random variable. By using the new idea of analyzing fuzzy reliability put forwards in the paper, we can make f i l l use of the known and mature general reliability theory to calculate the indexes of fuzzy reliability, and reduce the difficulty of the fuzzy reliability analysis.

2. The Fuzzy Reliability Analysis in Case of Discrete Variable Assuming stress s o f a machine part is a discrete random variable, whose possible value is s, ( i = 1, 2 , ... , rn) , and p ( s , ) is the corresponding probability, fuzzy strength F of the machine part is a discrete fuzzy variable, whose possible value is rl ( j = 1 , 2 , ..., n), and pi (r/) is the corresponding membership. For any given threshold value 2 , when the fuzzy strength take value rJ , whether its membership is greater or equal to the threshold value A can be expressed by Q ( p i ( r /-A). ) The meanings of B(.) is e(u) =

to

i

1 u20 0 u
Therefore, the number of the fuzzy strength F whose membership is greater or equal can be written down as

A

According to the concept of the cut-set of fuzzy mathematics, we can think that the probabilities at every possible value of the strength, whose membership is greater or equal to the A , are the same. According to this idea, when the threshold value is A ,the probability at value rj of the strength is given by

Obviously, when p i ( r / ) < A , PA(r,)=O;and when p i ( r / ) ? A , P A ( r J )is as follows.

103

When the stress takes s, and the threshold value is il , the reliability of the part is n

R, (s, = ~ ( s I, F, =

i [(~ r l) . @,( r , s, )I = -

/=I

c [@(Pu,( r /1- A ) @ ( r , 5 @(Pi( r ,1- A)

- s,

'

)I

(5)

/=I

,=I

where R, (s, ) means the reliability of the machine part when the threshold value is il and the stress takes s, . For any possible value s, (i = 1, 2 , ... , rn) of the stress, when threshold value is A , the reliability R,(s) of the part is

c[@(Pu,(r, ) A ) .@(r, s, 11 =Cp(s,) i@(~i )-A) n

-

m

R , ( s ) = P ( s I F,)

-

(6)

/=I

,=I

(ri

/=I

The reliability R ( s ) of the part is the integral of the reliability R, (s) on [0,1]

R ( s ) =P ( s I r ) = [dR,(s)dil= [ : P ( s < % ) d A

Correspondingly, the failure probability F, (s) of the part can be expressed by

where @'(.) means @'(u)=

{

1 u>o 0 us0

(9)

And, the failure probability F ( s ) of the part can be obtained as n

c[@(Pu,( r , 1

-

F(s)= P ( s > F ) = [ d P ( s > F , ) d i l = f p ( s , ) [ : ,=I

A ) @'(s, '

-r/

11 dA(l0)

/=I

i@(P&)-4 i=I

According to Eqs. (7) and Eq. (lo), it is not difficult to prove F @(r,- s, ) + 8'(s, - rI) = 1 .

+ R =I

because

104

3. The Formula of the Transformation from a Discrete Fuzzy Variable to a Discrete Random Variable

Assuming the stress s o f a part is a discrete random variable, the probability of its possible value s, (i = 1 , 2 , ...,rn) can be denoted by p ( s ,) = p(s = s,) , and the strength r of the part is a discrete random variable, the probability of its possible value can be denoted by p ( r , )= p ( r = r , ) , by means of the general r J ( j=1,2;..,n) reliability theory, the reliability R of the part can be expressed by

As for the discrete fuzzy variable F in Eq.(7), the membership at value rj of the fuzzy strength F is ,ui(r,),(J = 1,2,...,n) . To obtain the formula of the transformation from the discrete fuzzy strength to a discrete random strength, let

Therefore, Eq.(7) can be expressed by

On the interval [ ~ U , - ( r k + , ) , ~ i ( r k )n] -(Ik, = n ,2,1), we have

So, Eq.(13) can be given by

105

After comparing Eq.( 15) with Eq.( 1I), we know that if we transform the discrete fuzzy variable F to a discrete random variable FT, we have the following equation

106 Equation (16) can be used to calculate the probability at possible value r , ( j = 1,2;..,n) of the discrete random variable I' when it is necessary to transform a discrete fuzzy variable to a discrete random variable in the fuzzy reliability analysis. Certainly, it is not difficult to prove that Eq.( 16) can be derived from Eq.( 10) too. The above discussion shows that we can use Eq.( 16) to transform the membership of the discrete fuzzy variable I to the probability of the discrete random variable I' , and analyze the fuzzy reliability by using Eq.( 1 I ) in above case. Equation (1 6) accords with the result of reference [14]. Therefore, we know that the transformation equation in reference [I41 is rational, and the fuzzy reliability analysis by using Eqs.(7) and (10) put forwards in reference [2] is rational too. At the same time, we come to a conclusion that it is feasible from theory and methodology to analyze the fuzzy reliability by using the general reliability theory based on the transformation from a fuzzy variable to a random variable. 4. An Example

Assume that stress is a discrete random variable, and

p ( s = 9 ) = 1/8, p ( s = 10) = 1/2, p ( s = 1 1 ) = 3/8 Strength is a discrete fuzzy variable, and p(r=10)=17 p ( r = 8 ) = p ( ~ = l 2 ) = 0 . 5 ,p(r=6)=p(r=14)=0.1

The above data were given in reference [2]. At first, we calculate the reliability R under the given data by using Eq.(7). For the different threshold value A , there are different reliabilities R , . From the above discussion, we obtained R, as

1 3 1 3 3 2 2 1 R, = P ( s ~ I , ) = - x - + - x - + - x - ~ (O
< A < 1)

The reliability R of the part is obtained as

349 =O.lxR, +(0.5-0.1)XR2 +(l-O.S)XR, =600 Now we calculate the reliability R of the part based on the transformation from the

107

fuzzy variable F to a random variable FT .

table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra The probability value in tab. 1 is calculated by using Eq.( 16), namely

1 1 P, = (1 - 0.5) + -(0.5 - 0.5) + 4 0 . 5 - 0.1) 2 3

1 + -(0.1-

4

1 0.1) + -(0.1 5

-

49 0 ) =75

23 1 1 P2 = - X 0.4 + - x 0.1 = -= P3 150 3 5 1 1 P4 = P5 = - x O . l = 5 50

From Eq.( 1 I), the reliability R of the part is given by 1 1 3 349 R = P ( s I F ) = P ( s I F T ) = - ( P ,+ P 2 + P 4 ) + - ( P , + P 2 + P 4 ) + - ( P 2 + P 4 ) = 8 2 8 600

The results by using two calculating approaches are the same. The example shows that fuzzy reliability can be calculated totally by using the general reliability theory after a discrete fuzzy variable is transformed to a discrete random variable, and an ideal result can be obtained. It gives a new idea to analyze fuzzy reliability by using the general reliability theory to solve the problem of fuzzy reliability.

5. Conclusions When we analyze fuzzy reliability, if we can take good use of the general reliability theory to solve the problem of fuzzy reliability, a complex fuzzy reliability problem will become a simple one of general reliability. On the one side, it will simplify the problem of fuzzy reliability, and on the other side, it will be easy for people to understand and master. In this paper, the formula of the transformation from a discrete fuzzy strength to a discrete random strength is derived by means of the known approach of the fuzzy reliability analysis, and its feasibility is proven by the example in the paper. The approach discussed in the paper can also be applied to the other case, such as when strength is a discrete random variable and stress is a discrete fuzzy variable, or both strength and stress are discrete fuzzy variables. We all know that it is difficult to deal with fuzzy information and random

108

information simultaneously in the process of fuzzy reliability analysis. This paper first puts forward a new approach for the fuzzy reliability analysis to solve the complexity of the fuzzy reliability analysis, and gives an effective attempt for the development of fuzzy reliability analysis. The success of the transformation from a discrete fuzzy variable to a discrete random variable in this paper provides a possibility in theory to build a general modeling to analyze fuzzy reliability based on the transformation from a continual fuzzy variable to a continual random variable. We will discuss it in another paper. References 1.

2. 3.

4. 5.

6. 7. 8.

9.

10. 11.

12. 13. 14. 15.

Zhu, W.Y ., “Mechanical Reliability design,” Shanghai, Shanghai Jiaotong University Press ( 1997). Dong, Y.G., “Study on Method of Fuzzy Reliability Design as Discrete Fuzzy and Random Variables,” Mechanical Design and Research, No. I , pp. 17-1 8 ( 1 999). Dong, Y.G., “Study on Fuzzy Reliability Design Approach while Fuzzy Strength is in the Form of constant Stress,” Mechanical Science and Technology, Vol. 18, No. 3, pp.380-382 (1999). Dong, Y.G., “Study on Fuzzy Reliability Design Approach while Fuzzy Strength is in the Form of Random Stress,” Machine Design, Vol. 16, No. 3, pp.l1-13(1999). Dong, Y.G., Chen, X.Z., 2000, “A Study of Mechanical Reliability Design With Fuzzy Information and Its Application,” Journal of Applied Sciences, Vol. 18, No. 2, pp. 148-152 (1999). Dong, Y.G., Zhu, W.Y, Chen, X.Z., et al. “Study on a calculating method of machine fuzzy reliability,” Journal of Systems Engineering, Vol. 15, No. 1, pp. 7-12 (2000). Dong, Y .G., “Fuzzy reliability Design with Fuzzy Variable and Random Variable,” Chinese Journal of Mechanical Engineering, Vol. 36, No. 6, pp.25-29 (2000). Dong, Y.G., Chen, X.Z, et al. “Reliability Simulation with Fuzzy Information,” Mechanical Science and Technology, Vol. 19, No. 3, pp. 381-382 (2000). Dong, Y.G., Chen, X.Z., et al. “Mechanical Reliability Design with Fuzzy and Random Stress,” Mechanical Science and Technology, Vol. 19, No. 6, pp. 891-892 (2000). Dong, Y.G., “Mechanical Design with Fuzzy Reliability,” Beijing, Mechanical Industry Press (2001). Dong, Y.G., Chen, X.Z., et al. Fuzzy Reliability Theory Application in Reliability Analysis of Mechanism Movement, ” Journal of Applied Sciences, Vol. 20, No. 3, pp.316-320 (2003). Dong Yuge, Chen Xinzhao, et al. Simulation of Fuzzy Reliability Indexes, KSME International Journal,, Vol. 17, No. 4, pp492-500 (2003). Yang, L.B. and Gao, Y.Y., “Fuzzy Mathematics Theory and Application,” Guangzhou, South China University Ce TechnologV Press (1992). Siegfied Wonneberger, Generalization of an invertiable mapping between probability and possibility. Fuzzy Sets and Systems, 64( 1994): 229-240. M. Delgado and S. Moral, On the concept of possibility -probability consistency. Fuzzy Sets and Systems, 2 1 (1 987): 3 1 1-3 18

FUZZY RELIABILITY ANALYSIS OF COMPLEX MECHANICAL SYSTEM* YUGE DONG, ZHENG N1, CHUNXIAN WANG School of Mechanical & Automotive Engineering Hefei, University of Technology 193 Tumi Rd., Hefei Cit, Anhui Province, 230009, P. R. China E-mail: dvalzdiyi~muil.lif.'u~i.L.n In the paper, modeling for simulating fuzzy reliability of mechanical system based on the operator of fuzzy set in fuzzy mathematics is given Approaches for establishing membership functions of fuzzy safe events are discussed, and the approach how to obtain the membership function of the fuzzy safe event from the known membership function of fuzzy strength based on the fuzzy probability theory is studied especially Therefore, the value of the membership function of the fuzzy failure event of mechanical system can be calculated according to the above obtained membership function, and f u z y reliahility of complex mechanical system can he analy7ed An example is given to prove it feasible the digital simulation approach

1. Introduction In the general reliability analysis theory, random and uncertain information is often described by random variable. But in the fuzzy reliability theory, fuzzy and uncertain information is described by membership function. There are two familiar cases. One is that a membership hnctions is used to describe the gradual process of failure or the fuzzy The other is that fuzzy variables are used to state of failure that is difficult to describe the fuzzy and uncertain values of variables in the engineering design['-', '-lo, ' I. In the first case the fuzzy reliability of mechanical system can be analyzed by digital simulation approach"]. In the other case, how to analyze the fuzzy reliability of mechanical system was not solved. The reason is that the membership functions of fuzzy events cannot be obtained when the fuzzy reliability of machine parts is analyzed based on the former method. To analyze the fuzzy reliability of mechanical system, an important precondition is that the membership function of each fuzzy safe event (or fuzzy failure event, in this paper only fuzzy safe event is discussed) of machine parts should be known. Therefore, in the reliability design it is important to know how to obtain the membership functions of fuzzy safe event or to derive it from the membership function of the known fuzzy information. Unfortunately, in most cases the membership function of the fuzzy information is not that of the fuzzy safe event. That is to say, the membership function of the fuzzy safe events usually cannot be obtained from direct judgment or experience. Thus, a method should be used to derive the membership function of the fuzzy safe event from the known fuzzy information.

*

This work is supported by the National Nature Science Foundation of China (50375042), and also supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.

109

110

On the basis of the fuzzy reliability analysis of machine parts, a commonly used method to analyze the fuzzy reliability of mechanical system is as follows. Firstly, the fuzzy reliability of each part is calculated, and then the general reliability theory, such as the general reliability theory of series system and parallel system, is used to analyze the fuzzy reliability of mechanical system. However, this method cannot consider the relationship among failure modes of each part, so the result of analyzing the fuzzy reliability has a big error. If the fuzzy reliability of the system is analyzed by using the digital simulation, the relationship among the failure modes can be ignored. Furthermore, it is difficult to obtain the membership function of the fuzzy event that describes the safe of mechanical system, but the digital simulation approach can evade this problem. 2. Digital Simulation Modeling Mechanical system consists of a lot of parts, and there are usually more than one failure modes in each part. Therefore, we can use each failure mode as a unit of analyzing the fuzzy reliability o f mechanical system. Assuming that n is the number of the failure modes of mechanical system, that each failure mode does not happen, namely, fuzzy safe event, is a fuzzy failure event A , , and its membership function is written down as pi, (2,) , where 2, is a random variable vector relevant to each failure mode. Usually, mechanical system is a series system consisting of each failure mode. Therefore, in the paper only digital simulation modeling of mechanical series system is discussed. In the mechanical series system the fuzzy event 2 of system safe is an intersection operation of each fuzzy safe event because when each failure happens, it will cause the failure of system. Therefore, we have 1

2 = 2, n i2n...n2fl

(1)

The membership function p a ( 2 ) of the fuzzy safe event A“ of mechanical system can be calculated according to Zadeh fuzzy intersection operator.

where is Zadeh fuzzy intersection operator, which means “minimum”, and Xis a random variable vector of mechanical system. Equation (2) shows that the least membership of each safe event is that of the series system. And then, the steps of digital simulation approach are as follows. ( 1 ) To obtain simple value i y of random variables based on their probability density functions. ) based on the membership function o f each fuzzy safe event. (2) To calculate pi, (i,, (3) To calculate the membership of fuzzy safe event of mechanical system according to Eq.(2), namely

111

(4) To repeat the above steps, when m is big enough, the reliability R of mechanical system can be estimated by the following equation.

3. To conform the Membership Function of Fuzzy safe Event When the fuzzy reliability of mechanical system is analyzed, sometimes the membership function of fuzzy failure event can be obtained from experience directly, and sometimes it needs to be derived from the membership function of the known fuzzy information. 3.1. The Membership Function of Fuzzy Failure Event Is Known In the general reliability design, the part is only in the state of safety or not, so the state of part from safety to failure is a sudden change. But it is considered by fuzzy reliability theory that there is a transitional process between the state of safety and failure. That is to say, there is not a single partition point to differentiate the state between safety and failure. A commonly used membership function to describe the fuzzy characteristics of the transitional process is shown in Fig. 1. The fuzzy safe event in Fig.1 can be written down as A“ = {s gal),the linear membership function in Fig.1 is proven to be effective and reliable”41by a great deal of practice. a, is the numerical value of strength from handbook and a 2 = (1.05 - 1 . 3 ) ~ ~ can be determined by the expanding coefficient method.

Fig 1 Linear membership function

If the membership function in Fig.1 is written down as y, (s),we have s
Based on the fuzzy probability theory, the equation of calculating probability of the above fuzzy safe event is given by

112

where f ( s ) is the probability density function o f the random stress s. Equation (6) can be used to analyze fuzzy reliability after the membership function of the fuzzy safe event is obtained. This is why it is important to obtain the membership function of the fuzzy safe event. Obviously, the probability of the fuzzy event is the mathematical expectation of the membership function of the fuzzy safe event, namely

This is the reason that we can adopt fuzzy theory to establish the modeling of digital simulation to analyze the fuzzy reliability of mechanical system. 3.2. The Membership of Fuzzy Safe Event Is Unknown This section discusses a case when stress is a random variable and strength is a fuzzy variable. The fuzzy safe event can be written down as 2 ={s S F } . For the given = [ a , , b,] can be obtained based on the threshold level A , the internal number concept of the cut-set of fuzzy mathematics. Assuming that strength is considered as a random variable of even distribution on [a, , b,] , the probability density function is

5

Based on the general reliability- theory, when the threshold level is A , the probability R, of the general event Ad = {s
1 -[ jj(s)ds]dr = A b , - ~ , -m

b.i

+-

f(s)xmax[min(-

b,

-S

b, -a,

, I),

Olds (9)

where f ( s ) is the probability density function of the random stress s too, and

max[min(-

b, --s b,

-

, I ) , 01 =

a,

{ 6,-s b, - a, 0

a, < s d b , s>b,

After the probability of the general event A", = {s
R = P(s d F) =

1; R, d A = 1; P(s d < ) d A

(1 1 )

Therefore, the probability of the fuzzy safe event, namely, the reliability R, can be given by

113

Compared Eq. (12) with Eq. ( 6 ) , the membership function of the fuzzy safe event is obtained as

A" = {s dF}

, l ) , Old2

After the membership function of the fuzzy safe event is obtained, the probability of the fuzzy safe event can be expressed uniformly by Eq. (6). Based on Eq. (7), the probability of the fuzzy safe event is the mathematical expectation of the membership function of the fuzzy safe event, namely

When the above simulation approach is used to estimate the reliability of mechanical system, it is difficult to calculate the membership of each fuzzy safe event directly according to Eq. (13). Thus, to make it convenient to simulate the reliability of mechanical system, it is necessary to derive concrete expressions of the fuzzy safe event when the commonly used membership functions are taken to describe the fuzzy strength. Assuming fuzzy strength can be expressed in a fuzzy number, which is written down as (m,a,/3), its left and right reference functions are L ( r ) , R ( r ) differently. For the given stress, if s d m , based on Eq. (13), the expression of the membership function of the fuzzy safe event can be derived as

If s > m , the expression is derived as

Equations (1 5a) and (15b) are the basis expressions of the membership function of the fuzzy safe event for a given s. The more concrete expressions of the membership function of the fuzzy safe event are given in this paper when the membership functions of fuzzy strength are commonly used linear distribution and normal distribution. For the membership function of linear distribution and normal distribution, please refer to reference [14]. When the membership function of fuzzy strength is a commonly used linear distribution, based on Eqs. (1 5a) and (15b), the more concrete expression of the membership function of the fuzzy safe event can be obtained as

114

When the fuzzy strength is a normal distribution, the expression is obtained as

4. An Example

The diameter of a transmission axis is d = 20mm . Assuming shear fatigue strength z of material is a fuzzy variable whose membership function is a normal distribution, and its membership function is given by 2-45

~ ( 2=)exp[-(-)*]

5

MPa

0 ~ , the allowed torsion angle of the axis is Shear elasticity modulus is G = 8 ~ 1 MPa [q]= 2.5" /m . If the torsion T of the axis is a normal distribution, its mean and the standard deviation are 5 ~ 1 N.mm 0 ~ and 5x10' N.mm individually, let us estimate the reliability of the axis. The expression of calculating the torsion angle ( '/m) of axis is as follows.

The expression of torsion cut stress (MPa) of axis is given by 16T ad3

s=--

The torsion of the axis is a random variable of normal distribution, so the torsion angle and the torsion cut stress of the axis are also random variables subjected to a normal distribution. In consideration of rigidity condition of the axis, generalized stress is the calculated torsion angle and generalized strength is the allowed torsion angle. When the calculated torsion angle is only a bit bigger than the allowed torsion angle, it is difficult to judge whether the rigidity condition can be satisfied or not. Based on the fact, taking a, = 2.5 '/m , a , = 1 . 2 a, ~ = 3 '/m , we can obtain the membership function of the fuzzy safe event describing whether rigidity is reliable just as Eq. (5) and Fig. 1. We use the

115

simulation approach to estimate the reliability of rigidity based on Eqs.(S) and (7). A computer program was written to estimate it. When the number of simulation is 1000 0, we obtain rigidity reliability R = 0.958 6 . In consideration of intensity condition of the axis, generalized stress is the calculated torsion cut stress and generalized strength is the shear fatigue strength of material. According to the membership function of the shear fatigue strength, we can obtain the membership function of the fuzzy safe event describing intensity safe according to Eq. (1 7). Based on Eqs.( 17) and (7), a computer program was written to simulate intensity reliability, and we obtain R = 0.998 6 when the number of simulation is 1000 0. The fuzzy reliability of the axis can be regarded as a series system consisting of rigidity fuzzy reliability and intensity fuzzy reliability, Because rigidity fuzzy reliability and intensity fuzzy reliability in this example are relative, and it is nearly impossible to obtain the membership function of the fuzzy safe event of the series system from the membership functions of the fuzzy safe event of rigidity and intensity, the above digital simulation must be used to estimate the synthesis fuzzy reliability of the axis. A computer program was written to estimate it. When the number of simulation is 1000 0, we obtain reliability R = 0.958 5 . The simulation shows that reliability of the axis mainly depends on reliability of rigidity.

5. Conclusions In engineering design, random variables and fuzzy variables that describe uncertainty information from different point of view exist in common. If the uncertainty information is based on a mass of data, random variables should be used to describe it. If the uncertain information is based on the experience and judgment, fuzzy variables should be used to describe it. In this paper, after building the membership function of the fuzzy safe event of each failure mode based on the fuzzy probability theory, by applying Zadeh intersection operator of fuzzy set, we put forwards the digital simulation method to estimate the reliability of mechanical system which contains two different kinds of membership functions, i.e., the membership function to describe the process that happens gradually or the fuzziness that is difficult to judge the state of failure, and the membership function of fuzzy safe event which is derived from the known fuzzy information, such as fuzzy variables to describe fuzzy uncertainty of variable value in engineering design. This approach is an extension of the traditional digital simulation for system reliability. That is to say, when we simulate the fuzzy reliability of mechanical system, we replace the characteristic function of each general event by the membership function of each fuzzy safe event. Of course, in the practical use of the above simulation approach, any safe event can be a general event or a fuzzy event. This paper mainly discusses the method to obtain the membership function of fuzzy safe event when stress is a random variable and strength is a fuzzy variable. It is easy to

116

know that membership functions of fuzzy failure events can also be obtained by similar analysis, and then the failure probability of mechanical system can be simulated in this case. When stress is a fuzzy variable and strength is a random variable, the same approach given in the paper can be taken to obtain the membership functions of fuzzy safe events or fuzzy failure events, and to estimate the reliability or the failure probability of mechanical system. References

Dong, Y.G. and Tao, G.S., “Random stimulation of fuzzy reliability for mechanical system,” Journal of H e f i University of Technology, Vol. 22, No. 5, pp. 40-43 ( 1 999). 2. Dong, Y.G., “Study on Method of Fuzzy Reliability Design as Discrete Fuzzy and Random Variables,” Mechanical Design and Research, No. I , pp. 17-18 (1999). 3. Dong, Y.G., “Study on Fuzzy Reliability Design Approach while Fuzzy Strength is in the Form of constant Stress,” Mechanical Science and Technology, Vol. 18, No. 3, pp.380-382 (1999). 4. Dong, Y.G., “Study on Fuzzy Reliability Design Approach while Fuzzy Strength is in the Form of Random Stress,” Machine Design, Vol. 16, No. 3 , pp.11-13 (1999). 5 . Dong, Y.G., Chen, X.Z, Zhu, et al., “A Study of Mechanical Reliability Design With Fuzzy Information and Application,” Journal ofApplied Sciences, Vol. 18, No. 2, pp, 148- 152 (2000). 6. Dong, Y.G., Chen, X.Z, et al., “Reliability Simulation with Fuzzy Information,” Mechanical Science and Technology, Vol. 19, No. 3, pp. 381-382 (2000). 7. Dong, Y.G., Zhu, W.Y, et al., “Studying on a calculating method of machine fuzzy reliability,” Journul ofsystems Engineering, Vol. 15, No. I , pp. 7-12 (2000). 8. Dong, Y.G., “Fuzzy reliability Design with Fuzzy Variable and Random Variable,” Chinese Journal of Mechanical Engineering, Vol. 36, No. 6, pp.25-29 (2000). 9. Dong, Y.G., Chen, X.Z, et al., “Mechanical Reliability Design with Fuzzy and Random Stress,” Mechanical Science and Technology, Vol. 19, No. 6, pp. 891 -892 (2000). 10 Dong, Y.G., 2001 “Mechanical Design with Fuzzy Reliability,” Beijing, Mechanical Industry Press (2001). 11 Dong, Y.G., Chen, X.Z., et al. “ Fuzzy Reliability Theory Application in Reliability Analysis of Mechanism Movement, ” Journal of Applied Sciences, Vol. 20, No. 3, pp.316-320 (2003). 12. Dong, Y.G., Chen, X.Z, et al. Simulation of Fuzzy Reliability Indexes, KSME International Journal, 2003, Vol. 17, No. 4, pp492-500 (2003). 13. Huang, K.Z., Mao, S.P., “Random Method and Fuzzy Mathematics Application,” Shanghai, Tongji University Press (1987). 14. Wang, C.H., Song, L.T., “Methodology of Fuzzy Mathematics,” Beijing, China Architecture (e Building Press (1988). 15. Yang, L.B. and Gao, Y.Y., “Fuzzy Mathematics Theory and Application,” Guangzhou, South China University & Technology Press (1 992). 1.

OPTIMAL RELEASE PROBLEM BASED ON THE NUMBER OF DEBUGGINGS WITH SOFTWARE SAFETY MODEL

TOMOAKI FUJIYOSHI Course of Social Systems Engineering, Graduate School of Engineering, Tottori University Minami 4.101, Koyama-cho, Tottori 680-8552, Japan E-mail: 99t'[email protected] KOICHI TOKUNO AND SHIGERU YAMADA

Department of Social Systems Engineering, Faculty of Engineering, Tottori University, Minami 4-101, Koyama-cho, Tottori 680-8552, Japan E-mail: { toku,yamada} @sse.tottori-u. ac.jp

In this paper, we discuss the optimal software release problem based on the number of debuggings, using the Markovian software safety model. First, we formulate the total expected software cost occurring in the testing phase and the warranty period after release and discuss the optimal software release problem based on the cost criterion. Furthermore, we treat the problem considering both cost and safety requirement simultaneously. We investigate the optimal software release policies for the problems. Finally, we present several numerical examples of the policies.

1. Introduction One of the applications of the software reliability growth is the optimal software release p r ~ b l e m . ~Most , ~ , ~of the past studies have considered only the developer-oriented measures such as the software reliability and the mean time between software failures. However, software availability and ~ a f e t ybegin ~ , ~ to be noticed; these are the quality characteristics from user's viewpoint. In this paper, we discuss the optimal software release problem considering safety requirement, using the Markovian software safety model.8 Section 2 states the software safety model used here. Section 3 formulates two optimal software release problems. One is the problem based on the cost criterion. Analyzing the cost factors occurring in the testing and the operation phases, we give the total expected software cost as the function of the number of debuggings performed in the testing phase. The other is the problem evaluating both software cost and software safety criteria simultaneously. Section 4 derives the optimal software release policies for finding the optimum number of debuggings and the mean of the optimum release time. Section 5 presents several numerical examples of the optimal software release 117

118 policies and examine the relationship between the optimum number of debuggings t o release and the evaluation criteria.

2. Software Safety Model

Figure 1

A sample state transition diagram of X ( t )

The following assumptions are made for software safety modeling:

A l . When the software system is operating, it falls into an unsafe state a t random, and the time interval t o be in an unsafe state is also random.

A2. The debugging activity is performed when a software failure occurs. This is performed perfectly with the perfect debugging rate a (0 5 a 5 1) and imperfectly with b(= 1 - a ) . A3. One fault is removed when the debugging activity is perfect and software reliability improves. The next time interval of software failure-occurrence when n faults have already been corrected follows the exponential distribution with mean l / A n . A, is a non-increasing function of n. A4. If the debugging activity is perfect, the system also improves in safety with the probability p (0 5 p 5 1) and does not improve with the probability q(= 1 - p ) . When n faults are corrected, the time to fall into an unsafe state follows the exponential distribution with mean 1/19,. We call 0, the software unsafety rate. On is a non-increasing function of n. On the other hand, imperfect debugging does not affect software safety improvement or

119 degradation. The time interval when the system is in an unsafe state follows the exponential distribution with mean l / q . The state space of the process { X ( t ) ,t 2 0} representing the state of the software system a t the time point t is defined as follows:

Wn : the system is operating safely, U, : the system falls into an unsafe state. Figure 1 illustrates a sample state transition diagram of X ( t ) . The metrics of software safety can be obtained as

BiP =

k=i n

Equation (1) represents the probability that the system is operating safely a t the time point, t , given that the 1-th debugging was completed a t time point t = 0. The reliability function and the expectation of random variable X1 (1 = 1 , 2 , . . . ) representing the time interval between the (1 - 1)-st and the 1-th software failureoccurrences are given by 1-1

= Pr{X1 > t>= C

~ l ( t )

i=O

respectively.

(2)

120 The expected number of debugging activities performed in the time-interval = 0, is given by

( O , t ] , given that the I-th debugging activity was completed at time point t

A&

M ( t ; I )= U

'20

( a! ) u " b ' - ' ~ Gz,n(t),

(4)

n=z+l

2=0

where Gi,n(t) is the distribution function of random variable S2,n (i 5 n) representing the transition time of { X ( t ) , t 2 O} from state W, t o state W, and given bY

3. Optimal Software Release Problem

3.1. Cost-Optimal Software Release Problem In order t o formulate the optimal software release problem based on the cost criterion, we introduce the following cost parameters: cl: debugging cost per fault in the testing phase (cl > 0), c2: testing cost per unit time (c2 > 0), c3: debugging cost per fault in the warranty period T,(> 0) of the operation phase (c3 > c1 > 0 ) .

Suppose that we release a software system after 1 debuggings are performed in the testing. Then the total expected software cost arising in the testing phase and the warranty period in the operation phase is given by

+ ~2

1

WC(l)= c ~ I

E[Xj] + c~M(T,;1 ) .

j=1

Therefore, the positive integer 1 = 1* minimizing WC(1)in Eq.(6) is the optimum number of debuggings. Then the mean of the optimum software release time T* can be calculated by

3.2. Cost-Safety-Optimal Software Release Problem We discuss the optimal software release problem evaluating both software cost and safety criteria simultaneously. Consider the decision policy on the optimum number of debuggings to be performed by the release minimizing WC(I)in Eq.(i') subject

121

t o the condition that S(t;1) in Eq.(1) satisfies the safety objective, SO. Then we can formulate the following optimal software release problem:

minimize W C (1 ) subject to mjnS(t; I ) 2 SO (0 < SO< 1)

1

'

(8)

We call this problem a cost-safety-optimal software release problem.

4. Derivation of Optimal Software Release Policy 4.1. Cost-Optimal Software Release Policy The first and the second deference equations of Eq.(6) are given by

1) respectively. D(1) > 0 holds for arbitrary positive integer 1 since E[Xl] and Md(Tw; are monotonically increasing functions of the number of debuggings, 1. Thus, Z(1) is the monotonically increasing function of 1. Therefore, the behavior of WC(1) depends on the sign of Z(1). That is, if Z ( 0 ) < 0, then there exists the minimum integer 1 = 12 holding Z(1) 2 0, and 1 = lz satisfies both inequalities WC(1 1) 2 WC(1)and WC(1)< WC(1 - 1) simultaneously. Accordingly, I* = 12 is the optimum number of debuggings. On the other hand, if Z ( 0 ) 2 0, then WC(1)is a monotonically increasing function of 1 since Z(1) 2 0 for any positive integer 1, and the optimum number of debuggings is 1* = 0. We have the following policy for the cost-optimal software release problem:

+

/Optimal software release policy I] Let lz (1 5 lz < cm)be the minimum integer 1 holding Z(1) 2 0. (1.1) If Z ( 0 ) < 0, then the optimum number of debuggings t o release is I* = l z . (1.2) If Z ( 0 ) 2 0, then the optimum number of debuggings t o release is I* = 0. Then the mean of the optimum release time T* is given by

122 4.2. Cost-Safety-Optimal Software Release Policy

As t o the behavior of S ( t ;1) with respect t o t , S ( t ;1) takes the minimum value at the neighborhood of the time origin t = 0, say t o , and then increases monotonically in [to,m). Furthermore, S(m; 1) = 1 for any positive integer 1 if 0, is a decreasing function of n. On the other hand, as to the behavior of S ( t ;I ) with respect t o I , S ( t ;I ) is the increasing function of 1. Therefore, if min S ( t ;0) < SO,then there exists the minimum integer 1 = 1s satisfying min S ( t ;1) 2 So and the safety requirement is satisfied for 1 2 Is. Accordingly, the problem can be interpreted as one for finding the integer 1 minimizing WC(1)in the range [ls,m). Using the discussion in Sec. 4.1 also, we have the following policy for the cost-safety-optimal software release problem:

[Optimal software release policy I 4 Let 12 (1 5 L z < m) and 1s (1 5 1s < m) be the minimum positive integers 1 holding Z(1) 2 0 and min S(t;I ) 2 SO,respectively, and suppose that 0 < SO< 1. t

(11.1) If Z ( 0 ) < 0 and ininS(t; 0) t to release is I* = 12. (11.2) If Z(0) < 0 and inin S ( t ;0) t

2 So, the optimum number of

the debuggings

< SO,the optimum number of the debuggings

t o release is 1" = max{lz, Is}. (11.3) Z (0) 2 0 and min S(t;0) 2 SO,the optimum number of the debuggings t o t release is I* = 0. (11.4) Z(0) 2 0 and min S ( t ;0) < SO,the optimum number of the debuggings t o release is 1* = 1s. Then the mean of the optimal release time T* is given by

5. Numerical Examples We show numerical illustrations for the optimal software release policies discussed above, where we apply A, 3 Dun (D> 0 , 0 < v < 1) and 0, = E ( p f q ) n ( E > 0 , 0 < f 5 1) t o the hazard rate and the software unsafety rate, respectively. Tables 1 and 2 summarize the optimum number of the debuggings, I * , the mean of the optimum software release time, E[T*],and the total expected software cost, W C ( l * )for , various values of the perfect debugging rate, a , for [Optimal software release policy I] and [Optimal software release policy 111, respectively. Here we consider the following cost-safety-optimal software release problem:

+

minimize WC(1) subject to minS(t; 1) t

2 0.99

(12)

123 These tell us that improving the debugging ability reduces the software cost rapidly and speeds up the optimum release time efficiently when the perfect debugging rate is low. Figure 2 displays the behaviors of WC(1)and min S ( t ;1) in the case of a = 0.9. t

From Fig. 2, the number of debuggings minimizing WC(1)is 12 hand, the minimum number of debuggings satisfying minS(t; I )

14, on the other 2 0.99 is 1s = 19.

=

t

Therefore, the optimum number of debuggings to release is 1* = max{lz,Is} = max{ 14,19} = 19 from (11.2). That is, we need to perform the additional 5 debuggings (or fault-detections) from the number of debuggings minimizing WC(I). Then we can estimate that the release time is extended by 1.9 times of the time minimizing WC(1)and that the software cost increases by 30%.

Table 1. Optimum number of debuggings l', E[T*], and WC(Z*) for [Optimal software release policy I]. (c1 = l.0,cz = 0.1,cs = 20,D = 0.1,~ = 0.8,Tw = 1500)

Table 2. Optimum number of debuggings 1 * , E[T*], and WC(1*) for [Optimal software release policy 111. ( ~ = 1 1 . 0 ,= ~O ~ . l , ~ 3= 20,D = 0 . 1 ,= ~ 0.8,= ~ 0.8,f = 0.8,E = 0.02, 11 = O.l,Tw = 1500,so = 0.99)

~~

a

1*

ElT*l

WC(1*)

0.9

14

717.1

218.53

0.8

16

874.4

254.83

0.7

18

984.3

304.09

0.6

21

1188.1

367.59

0.5

26

1630.2

461.62

0.4

33

2222.5

606.94

0.3

44

3079.7

855.89

0.2

67

5056.7

1369.51

0.1

137

11382.3

2955.09

a

1*

E[T*]

WC(1*)

0.9

19

2056.4

0.8

22

2710.3

283.10 356.14

0.7

25

3163.3

418.41

0.6

30

4347.5

551.51

0.5

36

5473.7

693.51

0.4

45

7189.1

910.26

0.3

60

10086.6

1277.39

0.2

91

16753.4

2091.66

0.1

182

35393.4

4430.61

6. Concluding Remarks

In this paper, we have discussed the optimal software release problems based on the number of debuggings using a Markovian software safety model. We have formulated the software cost model considering the maintenance expenses in the testing phase and the warranty period after release. Furthermore, we have discussed the problem incorporating safety requirement and the cost criterion simultaneously. We have investigated the optimal release policies t o find the optimum number of debuggings and the mean of the optimum release time and presented several numerical examples of the policies. To establish a method for deciding cost parameters remains in the future study.

124

5

s

=0.99.. 0.98 0.97 0.96~ 0.95 C 0.94 0.930.92.

10

15

i20

25

the number of debugging{ 5 10 15 q.0. . . - 2 5

6..'-

Lm

*i% .. Figure 2 . Behaviors of W C ( I )and minS(t;l)

(CI =

1.0.~ =~ O.l,c3 = 20,D = 0 . 1 , = ~ 0.8,= ~

0.8,a = 0.9,f = 0 . 8 , ~= 0.1, E = 0.02,Tw = 1500,So = 0.99).

Acknowledgements T h i s work was supported i n part by the Saneyoshi Scholarship Foundation, J a p a n , a n d Grants-in-Aid for Scientific Research (C)(2) and Young Scientists (B) of the Ministry of Education, Culture, Sports, Science and Technology of Japan under G r a n t Nos. 15510129 a n d 16710114, respectively.

References 1. M. R. Lyu (ed.), Handbook of Software Reliability Engineering, IEEE Computer S e ciety Press, Los Alamitos, CA (1996). 2. S. Yamada, Software reliability models, Stochastic Models in Reliability and Maintenance, Springer-Verlag, Berlin, 253 (2002). 3. S. Yamada and S. Osaki, Optimal release policies with simultaneous cost and reliability requirements, European J. Operational Research 31,46 (1987). 4. H. Pham and X. Zang, A software cost model with warranty and risk cost, IEEE Trans. Comp. 48, 71 (1999). 5. M. Xie and B. Yang, A study of the effect of imperfect debugging on software development cost, IEEE Trans. Software Eng. 29, 471 (2003). 6. N. G. Leveson, Safeware-System Safety and Computers-, Addison-Wesley Publishing, Massachusetts (1995). 7. K. Tokuno and S. Yamada, Stochastic software safety/reliability measurement and its application, Ann. Software Eng. 8 , 123 (1999). 8. K. Tokuno and S. Yamada, Markovian software safety measurement based on the number of debuggings, Proc. 2nd Euro- Japanese Workshop o n Stochastic Risk Modeling for Finance, Insurance, Production and Reliability, 494 (2002).

OPERATING ENVIRONMENT BASED MAINTENANCE AND SPARE PARTS PLANNING: A CASE STUDY BEHZAD GHODRATI, UDAY KUMAR Division of Operation and Maintenance Engrneering LuleB University of Technology SE 971 85 LuleB - SWEDEN B e h x d (;hoJi~~.lt~ $iiu w , Cdav A i i ~ ~ r k
Maintenance strategies and spare parts consumption is greatly influenced by the reliability characteristics of system or components under consideration. Any maintenance policies or spare parts planning without considering the reliability characteristics are not optimum. Therefore it is important to study and analyse the reliability characteristics before making decision concerning spare parts and maintenance planning. It is known that the operating environmental conditions in which system is to be operated, such as temperature, humidity, dust, load, voltage stress, etc. often have considerable influence on its reliability characteristics. These factors, in fact affect the failure rate of a repairable system and non-repairable components, but are usually ignored in the reliability analysis. Thus operating environment should be considered as an important factor while making decision about maintenance, spare parts planning, product support, and service delivery strategies. In general, new products are often used under conditions that are not anticipated. It is common to modify predicted life length and reliability characteristics of product by considering the environmental and other factors. The purpose is to incorporate the effect of the environmental factors such as temperature, humidity, dust, voltage stress, etc in reliability analysis.

1

Introduction

A product, generally, can be classified according to the product characteristics into two

groups: Consumer products and Industrial products. Industrial products are those, which are being used for production goods or for producing other products either for consumer or industry. Industrial product’s customer may be a more professional customer and may set up special product criteria, specification, requirements, etc. Mining equipment, car assembling machine, crusher, and etc are the examples of this type of product. The most critical support aspects for a system /machine at work are service delivery (logistics), maintenance /provision service, and finally repair and spare parts availability. Maintenance and spare parts support are two basic and critical issues that are linked and involved in availability of systedproduct. This is because, often due to lack of technology and other compelling factors (such as economy, environmental situations, etc), in the design phase it is impossible to design a product that will fulfill its function without maintenance and spare parts need. So the need for support is becoming vital to enhance system effectiveness and prevent unexpected stoppage. Product’s technical characteristics such as product reliability and maintainability are more useful factors in defining the maintenance need and especially spare parts requirements for existing systems. 125

126 2

Required Spare Parts Calculation

Environmental conditions in which the equipment is to be operated such as temperature, humidity, dust, etc. often have considerable influence on product reliability characteristics [ 10,121. Thus, an operating environment should be seriously considered when dimensioning product support and service delivery performance strategies, as this will likely have a significant impact upon operational/maintenance cost and service quality. Some of the important examples of operating environment are: 1. Working environment: includes (a) climatic conditions such as temperature and humidity in which a system will be working, and (b) physical environment factors: such as dust, smoke, fumes, corrosive agents, and the like.

2. User characteristics: such as operator skill, education, culture, and language. 3.

Operating place or location: this factor refers to workplace settings i.e. open or closed space, the industry that will use a product and/or other area characteristics (such as mines) where a product will be used.

4.

Level of application: the system may be intended to have a majodmain purpose, a minor or auxiliary purpose and even a standby purpose in an operational setup.

5.

Working time and period of operation: planning may call for a product to be in continuous or part-time operation.

Often, operating environment influences the degree of support needed to achieve an expected performance level [ 151. Our literature survey shows that most of the research work on reliability considers the operation time as the only variable for estimating reliability of a system [1,3,4,14,17]. The reliability models, which also uses only time as an influencing factor, may not be suitable for the reliability analysis, because there are other factors that may influence the reliability characteristics of a system during its working lifetime [9]. For example if a system is used in different climatic conditions, then its reliability characteristics will not be equal in different condition (see [6] for detail). Product hazard rate is, for the purpose of this article, defined as the rate at which a product will experience some form of failure during usual operation (for a more detailed definition see [10,12]). The hazard rate of a system, in general, is influenced not only by the time, but also by the covariates under which it operates. It is the product of baseline dependent on time only, and another positive functional term, which is hazard rate A&, independent of time, that incorporates the effects of a number of covariates. The baseline hazard rate is assumed to be identical and equal to the total hazard rate when the covariates have no influence on the failure pattern. The covariates may influence the hazard rate so that the observed hazard rate is either greater (e.g. in the case of poor maintenance or incorrect spare parts) or smaller (e.g. a new improved component of a system or reliable components) compared to the baseline hazard rate. The basic concept of this model is shown in Fig. 1 [ l 11. So, the actual hazard rate (failure rate) with respect to exponential form of time independent function, which incorporates the effects of covariates, can be defined as [lo]:

127

ObsaBNed hazard rate

BaWme hazard nte

O

1,

Time

Figure. 1. Effect of covanates on the hazard rate of the system

Where z, j = I , 2, ..., n, are the covariates associated with the system and a,, j = 1, 2, ..., n, are the unknown parameters of the model, defining the effects of each one of the n covariates. The above model is commonly known as proportional hazards model (PHM) (see for details [ 10, I I]). For several types of industrial parts and subassemblies replacement of an entire unit upon failure of a part is more economical than repair. Bearings, gears, gaskets, seals, filters, hoses, and valves are some of the kinds of parts that are normally replaced rather than repaired. The Weibull reliability model is a fitting model for analyzing the mechanical systems (parts) life. On the other hand, in practice the exponential distribution is assumed for simplicity while analyzing time betweedto successive failure of equipment or their components even though the true distribution is Weibull process. Meanwhile the percentage of error in the calculation of mean time to failure when assuming an exponential model instead of the Weibull model is small (about 5% and it is negligible in compare with error in data collection, which is usually about 10-15%)[13]. So, we can ignore this error to gain the advantage of simplicity of analysis, and claim that the exponential model is applicable and probably the best model when the effects of covariates come into the calculation, especially when the parts being studied are nonrepairable parts. Then, considering this point, the number of required spare parts can be obtained by:

Where:

P = The probability of the shortage of spare parts [ I - P

=

the confidence level

of spare parts’ availability]

1 = The failure rate of the part concerned (considering the effect of covariates) t = The operation time of the system (life cycle)

N = The total number of required spare parts in period “t”

128

If “q ” is the number of a specific part that are in use at a given moment, then q is entered into the equation in the form of multiplication by “Ltq”. So the calculated “ N ” will represent the total required number of spare parts for the whole system during time t.

3

Casestudy

Wheel loaders are known for high usage flexibility and durability. The loader is designed to pick-up, transport, and dumps or spread material, which is used commonly in the open pit mines. Most wheel loaders are articulated in the center to enable maneuverability. Power is delivered to all four wheels by a diesel engine that is normally mounted in the back of the unit. The mine plan and other factors, including cycle time, can determine the appropriate loader. Since the loader works in a severe operating environment, working condition factors such as water, heat, dust, etc influences its components reliability and life length consequently. The roll bearings of loader front wheels (Fig. 2) are nonrepairable item that was studied for analyzing of the effect of operating environment on mean time to failure and forecasting the spare part needs.

Figure. 2 . Loader and its wheel roll bearings

For studying the operating environment’s influence on failure rate of loader wheel roll bearing, failure time data was obtained from one iron ore mine in Iran where there are 14 loaders in a haul fleet. The capacity of bucket of loaders is about 9 m3. The obtained data were analyzed to identify important factors that influence roll-bearing performance. In the next step, we codified them by numeric value, -1 for bad situation like water (moisture) and dust existence, and +I for good and desirable condition like no overload and skilled operator. These factors (covariates) are: Dust (and fine solid particles): these particles come from several sources, for example in the lubricant due to lack of cleanliness, wear of metal seal cage, and destroying of bearing side seals. This covariate is denoted by DUST. Temperature: due to climatic condition of mine the variation range of temperature is remarkable. On the other hand some other factors, i.e. over lubrication of the bearing and friction with the seal cage, cause to increase the temperature, which is harmful for bearings’ seal and consequently its life length. TEMP denotes this covariate. Water (moisture): this factor causes several problems i.e. pitting and corrosion of the bearing races and rolling elements that will increase the fatigue of the metal components, and a water and oil emulsion does not provide a good lubricating film. This covariate is denoted by WATER. 0

Over-load: In this case study, the front wheels are considered where the over load in the loader bucket (25 ton) creates extra force (more than allowed limit)

129

on the wheel axel and consequently on the roll bearings. These causes the critical wear-out process. This is denoted by OVLOAD. Operator skill: in driving, loading and hauling denoted by OPSK. Maintenance crew skill: this factor affects the quality of service, repair and maintenance and the condition of bearings after service, and denoted by MCSK. Table 1 shows the collected data for time to failure (TTFs) of wheel roll bearing, which used for failure rate estimation. Table 1 . Time to failure of wheel roll bearing (in hour)

IID (independent identical distribution) assumption for data for reliability model, could be checked by plotting the cumulative time to failures (TTFs) Vs cumulative failure number as shown in Fig. 3. As it is seen in the Fig. 3, the plotted points have lied on a straight line. It implies that there is no trend in the failure data. Then, the assumption of identical distribution for the TTFs under consideration is not contradicted. On the other hand, regarding to non-repairability of part, we allow concluding data as independent.

Cumulative TTFs

Figure 3. Graph for checking the trend in TTFs data

The SYSTAT software was used for estimating the corresponding value of a, and was tested for their significance on the basis ofp-value. In the other word, I-p for a covariate indicates the importance of its for considerations in the model. Table 2 shows the estimates of a for the six covariates (S.E. indicates the standard error of estimates).

130 The results show that the effects of two covariates (DUST, OVLOAD) are significant at lO%p-value. Then the best model for hazard rate of bearing according to the result of the PHM analysis is: h(t,z)= b(t) xexp(-0.748 OVLOAD - 1.15 DUST) Table 2. Estimation of covariates

The effect of two covariates (OVl.,OAi>and Dt!SX') is significant at 10% p-value.

To satisfy the proportionality assumption of the hazard rates, the plots of the logarithm of the estimated cumulative hazard rates against time should be simply shifted by an additive constant a, the estimate of the regression parameter a of the covariate which is taken as strata [8]. Therefore, the plots should be approximately parallel and separated appropriately corresponding to the different values of the regression parameter a, if the proportionality assumption is correct as is seen in Fig. 4 [lo]. Survival Plot

Survival Plot

OVLOAD

OUST .......... . ... . ~

Time

-.

-

~

Tlme

Figure 4. Graphical test for proportionality assumption of the hazard rates

While using the assumptions of the exponential reliability model for this item the mean time to failure (MTTF) is 10000 (manufacturer recommendation) hour and:

131

This failure rate is constant with this approach. In this case study the operators are not permitted to pickup overload, but dust is excessive in the place of operation (mine). Under this condition the actual hazard (failure) rate is estimated as: h(t,z)= le-4 x exp(-1.15x (-1) - 0.748x 1 )=1.495e-4 The expected number of failures (required replacement) in one year (two working shifts per day) when Ilh(f,z) = 6690 hours is considered to be the MTTF of bearings with a 90% confidence of availability is equal to:

0.90 = exp(-l.495e N

=3

- 4 x 5400 x 2) x

(1.495e - 4 x 5400 x 2)x x=o

X!

(unit/loader/year)

In the ideal circumstance, where no covariate existing, the required number of replacement (spare bearing) is equal to:

0.90 = exp(-le N

=2

- 4 x 5400 x 2) x

(le-4~5400~2)~ x=o X!

(unit/loader/year)

This difference between number of replacement with and without considering the effect of operating environment might not seem important for one loader in one year. However, it is important in spare parts inventory management while considering a loader fleet (of 14 loader) in the mine. Further, some times the company faces with downtime of loaders due to shortage in availability of required spare parts, and this is because of the manufacturerhpplier’s recommended less number of required spare parts to be kept in stock. In most cases the manufacturer is not aware of the environmental factors and as such has not considered these issues in the estimation of the number of required spare parts (like in this case). So, to avoid downtime regarding the unavailability of spare parts, it is suggested that the mine company should take the operating environment factors into consideration while estimating the spare parts need. 4

Conclusion

The reliability investigation of the system may be helpful in arriving at the optimum number of spare parts (for scheduled and preventive maintenance) needed for fulfillment of the tasWgoals. The operating environment of systemlmachine has a key role in system output and its technical characteristics such as reliability are part of that criticality. Forecasting required supportlspare parts based on technical characteristics and the system-operating environment is an optimal way to prevent unplanned disruptions or stoppages. Then the operating environment should be highlighted while spare parts forecasting, calculation, and inventory management is in the process.

132 References 1. Al-Bahli, A.M. (1993), “Spares Provisioning Based on Maximization of Availability per Cost Ratio”, Computers in Engineering, Vol. 24 No. 1, pp. 81-90 2. Bendell, A., Wightman, D.W. and Walker, E.V. (1991), “Applying Proportional Hazards Modeling in Reliability”, Reliability Engineering and System Safety, Vol. 34, pp. 35-53 3. Billinton, R. and Allan, R.N. (19831, “Reliability Evaluation of Engineering Systems: Concepts and Techniques ”, Boston, Pitman Books Limited 4. Blanks, H.S. (1998), “Reliability in Procurement and Use”, England, John Wiley and Sons Ltd. 5. Cox, D.R. (1972a), “Regression Models and Life-Tables”, Journal of the Royal Statistical Society, Vol. B34, pp. 187-220 6. Ghodrati, B. (2003), “Product Support and Spare Parts Planning Considering System Reliability and Operation Environment”, Licentiate Thesis, LuleA University of Technology, Sweden 7. Ireson, W.G. and Coombs, C.F. (1988), “Handbook of Reliability Engineering and Management”, New York, McGraw-Hill Book Company 8. Kalbfleisch, J.D. and Prentice, R.L. (1980), “The Statistical Analysis of Failure Time Data”, New York, John Willey and Sons Inc. 9. Kumar, D. (1993), “Reliability Analysis Considering Operating Conditions in a Mine”, Licentiate thesis, Luleh University of Technology, Lulek Sweden 10. Kumar, D. and Klefsjo, B. (1994a), “Proportional Hazards Model - an Application to Power Supply Cables of Electric Mine Loaders”, International Journal of Reliability, Quality and Safety Engineering, Vol. 1 No. 3, pp. 337-352 11. Kumar, D. and Klefsjo, B. (1994b), “Proportional Hazards Model: a Review”, Reliability Engineering and System Safety, Vol. 44 No. 2, pp. 177-188 12. Kumar, D., Klefsjo, B. and Kumar, U. (1992), “Reliability Analysis of Power Transmission Cables of Electric Mine Loaders Using the Proportional Hazard Model”, Reliability Engineering and System Safety, Vol. 37, pp. 217-222 13. Kumar, U. (1989), “Reliability Investigation for a Fleet of Load-Haul-Dump Machines in a Swedish Mine”, Reliability Engineering and System Safety, Vol. 26 pp. 341-361 14. Kumar; U.D., Crocker, J., Knezevic, J., El-Harm, M. (2000), “Reliability, Maintenance and Logistic Support A Life Cycle Approach”, Kluwer Academic Publishers; USA 15. Markeset, T. and Kumar, U. (2003), “Design and Development of Product Support & Maintenance Concepts for Industrial Systems”, Journal of Quality in Maintenance Engineering, Vol. 9 No. 4, pp. 376-392 16. Rigdon, S.E. and Basu, A. (2000), “Statistical Methods for the Reliability of Repairable Systems”, New York, John Wiley & Sons Inc. 17. Sheikh, A.K., Younas, M. and Raouf, A. (2000), “Reliability Based Spare Parts Forecasting and Procurement Strategies”, in Ben-Daya, M., Duffuaa, S. 0. and Raouf: A . (eds.) Maintenance, Modeling and Optimization, pp. 81-108, Kluwer Academic Publishers, Boston 18. SYSTAT 10.2 (2002), Statistics Software, Richmond, CA, USA ~

DISCRETE-TIME SPARE ORDERING POLICY WITH LEAD TIME AND DISCOUNTING

B.C. GIRI AND T. DOH1 Department of Information Engineering, Hiroshima University 1-4-1 Kagamiyama, Higashi-Hiroshima 739-8527, Japan E-mail: dohi &el. hiroshima-u. ac.jp N. KAIO Department of Economic Informatics, Hiroshima Shudo University 1-1-1 Ozukahigashi, Asaminamiku, Hiroshima 731-3195, Japan E-mail: kaio Qshudo-u. a c j p The paper considers an order-replacement model for a single-unit system in a discrete time circumstance. The expected total discounted cost over an infinite time horizon is taken as a criterion of optimality and the optimal ordering policy is obtained by minimizing it. The existence and uniqueness of the optimal ordering policy are verified under certain conditions. The model is further extended to include a negative ordering time.

1. Introduction The impact of catastrophic failure of any complex system may be large in terms of both safety and finance. A planned maintenance can reduce the risk of potential failure, promote a longer life for the system and decrease aggregate costs for repair and maintenance. In many practical situations, it may be difficult to perform repair of a failed unit or the cost of repair of a failed unit may be exceptionally high. In such cases, disposing the old unit after a critical age or at failure and replacing it by a new one may be a viable option. For example, consider the electronic systems where the maintenance is usually performed through disposal and replacement of a subassembly or component because the electronic components are virtually impossible to repair in a cost-effective manner. During the past three decades, various maintenance policies have been proposed and studied in the literature. Most of these policies are based on the assumption that a spare unit is immediately available whenever a replacement is needed. However, in practice, this assumption may not be true in all circumstances. A significant time lag between placement of an order for a spare and the instance of its supply/delivery can be observed due to unwanted reasons. So, the determination of the optimal ordering policy under such a time lag or lead time is quite appropriate. Osakil was the first who considered a single-item order-replacement model with lead time.

133

134 After his seminal work, several researchers investigated order-replacement models with a positive lead time from various view points. We refer the readers to the articles by Dohi and his c o - a ~ t h o r sfor ~ >comprehensive ~ review and bibliography of the relevant literature. However, most of the order-replacement models developed in the literature are based on continuous time framework. But, there are many practical situations where system lifetime can not be measured in calendar time. For example, consider the failure of a digital weapon system where the number of rounds before failure is more important than the age of the failure. In such a case, system lifetime should be regarded as a discrete random variable. Unfortunately, enough research has not been carried out on discrete-time order replacement models. Kaio and Osaki4 first considered an order-replacement model in discrete time setting and obtained the optimal ordering policy minimizing the average cost rate in the steady state. Later, the same authors5 analyzed the model by taking account of the minimal repair. In this article, we develop a discrete-time order-replacement model with discounting. Two deterministic ordering lead times are considered; one is €or regular (preventive) order and another is for expedited (emergency) order. We take the regular ordering time and the inventory time limit for the spare as two decision variables and characterize the optimal ordering policy under certain conditions. We also extend the model to include the negative (regular) ordering time. 2. Model Development 2.1. Nomenclature

N : discrete random variable denoting failure time; p ( n ) , 1/X (> 0) : probability mass function, mean of N ; P ( n ) : cumulative distribution function of N ; L1 (> 0) : constant lead time for an expedited (emergency) order; Lp (> 0) : constant lead time for a regular (preventive) order; no ( 2 0) : regular ordering time for a spare (decision variable); n1 ( 2 0) : inventory time limit for a spare (decision variable); c, (> 0) : shortage cost per unit time; Ch (> 0 ) : inventory holding cost per unit time; c1 (> 0) : fixed expedited ordering cost; cp (> 0) : fixed regular ordering cost; +(.) : survivor function of +(.) ie.,$(.) = 1 - ?I(.); b ( 0 < b < 1) : discount factor 2.2. Model Description Consider a single-unit system which is subject to random failure; each failed unit is scrapped without repair and each spare is supplied by order with a deterministic lead time. For a discrete time index n = 0 , 1 , 2 , . . ., suppose that the original unit begins operating at time n = 0. If it does not fail before a predetermined time no E [0, ca) then a regular (preventive) order for a spare is made at time no and after a lead time La, the spare is delivered. If the original unit fails in the time interval [no,no+Lp] then the delivered spare takes over its operation at time no+Lz. If the unit does not fail until time n = no Lp then the spare is put into inventory

+

135 and the original unit is replaced by the spare when it fails or passes an inventory time limit n 1 E [0, m) after the spare's arrival, whichever occurs first. On the other hand, if the original unit fails before the time no, an expedited (emergency) order is made immediately at the failure time and the spare takes over its operation when it is delivered after a lead time L1. In this case, the regular order is not made.

2.3.

Assumptions

(i) The system failure is detected immediately. (ii) The spare in inventory does not fail or deteriorate. (iii) The time interval between two successive replacements or dispositions of the unit is one cycle. (iv) The planning horizon is infinite. 2.4.

Model Formulation

By discrete probability argument, the expected discounted cost for holding inventory in one cycle is given by no+Lz+nl-l

H ( n 0 ,n1) = C h b n o f L Z

c

n-no-Lz-1

{ =

biP(.)

is0

n=no+Lz

2

+

n=no+Lz+nl

b'p(n)}, j=O

Similarly, the expected discounted costs for shortage and ordering per cycle are

and no-1

00

n=O

n=no

respectively. Hence, the expected total discounted cost per cycle is vb(no,n1) =

+

+

O(n0) S(n0) H(no,n1).

Just after one cycle, the expected unit cost is discounted as

n=no+Lz+n~

Thus, when a unit starts working at time n = 0, the expected total discounted cost over the time horizon [0, m) is 00

TCdno,n 1 > =

&(no,n1){&(no,

= Vb(no,n1)/&(no, n1).

k=O

Our objective is to find the optimal pair (n:,n;) minimizing TCb(n0,n l ) .

(3)

136 3. Analysis

We first obtain the following result which will be useful t o reduce the above twodimensional optimization problem into a simple one-dimensional one.

Theorem 1: For an arbitrary regular ordering time no, if (1 - b)TCb(no,n l ) 2 then the optimal inventory time limit is infinite i.e., n; + 00, otherwise n; = 0.

ch

Therefore, following Theorem 1, we need t o obtain the optimal regular ordering time n;Junder two extreme situations: (i) n; + co and (ii) n; = 0. 3.1.

The case of n;

+ 00

When nl + 00, the expected total discounted cost over an infinite time span is TCb(no,m) = Vb(no,m) /%(no,co) where, from equations (1) and (2), m

m

n-no-L2-1

Define the numerator of the difference of TCb(n0,m)with respect to no, divided by the factor (1 - b)bnOP(no), as WbW(n0).Then

( 112)} + + bL;

(c,

ch)bL'R(no)

-

(1- b)cz

where r ( n ) = p ( n ) / p(n) and R(n) = { P ( n+ Lz) - P ( n ) }/ P(n). In fact, ~ ( n ) is not the failure (hazard) rate of the discrete failure time distribution. In discrete time setting, the failure rate can be defined as p ( n ) / p ( n- l),see Barlow et aL6.

Lemma 1: The function R(n)is increasing (decreasing) provided that the function ~ ( nis)increasing (decreasing).

To characterize the optimal regular ordering policy, we make the following plausible assumptions:

(A-1) (A-2)

c1

+ c, x;'i1bz > cz +

C,

f Ch

C,

L2-l 62 , Cjc0

> (1 - b)TCb(nO,n l ) for all 720, 7%1E

[o, 00).

Theorem 2: (1) Suppose that the function r ( n ) is a strictly increasing under assumptions (A-1) and (A-2).

137 (i)

If Wbm(0) < 0 and wbm(m) > 0, there exists at least one (at most two) optimal ordering time n: (0 < n$ < m) satisfying Wbm(nz - 1) < 0 and wbm(nc) 2 0 The upper and lower bounds of the corresponding minimum expected total discounted cost are as follows:

urn(.; where

- 1)

[ {

Urn(.)

= r ( n ) c1

-(1

-

-

< TCb(n(;,m) 5 Um(n;),

r2)}+ +

( I

c~ - cs bL;

(c,

b)cz - C h b L 2 ] / [(l - b)b%(n)

(7) Ch)bL2R(n)

- .(n)

(bL1

-b 9 ] .

(ii) If Wbm(0) 2 0, the optimal ordering time is ni = 0 which means that ordering for the spare a t the instant of operation of the original unit is optimal. (iii) If Wbm(0) 5 0, then nc + m which means that ordering for the spare at the instant of failure of the original unit is optimal.

(2) Suppose that r ( n ) is a decreasing function of n under assumptions (A-1) and (A-2). Then, the optimal regular ordering time is either n: = 0 or ni + 00.

Proof: Using Lemma 1 and assumptions (A-1) and (A-2), it can be shown that Awb,(no) > 0 for all no E [0, co). This implies that the function TCb(n0,m) is strictly convex in no. Therefore, if Wbm(0) < 0 and Wbm(O3) > 0, then Wbm(no) changes sign from negative to positive only once. Hence, there exists at least one (at most two) optimal ordering time n$ (0 < n: < 03) satisfying wbm(?%;) - 1) < 0 and wbm(nG)2 0. If Wbm(o) 2 0 under assumptions (A-1) and (A-2) then the function TCb(n0,m) is an increasing function of no, and the optimal ordering time is clearly n: = 0. Conversely, if wbm(m)5 0, then the TCb(n0,m) is a decreasing function of no. Therefore, nc 3 m. The proof of the second part of the theorem (when r ( n ) is decreasing) is trivial. Thus the proof of the theorem is completed.

3.2. The case of n: = 0 When the original unit is replaced/exchanged by the spare one as soon as the spare is received by a regular order, irrespective of the state of the original unit, the expected total discounted cost over an infinite span is TCb(no,o)= &(no,O)/ &-(no,o), where no-1

m

n=O

n=no

n=no

j=O

nn-1 n=O

n=no

no-1 L I - 1

n=O

I

i=O

138

Define t,he function

Then, the following theorem corresponding to Theorem 2 can be obtained.

Theorem 3: (1) Suppose that r ( n ) is a strictly increasing function of n under assumption (A-1). (i)

If WbO(0) < 0 and Wbo(03) > 0 then there exists at least one (at most two) optimal ordering time nT, (0 < n;) < cm) satisfying WbO(n;)- 1) < 0 and WbO(n;)) 2 0. The lower and upper bounds of the corresponding minimum expected total discounted cost are given below:

Uo(ni - 1) < TCb(ni,0) 5 ~ O ( $ J ) , where

[ {

Uo(n) = r ( n ) c1 - cg - c s

r')}+

( 1 bL;

(11) c,b%(n)

(ii) If WbO(0) 2 0, then the optimal ordering time is n;l = 0; otherwise, n;l + 03.

(2) Suppose that r ( n ) is a decreasing function of n. Then, under assumption ( A - l ) , the optimal regular ordering time is either n;l = 0 or n;l + 00.

Proof: The proof is similar to that of Theorem 2. So it is omitted for brevity. Remark 1: Based on the results given in Theorem 2 and Theorem 3, the optimal ) be determined by comparing TCb(n;),O)with TCb(n;), cm). pair ( n ; ) , n ;can Remark 2: It can be shown that the long-run average cost in the steady state C(no,n l ) is the limiting value of the annualized total discounted cost as the discount factor b + 1, Le., b-1

(1 - b ) . TCb(no,n l )

(13)

4. Model with Negative Ordering Time

A negative ordering time can be included in our model by choosing the initial time point before 0. Of course, the idea of negative ordering time is applicable to the regular order only in the case of n; = 0. There is no meaning of considering negative 00. See Kaio and Osaki7, for the negative ordering ordering time in the case of n; policy in continuous time setting.

-

139 Suppose that the first regular order for the spare is made at time no where -Lz 5 no 5 0 and the original unit begins operation after a time interval -no; the spare is delivered at time no Lz. In this case, suppose that p ( n ) = 0 for n = - L z , -La 1,.. . , - 1 , O . Then, the expected total discounted cost TCb(n0,0) is still valid for no (-& 5 no 5 0 ) . We have, from equation (11), WbO(-LZ) = - b L 2 & ( - L z , 0 ) = -cz < 0. Hence, the optimal regular ordering time can be characterized as given in the following theorem:

+

+

Theorem 4: (1) Suppose that r ( n ) is strictly increasing for non-negative n, under assumption (A-1). If W b O ( 0 ; ) ) > 0 then there exists at least one (at most two) optimal ordering time n; ( - L z < n: < CCI) satisfying WbO(12; - 1) < 0 and WbO(n6) 2 0. The upper and lower bounds of the corresponding minimum expected total discounted cost are identical to those obtained in equation (11). (ii) If wbO(0O)5 0 , then ng + 00.

(i)

(2) Suppose that r ( n ) is decreasing for non-negative n, under assumption (A-1). (i)

If WbO(0) > 0 then there exists a t least one (at most two) optimal negative ordering time M (-L2 < M < 0) satisfying wbO(hf- 1) < 0 and WbO(M) 2 0. The bounds of the corresponding minimum expected total discounted cost are obtained as:

ut(&I- 1) < TCb(hf,O)5 U & ( M ) , where

+ L2)

U & ( n )= [c,bLZP(n

-

(14)

(1 - b ) ~/ ] [(I - b)bL2].

Furthermore, if WbO(0O) 2 0, then n; = hf and if Wbo(c0) < 0 then n; = M or n; + 00. (ii) If WbO(0) 5 0 then n; + 00.

Proof: The theorem can be proved in the same line as that of Theorem 2 or Theorem 3. Hence it is omitted for brevity. 5. Numerical Example Suppose that the lifetime of the unit obeys the discrete Weibull distribution:

and the parameter values of the model are: rn = 2, c1 = 60, c~ = 20, L1 = 1, LZ = 5, c, = 12, ch = 5, b = 0.85. For the model with non-negative ordering time, Table 1 shows that when the failure rate is high it is desirable to place the regular order in the early stage and keep the spare in inventory; when the failure rate is low, delay in regular ordering is enviable. Table 1 further shows that a negative regular ordering

140 time can reduce substantially the expected total discounted cost in the steady state when the failure rate is high.

Table 1. Dependence of the optimal ordering policy on the failure parameter q. -

Non-negative ordering time TCb(n;,O) n; TC,(nc,cm) 56.5725 56.8392 55.2667 55.4834 53.5651 53.6906 51.6851 51.5089 49.1873 49.4293 46.4079 46.5312 42.8000 42.9647 38.6601 38.8361 33.2735 33.2757 24.2105 24.4587 Y

nT,

4

0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 -

0 1 1 1 1 2 2 2 3 6

I

1

I

Nega ve ordering time n; -2

1 6

. 31.6833 31.3242 30.6942 29.9621 29.1891 28.3725 27.5094 26.5969 25.6317 24.2105

I

t regular ordering time no E [-5, m) 6. Concluding Remarks In this paper, we have developed a discrete-time order replacement model for a single-unit system with ordering lead time and discounting. The expected total discounted cost has been taken as a criterion of optimality because it allows us t o put emphasis on the present term behavior of the system. Further, the behavior of the system in distant future can be realized by the limiting value of the annualized total discounted cost when b tends to 1. As a future research effort, it would be interesting to develop a generalized discrete-time order-replacement model with discounting, randomized lead times and/or minimal repair.

References 1. S. Osaki, A n ordering policy with lead time, Int. J. Sys. Sci. 8 , 1091 (1977). 2. T. Dohi, N. Kaio and S. Osaki, On the optimal ordering policies in maintenance theory - survey and applications, Appl. Stoch. Models B Data Analysis 14, 309 (1998). 3. N. Kaio, T. Dohi and S. Osaki, Preventive maintenance models: replacement, repair, ordering and inspection, Handbook of Reliability, edited by H. Pham, Springer-Verlag, 349 (2003). 4. N. Kaio and S. Osaki, Discrete-time ordering policies, IEEE Trans. Rel. R-29 (5), 405 (1979). 5. N. Kaio and S. Osaki, Discrete time ordering policies with minimal repair, RAIROOpns. Res. 14, 257 (1980). 6. R. E. Barlow, A. W. Marshall and F. Proschan, Properties of probability distributions with monotone hazard rate, Annl. Stat. 34,375 (1963). 7. N. Kaio and S. Osaki, Optimal ordering policies with two types of randomized lead times, Comp. Math. Appl. 19, 43 (1990). .

I

SNEM: A NEW APPROACH T O EVALUATE TERMINAL PAIR RELIABILITY OF COMMUNICATION NETWORKS N. K. GOYAL Reliability Engineering Center, IlT Kharagpur, West Bengal, INDIA - 721302

R. B. MISRA Reliability Engineering Center, IIT Kharagpur, West Bengal, INDIA

~

721302

S . K. CHATURVEDI Reliability Engineering Center, lIT Kharagpur, West Bengal, INDlA - 721302 This paper presents a method to evaluate terminal pair reliability of complex communication networks using a new approach SNEM. The algorithm SNEM (Source Node Exclusion Method) first reduces a given network t"o its non-series-parallel form and then breaks the network by excluding the source node from rest of the network to obtain its sub-nehvorks. The reliabilities of these sub networks are computed thereafter by the recursive application of SNEM. The proposed approach is quite simple in application and applicable to any general networks, i.e., directed and undirected. The method does not require any prior information such as path (or cut) sets of the network and their preprocessing thereafter or perform complex tests on networks to match a predefined criterion. The proposed method has been applied on a variety of network and found to be quite simple, robust, and fast for terminal pair reliability evaluation of large and complex networks.

1.

Introduction

In the design of communication networks, reliability has emerged as an important parameter due to the fact that failure of these networks affects its user adversely. The interest in area of reliability evaluation is quite evident from the numerous formulations of the network reliability problems and the articles, which have been appearing in the literature for the past couple of decades, thereby evolving various methodologies, techniques and algorithms to tackle these problems in an efficient and effective manner. The reasons of the proliferation of interests and such articles appear to be a better understanding of the theoretical nature of the network reliability problems on variety of networks. Among the various formulations, the most familiar network reliability problem involves the computation of probability that the two specified communication centres in a network could communicate with each other. These formulations model the network by a probabilistic graph comprising n number of nodes (communicable centres) and b number of branches (connecting links) and assume the statistical independence of the failure of the connecting links. This problem is known as (s-t) reliability or two-terminal reliability in the reliability parlance. The survey of the literature indicates that the approaches, which have been used to compute the two-terminal reliability, includes serial-reduction and parallel combination, event space enumeration, path (cut) sets unionization, pivotal decomposition using 141

142

keystone components and transformation techniques etc. Therefore, the whole spectrum of methodologies could broadly be classified into two paradigms, viz. 1 . The paradigm in which one of the prerequisite is - the enumeration of all possibilities through which the two specified nodes can communicate (or not communicate) with each other. Some of the recent developments in this area can be seen in [I] 2. The paradigm that does not require knowledge of path (or cut) sets in advance. [2111 However, the common feature in both of the paradigms is-whatever solution techniques we use, it turns out to be highly recursive in nature. The approach presented in this paper is also not an exception. Misra [I21 presented an efficient algorithm to compute the reliability of seriesparallel (SP) networks and suggested that it could be used for a general network after shorting and opening of pivotal branches. However, the responsibility of selecting a pivotal branch solely lies on the analyst. Moreover, the method applies to the networks that contain the bi-directional elements. Park and Cho [7] suggested an algorithm based on the recursive application of pivotal decomposition using keystone components combined with series reduction and parallel combination. Nakazawa [ 131 recommended a technique for choosing the keystone element. Hansler [14] studies the reliability in which all links were bi-directional. The application of factoring theorem to reliability evaluation can be seen in [6-91. The present paper deals with the computation of (s-t) reliability and presents an approach, which belongs to the second paradigm, i.e., non-path (cut) sets based techniques. Some of the salient features, which make it different from the existing approaches, are: No pre-requisite to determine the path (or cut) sets and their preprocessing thereafter to get them in certain order as is desirable in the sum-of-disjoint (SDP) form based approaches. Compared to SDP based approaches, it solves the problem with lesser number of multiplications, thereby provides reduced round off error. It doesn’t require any overheads such as search criteria for a key or pivotal element. This is important because any search or testing conditions takes more computational time. It doesn’t burden computer memory, as the data to be processed is only the weighted connection matrix of the network under consideration whereas the connection matrices of the sub networks, extracted from this main matrix, are used in subsequent calculations to compute overall two-terminal reliability. The connection matrix representation of a network is the simplest approach as compared to other representations because of the computational ease and flexibility it provides. Moreover, the sub matrices existence would be temporal till they serve their purpose. It enables this method to run even on a desktop PC for quite large networks.

Assumptions 1. A communication Network is modeled by a probabilistic simple graph. 2. The nodes of the network are perfectly reliable.

143

3 . The network and its branches have only two states (i) working or (ii) failed. 4. The branch failures are statistically independent. Notation n -Number of nodes in the network. b -Number of branches in the network. N, - ith node of any network, where 1 5 i 5 n L, - Reliability of ith link in any network, where 1 5 i 5 b. L, - Unreliability of ith link in any network, where 1 5 i 5 b. R, - Reliability of network from N, as source node and N, as the terminal node. u- Union. n- Intersection. [C] - Weighted connection matrix. Each entry C (i, j) in [C] denotes the reliability of link connected from node N, to NJ. If there were no link from node N, to N,, then its value would be 0. Acronyms Non Series-Parallel Sum of Disjoint Product The proposed method (Source Node Exclusion Method) Series-Parallel Series-Parallel Reduction

NSP SDP SNEM SP SPR 2.

The Proposed Approach

We consider a general network as shown in Fig. 1. Let the source node s is connected to rest of the network via r links, viz., (L,, Lz...L,), which are terminating to various nodes, viz., N1, N2...N,, of the network. Then we can express the (s, +reliability of the network as: 4 . 1

i = 1, 2,

= ( ~ , n R. , ), U ( L , n R , , ) u . . . u ( L , n R , , , )

(1)

...r, is the reliability between node N, (as new source node) and t of the n

Rest of the Network Figure 1: A general communication network

sub network resulted by omitting the source node s from rest of the network.

144

Equation ( 1 ) contains two type of terms, viz., (i) sub network reliability terms (Rl,l, ...,Rr,l), which are dependent on each other, and (ii) link reliability terms (L,, ..., Lr), which are independent to each other as they are not part of the rest of the network. Links are also independent to the sub network reliability terms. To explain the above points, let us consider the first two terms of equation (1). This can be expanded as: (L, n4,,) u ( L , nR,,)

-

nc) (1, nRi., nL2n%)+ ( L ,

4,

= (L, n

+

R,.,)

=L, n(L,n~.,)+(L,nR,,)+(L,nR,.,nL,nH,)

(2)

=L2"(L,nRi.,)+L,n(R,.,+l.,nR,.,nR,)

= L, n ( L l n R , , ) + L , n ( R 2 . ,u ( L , n R , . , ) ) = I,, * R i O + L, *(Ruu(L,*Ri.,))

*(4

Where, R,,l, for i = 1, 2, ...r, is the reliability between i" node N, (as the new source node) and node t of the sub network, which is the result of deleting the source node s from rest of the network. Therefore, expanding equation (1) in its entirety, we obtain the expression as given in equation (3): R,.,= (L, nRl.,)u ( L 2n R 2,) u...u ( Ln K r )

_ _

= L,

*{...*(G*tL, * Ri.,) + L 2 *(R,,, u ( L , * Ri.,)))...l+...L,-l *(JL., u

(L:*R,-z.,)U-.u(&

(3)

*%))I+L, *(R,,, U ( L i *R,. i.OU...U(Li *Rn.O)

In other words, the proposed method SNEM (Source Node Exclusion Method) solves any network by recursively breaking and solving the resulting SP-sub networks to compute the overall (s, t) reliability. We illustrate the application of the above formulation and the steps to follow for computing the (s, t)-reliability of the network by taking networks of 5 nodes and 8 links as shown in Fig. 2.

Illustration Consider the network of Fig. 2(a) with the source node and destination node as 1 and 5, respectively. Firstly, we obtain the number of sub network, which would be equal to the number of links emerging out from the node N,, i.e., three in this case. The sub network of Fig. 2(b) turns out to be a simple series parallel network with the new source node as N1. Its reliability is evaluated as R2.5'=0.9587. Second sub network in Fig. 2(c) is obtained from 2(b) by connecting the link LI to node 3 instead of node 1. However, this sub network turns out to be a non-series parallel network with its source node, 3. Therefore, we break it further as shown in Fig. 2(e) and 2(f), respectively. Now, its reliability is evaluated from the reliability of its sub networks. These two sub networks turns out to be simple SP networks with source nodes, 2 and 4 respectively. The reliabilities of these series parallel networks are evaluated as R2.5''= 0.951 and &.s''= 0.9676, respectively.

145

Figure2. (a) Network of 5 nodes and 8 links with its 5 associated sub networks Fig. 2(b) to 2(f)

Now we can express and compute the reliability of the sub network of Fig. 2(c) as: R3,5‘= (R2.5“*(LI+z *L4)* z)+L5*&.5”= 0.9598. The third and last sub network, we obtain from Fig. 2(b) by connecting the node 4 with node 2 and 3 via links, L, and Lz. The source node for this network is node 4. The reliability of network of Fig. 2(d) is evaluated as &,5‘= 0.9675. Now, we can express and compute the overall (s, t)-reliability from the reliability of its sub networks (R2,5’, R3,5’,&,5’) as: R1,5 = ((Rz,s’*LI)* 7;;+LZ*R3.5’)*(l-L3)+L3*R,,,’ = 0.9657

3.

Implementation

We have implemented the proposed method by using the weighted connection matrix of the probabilistic graph. The flow chart of Fig. 3 implements the proposed method SNEM whereas Fig. 4 is meant for the procedure SPR.

146

For each non zero entn J in [V]. Do

Exit

C1ll.k) =O,forall k 3

Get. Next nan-zero entn 1 in [Vl

Figure3. Flow chart - SNEM (Source Node Exclusion Method)

4.

Results and discussions

We have implemented the proposed method in Matlab and have applied it to a variety of test networks taken by the earlier researchers [l]. The results obtained by applying the proposed method are shown in Table 1 . The computed value of the reliability is matching exactly with the results obtained in [l]. Some of the points worth noting are:

147

Is Type Undirectional’J

matri\ [Cl I

I

I

Recursivel! find nodes which have indegree and outdegree equal to I and apply wries prollel reductions to modih- /(-1

Figure4. Procedure - SPR (Series Parallel Reductions) on weighted connection matrix [C]

1.

2.

For network of Fig. 2 (a), the number of multiplication performed are 24 as compared to the SDP based approach which uses path sets and involves 58 multiplications; Compared to the algorithms based on factoring theorem, this method generates lesser number of sub-graphs. For the network of Fig 2 (a), it would generate eight series-parallel sub-graphs corresponding to 3 bidirectional links. However, the proposed algorithm generates only four series-parallel sub-graphs. Besides, the present algorithm has the added advantage of being applicable to both directed and undirected networks; and doesn’t require to find a pivotal elements in the graph, which itself is a trivial task.

It can be observed from the table I, that the time taken for small to medium sized networks is less than a second. However, for the network of 21 nodes 33 (21n331) links the execution time is about one minute only whereas it has been few hours as reported in [I]. Although, comparing algorithms on the basis of execution time is not considered to be the correct approach unless the algorithms are implemented by the same and an unbiased programmer and in the same environment, which takes care of the factors such as the skill of programmer, hardwarekoftware platform, processor speed etc. However, the difference in execution time is quite significant.

148 Table 1 : Results of application of method to various networks taken from [I]. (The method is implemented using MATLAB 6.5 on PC with Pentium-I11 processor and 128 ME RAM and Win 2000 operating system) Network Name 1. 51181 2. 6n91 3. 711151 4. lln211 5. 811121 6. 8n121 7. 711121 8. 8n131 9. 1611301 10. 911141 I I . 1311221 12.2011301 13.2111331

Network Type Undirected Undirected Directed Directed Undirected Undirected Undirected Undirected Directed Undirected Undirected Undirected Undirected

Reliability 0.99763 164000 0.97718440500 0.99776822031 0.997 10478I97 0.98406815298 0.9751 1589739 0.99749367367 0.99621 749334 0.99718629078 0.974 I4541477 0.98738996728 0.99712039875 0.97379846832

CPU Time (in sec) 0.01000 0.02000 0.02000 0.02200 0.02000 0.02000 0.02000 0.03000 0.10000 0.03000 1.27200 55.5500 67.9000

5. Conclusion SNEM is found to be very versatile and easily amenable to computer implementation. The proposed method has been applied well to both the directed and undirected topologies of various networks. It computes the (s, t)-reliability of communication networks with less memory requirements and at reduced round off errors.

References 1.

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

S. K. Chaturvedi and K. B. Misra, Int. Jour. ofQuality and Reliability Management, 9,237 (2002) N. Deo and M. Medidi, IEEE Trans. on Rel., R-41,201 (1992) F. Beichelt and P. Tittman, Microelectronics andreliability, 31, 869 (1991) 0. R. Theologou and J. G. Carlier, IEEE Trans. on Rel., 40,210 (1991) F. Beichelt and P. Tittman, optimization, 20,409 (1990) L. B. Page and J. E. Perry, ,IEEE Trans. on Rel., R-38,556 (1989) K. S. Park, B. C. Cho, IEEE Trans. on Rel., 37, 50 (1988) R. K. Wood, IEEE Trans. on Rel., R-35,269 (1986) R. K. Wood, Networks, 15, 173 (1985) A. Satyanarayan and M. K. Chang, Networks, 13, 107 (1983) J. P. Gadani and K. B. Misra, , IEEE Trans. on Rel., R-31,49 (1982) K. B. Misra, IEEE Trans. on Rel., R-19, 146 (1970) H. Nakazawa, ,IEEE Trans. on Rel., R-25,77 (1976) E. Hansler, IEEE Trans. on Comm., COM-20,637 (1 972)

ROBUST DESIGN FOR QUALITY-RELIABILITYVIA FUZZY PROBABILIW HUIXIN GUO Department ofMechanical Engineering, Hunan University ofArts and Science Changde, Hunan, China Emai1:[email protected] The purpose of this paper is to desczibe a robust design method for quality reliability with fuzzy design target. Fuzzy number was used to express the design target of product’s quality with fuzzy character, and its membership functions in mmmon use were given. The robustness of a design solution was studied using the theory of fuzzy probability. To maximize the fuzzy probability of quality design was proposed as a robust design criterion, and its validity and practicability were validated. The controlling effect of this criterion on the mean and variance of quality index is similar to the effect of Taguchi’s signa1.t.o-noise ratio. This criterion is well adaptable to the different choice of the membership functions, and can effectively controls the direction a t which the mean on quality index approaches to the nominal target. A robust optimal design example of a circuit was presented, and it shows that its solution is better than traditional robust design methods.

1

Introduction

Robust design of product quality aims at high quality and low cost based on the consideration of product performance, quality and cost. Taguchi method[’] is a representative of traditional robust design methods. Taguchi’s quality loss function developed a school of his own, and this method was ingenious and was easily popularized. In practical application, its effect was notable, but it required of the monotony of mathematical modellzl. Because the operations of square and logarithm were adopted in the calculation of the signal-to-noise ratio (S/N) , this requirement could not be satisfied. The orthogonal table was adopted in the calculation of S/N, and the departure levels of controllable factors and noise factors were discrete. But the quality index of a real product was usually continuous, so only the approximate optimal solution was gained, therefore the calculation formula of S M defined by Taguchi needs to be improved. In the engineering model of robust design based on the random optimum rnethodsl3-41, the sensitive index (SI) of the quality index was introduced, but Taguchi‘s quality loss function was used more commonly. As to “nominal is better” design target, the robust design criterion was as follows: ( ; - y o ) * + r :+ min (1)

-

Among related Formula, y w a s the quality character index of a product, y was its mean, and D; was its variance. Through the Formula (I), robust design makes

5

* This work is supported by Scientific Research Grant (03A031) of Hunan Provincial Education Department of China.

149

150 approach to the nominal target value yo and makes its variance smaller as best one can. But after the operations of square and sum in the Formula (11, it was difficult to predict its mathematical character. At the convergence point of the robust optimal design, the combinations of and 0; might be completely different. Moreover, by approaches to yo could not be effectively the Formula (l),the direction at which controlled, and the designer’s desirability, making approach to yo from the left or right direction, could not be realized. In this paper, a robust design criterion via fuzzy probability was put forward, and its validity, adaptability and practicality were well studied. Compared with the traditional robust design criterion Formula (11, it has special advantages.

7

7

2 The fuzzy mathematical description of product quality index 2.1 m e expression of qualityindexin fuzzy number PAY)

PAY,

1

Yo-&

Yo

m+4v

(a) Triangle distribution

yo-& (b)

Yo

m+4v

.

Normal distribution

T (c) Echelon distribution Figure 1

(d) The geometry meaning of the fuzzy robust design criterion

The membership functions of fuzzy design target

In Figure l(d, yo was the nominal target value of quality index y, its tolerance was +Ay Thus that the design quality y w a s a high quality was a fuzzy subset in the value region of JS expressed in a real fuzzy number noted as A, and the subjection degree of y to the ideal design was pJy) . Obviously, pJyo) = 1 . When ly yo I l a y , there w a s p J y ) = O . When l y y , l
151 function of y So the design quality index could be expressed in a real fuzzy number A, named as the fuzzy design target.

2.2

The rnembexship functions offuzzy design targetin common uses 7%e triangle fuzzy distribution

2.2.1

The hiangle distribution could be linear or nonlinear, showed in the Figure I(a). Its membership function is as follows: b0-W5

Y 2 Yo)

when ki=kz=l, it was a linear distribution; when ki(or kz)>l, the fuzzy distribution curve was down-protruding; when h ( o r kz)
The normal membership function

2.2.2

The membership function of normal distribution is as follows:

Its distributing curve was shown in Figure l(b). According to 6 0 criterion of the normal distribution, we could choose k=?.(%)* =$Ay2 . When ly yo ITAy, there wasp&) # O , so its application was limited to some extent. In practical application, its characteristic must be noticed (seeing the latter analysis).

Echelon fuzzy distributing function

2.2.3

Shown as the Figure l(c), when I g yo I >Ay, p d y ) = O ; when y E [a, b1, the design quality was satisfying and it could be considered as a high quality product, so = 1, The distribution in the fuzzy transition region could adopt a linear function, and then: (a -< y 5 b) ( 1 ;

w)

PJY)=*

\

y-yo+Ay ; AY

yo+Ayy-Y; AY 0 ;

fyo-Ayyly
(b
3

Robust design criterion based on fuzzy probability

3.1

m e basic concepts and dehitions

In robust design, suppose that the random design variables was xT = (x,,x,, ...,x,,) ,

152 T

the random noise factor was Z = (Z,,Zz,. ..,Z,,,),thus the product quality character index y = y ( x , z ) was also a random function. Suppose that the combined distributing density function of ydefined by the distribution of x a n d z w a s AJ$. Because the design target was described in a real fuzzy n u m b e r z , that the design scheme y w a s a high quality design was a fuzzy event in the value region of y, and its fuzzy distribution had the same membership function of A. P(2J is the fuzzy probability of fuzzy event Aid. The bigger its fuzzy probability, the better the degree of yapproaching to under the influence of disturbing factors, and the higher the high quality rate of product designed, and the stronger its anti-jamming capacity, and the better the robustness of the designed scheme. Thus the robust design criterion could be defined as:

The related Formula was a robust design criterion based on fuzzy probability, named the fuzzy robust design criterion in short.

3.2

me validity anaIysis ofthe fuzzy robust d e s a cn.temon

Based on the Formula (5) and with every restriction conditions considered, a robust optimum design model - could be constructed. In the iterative computation of searching optimum, the mean y and variance 0; of ychanged with the iterative points, so AJ$ was also changed in iteration. In programming, there were such methods for generating the distributing density &d as approximate Taylor expansion, Monte Carlo simulation, maximum entropy method and best square approaching method and so on[d. In order to testify the validity of the fuzzy robust design criterion, suppose that - and &I! were normal, and let t = y - y . If y = y o ,there was:

~LAY)

aP(* -=-

ay

1 &&t.e

[(Ycp-y)’ +m

k

’

I*

‘3’ dt=O

Then P ( 9 had a maximum value, so the Formula (5) could effectively make - the mean on y approach to JD. In the same reason, it could be testified that when y = y o , there was:

Namely, - P ( 9 was a monotony reducing function ofa,, so while the Formula (5) made y -+ y o ,it could effectively make 0; optimally reduced. It was not difficult to testify that, as to the common membership functions recommended in the Figure l(or other analogous membership function), if we make the design target JTI be the core o f p J y ) , the Formula (5) had the same effects. When we use the Formula (5) as a robust design criterion, its effect was similar to the effect of Taguchi’s quality loss function or S/N, and its validity accorded completely with the robust design requirements. Moreover, it must be pointed out that in related testifying, no special requirements on the distributing parameter kof the normal membership function were put forward, so it, made the choice of the membership functions more flexible.

153

3.3

me geometry expZanation of the fuzzy robust desip criterion

In order to visually explain the geometry meanings of the fuzzy robust design criterion of Formula (51, let pJy) be the echelon distributing, shown as the Figure l(d). According to the principle that the membership function pJy) could be understood as a weigh of G), P ( 3 was understood as the area that was enclosed by axis y and &$but was cut down by using pJy) as its weight coefficient. The more the “bell shape” of embedded into the high quality region [a, b] (where pJy) =I>, the smaller the cut degree of the area, the bigger the area “insetting” the high quality region, and then the better the robustness of the product design quality, so the stronger the anti-jamming capacity. Then - the process that the “bell shape” of fi)shrunk to pJy) =1 was the process of y -+yo and a : -+min.

4

Practicability analysis of the fuzzy robust design criterion

In the design practice, to accurately define pJy) was not easy. When the Formula (5) was used as a robust design criterion, the choice of pJy) would sffect its validity and its practical value directly. When the Formula (5) makes: + y o ,if it also makes the effect of a : +min be insensitive to the definition of pJy) , then the robustness of the robust design criterion itself was good, and its adaptability and its practicability was well. Now we illustrated a robust design example as follows.

Figure 2

The circuit sketch

Design example 1: the Figure 2 was a circuit[d containing a resistance R and an inductance loop L.It was known that the nominal value of the current was m = 10A. The voltage Vand frequency f b o t h submitted to the normal distribution, V- (120, 0.62)V, fL(60, 0.32)Hz; the R and L also submitted to the normal distribution, the 120S2, chosen region of their mean were respectively as follows: lC& 0.003HSz 50.045H. The current in the circuit was: V

’=JIf the tolerant error of the current was &y=&0.15A, y had “nominal is better” character. We want to determine the mean R on R, its tolerant error +AR,the mean L on L, its tolerant error +AL. According t o the fuzzy robust design criterion, its robust optimal design model was as follows:

154 X T = (x,,x*. * . x 4 )= (R,Z,AR,AL)

max /-:PAY)f(Y) s.t.

(X,Z)E

CD

Among the related model, @ was the restriction feasible region defined by the restriction conditions, one of them was that the defect rate of the current y w a s chosen a s 0.0001. Suppose that yobeyed the normal distribution, and its mean and variance were approximately calculated in the Taylor expanding method. The robust optimal its design scheme 1-5 were listed design was carried out by choosing different in Table 1.

PAY),

Scheme serial numbers Former scheme[4]

Table 1. The robust optimum design parameters of example 1 -

R

L

AR

AL

0.1050

0.630OXlO+

Y

0:

9.9798

0.3745X10.1

11.9700

0.3000X10'2

Scheme 1

11.9262

0.3941~10.2 0.1058

0.4572~10'4 9.9847

Scheme 2

11.9253

0.3096~10'2 0.1055

0.3923~10.3 10.0148

0.1122~10.2

Scheme 3 Scheme 4

11.9262

0.3941~102 0.1058

0.4572~10.4

9.9847

0.1095~10'2

11.9266

0.3959~102 0.1043

O.lO2OxlO.3

9.9837

0.1073~10.2

0.1095~10.2

Scheme 5 11.9253 0.3096~10.2 0.1050 0.3923~10.3 10.0148 0.1122~10.2 Noted: The former scheme adopted the traditional robust design criterion[41, namely, SO(? - yo)' + C ; + m in . In the scheme 1, pJy) adopted the Formula (2) and b = b = l ,~a = 10, Ay=0.15. In the scheme 2, pJy) adopted the Formula (2) and kl=kz=l, JO= 10, but choose e l . In the scheme 3, pJy)adopted the Formula (3), m=10, W 0 . 1 5 and k= fAy2 = 0.005 . In the scheme 4, pJy) adopted the Formula (3), m = 10, W 0 . 1 5 but = &=0.15, choose choose k 0 . 0 5 . In the scheme 5, pJy) adopted the Formula (4), ~ a 10, [a,bl=[9.95,10.051.

Knowing from the comparison of the robust design parameters in Table 1, when adopting the different forms of the membership function pJy) recommended in this paper (or other similar forms), the better robustness could be obtained, and it shows that the effect of the Formula (5) was insensitive to the choice of pJy) . Knowing from the comparison analysis of the scheme 1, 2, 3 and 4, the effect of the Formula (5) was insensitive to the fuzzy borders of pJy) (determined by +A$. So only if the center of pJy) was made equal to the nominal target value JD,to choose different forms of pJy) did not affect the effect of the fuzzy robust design criterion. So in the practical application, even if the fuzzy distributing of p d y ) was not accurately determined when the quality index ydeviated from the target value JD;even tolerance *&could not accurately judged, the effect of the fuzzy robust design criterion Formula (5) was not affected. This character made the definition of pJy) more flexible, thus the practicability of the robust design criterion described by Formula (5) was good. It must be pointed that the definition ofpJy) could be flexibly processed under a premise that its center was chosen as JD,the effect that the Formula (5) made +yo and a : +min was not affected, but it affected the value of P(a.If pJy) could not be accurately defined, P ( a > B could not be used as a restriction condition. As to a

5

155 certain 8,the membership function shown in the Figure l(c) had a potentiality to obtain a better design with larger tolerant error. Moreover, to determine the tolerant error h a y w a s related to the defect rate in the product quality design. In the scheme 1-5 of the Table 1, pJy) was all chosen as the symmetry distributing, so the direction of +yocould not be controlled.

5

The application example of fuzzy robust design

The robust design aims to high quality reliability and low cost. In the example 1, the making cost of the product was not considered. Now suppose that the making cost of the resistance or inductance was a function of its standad deviation:

Thus in the Figure 2, the circuit's robust optimum design model based on the cost and quality was as follows:

xT = (xI,x2,X, ,x, ) = (E,z,AR,AL) 0.09+0.0003

min ( s.t.

''

J","P A Y ) f ( Y ) d y

1

10.15

1 J 9.85 f ( y ) d y - 11 s 0.0001; 1 < xI I 2 0 ; 0.003 I x2 10.045; x, 2 0; x, 10

Like the design example 1, the different ~JJJ)was adopted in determining the robust optimum design. The different robust design solutions were shown as the Table 2. Table 2. The robust optimum design parameters and its comparison

I

Scheme serial numbers Former scherne'4' Scheme 1 Scheme 2 Scheme 3 Scheme 4 Scheme 5 Scheme 6 Scheme 7 Scheme 8 Scheme 9

-

L

AR

u

11.110 11.9487 11.9474 11.9487 11.9398 11.9397 11.9397 11.9511 11.7550 11.9552

0.012 0.0030 0.0030 0.0030 0.0030 0.0030 0.0030 0.0030 0.0030 0.0030

0.299 0.1050 0.1050 0.1050 0.1050 0.1050 0.1050 0.1051 0.1056 0.1050

0.0015 0.002141 0.002139 0.002136 0.002053 0.002052 0.002051 0.002108 0.002019 0.002025

R

Y 10.0031 9.9998 9.9999 9.9982 10.0056 10.0056 10.0056 9.9962 9.9940 9.9929

0: 0.04065 0.1540~10.2 0.1539~10.2 0.1537x10* 0.1507~10.2 0.1503~10'2 0.1507~10.2 0.1526~10'2 0.1502~10~2 0.1488~10.2

In Table 2, the schemes 1-3 adopted the membership functions such as the Formulae (2),(3),(4) respectively, and the Formula (2) was chosen b=&=l.As to the Formula (3), choose k=$Ay? =0.005 . The robust design solutions in the scheme 1-3 were basically accordant, and it showed that the fuzzy robust design criterion was well adapted to the different choice of pJy) and the design schemes were all superior to

156 the former scheme[d, then it embodied the validity of the fuzzy robust design criterion. Comparing the schemes in the Table 2 with the schemes in the Table 1, the resistance and inductance had the more reasonable tolerant error because the making cost was considered. When pJy) was adopted shown as the Figure l(a) and ki=l,kzeki, the schemes such as 4, 5, 6 corresponded to kz=1/2, 1/3, 114 respectively. And then, the left side of pJy) was linear, the curve in the right side protruded upwards, it showed that the designer considered that the current value was good in [lo, 10.151 (or the practical product required). This time, the mean on the current approached to the nominal target value 10Afrom the right side. If choosing kz=l and ki&, the schemes such as 7, 8, 9 corresponded to k1=1/2,1/3,1/4 respectively, then the right side of pAy) was linear and the curve in left side protruded upwards; It showed that the designer considered that the current was good in [9.85, 101. This time, the mean on the current approached to the nominal target value 10A from small direction namely from the left on the design side. So the designer could control the direction that the mean quality index approached t o the nominal target value fa through suitably choosing the This was an advantage that the non-symmetry membership function of traditional robust design criterion did not have. Using this advantage, designer could make the distributing of the design quality index yeven more accorded to the practical need. This character could be clearly explained by the construction of Formula (5) and its geometry meaning

PAY).

6

Conclusion

A new robust design criterion via fuzzy probability was put forward in this paper. The controlling this criterion on the mean and variance of quality index is similar to the effect of Taguchi’s signal-to-noiseratio. And that it could effectively control the fashion, approaches to the design target JQ, by which the design quality characteristic index was an advantage that present robust design criterions did not have. Moreover, the principle in this paper still could be generalized and applied to study the robust design of product having “smaller is better” or “larger is better” character.

Reference 1. Taguchi G. Introduction to Quality Engineering: Designing Quality into Products and Processes. Asian Productivity Organization, Tokyo, 1986 2. Wilde D J. Monotony Analysis of Taguchi’s Robust Circuit Design Problem, Tran. of the ASME. J. of Mech. Design, 1992.114 : 616-619 3. Parkinson A. Mechanical Design Using Engineering Models, R a n . of the ASME. J. of Mech. Design, 1995.117 : 48-54 4. CHENG Lizhou, Robust Design, Beijing: Mechanical Industry Publishing House, 2000 5. Zadeh L A. Outline of New Approach to Analysis of Complex Systems and

Decision Process. IEEE Transactions on Systems, Man and Cybernetics, 1973, SMC-3(1): 28-44.

INTERVAL-VALUED FUZZY SET MODELLING OF SYSTEM RELIABILITY RENKUAN GUO Department of Statistical Sciences, University of Cape Town, Private Bag Rhodes Gijis, Rondebosch 7707, Cape Town, South Africa Email:[email protected]

System reliability modeling in terms of fuzzy set theory is basically utilizing the Type I fuzzy sets, where the fuzzy membership is assumed as point-wise positive function ranging on [O,l] Such a practice might not be practical because an interval-valued membership may reflect the vagueness of system better according to human thinking patterns In this paper, we explore the basics of the Interval-valued fuzzy sets theory and illustrate its application in terms of an industrial example

1

Introduction

System operating and maintenance data are often imprecise and vague. Therefore fuzzy sets theory (Zadeh, [7]) opened the way for facilitating the modeling fuzziness aspect of system reliability. A fundamental issue is the treatment of membership function because fuzzy set as an extension of classical set in terms of extending the (0,l) two-valued indicator function characterizing a crisp set into a membership function ranging on interval [0,1] which characterizes a fuzzy set. Most of the fuzzy reliability modeling efforts is assuming a membership function, which could be regarded as a point estimate of the degree of belief of belongingness relation, for the reflection of vague nature of system operating and maintenance data. However, it may be more logical and practical to assume an interval-valued membership grade, which could be regarded as an intervalvalued estimate of the degree of belief of the subordination relation because as a general and natural human thinking pattern, the degree of fuzziness appears as an interval-valued number on [0,1]. In other words, it is natural to use a special class of type I1 fuzzy sets interval-valued fuzzy set (IVFS) (Zahed, [S]) to describe the fuzzy aspect of system reliability. Section 2 contains the elementary concept and operations on IVFSs. Furthermore, the relation between IVFS and rough set (Pawlak, [5]) and thus the IVFS decomposition theorem is established in section 3. In Section 4, the probability of IVFS is defined. In Section 4 a stress-tension style reliability model is proposed for analyzing the state of a repairable system. Section 5 is used to illustrate reliability analysis details in terms of an industrial example - cement roller data (Love and Guo, [4]). Section 6 gives a few comments on the IVFS system reliability analysis. 157

158 2

Interval-Valued Fuzzy Set

1.1. Concept of Interval- Valued Fuzzy Sets

Definition 1. A closed interval 5 A a' a" ,a',a" E R , and a' 5 a' is called realvalued interval number. If a',a" EGO,;], i a ' , a g is called an interval number on unit interval or simply interval number. Let 0,1] = { a= [a', a"]10 I a / 5 au 5 1 then it is the collection of all interval numbers (on unit interval [O,l]). +. Definition 2. Let set U denote a discourse. An interval-valued fuzzy set (IVFS) /f is a mapping from u to I[[0,1] :

p j :U?rI"O,l] For

(1)

VU E u , ii/j(+[P;(U),P:(u)]

(2)

where,

pi :U+[O,l]

and

pi :U+[O,l]

(3)

such that,

o I p I , ( u ) I p ; ( u ) 5 1 for

VUEU

Therefore, an IVFS is characterized by an interval-valued membership function denoted as,

membership

Figure 1. Membership ofan IVFS

(4)

,iij,

159

Atanassov [2] proposed concept of intuitionstic fuzzy set (IFS), which is equivalent to an IVFS. Mapping

+)Al-P&)-V&)

(6)

defines the depth of the degree of vague uncertainty of an IVFS and therefore,

n A pll - p -/ A

(7)

A

1.2. A Geometric Interpretation Agustench, Bustince and Mohedano [I] gave a geometric interpretation of an IVFS, which clearly identifies the triangle OAB (red-colored) within the unit-cube under the coordinate system

(p A' ,p;, z) (i.e., lower-membership p j - horizontal axis (purple/

colored), upper membership grade

p i - green-colored axis, the depth of vagueness

/

z = p; - p a - vertical axis) the projection space. In other words, an IVFS is a mapping from U to the triangle OAB.

If we only look at

(D, ; (u ) ,p i (u ) ),then a curve defined inside the triangle OAB

[(O,O),(l,l),(O,l)] on the bottom will characterize an IVFS. It is needless to say that the geometric interpretation should help us a better understanding and thus specification of an IVFS in practice.

160

1.3. Basic Operations on IVFSs

2,B 6 (u)

Let E be two interval fuzzy sets on discourse U. The three _bas& operations: union, intersection and complement operations of interval fuzzy sets A, , are defined as: : (i) Union of A and

B

(ii) Intersection of

2 and 8 :

+ .

(iii) Complement of

A: =(u,P2

(.)I

E

u}

p2 ( u >2 [1- p; ( u ) ,I - p; ( u ) ]

1

Other operations, say, t-norm and t-conorm will not be mentioned here for briehess but are critical in IVFS inferences. 3

Decomposition of an IVFS

The critical role of fuzzy set decomposition in fuzzy mathematical theory is that it links a fuzzy set to the common (crisp) set. For the Type I fuzzy set case the decomposition takes a form of

where

A, = { u E UI& ( u ) 2 A], V A E [O,']

.(12)

The key issue here is that in the case of IVFS, the membership is an interval a E 1 11 . Therefore, the decomposition should not be performed by line-cut (Type I fuzzy set) but by an interval-cut, in other words, it is necessary to investigate the set

yo,

{ pj ( u ) 2 Z), vx =[A!,A"]E 1[o, I]

161

that is,

and,

(15) = [u' E ulp;,(u' ) I A') and 2" -level cut set can be used to characterized the interval-valued cut set, denoted as:

+.

Theorem: An interval-valued fuzzy set

A can be represented as:

2=

u 14,

Xd[O,l]

where

14,A [A'.A,/, A"'A,,, ] Proof: In terms of the construction definitions,

It is obvious that the interval A,, can be treated as an lower approximation to the set A, and the interval A,,, can be treated as the upper approximation - to the set A, . Notice that A,, A, _C A,,, . It is reasonable to argue that the A -cut sets induces rough set in the sense of Pawlak [5]. This linkage may also promote a better understanding of the concept of an IVFS and even help to specify the interval-valued membership more intuitively. 4

The Probability of an IVFS

The probability of (Type I) fuzzy event

2 on I/, A E ( u ), is,

Pr[A]=E,,[P,(X)]= I , P a ( U ) q U ) In the context of IVFS, the relation between the interval-valued membership probability of the interval-valued fuzzy set will maintain a similar form:

(20) and the

162

-

II

This expression will give a probability interval for the IVFS A 5

.

An IVFS Reliability Model of Repairable Systems

5.1. A Virtual Allowable Capacity Model for Repairable System A basic idea of the reliability model proposed here is essentially taking from that of the traditional stress-tension modelling of an engineering structure. If we treat a repairable system as a virtual engineering structure, then the system parameters, the maintenance parameters and its operational environment parameters together can form a virtual allowable capacity, denoted as C,, which would restrain or control the system functioning state. The virtual allowable capacity plays a role similar to the stress level in the stress-tension model, which will determine a virtual allowable operating time, denoted as T,. On the other hand, the system functioning or operating causes system wear-out and increases its failure hazard. Therefore, the actual system functioning plays a role similar to the tension level, denoted as T. The limiting state equation of the reliability of functioning system is:

Z=T,-T

(22)

Furthermore it is assume that the limiting state Z is normally distributed random variable. It is intuitive to say that both To and Tare random and fuzzy in nature. The failure of the system is assumed to be an interval-valued fuzzy event with membership function:

5.2. The Cement Roller Example A set industrial data, e.g., Love and Guo [4]-a set of operating data extracted from a Canadian cement plant is used. Guo and Love [ 3 ] performed a fuzzy analysis on the same data in terms of fuzzy logical function method for obtaining the point-wise relative membership grades pi.a( u ) . For illustration purpose, we convert p:;, (u ) into interval-valued memship grades ,,?if( f a ) by assigning the depth of vagueness ~ 0 . at1

p:., (U)=0.5 and n=O at p:,

(U )=O

or 1.O.

For a recorded failure (or PM) time, the corresponding the allowable time satisfies

- 1 p (b - ( t ) = 1 --to that is. the allowable time

Therefore the virtual system state:

- - z=ta-t For failure times, fntax=maX{tlxl,. .., f31~31}=147, while for the censoring (PM) times, tnlax=)7laX(tl(l-ICI),..., t3,(1-~31))=217.Then ,iicL ( u ) , and interval-values are

<

calculated and listed in Table 1.

Table I "Observed" f a and

--

-valued for each PM

to

_____

Z

t, -

1

15

[0 450,O 5501

197 65,119 3.51

[43 68,65 351

133

1

18

[0 780,O 8201

126 46,32 341

1-106 54,-100 661

I47

3

I818

[0 800,O 8361

[38 588,43 41

[-I 1 I412,-103 61

72

1

16

[0 560,O 6401

[52 92,64 681

[-I9 08,-7 321

105

I

18

[0 780,O 8201

[26 46,32 341

[-78 54,-72 661

115 141

D

[0 338,04131 [0 492,O 5841

1127 379,143 6541 [90 272,l 10 2361

[I2 379,28 6541 [-50 728,-30 7641

54

Ki

0

1375 3 538

59

1

3 667

[0 630,O 7011

[43 95334 391

[-I5 047,-4 611

107 59

0

3 125

[0 113,O 1381

[ 187 054,192 4791

[80 054.85 4791

0

02

[0 180,O 2201

[I69 26,177 941

[ 110 26,l I8 941

36

1

04

[0 360,O 4401

[82 32,94 081

[46 32,88 081

210

0

0

[217.2 171 [77 616,90 2581

[7,71 [32 616,45 2581

45

I

0 429

69

0

06

[0 000.0 0001 [0 386.0 4721 [0 560,O 6401

[78 12,95 481

[9 12,26 481

55 74 I24 147 171 40

0

0 889

[0 877,O 9001

[21 7,26 6911

[-33 3,-28 3091

I

0 875

[0 853,O 8881

[ 16 464,2 I 6091

[-57 536,-52 3911

I

0 778

[O 756.0 8001

[29 4,35 8681

[-94 6,238 1321

1

0 667

[0 630,O 7011

0

0 378

[0 338,O 4 131

[44 1,53 6551 [I27 379,143 6541

[-43 621,-27 3461

1

0 667

[0 630,O 7011

[43 983,54 391

[3 953,14 391

77

1

0 778

[0 756,O 8001

[29 4,35 8681

[-47 6,-41 1321

98

1

06

[52 92,64 681

[-45 08,-33 321

I08

I

[52 92,64 681

[-4.5 08,-33 321

110

0

06 0 667

[0 560,O 6401 [0 560,O 6401

[64 883,80 291

[-45 117,-29 711

[0>01 [58 8,71 7361

[-85,451 [-41 2,-28 2641

[26 46,32 341

[-88 54,-82 661

[I69 26,177 941

[-47 74,-39 061

85

1

I

100

1

0 556

115

I

08

[0 630,O 7011 [ 1 000,l 0001 [0 512,O 6001 [0 780.0 8201

02

[0 180,O 2201

0 217 -

[-I02 9,-93 3451

164 2.5

I

0429

[0 386.0 4721

[77 616,90 2581

[52 616,65 2581

50

1

0429

[0 386,O 4721

[77 616.90 2581

[27 616,4O 2.581

From the table, it is easy to notice that most of the failure cases (.,=I), the Z -values observed are negative, which indicates the system falls in failure and damaged state, while quite a few of the censoring cases, the Z -values observed are positive, which indicates the system is still in "reliable" and "safe" state. The signs of these "observed" z -values confirm that the membership degree of the allowable capacity, &.m (U ) ,make sense. The mean and standard deviation of the interval-valued normal random variable z can be accordingly estimated as m =[-I 8.744,-8.8531 and 0=[65.886,66.458]

-

respectively. The fact that i%2 0 clearly indicates the system requires PM. System data t, can be used to fit Weibull distributions for further conventional reliability analysis. Concluding Remarks

6

In this paper, we briefly discuss the concept of IVFS and argue its necessity to use IVFS idea for the modeling system reliability. We could simply use the method by Wu [I IVFS to conduct fuzzy inference on the system reliability directly. However, the virtual operational state of an operating system gives another inside of the operating system state. Using IVFS to analyze the system state seems more meaningful. For simplification reason, we skip quite many computation details by just refer to our previous work, Guo and Love 131. As a matter of fact, it is more realistic to calculate the interval-valued membership grades and then use the logical function idea to have the IVFS membership grades for the system state. References 1.

2. 3. 4. 5. 6. 7. 8.

E. Agustench, H. Bustince and V. Mohedano, Mathware & Soft Computing 6, 267 (1999). K. Atanassov. FSS 20, 87 (1986). R. Guo and C.E. Love. Int. J. R. Q. S. Eng. Vol 10, No 2, 131 (2003). C.E. Love, and R. Guo, Q. R. Eng Int. Vol. 7 , 7 (1991). Z. Pawlak. Int. J. Comput. Inf Sci. 341 (1982). Wu, Wangming. Principles and Methods of Fuzzy Reasoning, ( I 994). Zadeh, A. L. Fuzzy sets, Information and Control 8, 338 ( 1 965). Zadeh, A. L. IEEE Trans. System. Man Cybernet. 3,28 (1973).

FUZZY SET-VALUED STATISTICAL INFERENCES ON A SYSTEM OPERATING DATA RENKUAN GUO Department of Statistical Sciences, University of Cape Town, Private Bag, Rhodes ’ Giji, Rondebosch, Cape Town South Africa E-mail: rguo@iais. uci.ac.za

ERNIE LOVE Faculty of Business Administrations, Simon Fraser University Burnaby, BC, V5A IS6, Canada E-mail: [email protected]

In this paper the characters and mathematical foundation of set-valued statistics - random set and the falling shadow function is briefly reviewed. We further discuss the procedure of fuzzy set-valued inference based on sampled data information proposed by Chen [2]. An industrial data is used to illustrate the prediction of system failure time based on covariate information.

1

Introduction

Using covariate information for predicting a system behavior is a well-known research topic in reliability engineering, e.g. Love and Guo [4] and Guo and Love [3]. In classical probability and statistics, what is obtained from each statistical experimental trial (or run) is a deterministic point on the phase space (which is the set of possible observations sample or elementary space). Such a statistical methodology is therefore is called apointwise statistics. However, in the great amount of the management and production practices, particularly, these involving human psychological measurements, what we faced are no longer some isolated points. The obtained information from each experiment is usually a common subset or a fuzzy subset of the phase space. Classical statistics often ignored the whole interconnections between points therefore ignored an fact that the viewpoint of whole is an fundamental criterion for human being to perform partitions, selections, orderings and decision-makings. In set-valued statistics the outcome of each experiment is a subset of a certain discourse and therefore the classical statistical methodologies no longer apply. As an extension to classical statistics the set-valued statistics will greatly expand the scope of applications. In general, the theory of random set and the falling shadow of random set is the foundation of set-valued statistics. Section two is used to investigate the basic features of set-valued statistics. In Section three the fuzzy set-valued inference is reviewed and in Section 4 an industrial data is used to illustrate the idea. A few remarks are given in section 5. 165

166

2

Fuzzy Statistical Model

2.1. Fuuy Statistical Experiment The best way to understand the characters of set-valued statistics is to compare it to the classical statistics. The table 1 gives a systematic comparison between them.

Space

Fixed Varying

Table 1 Comparisons Between Cla Classical statistical model Elementary space R, containing all the relevant factors and thus being an extremely high-dimensional Cartesian product space An event A C R A variate w on a, once w is fixed, then all the factors are fixed at their specific state level respectively

cal and Fuzzy Statistical Model Fuzzy statistical model A discourse U

A fixed element uu E U

-

a

on U a varying set A' IS formed by constructing an uncertain partition about A fuzzy concept

-

a ,,each ., fixing of A' implies a definite partition for a ,which represents an approximation to the extension of a , Condition

A certain condition S

A certain condition S

2.2. Duality Between Two Models Assume that P(U )is the collections of all the subsets of domain U and called the power set of U. It is obvious to see that any set A c U will be an element of the power set, i.e., A E P(U ) . For 'du E U , define set A {B :B E P ( U ) ,B 3 u } which is the super filter of set-algebra P . For any given u, can be regarded as the subset of i.e., C P(u). Thus a fuzzy statistical model on a discourse can be converted into a classical statistical model on the discourse P ( U ). For a given discourse U and its power set P ( U ) , define a set class : u E U } and let b' be a o-algebra containing i.e., C b' . Therefore, ( P ( U ) b') , is a measurable space. Actually, a measurable mapping from probability space (G,A,f') into measurable b') is called random set on U , i.e., [ :0 -+

(u)

P ( u ) , r,

{r?,

P(u).

[-'(B)={w

E[(W)€

r,

r,

u

6(rU), (P(u),

B,'dBEB}EA

(1)

Intuitively, a random set [ can be defined as a mapping from sample space 0 to a subset b ' o f P ( U ) w h i c h satisfies that each pre-image of [ ( w ) i.e., w E 0 is a possible experiment outcome whose occurrence is associated with a probability measure. 2.3. Falling Shadow Function

Term "falling shadow" is proposed by Wang [ 5 ] , basing on an image of a cluster of cloud on the sky throwing shadow on the ground. Mathematically, assume that [ is a random

167

u

u,

set on discourse . For \y'u E E< U ) = P [ w :[ ( w ) 3 U ] is called the falling shadow function. For a fixed u E , E< uo) is called the falling shadow value of [ at u,, . Notice that &< ( u ) = PPw : = ( u ) and therefore E< is a real . E< is not only used to describe random set to a certain degree function defined on but also can expressed as the membership function of the corresponding fuzzy subset on

u t(4) r,] 5

u

-

U.

Assume that A is a fuzzy concept which can be represented by a fuzzy subset of . A { B :B E P(u), B 3 u } is the coll_ections of all random sets which contain u E U and also represent the fuzzy concept A , b'B,, = [(w,, ) E . The associated probability 4 o f (( w , ) represents the degree of confidence for set B,, to describe fuzzy concept . Then the probability of , E< ( u ) , can be regarded as the membership of fuzzy set at u and denoted as (U ) .

u r,

r,

A

3

r,

A

Fuzzy Set-Valued Statistical Inference

Chen and Chen [ 2 ] developed a concrete model summarized in Chen and - Guo [l]. Assuming that X and Y are two related discourses, the implication relation R from X to Y can be represented by the falling shadow: 1

n

in terms of n independent experimental sample of random sets

{ &,R2,..., Rn}, R, E .F (Xx Y ) ,i = I, 2,. ..,n For given

(3)

'' F ( x ) ,

in terms of reasoning compositional rule

B'=A*oR

(4)

is obtained as the fuzzy inference conclusion. In order to continuously modify the original implication relation (no matter of prior nature or sampling one) R by utilizing the data-based inference conclusion progressively, define,

Rn+l= A*x B*

(5)

PR,,+,(x, Y >= Pd' (x>A PR' (Y>,\y'(x,Y >E (Xx y>

(6)

which has a membership function

Then a fuzzy set-valued inference model with self-learning has a modified implication relation A*, with a membership function

168

However, sometimes the samples obtained in terms of statistical experimentation are not subsets on X Y but the point (x, y ) on X Y . Particularly, when a problem involves many factors and thus the form of the random sample is multi-dimensional data (x1,x2,.. * X, ,y ) . Therefore, it is necessary to extend the above model. Let us consider an n-fold I-order, (n,!)- implication fuzzy inferen_ce.-Assumip that Xand Yaretworelatedrealsets. A ~ { A l , A 2 , . - . , A m } a n d B 4 { B , , B 2 ; - . , B k are two fuzzy normal convex partitions on X and Y respectively. In general A and are prior type of fuzzy partitions based on features of a real problem. Assuming that n independent random samples,

x

x

I

observeddata

onfuzzysetsof partionsA andB respectively, where

{a}:=,

on fuzzy partitions Therefore the two group of fuzzy sets (s,}" and @ of X and Y respectively, where sample'=k are obtained. Fuzzy implication is defined as a fuzzy relation from A and B ,where r E [0,1] is relation R = r inteTreted as the ( Y ) akgree xk of truth that if " x is " then " y is 3, ". Let the it row vector

A and

x

2,

of R be denoted as i.e.,

I?, = (q,,c2,.

* *,

qk ) being regarded as the fuzzy subsets on IB ,

-

Thus R, is understood as the random set falling shadow estimate from the sample of random sets of the fuzzy subset group

{P,I

XI €

4 , ( x , , Y , ) E S,I

2,

€

{1,.*.,n}}

(12)

If A is a common partitio! of X,i.e., all the E A are clear subsets, the above-stated meaning of constructing-R, in terms of falling shadow is obvious. However, when A is a fuzzy subset, " x I E A, " can not simply use "yes" or "no" to describe it, but it is

169

4.

usually to be described by the degree of X/ E The membership p - (x/)is just the 3 quantity reflecting the degree of X/ E Therefore, a linear estimator IS constructed for

4.

R1 .

2 2

QIhPI2

- a R, =

I=I

.

9

...

,

"I,

/=I

As to computation details, for a given sample S, in terms of (6.65), two matrices can be

where

Let

P

xu,,

,then in terms of (6.66), the fuzzy implication relation

/=I

For given x E

x , it is obtained, *

*

G* = (aI,a2 ,..-,(I.,*)

EF ( A )

a,*=PA, ( x ) , i = I,...

Therefore in terms of reasoning compositional rule

7m

(18)

170

=(P,',W',...,P,')t~(~) - Let y,,y 2 , ..,yk be the kernels of Bl,B 2 , ...,Bk respectively, then clr* o R = P *

*

is the inferential conclusion. 4

Failure Time Prediction

It is in general to accept that if the failure time of a repairable system can be reasonably predicted then a timely repair action can be taken so that the loss of non-operation would be minimized. In order to forecast the system failure time t, it is reasonable to take covariate D as the forecasting factor based on the analysis results by Love and Guo [3]. Table 2 Covariate D and Failure Times

{k,,2,'-4 ,-A1}

x

The fuzzy partition A = ,A3 on the discourse of covariate D and B, ,B3,B4 on the discour_e Y of - failure time T the fuzzy partition B = respectively. The membership functions of fuzzy subsets A, and B, are defined as follows:

1 x18.5

[

0 otherwise

1i

X-8.5 8.5 5 x 5 11 2.5 13.5-x 11 < x < 13.5 2.5 0 otherwise

x-11 1 1 5 x 5 13.5 0 ~113.5 2.5 X-13.5 16-x 13.5 5 x 5 16 1 3 . 5 < ~ < 1 6 pLi = 2.5 2.5 1 otherwise 0 otherwise and

171

405 y i 8 0

1 yI40

80 < y <120

i

0 otherwise

0 otherwise

0 y<120 160-x

I

120 < y

< 160

160 I y

1

0 otherwise

< 200

1 otherwise

respectivelyCalculating

then matricesa

areobtainedrespectively 0.875

1.4

0.125

0

1.97

1.54

2.03

0.26

2.04

2.3

3.02

0.84

0.415

0.185

0

0

Calculating the column sums of matrix A :

A T 1 = ( a , a2 a, a4)T=(2.4

5.8

8.2

0.6y

The hzzy implication relation in terms of (6.68) is:

-

0.400

0.266

0.350

0.05

0.249

0.280

0.368

0.102

,0.692

0.308

0.0

0.0

~

,

Now it is ready to perform inference. Let covariate D,x=l1, then 6*=(0,1,0,0). In terms of the fuzzy reasoning compositional rule (either A - V rule or matrix multiplication rule) Empirical evidences show that the common matrix multiplication rule gives results being consistent to real world. -I

=ti* O R =(0.400 0.266 0.338 0.06)

(27)

172

Based on the rule (20)

+

y = 0.400 x20+ 0 . 2 6 6 8 ~0 + 0 . 3 3 8 ~120 0 . 0 6 ~ 1 6 0 The next failure time is predicted as 77.7 hours if covariate D value is x=l 1 (The actual failure time is 55 hours). 5

Concluding Remarks

In today’s industrial practices, it is often the case that on the one hand the shortage of a complete system operating data is more and more serious and on the other hand a huge amount of system operating records generated automatically with latest industrial machinery are un-used, because classical reliability engineering methodologies were powerless and did not offer useful information for management decision-making. The fundamental root cause is that most of the reliability theory and methodologies are merely developed from classical probability and statistical theory, which is point-wise in nature and can not effectively handle (fuzzy) set-valued data situations. It merges out that some set-valued data related statistical methodology should step in to effectively extract meaningful information from system operation and maintenance for the best management. In this paper, we briefly reviewed the concept of fuzzy set-valued statistical model, which is based on the theory random set and its falling shadow function as the mathematical foundation initiated by Wang 151. Furthermore, a fuzzy set-valued inference (reasoning) developed by Chen and Chen [2] is reviewed and applied for the prediction of operating system’s failure time according to covariate information for an illustration purpose. It is obviously a simplified exercise that the demonstration example only uses one of the covariate (D). However, the role of this paper is intending to show that under the set-valued and imprecisely recorded data background, fuzzy set-valued inference statistical method is solid and natural and therefore deserved more research and application in industries.

References

1. S.Q. Chen and S.Z. Guo, Fuzzy Prediction, Guizhou Science Publishing House, China ( 1 992)

2. Y.Y. Chen and T.Y. Chen, Characteristic Expansion Approximation Reasoning Method, The Journal of Liaoning Norm University (Science Edition) 3, (1984).

3 . R. Guo and C. E. Love, Reliability Modelling With Fuzzy Covariates, International 4. 5.

Journal of Reliability, Quality and Safety Engineering, Vol. 10, No 2, 131 , (2003) C. E. Love and R. Guo, Using Proportional Hazard Modelling In Plant Maintenance, Quality and Reliability Engineering International, 7, 7 (1991). P.Z. Wang, Fuzzy Sets and Falling Shadow Function of Random Sets. Beijing Norm University Publishing House (1985)

A SOFTWARE RELIABILITY ALLOCATION MODEL BASED ON COST-CONTROLLING CAN HUANG, REN-ZUO XU, LIANG-PING ZHANG State Key Laboratory of Software Engineering of Wuhan University Hubei, Wuhan, 430072 People’s Republic of China Abstract Software reliability allocation develops gradually in recent years [ I ] , however, many reliability allocation methods allocate reliability to software without considering the cost of it [I 21. In this paper, we introduce a model relative to reliability allocation. The model can minimize the cost of the software, as well as guarantee that software system reliability can achieve the goal. At the same time, we introduce the genetic algorithm to achieve the optimal result of the model. Keywords reliability allocation; cost of software development; genetic algorithm; optimum; restrained programming problem

Introduction When software product is still on design, the reliability of the complete system should be set. But if we want to minimize the development cost of the system, how much should each system module reach to meet the system’s demand on the system reliability. This is the problem that the software reliability allocation intends to solve [ 13. At present, there are many mature techniques can be used, such as partition law of the rapid allocation algorithm [2], allocation method based on the fault rate, AHP [3], and so on. Most of these traditional reliability allocation techniques do not consider the development cost of a software project. Although AHP considered the cost, its algorithm is too complicated. Wanda F. Rice and C. Richard Cassady [4] applied the idea of hardware reliability into the software system: according to the restrict of the system-resource, set up a optimum model and calculate the redundancy degree of each module, then utilize the redundancy techniques such as storage and parallel connection to extend each function module to a subsystem, the redundancy degree of each module determines the amount of the storage modules of each subsystem. The method discussed in this paper tries to assess the reliability of each module by modeling, as well as minimize the software development cost [I 11. After the reliability of each module is ensured, software engineers allocate the resource according to the reliability allocation situation, in order to know which modules fulfill the requirement, which modules do not and still need more resource. The reliability allocation model presented in this paper guarantees the optimal use of system resource while allocating the reliability. During reliability allocation, the module development cost is an important factor to consider. Due to the difference of module size and design, the reliability improvement complexity of each module is also different, the more difficult to improve, the more development cost is needed. Thus, when allocating reliability, the reliability improvement complexity of the module is also an important factor to consider.

173

174

A model relative to reliability allocation

Example 1: Suppose a software system is composed of n modules, the minimum reliability that the system can accept is&, calculate the reliability of each module, satisfying: 1 ) the system reliability is no less than RG; 2) the system development cost is minimized. This is a common optimum problem. Transform it to a nonlinear programming problem as follow: x

min c =

Goal function:

CC~(R~) i-1

Subject to:

B>RG RinLin

I Ri IRmmc

i

- 1.2 ....x

where C denotes the system development cost, which IS assumed to be the sum of each module’s cost. Ci denotes the cost of module i and Ri denotes its reliability, Ri.rnin is its initial reliability which has been estimated before reliability allocation by reliability model, Ri.m~ is the maximum reliability it can reach. The system reliability is denoted by Rs . In this model, there are two relations needing confirmation. One is relation between Rs and Ri , the other is relation between and Ri . So, the formulation of the system reliability related to these module reliabilities is very important. In simple systems, modules connect together serially and independently, thus the system reliability is the product of the reliability of each module. While in complex system, as relations between modules are complicating, the system reliability is not the product of the reliability of each module any more. As to how to confirm the relation between Rs andRi , it is closely related to the concrete system. There is detailed discussion in this respect in Ref. [12]. To simplify the problem, suppose the formulation of system reliability is easy to construct. In fact, many researchers have put forward many methods [ 1I]. The relation between and Ri , is used to react the relation between module development cost and module reliability for mainly, which is called cost estimate function Ci(Ri). IfCi(Ri) isn’t confirmed, then Ri cannot get confirmed to minimize the system development cost C. There are many methods to confirm the relation between and Ri , one direct method is according to the experience datum. However, it has to take a consideration of the modules’ similar degree in all fields, if two modules differ a lot from each other, their relations between ci and Ri are impossible to be close. For example, it has been mentioned above that the reliability improvement complexity of each module influences the development cost of that module, if the complexity is different, their relations are also different. The relation between Ci and Ri can be confirmed by analyzing all the factors influencing c i and formulating Ci with these factors [5-71. After anal zing the experience data we collected before, we find that the relation between Ci and i satisfies the Cost-Reliability model below (we will give detailed discussion in another paper on this formulation):

ci

ci

c;

B

where f ; denotes the reliability improvement complexity of module i, and it ranges from 0 to 1. A is a positive constant, whose value can be adjusted according to the concrete situation. The factors influencing consist of the module reliability, the reliability improvement complexity of the module, the initial reliability of module i , the maximum reliability of module i , and so on.

ci

175

c1

This formulation shows that is a monotonically increasing function of RI , the higher the module reliability is, the higher its development cost is. CIdecreases while $ increases, because the module with bigger $ improves its reliability more easily, and the relevant expense is lower. The concrete relation between CIand RI while $ =O. 1 , Ri min = 0.7, Rl max =0.99 is shown in Fig.1:

4 8

(Note: 0

7

Itwasranked m

intounit cost B

for convenience, -. 9 thesameas in Fig.2 below.)

’

5 4

3 L

1

0.7

0.75

0.8

0.85

0.9

0.95

Ri .max

Module Reliability Figure1 Relation between

ci and Ri

( $ = 0.1 )

From Fig.1, it can be found that the module development cost increases with the module reliability more and more fast. When the module reliability approaches to its maximum, the module development cost increases rapidly to infinity. It indicates that to improve the module reliability from 95% to 96% is more difficult than from 70% to 80%. In our cost estimate function, there are three parameters needing confirmation, denoted by Ri min ,$ and Ri.max . As in design phase, software engineers will assess the reliability of each module and decide whether to invest resource or not for the system’s reliability improvement according to the results of assessing. The assessment result of each module in this phase is regarded as the initial reliability of that module Ri min . It can be imagined that the quality of the assessments has an important influence on final reliability allocation quality, so reliability modeling and prediction on the software influence the reliability allocation directly. The other two parameters are confirmed as follows:

( I ) The reliability improvement complexity of module i, denoted by$ : The reliability improvement of module i is also called feasibility $ . Caused by the technical restrictions and the quality of design, the reliability improvement complexity of each module is different. It is not easy to confirm the module complexity of reliability improvement in actual projects. Experience datum benefits a lot while estimating this parameter. Besides, allocate distinguish weights to relevant factors to quantify the “feasible” parameter, is also considered as an important method to estimate this parameter. Except the reliability improvement complexity of module i , the factors relative to$ include operation section of the module, key degree in the system of the module and so on.

176

(2) The maximum reliability of module i, denoted by Ri. max When there is not any basis, we can think that the maximum reliability of a module is loo%, but this is impossible in actual engineering. Restrictions on techniques and funds have caused the maximum reliability of the module less than 100%. Because the initial reliability of the module has been confirmed, once the maximum reliability is confirmed, the span block of the module reliability will be confirmed. From the relationship between the development cost of module I and the reliability of module 1 shown in Fig. 1, it can be found that in this span block, development cost of the first half increases very slowly, while the development cost of the latter half especially next to the maximum reliability increases rather rapidly, and reaches infinity greatly in the maximum reliability. Under the condition that the other factors are all the same, to get same reliability improvement, the module whose reliability is more than Ri.rnax needs less development cost than that whose reliability is less than Ri. max . Hence, Ri. max is one of the factors that influence the software development cost. Fig.2 is presented to illustrate how the maximum reliability influences the software development cost.

A a 5

i

0

-

-

0.7

0.75

0.8

0.85

0.9

0.95 R i m a x

Module Reliability Figure2. Relation between

ci and&

($

= 0.1)

It shows that, when the initial reliability of the module and the value of$ are all the same, the module with larger maximum reliability has more space to improve its reliability; while in the same reliability block, its development cost is less than the other modules. Fig. 2 shows the general relationship of Ci and Ri . While estimating the relation of C i and Ri in a certain system, we should adopt a unified cost estimate function. Since different functions have different emphasis, the results will lack of comparability because of different estimation standards. Model solution In the model above, the complexity of our cost estimate function has made it difficult to get the optimal solutions with these common techniques 181. However, genetic algorithm (GA) transforms the problem-solving course to a series of operations on “chromosome complex”, through the evolution of the colony, finally disappears on the optimal solutions of the problem. GA searches for more than one point in the solution space at the same time, so that disappears with the overall space very fast [9]. As a result, here we adopt the genetic algorithm to solve the model. The detailed steps of GA are as follows:

177

Encode the variables denoted by x, y and z with a binary mode of certain bits to form a gene chain, every gene chain represents an individual, encoding operation utilizes bit-string encoding to shine the solution space upon a bit-string space. While t=O, randomly generate n gene chains to form an initial population denoted by P(O), which represents a set of some possible solutions of this optimum problem. Beginning with this initial population, GA simulates the evolution, selects the superior and eliminates the inferior, searching for excellent individuals. Following code rules, put the independent variables (Xi,yi,Zi ) of population P(t) corresponding to each individual gene chain into goal function, calculate out the function value denoted by Fi , the smaller Fi is, showing higher fitness this individual has. According to certain probability, select M pairs of individuals from the population P(t) to proliferate their later generation. As a member of the population, the probability of being chosen is ratio to its fitness. Carry on hybridization of the parents who are chosen at random. The simplest method is to choose a point to cut at random, cut the gene chains of parents at this point, and then exchange the tails. The number of newly generated individuals by hybridization is still n. According to certain probability denoted by PIII, select several individuals from the n pieces of chromosome generated newly at random; to individuals that have been ihosen, choose a certain%it to negate atiandom. GA solves the optimal allocation problem of the reliability in the following steps: 1) GA is generally used to solve unrestrained optimum problems, but the reliability allocation model described in this paper is a restrained programming model. Therefore, we adopt the method of fining function to bring restrained conditions into original goal function and generate a new function as follows: n

min C'

=

c+

z ~ i

i=l

,if it subjects to restrained condition i of the model, then Mi=O, otherwise Mi=100000. The purpose is that by raising the goal function values corresponding to the unfeasible bits, make these unfeasible bits cannot become optimal solutions. 2) The restrained conditions show: O< Ri 4, thus a 10-bit binary mode can be used to express the decision variable denoted by Ri . As a result, there are 1024 dispersed bits in block [0,1], the value of each bit is:

x,=

x-1 (j=1,2,3,4 ,..., 1024) 1024 - 1

If there are m decision variables, then a bina mode of 10*m bits is needed to express a solution. The solution space consists of 1~4msolutions; GA has to find an optimal solution from the extensive solution space. GA can find the optimal solution theoretically, but the calculating amount tightly relates to the precision of the result, the higher the precision of the result is, the larger the calculating amount is. 3) While t=O, produce an initial population of 200 pieces of chromosome. 4) Decode these 200 pieces of chromosome, and evaluate their individual fitness with goal function C' . If the optimal solutions have not changed for 20 generations in succession, then go to 9). 5) According to the fitness of the chromosome, choose 200 pairs of chromosome for seeds.

178

6) Carry on hybridization of these pairs and produce 200 new pieces of chromosome. 7) According to certain probability (set f'", =0.04), select three individuals from the 200 new pieces for mutation, the bit is chosen at random. 8) Set t=t+l , then go to 4); 9) Get the optimal solutions, and then stop computation. Example Analysis Suppose a simple system is composed of three modules, the modules connect together serially and independently, the minimum reliability that the system can accept is 90% assuming the time interval is 100 hours). The reliability of each module is R i , L ,R3. Above all, the first step is to construct the formulation of system reliability, the same thing as to confirm parameter Rs of the reliability allocation model. Supposing the system is simple and parallel connected, the system reliability is the product of module reliability, thus

Rs = R I* R2 * R3

Next, take the place of all parameters into the reliability estimate model. These parameters include system reliability Rs , cost estimate function C;(Ri). After that, the reliability estimate model changes to the following form (estimating the cost by Y ,set A=l):

i=l

Subject to

:

RiR2R3 2 RG Ri.mm I R, I R,.max

(3) i=1,2,3

Finally make computational analysis. Considering following situations (set t=l 00 hrs in the following discussion): Suppose the time failure probability of the three modules follows Weibull distribution. Its distribution function is formulated as F (t) =( - ( t / ~ ) ~where ), p = 1.318 , =312 hrs (parameter 77 denotes MTBF), set the reliability improvement complexity of three modules the same, set f ; =0.9. The time failure probability of the modules follows the same eibull distribution, but the reliability improvement complexity of each module I is 0.9, 0.5 and 0.1 respectively. Suppose the initial reliability of each module is 0.7, 0.8 and 0.9, the reliability improvement complexity of each module is the same as 0.9. The initial reliability of each module is 0.7, 0.8 and 0.9. The reliability improvement complexity of each module is set as 0.9,0.5 and 0.1. The initial reliability of each module is 0.7, 0.8 and 0.9. The reliability improvement comulexitv of each module is set as 0.1.0.9 and 0.5. In these five shations, the maximum reliability that all modules can reach is 0.999. Each situation will be emulated below. calculating out its reliability allocation solutions and analyzing the result. While under situation I), all modules follows Weibull distribution with parameter p = 1.318 , =312 hrs. So within 100 hours, the successful run probability of module i is: Ri (100) =exp ( -(100/312)1318)= 0.8. It's the initial reliability of '

7

179

module i ,R,. min . Take the place of parameters into Eq. (3), and calculate it with the method described in the former part. After calculation, the optimal reliability allocation solution under situation 1) is obtained as Rl=R2=R3=0.9655. It means that to make system reliability increased to 0.9, reliability of each module should be no less than 0.9655. In this case, as a result of modules’ similarity, same amount of reliability will be allocated among them. The calculation results of the other situations are listed in the following table:

From the comparison of the reliability allocation results in situation 1) and 2), we can find that module 1 has got more reliability, due to the less reliability improvement complexity than the two other modules. It indicates that when the other conditions are the same, the module whose reliability is easy to improve tends to obtain more reliability. While under situation 3), reliability improvement complexity of the three modules is the same, but module 1 whose initial reliability is the least has got the most reliability. When Ri.max is equal, the module whose initial reliability Ri. min is less can have its reliability improved more. As mentioned above, the development cost of this module is less than the other modules. From another view, a module’s key degree in the system is another factor influencing the module’s reliability allocation. A module’s key degree can be calculated like this: dR.5 h ( i )= dR, The higher a module’s key degree is, the more influence its reliability draws on the system. We can calculate out the key de ree of the three modules in situation 3) as following: I R (=0.72, ~ ) I R (=0.63, ~ ) h ( f )=0.58. Obviously module 1 has the highest key degree, which indicates that its reliability draws the most influence on the system reliability. So, module 1 obtains the most reliability. Compared to situation 3), module 1 got more reliability while in situation 4). Because in situation 4), module 1 has the biggest feasibility of reliability improvement, so it has opportunity to get more reliability than the other two modules, while in situation 3), their reliability improvement complexity is the same. In the last situation, with the smallest feasibility of reliability improvement, module I became the hardest one to improve reliability. As a result, module 1 was allocated less reliability than in situation 3) and 4), but still the most of the three modules. It mainly because that, with less initial reliability, module 1 had more potential to improve its reliability, while the other two modules’ initial reliabilities are both close to their maximum reliabilities, the cost for improving their reliabilities increase rapidly, and then reach the infinity when their reliabilities reach the maximums. So, module 1 still obtained the most reliability. It also indicates that the feasibility of reliability improvement of a module $ is not the only factor influencing the reliability allocation. Conclusion

Software reliability allocation is an important procedure of software products design and software reliability design. It tightly relates to software reliability modeling and

180

prediction, the prediction precision of software reliability model [lo, 1 I ] influences the quality of the final reliability allocation solution directly. The reliability allocation model presented in this paper minimizes the system development cost, as well as guarantees the software system reliability can achieve the goal. Not only can to simple systems, this model can also be applied to complex systems, determining on the parameters confirmation of this model at first. As to how to confirm the parameters, it still needs further exploration. Reference Zabedi F. and Ashrafi N, Software Reliability Allocation Based on Structure, Utility, Price and Cost, IEEE Trans. Software Engineering, 17(4), 345-356 (1 991). 2. Xi-Zi Huang, Software Reliability Allocation, FMEA and FTA, Quality and Reliability, 12-18 (1994.5). 3. Zabedi F, The Analytic Hieracy process, A Survey of The Method and Its Applications, Interfaces, 16(4), 96- 108 (1986). 4. Wanda F. Rice & C. Richard Cassady, Simplifying the solution of redundancy allocation problem. Proceedings - of Annual Reliability and Maintainability Symposium; (1 999).' 5. Kececioglu Dimiri, Reliability Engineering Handbook. Volume2, PTR Prentice Hall 1.

6. 7.

8. 9. 10. 1 1.

12.

(1991). \ - - - -,.

Aggarwal K K and.Gupta J S, On minimizing the cost of reliability systems, IEEE Trans. on Reliability, R24(3), 205-21 8 (1975). Guo-Wei He, Software Reliability, Beijing, China, National Defense Industry Press, (1 998). Bao-Guang Liu, Nonlinear Programming, Beijing, China, Beijing Institute of Technology Press (1998.12). Zheng-Jun Pan, Li-Shan Kang and Shu-Ping Chen, Beijing, China, Beijing University Press (1998). Ohtera H. and Yamada S, Optimal allocation and control problem for software testingresources, IEEE Trans. On Reliability, 39 (2) (1 990). Ren-zuo Xu, M. Xie, Ren-jie Zheng, Software Reliability Models and Applications, Beijing: Tsinghua University Press (1 994. I). Xiao-Qing Yang. Technology of Software Reliability Allocation. Thesis for Master Degree in Wuhan University (1 994.4).

RELIABILITY OF A SERVER SYSTEM WITH ACCESS RESTRICTION

M. IMAEZUMI College of Business Administration, Aichi Gakwen University, 1 Shiotori, Oike-cho, Toyota 471-8532, Japan E-mail: imaizumiOgakwen. ac.jp

M. KIMURA Department of International Culture Studies, Gifu City Women’s College, 7-1 Hitoichibakita-machi, Gifu 501-0192, Japan E-mail: kimuraOgzfu-cwc.ac.jp

K. YASUI Department of Management and Information Systems, Aichi Institute of Technology, 1247 Yachigusa, Yagwa-cho, Toyota 470-0392, Japan E-mail: yasuilab Oaitech. ac.jp As Internet has been greatly developed, t h e demand for improvement of the reliability of t h e Internet based system has been increased. T h e Internet based system is a system where a Web server receives various requests from clients through Internet, executes their request processing and returns results t o clients. This paper considers a stochastic model where a server system has some access restriction. To prevent from errors, a server restores the state of t h e resource after a server completes the request processing a t m times. T h e mean time t o complete t h e request processing at m times is analytically derived. Further, an optimal policy which maximizes the expected profit is discussed. Finally, a numerical example is given.

1. Introduction

As Internet has been greatly developed and rapidly spread all over the world, the demand for improvement of the reliability of the Internet based system has been increased. The Internet based system is a system where a Web server receives various requests from clients through Internet, executes their request processing and returns results to clients. Recently, an auction, a commercial transaction, database processing, and so on, are performed on Internet, and the Internet based system has been indispensable as a social infrastructure. Authors recently have considered the reliability in the side of a client for the Internet based system, and have proposed 181

182

the policy t o prevent the system from failure by illegal code.’ We have also shown how to perform security inspection for system call. In the Internet based system, the reliability in the side of a server simultaneously with the reliability in the side of a client is also important. There exists a serious problem in the illegal access such as crack which attacks a server intentionally. The buffer overflow attack’ which sends a dangerous code and hijacks control of a server is well-known as a typical one. In order to cope with this problem, several schemes one have been c o n ~ i d e r e d . ~As> ~ , ~ of schemes to minimize damages by crack attacks, imposing access restriction on servers has been proposed.’ This paper considers a stochastic model where a server system with access restriction: To protect from crack attacks which are performed by a part of clients, after a server stores the state of the resource, a server executes the request processing. When a server executes the request processing, a part of control function of a server can not be used with some probability by illegal code which contains the request processing. In this case, some parts of memory and files are destroyed and errors of a resource occur according to a certain probability distribution. To prevent from such errors, a server restores the state of the resource after a server completes the request processing at m times. The mean time t o complete the request processing at m times is analytically derived. Further, an optimal policy which maximizes the expected profit is discussed. Finally, a numerical example is given. 2. Model

A web server executes various requests from clients. The request files are HTML files, binary files, files which are generated by CGI program, and so on. To protect from crack attacks which are performed by a part of clients, a server stores the state of the resource, restricts the access, and executes the request processing. Further, there are a buffer overflow attack which hijacks the control of a server and an attack to execution environment such as a n environment variable of a server. By these attacks, when the argument of a program is changed, attackers may be able to come t o execute any commands. Most of attacks are caused by programming errors of servers.z We consider the Internet based system with the following assunptions: (1) After a server begins to operate, stores the initial states of memory and register, and restricts the access itself, a server waits requests from clients. The time for restriction of the access has a general distribution A ( t ) with finite mean a, requests occur according to a n exponential distribution B ( t ) with finite mean l / b and the time for request processing has an exponential distribution D ( t ) with finite mean l l d . (2) When a server executes the request processing, a part of control function of a server can not be used with probability p ( 0 < p < 1) by illegal code which contains the request processing. This state is detected by an operating

183

system after a server executes the request processing. In this case, a server keeps to accept requests and executes processing as a part of control function which can not be used. If a server can not use a part of control function, some parts of memory and files are destroyed by illegal access from outside, and errors of a resource occur according to an exponential distribution F ( t ) with finite mean 1/X. These errors are immediately and certainly detected. In this case, a server is maintained and restarts again from the beginning of system operation. The time from occurrence of errors to restart has a general distribution V ( t )with finite mean v. If errors do not occur, a server continues to execute the request processing. (3) To prevent from errors, a server executes the restoration processing to an initial state of the resource, after a server completes the request processing at m times. Under the above assumptions, we define the following states of the system: State 0: System begins t o operateD State 1: Request processing starts. State 2k: Request processing starts as a part of control function of a server which can not be used by illegal code where k ( k = O , l , . . . ,m - 1)denotes the number where a server completes the request processing in State 1. State 3: Server completes the request processing at m times. State 4: Errors of a resource occur. The system states defined above form a Markov renewal process where State 3 is an absorbing state. Transition diagram between system states is shown in Figure 1. Let Q i , j ( t ) ( i = 0, 1 , 2 k 1 4 ; j= 0 , 1 , 2 k , 3 , 4 )( k = 0, l , . . .,m-1) be one-step transition probabilities of a Markov renewal process and $(s) be the Laplace-Stieltjes , $(s) = e P t d @ ( t )for R e ( s ) > 0. Then, (LS) transform of any function @ ( t )i.e., by the similar method of and from Appendix A, we have

som

First, we derive the mean time !(m) from the beginning of system operation until a server completes the request processing at m times. Let H 0 , 3 ( t ) be the

184

Figure 1. Transition diagram between system states

first-passage time distribution until from state 0 to state 3. Then, we have m-1

H0,3(t)

Qo,l(t)

* Q 1 , 3 ( t ) + Q o , I ( ~ )* C Q 1 , ~ r* Q 2 k , 3 ( t ) k=O m-1

+Qo,i(t)

* C Q 1 , 2 k ( t ) * &2,,4(t) * Q4,0(t) * H 0 , 3 ( t ) . k=O

Taking the LS transforms on both sides of (7) and arranging them, we have

Thus, the mean time t ( m ) until a server completes the request processing at m times is

[ ( m )= lim

-dh0,3(s)

ds ’ a+~($+a)+(;+V)[Cy~((l-p)”(l-zm-k)l

S+O

-

1-

cp?;(l

3

- p)”(l

~

zm--k )

(9)

where z = b(X)d(X) and 0 < z < 1. Next, we derive the expected number of occurrence of errors. The expected number M ( t ) of occurrence of errors until a server completes the request processing

185

at the m times during ( 0 ,t] is given by the following equation: m-1

M ( t ) = &o,l(t) *

&1,Z,(t)

* &Zk,4(t) * [I f &4,0(t) * h f ( t ) ] .

(10)

k=O

The LS transform m(s)of M ( t ) is

Thus, the expected number M ( m ) of occurrence of errors until a server completes the request processing at the m times is given by

XFG1(l- p)kp(l - z7n-k)

M ( m ) f lim m(s)= s+o

1 - C r G l ( 1 - p)kp(l

- zm-k

1

( m= 1 , 2 , . . ' ) .

(12)

3. Optimal Policy We obtain the expected cost and discuss an optimal policy which maximizes it: Let c1 be the cost for occurrence of errors, c2 be the profit for execution of request processing. Then, we define the expected profit C ( m ) ( C ( m )2 0) until a server completes the request processing at m times as

C ( m )= czm - clM(m),

We seek an optimal m* which maximizes C ( m ) .We put formally that A(m) = l/ C (m) and seek m* which minimizes A(m). From the inequality A(m+l)-A(m) 2 0, we have P(l [l - c;=o(l

-

z ) C b o ( l - p)"m-k

- p)kp(l - zm+l-k )][1-

Zy!)l(l -p)"(l

> -c2 - zm-k)]

-

c1

( m = 1, 2, . . .).

(14)

Denoting the left side of (14) by L(m),we have

Hence, if L ( m )> L ( m - l),then L ( m ) is strictly increasing in m from L(1). Thus, we have the following optimal policy: (1) If L(1) < cz/cl and L(m)> L ( m - l ) , then there exists a finite and unique minimum m*(> 1) which satisfies (14). (2) If L(1) 2 cz/c1 and L(m)> L ( m - l),then m* = 1.

186 Table 1. Optimal number m* to maximize C ( m ) .

500

1000

1500

4. N u mer ic al E x a m p l e

We compute numerically the optimal number m* from (14): Suppose that the mean time l / d for request processing is a unit of time in order to investigate the relative tendency of performance measures. It is assumed that the mean time for restriction of the access is a /( l/d ) = 1, the mean time to request occurrence is (l/b)/(l/d) = 10, the mean time to error occurrence is (l/A)/(l/d) = 100 1500, the mean time from occurrence of errors t o restart again is v/(l/d) = 1, the probability that a part of control can not be used is p = 0.05 0.20, the profit for execution of request processing is a unit of profit and the cost rate of occurrence of errors is c1/c2 = 10 100. Table 1 gives the optimal execution number m* which maximizes the expected profit. For example, when p = 0.05, (l/A)/(l/d) = 1000 and q/cZ = 20, the optimal number is m* = 160. This indicates that m* increases with (l/A)/(l/d), however, decreases with p and c1/cz. This can be interpreted that when the cost for occurrence of errors is large, m* decreases with c 1 / c ~ ,so that errors should not occur. Table 1 also presents that m* depends little on p when p is large. N

N

N

5 . Conclusions

We have investigated the stochastic model where a server system with access restriction and have discussed the optimal policy which maximizes the expected profit until a server completes the request processing at m times. LFrorn the numerical example, we have shown that the optimal execution number increases with the mean time to occurrence of errors, however, decreases with the probability that a part of control can not be used and the cost for occurrence of errors. It would be very important to evaluate and improve the reliability of a server

187

system. The results derived in this paper would be applied in practical fields by making some suitable modification and extensions. Further studies for such subject would be expected.

appendix A.Derivation of mass function q the mass functionq t from state i at time o to state j at time t are given by the follinge2quaqtions;ow

&2,,4(t)

=

J

c

t m-1-k

0

[ ~ ( z*)D ( % ) I (*~[)I - ~ ( z* )~ ( z ) l d ~ ( z ) ,

j=o

Q4,0(t) = V ( t ) , (A.5) where the asterisk mark denotes the Stieltjes convolution, ~ ( ~ )denotes ( t ) the ifold Stieltjes convolution of a distribution a ( t ) with itself, i.e., a ( i ) ( t )= c ~ ( ~ - ' ) ( t ) * a ( t ) , a ( t )* b ( t ) = J; b(t - u)da(u).

References 1. M. Imaizumi, M. Kimura and K. Yasui, Reliability Analysis for an Applet Execution Process, The Transactions of the Institute of Electronics, Information and Communication Engineers of Japan, J87-A, 375-381 (2004). 2. K. Kenichi and S. Chiba, A Secure Mechanism for Changing Access Restrictions of Servers, The Transactions of Information Processing Society of Japan, 42, 1492-1502 (2001). 3. K. KouheiCG. Mansfield, Illegal Access Detection on the Internet, The Transactions of the Institute of Electronics, Information and Communication Engineers of japan, J83-B, 1209-1216 (2000). 4. M. Asaka, T . Onabuta, T. Inoue, S. Okazawa and S. Goto, Remote Attack Detection Method in IDA: MLSI-Based Intrusion Detection with Discriminant Analysis, The Transactions of the Institute of Electronics, Information and Communication Engineers of Japan, J85-B, 60-74 (2002). 5. Y. Takei, K. Ohta, N. Kato and Y. Nemoto, Detecting and Tracing illegal Access by using Traffic Pattern Matching Technique, The Transactions of the Institute of Electronics, Information and Communication Engineers of Japan, J84-B, 1464-1473 (2001). 6. S. Osaki, Applied Stochastic System Modeling, Springer-Verlag Berlin (1992). 7. K. Yasui, T. Nakagawa and H.Sando, Reliability Models in Data Communication Systems, Stochastic Models in Reliability and Maintenance(edited by S. Osaki), SpringerVerlag, Berlin, 281-301 (2002).

This page intentionally left blank

CONTINUOUS-STATE SOFTWARE RELIABILITY GROWTH MODELING WITH TESTING-EFFORT AND ITS GOODNESS-OF-FIT*

s. INOUE+ AND s. YAMADA Department of Social Systems Engineering, Faculty of Engineering, Tottori University, 4-101 Minami, Koyama-cho, Tottori-shi, Tottori 680-8552, JAPAN E-mail: { ino, yamada} @sse.tottori-u.ac.jp

We propose a continuous-state software reliability growth model with testing-effort and conduct its goodness-of-fit evaluation. A testing-effort is well-known as a key factor being related to the software reliability growth process. We also discuss a parameter estimation method for our model. Then, several software reliability assessment measures are derived from the probability distribution of its solution process, and we compare our model with existing continuous-state software reliability growth models in terms of goodness-of-fit by using actual fault count data.

1. Introduction

A software reliability growth model (SRGM)4,9,10has been utilized to assess software reliability of the products quantitatively since 1970’s. Continuous-state space SRGM’s are proposed to assess software reliability for large scale software systems. Tanaka et a1.’ have discussed a framework of the continuous-state space software reliability growth modeling based on stochastic differential equations of ItG type, Yamada et aL8 have compared the continuous-state space SRGM with the nonhomogeneous Poisson process models. However, these continuous-state space SRGM’s have not taken testing-effort into consideration. The testing-effort5 such as number of executed test-cases, attained testing-coverage, and CPU hours expended in the testing phase is well-known as one of the most important factors being related t o the software reliability growth process. Under the above background, there is necessity t o discuss a testing-effort *This work is partially supported by the Japan Society for the Promotion of Science, Grant-in-Aid for Scientific Research (C)(2). Grant No. 15510129. ?The first author is financially supported by the Sasakawa Scientific Research Grant from the Japan Science Society, Grant No. 16-064.

189

190

dependent SRGM on a continuous-state space for t,he purpose of developing a plausible continuous-state space SRGM. This paper discusses continuous-state space modeling with the testing-effort factor by applying methematical technique of stochastic differential equations of ItG type. Concretely, we extend a basic differential equation describing the behavior of the cumulative number of detected faults to stochastic differential equations of It'd type considering with the testing-effort, and derive its solution process which represents the fault-detection process. Then, we discuss parameters estimation methods for our models. Finally, several software reliability assessment measures are derived by utilizing a probability distribution of the solution process, and we compare our model with existing continuous-state software reliability growth models in terms of goodness-of-fit by using actual fault count data. 2. Continuous-state space SRGM In this section we discuss a framework of continuous-state space software reliability growth modeling. Letting N ( t ) be a random variable which represents the number of faults detected up t o time t , we can derive the following linear differential equation from the common assumption for software reliability growth modeling6:

where b ( t ) indicates the fault-detection rate at testing time t and is assumed to be a non-negative function, and a the initial fault content in the software system. Eq.(l) describes the behavior of the decrement of the fault content in the software system. Especially, in the large-scale software development, a fault-detection process in an actual testing phase is influenced by several uncertain testing factors such as testing-skill, debugging environment, and so forth. Accordingly, we should take these factors into consideration in software reliability growth modeling. Thus, we extend Eq.(l) to the following equation: d N-( t ) - { b ( t )

dt

+ E(t)}{a

-

N(t)},

where [ ( t ) is a noise that exhibits an irregular fluctuation. For the purpose of making its solution a Markov process, [ ( t )in Eq.(2) is given as E(t) = a y ( t )

(a > 01,

(3)

where (T indicates a positive constant representing magnitude of the irregular fluctuation and y a standardized Gaussian white noise. We transform Eq.(2) into the following stochastic differential equation of It'd type3: 1 d N ( t ) = { b ( t ) - - a z } { a - N ( t ) } d t+ a { a - N ( t ) } d W ( t ) , 2

(4)

191

where W ( t ) is a one-dimensional Wiener process which is formally defined as an integration of the white noise y(t) with respect to time t. The U’iener process W ( t ) is a Gaussian process, and has the following properties: (a) Pr[W(O) = 0] = 1, (b) E[W(t)l = 0, (c) E[W(t)W(t’)]= min[t, t’], where Pr[ ‘1 and E[ .] represent the probability and expectation, respectively. Next, we derive a solution process N ( t ) by using the ItB’s formula. The solution process N ( t ) can be derived as

Eq.(5) implies that the solution process N ( t ) obeys a geometric Brownian motion3. And the transition probability distribution of the solution process N ( t ) is derived as

consequently, by the properties (a)-(c) and the assumption that W ( t )is a Gaussian process. a(.) in Eq.(6) indicates a standardized normal distribution function defined as exp(--)dy. Y2 2

@(z)= -

(7)

By giving an appropriate function by which the software reliability growth process is characterized t o b ( t ) in Eq.(5), we can derirve several SRGM’s. 3. Software Reliability Growth Modeling with Testing-Effort 3.1. Modeling

For the purpose of developing an SRGhil with the testing-effort, we characterize b ( t ) in Eq.(5) as follows:

b(t) = bT(t) = r . s(t)

(0

< T < l),

(8)

where T represents the fault-detection rate per expended testing-effort at testing time t and s ( t ) 3 d S ( t ) / d t in which S ( t ) is the amount of testing-effort expended by arbitrary testing time t . Thus, based on the framework of continuous-state space modeling discussed in the previous section, we can obtain the following solution process:

[

N ( t ) = N T ( ~=) a 1 - exp

{

--T

Jot

= a [1- exp { - r S ( t )

I1

s(7)d.r - oW(t) -

o W ( t ) } .]

(9)

192

The transition probability distribution function of the solution process in Eq.(9) can be derived as

We should specify the testing-effort function s ( t ) in Eq.(8) to utilize the solution process N T ( ~in) Eq.(9) as an SRGM.

3.2. Testing-effortfunction We need t o specify a suitable function for the s ( t ) in Eq.(8). In this paper we describe a time-dependent behavior of testing-effort expenditures in the testing by using a U’eibull curve function5, i.e., s ( t ) = apmtm-’ exp{ -pt”}

(a > 0, p > 0, m > 0),

(11)

then,

S ( t )=

l

S ( T ) ~ T=

a [I - exp{-Ptrn}] ,

(12)

where a is the total amount of expended testing-effort expenditures, p the scale parameter, and m the shape parameter characterizing the shape of the testingeffort function. The Weibull curve function has a useful property t o describe the time-dependent behavior of the expended testing-effort expenditures during the testing in the followings. When m = 1 in Eqs.(ll) and (12), we can obtain the exponential curves. And when m = 2, we can derive Rayleigh curves. Thus, we can see that the Weibull curve function is a useful one as a testing-effort function which can grasp the time-dependent behavior of the expended testing-effort expenditures flexibly. 4. Estimation Methods of Unknown Parameters

We discuss methods of parameter estimation for the testing-effort function in Eq.(ll) and the solution process in Eq.(9), respectively. We suppose that K data pairs ( t j , g j , n j ) ( j = 0 , 1 , 2 , . . . , K ) with respect to the total number of faults, n j , detected during the time-interval (0, t j ] , and the amount of testing-effort expenditures, g j , expended at t j are observed.

4.1. Testing-effortfvnction For a parameter estimation method for the testing-effort function in E q . ( l l ) , we apply a method of least squaresg. First, we can obtain a natural logarithm for E q . ( l l ) as follows: logs(t) = l o g a + l o g p + l o g m + ( m - l ) l o g t - p t ” .

(13)

193

Then, the sum of the squares of vertical distances from the data points to the presumed values is derived as K

S(% P, m) = c { l o g Y j

(14)

-

j=1

p,

by using Eq.(13). The parameter estimates 6, and 6 which minimize S ( a , 0,rn) in Eq.(14) can be obtain by solving the following simultaneous equations:

as as - a s aa ap -dm = 0.

- ---

(15)

4.2. Solution process

Next, we discuss a parameter estimation method for the solution process in Eq.(9) by using a method of maximum-likelihood. Let us denote the joint probability distribution function of the process NT(t) as

P(tl7 nl;t2, n 2 ; .

'

'

; t K ,n K )

Pr[NT(tl)5 nl~NT(t2)5 n 2 , ' " N d t K ) 5 nKINT(0) = 01,

(l6)

and also denote its density as

Since NT(t) in Eq.(9) has a Markov property, we can constract the following logarithmic likelihood function L consequently for the observed data pairs ( t j , n j ) ( j= 0,1,2,..* , K ) : L = logp(tl7nl;t 2 , n2; ' . . ; t K , n K ) K

K

= -)log(a-nj)-KlogO--log27r2 j=l

1 -log(tj-tj-l) 2

We can obtain the maximum likelihood esitmates C, ?, and a^ by solving the following simultaneous likelihood equations numerically:

5. Software Reliability Assessment Measures

We discuss instantaneous MTBF (mean time between software failures or faultdetections) and cumulative MTBF which have been used as the substitutions for the MTBF. First, an instantaneous MTBF is approximately given by

194

0

4

2

6

8 10 12 14 16 18 20 22 24 26 28 30 32 34 35 35 40 TeDIlW Time (number d mamhr)

Figure 1. The estimated testing-effort function.

1400 1300

1

5 3

P

1200 1100 1000 900

800 700

6

600 500

2: 0

_

_

_

_

_

^

^

200 100 0 0

Figure 2.

2

4 6

8 10 12 14 16 18 20 22 24 26 26 30 32 34 35 35 40 TestngTms (number d -ha)

The estimated expected number of detected faults.

We need t o derive E[iV~(t)lwhich represents the expected number of faults detected up to arbitrary testing time t to obtain E[diV~(t)] in Eq.(20). By noting that the Wiener process W ( t ) N ( 0 , t ) , the expected number of faults detected up t o arbitrary testing time t is obtained as

-

- ( r S ( t ) - -a%) 2

I1

.

Then, since the Wiener process has the independent increment property, W ( t )and dW(t)are statistically independent with each other, and E[dW(t)] = 0, E [ d N ~ ( t ) l in Eq.(20) is finally derived as 1 1 E[dN~(t)l= a{rs(t)- -02}exp{-(rS(t) - -a2t)}dt. 2 2 Thus, the instantaneous MTBF in Eq.(20) can be obtained as

(22)

195

TeSfinp lime (number d months)

Figure 3.

The estimated instantaneous and cumulative MTBF’s, respectively.

The cumulative MTBF is approximately derived as

6. Model Comparisons

In this section we show the results of goodness-of-fit comparisons between our model and other continuous-state space SRGM’s’l such as exponential, delayed S-shaped, and inflection S-shaped stochastic differential equations in terms of the mean square errors (MSE)’ and Akaike’s Information Criterion’. As t o the goodness-of-fit comparisons, we use two actual data sets2 named as DS1 and DS2, respectively. DS1 and DS2 indicate an S-shaped and exponential reliability growth curves, respectively. Table 1 shows the results of model comparisons. We can see that our model improves a performance of the MSE and the AIC compared with other continuousstate space SRGM’s discussed in this paper, especially for DS1. 7. Numerical Examples

We show numerical examples by using testing-effort data recorded along with detected fault count data collected from the actual testing. In this testing, 1301 faults are totally detected and 1846.92 (testing hours) are totally expended as the testing-effort within 35 months2 Figure 1 shows the estimated testing-effort function Z ( t ) in Eq.(ll) in which the parameter estimates Ei = 2253.2, p^ = 4.5343 x lop4, and f i = 2.2580. Figure 2 shows the estimated expected number of detected faults in Eq.(21) where the parameter estimates in g [ N ~ ( tare ) ] obtained as ii = 1435.3, F = 1.4122 x and 6 = 3.4524 x 10W2. Furthermore, Figure 3 shows the time-depedent behavior of the estimated instantaneous and cumulative MTBF’s in Eqs.(20) and (24), respectively. From Figure 3, we can see that the software reliability decreases in the early testing period, and then, the software reliability grows as the testing procedures go on.

196 Table 1. T h e results of model comparisons. Proposed model MSE

AIC

Exponential SDE model

Delayed S-shaped SDE model

Inflection S-shaped SDE model

DS1 DS2

1367.63

22528

1370.8

1332.34

6018.65 36549

6550.37 1986.8

DSl

306.15

DS2

125.51

325.32 125.18

315.98 131.65

318.57 126.47

(SDE : stochastic differential equation)

8. Concluding Remarks

In this paper we have dicussed a continuous-state space SRGM with testingeffort by using mathematical technique of stochastic differential equations and its parameters estimation methods. Then, we have presented numerical illustrations for the software reliability assessment measures and also conducted goodness-of-fit comparisons by using actual data sets. Further studies are needed t o evaluate for our model by using more observed data.

References 1. H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Auto. Cont., AC-19, pp. 716-723 (1974). 2. W.D. Brooks and R.W. Motley, “Analysis of Discrete Software Reliability Models,” Technical Report RADC-TR-80-84, Rome Air Development Center, New York (1980). 3. B. Bksendal, Stochastic Differential Equations An Introduction with Applications. Springer-Verlag, Berlin (1985). 4. S. Yamada and S. Osaki, “Software reliability growth modeling Models and applications,” IEEE Trans. Soft. Eng., SE-11, pp. 1431-1437 (1985). 5. S. Yamada, H. Ohtera, and H. Narihisa, “Software reliability growth models with testing-effort,” IEEE Trans. Reliab., R-35,pp. 19-23 (1986). 6. J.D. Musa, D. Iannio, and K. Okumoto, Software Relaability :Measurement, Prediction, Application. McGraw-Hill, New York (1987). 7. H. Tanaka, S. Yamada, S. Kawakami, and S. Osaki, “On a software reliability growth model with continuous error domain - Application of a linear stochastic differential equation -,”(in Japanese), Trans. IEICE, J74-A, pp. 1059-1066 (1991). 8. S. Yamada, M. Kimura, H. Tanaka, and S. Osaki, “Software reliability measurement and assessment with stochastic differential equations,” IEICE Trans. Fundamentals., E77-A, pp. 109-116 (1994). 9. K. Pham,Software Reliability. Springer-Verlag, Singapore (2000). 10. S. Yamada, “Software reliability models,” in Stochastic Models in Reliability and Maintenance ($3. Osaki, Ed.), Springer-Verlag, Berlin, pp. 253-280 (2002). 11. S. Yamada, A. Nishigaki, and M. Kimura, “A stochastic differential equation model for software reliability assessment and its goodness-of-fit,’’ Intern. J. Reliab. and Applic., 4, pp. 1-11 (2003).

ANALYSIS OF DISCRETE-TIME SOFTWARE COST MODEL BASED ON NPV APPROACH*

K . IWAMOTO~,T. DOHI+ AND N. K A I O ~ t Department of Information Engineering, Hiroshima University 1-4-1 Kagamiyama, Higashi-Hiroshima 739-8527, JAPAN Department of Economic Informatics, Hiroshima Shudo University 1-1-1 Ozukahigashi, Asaminamiku, Hiroshima 731-3195, JAPAN E-mail: [email protected]/[email protected]

This article concerns the determination problem of the optimal rejuvenation schedule for an operating software system subject to the software aging phenomenon. Especially, we focus on the discrete-time operating environment and derive the optimal software rejuvenation time which minimizes the expected total discounted cost over an infinite time horizon, based on the familiar net present value (NPV) approach. Further, we develop a statistical algorithm to estimate the optimal software rejuvenation time from the complete sample of failure time data.

1. Introduction When software application executes continuously for long periods of time, some of the faults cause software to age due to the error conditions that accrue with time and for load. Software aging will affect the performance of the application and eventually cause it to fail. Software aging has also been observed in widely-used software like Internet Explorer, Netscape and xrn as well as commercial operating systems and middleware. A complementary approach to handle software aging and its related transient software failures, called software rejuvenation, are becoming quite popular [6, 71. Software rejuvenation is a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. It involves stopping the running software occasionally, cleaning its internal state and restarting it. Cleaning the internal state of a software might involve garbage collection, flushing operating system kernel tables, reinitializing internal data structures, and hardware reboot [6]. Huang et al. report the software aging phenomenon in real telecommunications billing application, where over time the application experiences a crash or a hang failure, and propose to perform rejuvenation occasionally. More specifically, they

’

*This work is supported by the grant 15651076 (2003-2005) of Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Exploratory Research, and the Research Program 2004 under the Institute for Advanced Studies of the Hiroshima Shudo University, Japan.

197

198

model the degradation process as a two-step process. From the clean state the software system jumps into a degraded state from which two actions are possible: rejuvenation with return to the clean state or transition to the complete failure state. They propose a continuous-time Markov chain with four states and derive the steady-state system availability and the expected cost per unit time in the steady state. Dohi e t al. extend the original Huang et al. model to semi-Markov models and develop a non-parametric algorithm to estimate the optimal software rejuvenation schedule from the complete sample of failure time. Recently, Aung and her co-author reconsider the ogtimal software rejuvenation based on the concept of survivability. This article treats a software rejuvenation model similar to the one considered by Dohi et al. under a different operation circumstance. Here, we model the dynamic behavior of telecommunication billing applications by using a discrete-time Markov process (DTMP), and determine the optimal software rejuvenation schedule in discrete-time setting. The discrete-time rejuvenation models are analyzed by Dohi et al. 5 , where the expected cost per unit time in the steady state is used as a criterion of optimality. In this article, we discuss the optimal software rejuvenation time which minimizes the expected total discounted cost over an infinite time horizon. The idea on discounted cost is introduced in the literature [4] based on the familiar net present value (NPV) approach. We take account of the both aspects: discrete-time operation environment and discounted cost, in modeling the software cost model with rejuvenation for the telecommunications billing application. Further, we develop a statistical algorithm to estimate the optimal software rejuvenation time from the complete sample of failure time data. Then the statistical device called the modified discrete total time on test (MDTTT) transform and its numerical counterpart will be used. 2. Discrete Software Cost Model Consider the discrete time software model with rejuvenation similar to the one discussed in Dohi e t al. in discrete time. Define the following four states. State 0; highly robust state (normal operation state), State 1; failure probable state, State 2; software rejuvenation state from failure probable state, State 3; failure state. Figure 1 illustrates the transition diagram for the DTMP under consideration. Suppose that the software system is started for operation at time n = 0 and is in the highly robust state (State 0). Let Z be the random time to reach the failure-probable state (State 1) from State 0. Let Pr{Z 5 n } = Fo(n), ( n = 0 , 1 , 2 , . . . ) , having p.m.f. fo(n)and mean po (> 0). Just after the current state transits to State 1, a system failure may occur with a positive probability. Let X be the time to failure from State 1 having c.d.f. Pr{X 5 n } = F f ( n ) , p.m.f. f f ( n )and mean pf (> 0). If the failure occurs before triggering a software rejuvenation, then the recovery operation is started immediately. The time to complete the recovery operation Y is also a positive random variable having c.d.f.

199

Figure 1. Transition diagram of DTMP.

Pr{Y 5 n } = F a ( n ) and mean pa (> 0). Without any loss of generality, it is assumed that after completing recovery operation the system becomes as good as new. On the other hand, the rejuvenation is performed at a constant time interval measured just after entering State 1. The time to invoke the software rejuvenation is given by the integer value no which is the decision variable, and the c.d.f. of the time to complete software rejuvenation are given by F,(.n) with p.m.f. fc(n)and mean pc (> 0). In this article we call no ( 2 0) the sofiware rejuvenation schedule. After completing the rejuvenation, the software system becomes as good as new, and the software age is initiated at the beginning of the next highly robust state. Let cs and cp be the recovery cost from system failure per unit time and the rejuvenation cost per unit time, respectively. Also we define the discrete time discount factor cy E ( 0 , l ) to formulate the NPV of the expected cost function. We make the following two assumptions:

The assumption (1)implies that the probability generating function of recovery time is strictly larger than that of preventive rejuvenation time. Also the assumption ( 2 ) says that the expected discounted cost of recovery operation is strictly larger than that of preventive rejuvenation. These assumptions can be intuitively validated from the economical point of view.

3. NPV Approach We formulate the expected total discounted cost over an infinite time horizon. Focusing on the probabilistic behavior of one cycle (from State 0 to State 0 next), we

200 obtain the present value of mean unit cost after one cycle and the expected total discounted cost for one cycle by

z=O s=O z=no+l

k=l

respectively. Then the expected total discounted cost over an infinite time horizon is given by

Then the problem is to seek the optimal software rejuvenation schedule ng which minimizes TC(n0). It is evident that lima+l(l - a)TC(no)is reduced to the expected cost per unit time in the steady state in [5]. Taking the difference of TC(n0)with respect to no, define the following function: -

4no) =

S(no)va(no + 1) - $(no+ l)Va(no) ano+lFj(nO)

7

(4)

where the discrete failure rate r ( n ) = f f ( n ) / F f ( n -1) is assumed to be a monotone function of n and in general F ( . ) = 1 - F ( . ) . Two special cases in the expected total discounted cost, no = 0 and no 00,are the following: --f

and

The following result gives the optimal software rejuvenation schedule minimizing the expected total discounted cost.

Theorem 3.1. (1) Suppose that the failure time distribution is strictly I F R (increasing failure rate) under the assumptions (1) and (2).

(i) I f q ( 0 ) < 0 and q(m) > 0 , then there ezist (at least one, at most two) optimal software rejuvenation schedules n(; (0 < n;t < co) satisfying q(n6 - 1) > 0

201

and q(nE) 5 0 . Then, the corresponding expected total discounted cost over an infinite time horizon TC(n(;)is given by TCk(nG)5 TC(n6)< TCk(nG f I),

(7)

where

(ii) If q ( 0 ) 2 0, then the optimal software rejuvenation schedule is nc = 0 , i.e. it is optimal to trigger the software rejuvenation just after entering the failure probable state. Then, the minimum expected total discounted cost is given in Eq.(5). (iii) If q ( m ) 5 0, then the optimal software rejuvenation schedule is n; + m, i.e. it is optimal not to carry out the software rejuvenation. Then, the minimum expected total discounted cost is given in Eq. (6). (2) Suppose that the failure time distribution is DFR (decreasing failure rate) under the assumptions ( 1 ) and (2). Then, the expected total discounted cost TC(no)is a concave function of no, and the optimal software rejuvenation schedule is n(;= 0 or n; + m. 4. Statistical Estimation Algorithm

Dohi et al. define the discrete total time on test (DTTT) transform of the discrete probability distribution FJ ( n )by

where F;'(p) = min{n : F f ( n ) > p } - 1 and p j = E,"==,q(n). On the other hand, in this article we define the modified discrete total time on test (MDTTT) transform by

where G - l ( p ) = min{no : 1 - cP°Ff(no)> p } - 1 if the inverse function exists. cu"Ff(z) and lima-l &(p) = + ( p ) . In a fashion Then it is evident that ra = similar to the case of DTTT transform in Eq. (9), it can be shown that F f ( n ) is IFR (DFR) if the function & ( p ) is concave (convex) in p E [0,1]. After some algebraic manipulations, we obtain the following useful result to interpret the underlying optimization problem minosno<mTC(n0)geometrically.

xr=o

202 Theorem 4.1. Obtaining the optimal software rejuvenation schedule no* minimizing the expected total discounted cost TC(n0) is equivalent t o obtaining p* ( 0 5 p* 5 1) so as to

where cs

< czo EL,

{ a ( l - c ) - b ( l - c ) } / { ( l - a ) ( a d - bc)7,}, = a / ( a d - bc) - 1, a = Q k + + " f a ( Y ) f o ( zb) ,= .PC,"=, ~ k + " f c ( s ) f o ( z c) l= CEO ~ y + + " f a ( Y ) f O (and z)>d = C O : CzO ~ Y + + " f C ( S ) f O ( ~ ) . =

c:, ego c:=, ,:c

Theorem 4.1 is the dual of Theorem 3.1. From this result, it is seen that the optimal software rejuvenation schedule no* = G-l ( p * ) is determined by calculating the optimal point p*(O 5 p* 5 1) maximizing the tangent slope from the point (-0, -<) E (-m, 0) x (-m,O) to the curve ( p , 4 ( p ) ) E [O, 11 x [O, 11. Next, suppose that the optimal software rejuvenation schedule has to be estimated from k ordered complete observations: 0 = xo L 5 1 5 5 2 5 . . . 5 x k of the failure times from a discrete c.d.f. Ff( n ) ,which is unknown. Then the analogue of empirical distribution for this sample is given by

Then the numerical counterpart of the MDTTT transform, called the M D T T T statistics, based on this sample, is defined by Oik

Ti/Tk,

1

i = 0 , 1 , 2 , ... , k ,

(13)

#JOk=o.

(15)

where

i=1,2,".,k;

The resulting step function by plotting the points (1 - a z t ( l- i / k ) , & ) (i = 0 , 1 , 2 , . . . , k ) is called the M D T T T plot. The following result gives a statistically non-parametric estimation algorithm for the optimal software rejuvenation schedule under the expected total discounted cost criterion.

Theorem 4.2. Suppose that the optimal software rejuvenation schedule has to be estimated f r o m k ordered complete sample 0 = xo 5 x1 5 x2 5 ' . . 5 %k of the failure times f r o m a discrete c.d.f. F f ( n ) ,which is unknown. Then, a non-parametric estimator of the optimal software rejuvenation schedule fii which minimizes TC(n0) is given by xj., where

203 1

Figure 2.

Estimation of the optimal software rejuvenation schedule.

where

The graphical procedure proposed here has an educational value for better understanding of the maximization problem, and is convenient for performing sensitivity analysis of the optimal software rejuvenation schedule when different values are assigned to the model parameters. The special interest is, of course, t o estimate the optimal software rejuvenation schedule without specifying the failure time distribution. Although some typical theoretical distribution functions such as the negative binomial distribution are often assumed in the discrete reliability analysis, our nonparametric estimation algorithm can generate the optimal software rejuvenation schedule using the complete knowledge about the observed failure times.

5. A Numerical Example We present an illustrative example to estimate the optimal software rejuvenation schedule which minimizes the expected total discounted cost. Suppose that the failure time X obeys the negative binomial distribution with p.m.f.:

where q t ( 0 , l ) and T = 1,2,. . . is a natural number. For the other model parameters, we assume that ( T , q ) = (12, 0.5), c, = 5.0 x 10 [$/day], cp = 3.0 x 10 [$/day], m a ” f a ( y ) = 0.9, Egoa Y f , ( y ) = 0.3, Esz0 a s f c ( s )= 0.8 and a = 0.97.

CEO

204 Figure 2 illustrates a n estimation result of the optimal software rejuvenation schedule, where 200 failure time data are generated from the negative binomial distribution in Eq.(19). For this simulation data (negative binomial distributed random number), the estimates of the optimal rejuvenation schedule and the minimum expected total discounted cost are fi: = 29 = 19 and TC(fiL;)) = 234.009, respectively. On the other hand, under the same model assumption, we can calculate numerically the optimal software rejuvenation schedule if the system failure time distribution is known. Since p* = 0.37788 has the maximum slope from (-El -p) = (-0.484309, -0.447754), the real optimal software rejuvenation schedule is n; = G-l(0.37788) = 16. That is, the estimation error in this case is 19-16=3. However, since the estimate proposed in Theorem 4.2 asymptotically converges to the real optimal, our method may function well in the case with a number of failure time data.

References 1. K. M. M. Aung, The optimum time to perform software rejuvenation for survivability, Proc. IASTED Int’l Conf. Software Eng., pp. 292-296, 2004. 2. K. M. M. Aung and J. S. Park, A framework of software rejuvenation for survivability, Proc. 18th IEEE Int ’1 Conf. on Advanced Information Networking and Applications, (in press). 3. T. Dohi, K. GoBeva-Popstojanova and K. S. Trivedi, Analysis of software cost models with rejuvenation, Proc. 5th IEEE Int’l Sympo. on High Assurance Systems Eng., pp. 25-34, 2000. 4. T. Dohi, T. Danjou and H. Okamura, Optimal software rejuvenation policy with discounting, Proc. 2001 Pacific Rim Int’l Sympo. on Dependable Computing, pp. 87-94, 2001. 5. T. Dohi, K. Iwamoto, H. Okamura and N. Kaio, Discrete-time cost analysis for a telecommunication billing application with rejuvenation, Proc. 2nd Euro-Japanese Workshop on Stochastic Risk Modelling for Fianance, Insurance, Production and Reliabilzty, pp. 181-190, 2002. 6. T. Dohi, K. GoSeva-Popstojanova, K. Vaidyanathan, K. S. Trivedi and S. Osaki, Preventive software rejuvenation - theory and applications, Springer Handbook of Reliability (H. Pham, ed.), pp. 245-263, Springer-Verlag, London, 2002. 7. Y. Huang, C. Kintala, N. Kolettis and N. D. Funton, Software rejuvenation: analysis, module and applications, Proc. 25th IEEE Int’l Symp. on Fault Tolerant Computing, pp. 381-390, 1995.

REDUCING DEGRADATION TESTING TIME WITH TIGHTENED CRITICAL VALUE JOONG SOON JANG' Div. ojlndustrial & Information System Eng., Ajou Univ. Suwon,Kyungkido. 442- 749, Korea

SUNG JIN JANG Div. of Industrial & Information Sy.stem Eng., Ajou Univ. Suwon.Kyungkido, 442-749, Korea BOO HEE PARK Div. ojlndustrial & Information System Eng., Ajou Univ. Suwon,Kyungkido, 442-749, Korea HO KYUNG LIM Div. of Industrial & Information System Eng., Ajou Univ Suwon,Kyungkido, 442-749, Korea

Determination of critical value for a degradation test is considered when the testing time is reduced to some degree. A rule that assures the same failure probability is proposed under the assumption of Weibull distribution for a performance characteristic. Photo-diode balance of an optical pickup is analyzed as a case study.

1

Introduction

In the era of speed, time to market becomes the most important factor of competitiveness. Therefore, shortening development lead time is crucial for every manufacturing company to achieve successful launching of new products. Developing a product consists of two major activities; design and evaluation. For efficient design, many innovation procedures and tools are provided such as concurrent engineering, product data management and CAD/CAM. However, relatively little emphasis was placed to reducing evaluation time. To reduce the evaluation time of new products, accelerated tests are usually considered. Accelerated test is to get life information of products quickly by shortening the life or hastening the degradation of their performance. ALT (accelerated life test) is considered as the typical accelerated test. However, in ALT, there are some cases difficult to get failure time information, especially for the products with high reliability. For those cases, ADT (accelerated degradation test) is considered as a good substitutable alternative. In ADT, the values of performance characteristics are measured at

205

206 predetermined times and the failure times are predicted based on those measured values. See Nelson [ 11 or Meeker and E~cobar[2]. For accelerated tests, two ways of shortening the life are usually used: one is to run the products at a higher usage rate and the other is to elevate the stresses to higher levels than normal levels. For ADT, another way of shortening life may be possible by tightening the critical value. Here the critical value is the level of performance at which the product is defined to be failed as soon as the performance level touches or exceeds over. Yang and Yang [3] studied the problem of tightening the critical value for ADT. The purpose of tightening the critical value is to get more failure data for statistical analysis. Yang [4]also studied screening problem based on degradation data. This study also deals with the problem of tightening the critical value for development tests. The objective of tightening is to reduce the test time. The test time is usually specified by customers or well known standards. However, it is beneficial to curtail the test for the trial versions at the development stage if possible. In these days, the speed of down scaling of electronic components is very fast while the required functions are kept unchanged, which makes the complexity of design and evaluation greater than before. Thus it is seldom the case where the first trial design is accepted as good in the life tests. Improvement of trial versions and re-evaluation are needed several times for final acceptance. During the life tests, there are many cases where the amounts of degradation of some performance characteristics in earlier times of the life tests are so large that it is not necessary to continue the test until the prescribed test time. In these cases, we may reduce the test time to some degree. To do so, it is necessary to have a suitable criterion of failure, which means the critical value at the reduced test time should be adjusted accordingly. The objective of this study is to find the critical value of degradation when the test time is reduced to some degree. Section 2 explains a motivating example and Section 3 propose a procedure to determine the critical value. Section 4 applies the proposed method to the actual example in Section 2.

2

Motivating Example

Optical pick up is a device to read the stored data in CD and DVD. It is an optical head unit which generate revival signal by detecting the changes of diffracted beam that occurred by shooting laser from optical source to disc pit. It converts optical signal to electrical signal and detects the external deflection of disc that is an external influence like deflection or eccentricity which occurs upon the rotation of the disc. Figure 1 depicts the structure of an optical pick up. An optical pick up consists of many components such as laser diode, photo diode, etc. that should be carefully arranged and controlled to read the signal correctly. Actually, to develop acceptable pick up devices, it usually needs several times of re-design and reevaluation.

207

Figure 1. Structure of Optical Pickup

There are more than 10 performance characteristics for an optical pick-up to be evaluated. One of them is the photo diode balance (PDB). Photo diode receives the reflected signals from a disc and transforms them into electrical signals. Figure 2 shows the cross section of a photo diode. PDB indicates whether the amounts of light signals transmitting through the parts A, B, C and D are balanced. When balanced, the light

Figure 2. Cross Section of Photo Diode signals come into focus. Two measures are used to evaluate PDB:

PDBX

=

(A+D)-(B+C) A+B+C+D

>

and PDBY

=

( A + B ) - (C + D ) A + B t C t D

,

208

where A, B, C and D denote the areas of light transmitting parts of cross section A, B, C and D, respectively. At first, PDBX and PDBY are set to be zero. However, they may go far from zero as the pickup operates, which means that the light swerves from the center. To evaluate time dependence of PDB (reliability), 6OoC-90%RH temperature and humidity combined test is undertaken for 192 hours. The values of PDBX and PDBY are measured after 24 hours storage under the room temperature. If the absolute values of PDBX and PDBY after the test become larger than a critical value, the pickup is defined to be failed. And if there are no failed one out of 20 units tested, the trial lot is accepted as a good one. Actually, however, there are many cases where the values of PDBX and PDBY go far above the critical in the earlier times of the test. Thus, for the trial versions, we may half the test time to reduce the development lead time. The problem is then how large the critical value should be.

3

Critical Value

Notation Y, : performance level at time t t, : original test time t, : reduced test time C, : original critical value at to C, : reduced critical value at t, Weib(y:P,q) : Weibull d.f with shape parameter p and scale parameter q = 1 - exp( -(-)Y P ) 7 qt : scale parameter at time t pt : failure probability of a pickup at time t AF : acceleration factor = 3 %,

Assumptions 1. Yt follows Weibull distribution with shape parameter p and scale parameter qt. 2. p is a constant with respect to time. qt is increasing in t. 3. AF is a constant, which means that there is no item-to-item or lot-to-lot variation in AF. The value of C, is technically determined based on customer’s requirements. However, there is no such requirement for C,. The value of C, should be determined based on the properties of the performance characteristics. If we take C, to be equal to C,, there will be many cases where the accepted items at the reduced test will turn out to be defective items. To determine the value of C,, we should have some information about the degradation behaviors of the performance characteristics. If we know the exact formula for the degradation paths, C, may be exactly determined. To get such information, degradation

209 test in which each item is measured and then put back into the test at some pre-specified times repeatedly may be considered. However, since the pickups should be taken out from the chamber to be stored in normal condition for a day to measure PDBX and PDBY, such a degradation test is not adequate to apply, which makes it impossible to get exact formula for the degradation path. Thus it is necessary to have another criterion to determine the value of C,. In this study, Y, is assumed to follow Weibull distribution for any t > 0. Note that the larger the scale parameter, the larger the failure probability is in Weibull distribution. Since it is assumed that the shape parameter p is a constant and the scale parameter qt is increasing in t, the failure probability increases with time. Hence, it seems to be appropriate to determine the value of C, to make the failure probability at t, to be the same as that at :,t

(3)

PI, = Pt” .

Since pt= Weib(y:P,q,), we have

Hence the value of may be obtained if we know the acceleration factor AF. In this study, AF is assumed to be constant, which makes C, to be uniquely determined. If it is not, it is impossible to assure the same failure probability at the reduced test time for the items or lots to be tested.

4

4.1

Analysis of Motivating Example Degradation Test

This section analyzes the pickup case. First, to get the value of AF, a degradation test is performed as follows; 40 items are put to test and every 10 items are taken out to measure the values of PDBX and PDBY at 48, 96, 144, and 192 hours, respectively. However, since PDBX and PDBY are two dimensional, it is difficult to see particular patterns of degradation. We thus transform them into the radial axis type variables as follows:

r = JPDBX

i PDBY

and 0 = tan

The values of r and 8 are listed in Table 1.

-I-.

PDBY PDBX

,

(5)

21 0 table1.Valuesofrate NO 1

48

96

144

192

48

96

144

192

9.29

10.24

13.36

13.47

122.29

126.62

140.86

74.41

2

1.48

5.27

3.89

4.84

133.89

148.07

144.49

145.40

3

11.66

20.80

23.08

26.23

139.65

141.00

145.50

146.90

4

4.86

7.69

8.92

9.98

141.85

153.56

149.09

147.46

5

10.21

18.59

22.30

23.52

142.70

145.30

151.38

154.41

6

3.92

6.96

10.74

10.63

150.55

152.39

152.64

159.11

7

10.24

11.76

9.75

10.73

152.72

158.51

156.85

161.30

8

16.91

26.59

29.85

33.00

167.77

167.18

160.67

163.15

9

14.02

19.22

23.18

25.45

175.04

171.16

172.51

179.41

10

4.2

0

r L

I 10.02 I 12.22 I 15.18 I 18.45 I 199.65 I 185.93 I 190.32 I 190.97

Analysis

We first perform ANOVA on the values of 8 and find that the time has no effect on them (p-value is 0.97). However, the values of r looks increasing with time, which means r may be used as the degrading characteristic. Figure 3 shows the Weibull probability paper for the values of r.

Weibull Probability QQ

48

95

i 96

+

90

80 70

A

60 50

40

E 30

;20 a,

a

I0 5

1

1

10

Figure 3. Weibull Probability Paper

144 192

21 1

From the Figure 3 , it is seen that the slopes of the lines are not different, which means that the slope parameter p may be assumed to be a constant. Minitab gives the estimate of p as 2.2027 and estimates of qt as in Table 2. Table 2. Estimates of qt t

48

96

144

192

Using Origin software, Qt ’s are fitted to the following function o f t with R2 = 0.99;

4.3

Tightened Critical Value

Actually, the critical value of r was not specified in the test specification. However, we take 30 as C, for r conservatively since the critical values for PDBX and PDBY were 30. To determine the value of C,, we need AF. When t,=96, we have from (7)

Thus from (4) and (8), the value of C, is obtained as follows;

c = C. = L = 23 AF

4.4

1,259

(9)

.828

Validity Check

To check the validity of the analysis, the data of 4 lots already tested with testing time 192 hours are analyzed. And the reduced tests were performed on the two lots randomly selected. Probability paper plot for those data indicates that Weibull distribution assumption also holds for each of the failure times. Table 3 contains the estimates of the Weibull parameters and the corresponding AF’s. Table 3. Weibull Parameter Estimates T

I92

96 1.355 1.301

2.238

25.685

The hypothesis of equivalence in p’s including the value of the original lot is not rejected by X2-test with significance level 0.05. It is also seen that the values of AF’s are

212

not greatly different with respect to that in (8), which implies that the assumptions made in the analysis hold.

5

Conclusion

This study presents an approach of determining critical value for a degradation test when the testing time is reduced to some degree. A performance characteristic of optical pickup is analyzed as an application. Scale parameter of Weibull distribution is taken as a measure of degradation. It will be interesting to apply the approach on the data of degradation paths of performance characteristics.

References

1. Nelson, W., Accelerated Testing, John Wiley and Sons (1990). 2. 3. 4.

Meeker, W.Q. and Escobar, L.A., Statistical Methods for Reliability, John Wiley and Sons (1998). Yang, G. and Yang, K., IEEE Trans. Reliability, 5 1,463(2002). Yang, G., IEEE Trans. Reliability, 5 1,288(2002).

A N OPTIMAL POLICY FOR PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES WITH NON-INDEPENDENT MONITORS

LU J I N , TOMOAKI MASHITA, AND KAZUYUKI SUZUKI University of Electro-Communications Chofugaoka 1-5-1, Chofu-City, Tokyo 182-8585,JAPAN E-mail: [email protected] We developed a discrete-time Markovian deterioration system monitored by multiple non-independent monitors, and derive a sufficient condition where an optimum policy is given by a “control limit policy.” This condition is given by the conditional probability of monitors given the state of a system having a weak multivariate monotone likelihood ratio.

1. Introduction Breakdowns in large, complex computer systems can sometimes cause very seirous problems. Preventive maintenance plays an important role in avoiding these malfunctions. In the past, many kinds of maintenance problems of various systems have been studied using theories of reliability and maintainability, e.g. K. Vaidyanathan, D. Selvamuthu and K. S. Trivedi investigated a continuous time queuing model for an inspection-based preventive maintenance problem. In the case of discrete-time Markovian deterioration systems, Derman’ studied an optimal replacement problem where the state of the system is identified completely a t any given time. Rosenfield‘ and Whiteg discussed optimal inspection and replacement problems under the assumption that the system’s state can be observed only through costly inspection. Ohinishi, Kawai and Mine’s5 research explored a system which is monitored incompletely by one designated monitoring mechanism. However, the reliability of a system using only one monitor is not high due to the two types of contradictory failures, “false a1arms”and “failure t o alarm”. Taking this into account, this research deals with a system which is monitored by multiple monitors which are not independent, and shows that there exists an optimal control limilt policy under certain reasonable conditions.

*

2. Model 2.1. Model Description

This paper deals with a system of which the internal true state cannot be observed directly. Let X denote the true state, and it takes on a value in a finite

213

214 set {1,2;.. , n } . The state numbers are ordered to reflect the degree of the system’s deterioration. That is, state 1 denotes the best state which means that the system is like new, and state n denotes the most deteriorated state. The state of the system undergoes deterioration according to a stationary discrete-time Markov chain having a known transition law. Let P be the transition probability matrix, in which the element p,, denotes t,he 1-step transition probability from state i to state j . At each time period, the state of the system is monitored incompletely by some monitors which give the decision maker some information about the ture state of the system. We assume the number of monitors to be L ( 2 1). The outcome of the L monitors is given as M = ..., ..., where M ( k )denotes the outcome of the Ic-th monitor and takes on a value in a finite set { 1 , 2 , . . . , m k } . Let

be a conditional probability matrix which describes the relationship between system true states and monitor outcomes where

Two actions, “keep ” and “replace ,” are considered in this research. “Keep” means an action that continues the system operation with incomplete monitoring, and the operating cost per one period in state i is given by Ci. Similary, “replace” means an action t o replace the system by a new one of which cost is given by a constant R(> 0). At any given time period, the decision maker select,s only one of the above two actions.

2.2. Assumptions The following assumptions are made in this research.

(A-1) Transition probability matrix P has a property of Totally Positive of order 2 (TPz), that is:

(A-2) Observational probability matrix I’ has a property of Weak Multivariate Monotone Likelihood Ratio (Weak MLR) lo:

215

where outcome vector of the monitors has a partial order: 8 5 8’ for 8 = (01,. . . ,0,) and 8’ = (q,. . . ,O;) if .9k 5 el, for each k E {I, . . . , L } . (A-3) Ci is a nondecreasing function of state i. (A-4) R is given by a constant.

(A-1) states that, as the system deteriorates, it is more likely to make a transition t o a higher state. (A-2) implies that higher states of the system gives rise to higher outcome levels of the monitoring probabilistical1y.- (A-3) states that, as the system deteriorates, it becomes more costly to operate. (A-4) states that the replacement cost of the system is a constant.

3. Optimal Keep and Replacement Problem At the beginning of any time period, “keep” or “replace” will be selected as an optimal action. An optimal policy is the sequence of actions which minimizes the total cost incurred now and in the future. Since the state information obtained from the outcome of the monitoring is incomplete, the decision maker needs to take the most suitable action out of the two actions by inferring the exact state of the system from the current output and the past data. This problem is formulated as partially observable Markov decision processes (POMDP). The problem is how to minimize the expected total discount cost over an infinite horizon. Let II = ( T I , rr2,. . . , 7rn) be a prior state probability vector of X, where n

7rZ =

Pr (X = i ) , c 7 r Z = I, and 0 5

7rz

5 1 for any i. The transition between the

i=l

states is specified below. 0

When “keep” is selected, II + T ( I I ,8) with probability P ( e l n ) , where

T(n,e)=(wn,61, ~

0

m el,.,. . ,T n ( n , e ) ) .

It is noted that the updated probability distribution T ( I I ,0) is calculated using Bayes’ formula. When “replace” is selected, II + e0 with probability I, where

e0

=

( 1 , O , . . . ,O).

216 4. Properties of Optimal Cost Function

4.1. Expected Cost Function Let V“)(ll) denote the optimal expected total discounted cost over an infinite horizon with an initial state II.

This is a recursion function which can be calculated based on the initial state ll, and p(0 5 p < 1) is a discount factor. The first and the second terms on the right hand side of (3) respectively, correspond to the optimal expected cost of the N periods when the action “keep” and “replace” are selected at the beginning. n

c7riCi is the expected cost if we choose “keep” at the the first time period with i=O

the initial state II,and the following term presents the future optimal cost incurred at the next ( N - 1) periods with the updated state T(ll,O). R pV(N-l)(eo) is the expected cost if “replace” is selected as the optimal action at the beginning. Furthermore, the action which minimizes the right hand side of (3) is the optimal action which must be selected at the state n. And hence, an optimal keep and replacement policy is obtained by selecting the action which minimizes the right hand side of (3) for each II.

+

4.2. Control Limit Policy ,T:) and 112 Let two prior state probability vectors II’ = (~:,7ri,... . . . ,7rK) under a partial order 111+T 112 where

=

(7r;,7r;,

7rt7-r; - 7r,’7r,z

2 0 (I 5 i < j 5 n ) ,

if V(N)(lI1) 5 V“)(l12) holds, then we say the optimal expected cost function V ( N ) ( l Iexhibits ) a control limit policy.

“‘’T

Figure 1: Control limit policy

Figure 2: Non-control limit policy

This policy implies that the optimal expected total cost over an infinite horizon with an initial state II is a monotonically nondecreasing function of II with respect

217 to the partial order “TP2”. That is, the more deteriorated the initial state of the system is in a probabilistic sense of TP2 order, the larger is the cost incurred in the future. 4.3. Lemmas

In this section, we derive several lemmas. Let 8 k denote the outcome 8 = (01, . . . ,@k,. . . , @ L ) in which all the elements except @ k are fixed, and r0,denote the corresponding conditional probability matrix. Lemma 4.1. For 111 +T 112, n

n

i=l

i=l

holds i f gi is a nondecreasing function of i.

If A is a ( k A x k ) TP2 matrix and B is a ( k x k ~ TP2 ) matrix, then the product A B is a ( k x~k ~ TP2 ) matrix.

Lemma 4.2.

For the proof, see Karlin ’. 0 Lemma 4.3.

If a transition probability matrix P E TP2, IT’P +T II’P for II’+T n2.

Lemma 4.4. For arbitrary Ic E (1,. . . ,L ) ,

I’ E Weak M L R + r e , E TP2. Proof: Easy to prove this from the definitions of TP2 and Weak MLR. 0 Lemma 4.5. If P E TP2 and

re, E

TP2,

P(81, I n’) +T P(8k

I n2) for n14~ n2.

Proof: This is obvious from Lemma 4.2, Lemma 4.4 and the fact that

P(& I II) = IIPre,. ‘0 Lemma 4.6. For a n y II fixed,

T ( n , 8 k )
holds from Lemma 4.4 for every j < j ’ , 81,

E Weak MLR, thus

< 8’k

.0

Lemma 4.7. For any 8 = (01,‘ . . ,OL) fixed, we have

T(II1, 8 ) 4~ T(n’, 8) for 111+T

IT.

(7)

218

Proof: According to assumption (A-1), P E TP2, t,hus

holds from Lemma 4.3 for every i

5 j. 0

4.4. Suficient Condition

In this section, we derive a sufficient condition for a control limit policy.

Theorem 4.1. V(N)(lI1) 5 V(N)(1312) is held for 1114~ IIz under the following conditions, 0

P E TPL,

r€

WeakMLR Ci is a nondecreasing function of state i. Proof: We use an inductive method t o prove Theorem 4.1. The proof proceeds in three steps.

S-1) Assuming that V(N-l)(rI1) 5 V(N-1)(J12) is held for 111 4~ XI2.

S-2) One-period ( N

= 1)

The optimal expected cost function when N = 1 is given as

Let 111 +T

n2,from assumption (A-3) and Lemma 4.1, we obtain n

As R is a constant,

n

obviously has a control limit policy.

s-3) N-period The optimal expected cost function in the case of N-period is given as

Case 1) In the case “keep” is optimal at the beginning of the first period: According to the assumption (A-3), we get the same result with one time period,

219

From S-I), V(N-l)(II) is assumed to have a control limit policy, and T(II', 0) +T T(I12,0) holds by Lemma 4.7, then we obtain

v ( N - l ) ( ~ ( ~el),)5 v ( N - l ) ( q n z , e ) ) for II'

+T

n2.

(15)

Furthermore, we have that

V(~-')(T(II,

e,)) 5 V(~-')(T(II, eg))

(16)

from Lemma 4.6 and

p ( e , I II')

4T

p ( e k I II')

for

II~ +T II'

(17)

from Lemma 4.5. Thus we get mr

mh

ek=l

ek=1

C p ( e k I I I ~ ) v ( ~ - ~ ) ( T ( ~ I ~ ,5B ,C ) ) p ( e k I n2)v(N-1)(qn1,ek)) (18)

where V(N-l)(T(II', 0,)) is a nondecreasing function of O k for II'. Since Eq.18 is held for any 5 E (1,... , L ) , we obtain

ml

m I.

According to Eqs.15 and 19, T rn,

mr

Case 2) In the case "replace" is optimal at the beginning of the first period: In this case, the optimal cost, R pV(N-l)(eo) is a constant.

+

Therefore, for a general N , V")(IT)

has a property of control limit policy.

0 5 . Conclusion

This research deals with a deterioration system which is observed by multiple, nonindependent monitors, and proves the following:

220 [conditions A]

P E TPZ. I? E weark MLR. 0 0

Ci is a nondecreasing function of state i . R is a constant.

Conditions A is a sufficient condition under which t h e expected optimal cost function over an infinite horizon has a control limit policy. To conclude, I? having Weak M LR property plays a n important role for a control limit policy with P having TP2 property. References 1. C. Derman, “Optimal Replacement rules when changes of states are Markov,” Mathematical Optimization Technique, Univ. of California Press, Berkeley, CA, (1963). 2. S. Karlin, “Total Positivity,” Stanford Univ. Press, Stanford, California, Vol. I, (1968). 3. S. Karlin and H. Rubin, “The Theory of Decision Procedure for Distribution with Monotone Likelihood Ratio,” Annals of Mathematical Statistics, Vol. 27, 272-299 (1956). 4. K. Vaidyanathan, D. Selvamuthu and K. S. Trivedi, “Analysis of inspection-Based Preventive Maintenance in operational Software System,” The Proceedings of International Symposium o n Reliable Distributed Systems, Japan, pp.286-295 (2002). 5. M. Ohnishi, H. Kawai and H. Mine, “An Optimal Inspection and Replacement Policy under Incomplete State Information,” European Journal of Operational Research, Vol. 27, 117-128 (1986). 6. D. Rosenfield, “Markovian Deterioration with uncertain information,” Operation Research, Vol. 24, 141-155 (1976). 7. S. M. Ross, “Quality Control under Markovian Deterioration,” Management Science, Vol. 17, No. 9, 589-596 (1971). 8. K. Suzuki, “Theory for Reliability and Quality Assurance,” Bulletin of INCOCSAT, Vol. 3, 23-28 (1995). 9. C. White, “Optimal Inspection and Repair of a Production Process Subject to Deterioration,” Journal of the Operational Research Society, Vol. 29, 235-243 (1978). 10. W. Whitt, “Multivariate Monotone Likelihood Ratio and Uniform Conditional Stochastic Order,” Journal of Applied Probability, Vol. 19, 695-701 (1982).

MATHEMATICAL ESTIMATION MODELS FOR HARDWARE AND SOFTWARE FAULT TOLERANT SYSTEM PICHATE JIRAKITTAYAKORN Department of Computer Engineering, Faculty of Engineering King Mongkut 's University of Technology Thonburi, Bangkok, 10140 Thailand NARUEMON WATTANAPONGSAKORN Department of Computer Engineering, Faculty of Engineering King Mongkut s University of Technology Thonburi, Bangkok, 10140 Thailand

DAVID COIT Department of Industrial Engineering, Faculty ofEngineering Rutgers University, Frelinghuysen Rd., Pitcataway, NJ. 08854 USA ~

This paper presents a simple mathematical model for system reliability calculation. N-version programming and recovery block are fault-tolerant architectures that have taken into account for hardware and software fault tolerant system design. Each of the architectures has different characteristics for use in the real world system. Our work is extended from a work of Laprie et al where definition and analysis of multi-version hardware and software fault-tolerant system reliability was defined and from our previous work where the system reliability analysis is complex to be extended to higher degree of fault tolerance. Possible failures in the system are considered including decider failure, related fault betweedamong software versions, and related fault between decider and software versions. This model is easy to calculate and be extended to higher degree of fault tolerance with consideration of all possible causes of system failure.

1.

Introduction

Every organization needs reliable system for running their applications. There are many causes of system failure such as errors in system design specification, software bugs, machine used time and wear-out effect. Hence, increasing reliability of the system is considered crucial for business operation. Fault tolerant technique is thus a common approach to reduce the probability of system failure. System reliability must be considered for both hardware reliability and software reliability. In this paper, we focus on hardware and software fault tolerance with two common architectures: N-version programming (NVP) and recovery block (RB). As regards to existing system reliability models, Laprie et al. [7] analyzed the system failures caused from software fault only and Wattanapongsakom [8] proposed complex models for system reliability calculation. In view of this, our work provides extension from the said two previous works by considering all possible causes of system failures and simplifL the reliability models. Essentially, we extend these reliability models to higher degree of fault tolerance. For the next section, we continue with overview of related works in fault tolerant techniques. Section 3 explains concept of fault-tolerant architectures while Section 4 explains research methodology. Then, Section 5 presents results of our reliability models. Finally, Section 6 concludes our work.

221

222 2.

Related Works

Previous researchers tried to find solutions to tolerate those failures and there have been many sofhvare fault tolerant techniques being proposed. The multi-version sofhvarefault-tolerant (SW-FT) techniques are based on the use of two or more versions of software (called 'variants') executed in sequence or in parallel. Brian Randell [ l ] proposed the recovery block technique in 1975, collected the checkpoint and restarted at the last checkpoint with the next variant when fault occurred. A. Avizienis, et a1 [ 2 ] , implemented N-version programming in 1977, comparing all outputs and decision of output correctness. Later on other multi-version SW-FT techniques were explored including N SelfChecking Programming (proposed by Laprie in 1987), Consensus Recovery Blocks (proposed by Scott in 1987) and t/(n-1)-Variant Programming (proposed by Xu in 1997) [3, 4, 5 , 6 , 73. These result to increasing in architecture complexity. The corresponding system reliability analysis is rather complicate even though the assumption with independent software failure is normally used to ease the analysis. There is non-existing work of generalized reliability analysis model for multi-version SW-FT architecture that can tolerate n software faults and m hardware faults considering failure dependency in the system. An initiative work in 1990 by J. -C. Laprie, et a1 [7], defined and analyzed reliability of the hardware and software fault tolerant architectures considering only one or two software or hardware faults can be tolerated. A previous work of N. Wattanapongsakom et a1.[8] provided similar reliability analysis, which is quite complex. In view of this, our work provides extension from the said two previous works of [7] and [8] by considering all possible causes of system failures and simplify the reliability models. Essentially, we extend these reliability models to higher degree of fault tolerance. The notations and parameters used throughout this paper are borrowed from [7] and [S] as shown below: Probability of failure of hardware component i phi Reliability of hardware component i , Qht =I - Phl Qhl Probability of failure of software version i. QyI = I - P, P", Probability of failure from related fault between two software versions. pr, Qrv

=

1- Pry

Probability of failure from related fault among all software versions, due to fault in software specification. Qa//=I - Pall Probability of failure of decider or voter. Qd = 1 - P d Pd Probability of activating an independent fault in decider of X p1D.X Probability of activating a related fault among n variants of X Pnv.x Probability of activating a related fault among the variants and decider of X PRVD.X Hardware error-confinement area HECA SECA Software error-confinement area X/H/S Software fault-tolerant method X, which can tolerate H hardware faults and S software faults. H and S are non-negative integers

Pall

223 3. Hardware and Software Fault Tolerant Architecture In this research, two models of hardware and software fault tolerant architectures are focused: N-version Programming (NVP) and Recovery Block (RB).

3.1 N- Version Programming (NVP) Architecture N-version programming, which is parallel processing, is a multi-version technique in which all the versions are designed to satisfy the same basic requirement and the decision of output correctness is based on comparison of all outputs. The decision algorithm (usually a voter) selects the correct output. Voter is the fundamental difference between this approach and the recovery block approach that requires an application dependent acceptance test. Since the entire versions are built to satisfy the same requirement, Nversion programming thus requires considerable development effort. Designing of the voter can be complicated by the need to perform inexact voting. Many researches have gone into development of methods that increase effective diversity in the final result.

i.................i L

0HECA

J

RBfljl

Wf?%fl

0 SECA

[7Functioning version

Idle version

Figure 1 Wl/l architecture

3.1.1 NVP/H/S Architecture NVP/i/j architecture can tolerate i hardware faults and j software faults. This architecture consists of j+2 independent software variants and i+2 hardware components. If i = j , each software variant will run in parallel on a separate hardware components. For example NVP/1/1 has three hardware components and three software variants, as shown in Figure 1. Any hardware failure can cause the software that is running on that hardware to fail to give result. We discuss the reliability details in reliability analysis section.

3.2 Recovery Block (RB) Architecture The recovery block technique combines the checkpoint and restart approaches with multiversions of software such that a different version is tried after an error is detected. Checkpoints are created before a version executes. Checkpoints are needed to recover the system state after a version fails to provide a valid operational starting point for the next

224 version if an error is detected. The acceptance test needs not be an output-only test and can be implemented by various embedded checks to increase the effectiveness of the error detection. In addition, because the primary version will be executed successfully most of the time, the alternatives could be designed to provide degraded performance in some sense. The data diversity and the output of the alternatives could be designed to be equivalent to that of the primary, with the definition of equivalence being application dependent. If the entire alternatives are tried unsuccessfully, the component will give up executing the rest of the system, and thus it fails to complete its function. 3.2.1 RB/H/S Architecture

This architecture has i+l hardware components and j+l software variants, which can tolerate i hardware faults and j software faults. All software variants run redundantly in sequence. The system will first start at primary versions (I); however when a failure occurs, it will roll back to the last check-point state then the secondary version ([I) will take the action. For example RB/1/1 has two hardware components and two software variants. The Rl3/1/1 is shown in Figure 1. The reliability details are discussed in reliability analysis section. 4.

Methodology

4.1 Assumption

Throughout this paper, we use the following assumptions: Each software component, hardware component or the system has 2 states: functional or fail Reliability of each software or hardware component is known There is no failure repair for each component or system Hardware redundancy is in active mode (i.e., hot spares) Failure of individual hardware components are s-independent 4.2 Reliability analysis

We classify the failure types as separate failure, or common-mode failure, which can be detected or undetected. Separate failures are caused from faults of independent software variants or hardware components. Common-mode failures can be resulted from related faults or independent faults in decider. If we can specify where the fault is, it is detected failure and the reverse is undetected failure. 4.2.1 Separate mode Separate failure is failure that is caused by unreliability of software fault or hardware fault and can be detected in the corresponding error confinement area. NVP/l/l architecture has three variants. Cause of the system failure occurs when at least two variants fail. There are two cases that can occur: three variants fail and two out of three variants fail. We can explain the phenomena by the following derived equations:

225 P(separate failure of3 variants)

=

P ( v ) ~+ ~ P ( v ) ~ Q ( v=) P(v)' + ~ P ( v ) ~-[P(v)] I

=

3P(v)2 [I

- 2/3P(v)j

This can be expanded to higher degree of fault tolerance. Thus, we have the general term of NVP separate failure P(separate failure ojn variants)

=

nP(i)"-'(l- ([n-~ j / n ) ~ ( i ) l

(1)

RB/I/I architecture has two variants. The failure occurs when all variants fail. Therefore, the probability of system failure is the product of probabilities that two software versions fail. P(separatefai1ut-e o f 2 variants)

=

P(v)~

This can be expanded to higher degree of fault tolerance. The general term of RE3 separate failure can be derived: P(separatefailure of n variants)

=

P(i)"

(2)

4.2.2 Common mode undetected

The common mode undetected failure is the failure hidden in the system that we cannot specify. The related fault between variants and decider is another failure that can occur in the system. Therefore, this part is the related fault occurred in all possible cases, which can cause failure to the system. NVP/I/I has the related fault between software variants and decider, and the related fault between any two variants and one related fault of three variants. Prvd + 3Prv + P3v In RE3/1/1 architecture, the related fault between software variants and decider is the only failure that can occur in this part. Prvd 4.2.3 Common mode detected

The common mode detected failure is a specific failure in a system. The decider failure is considered in this part. NVP/I/I has the exceptional cases of failure: when a hardware component and a software variant which runs on a different hardware both fail (6 different combinational cases), and a decider fails. 6PiPh + Pid In RB/I/I architecture, the related fault between all software variants and decider fault can occur in this part. P2v + Pid

4.3 Considering Term of Hardware Failure After considering hardware fault terms, we can specify probability of system failure for W l / l architecture as summarized in Table 1. We use a set of values from [7] and substitute these values in to equations [8] and our equations in Table 1. We obtain the results from the equations [S], which are very close to our results. Therefore, in the next step, we extend our models for higher degree of fault tolerance.

226 table1. profeof failureshardware fuultin cons

4.4 Extended to Higher Degree of Fault Tolerance

^

lil

.............c

M/P!%;l

i

............. J

;,-;I,

............

RBIZil

-J

Figure 2. X/Z/l and X/2/2 architectures

4.4.1 NVP/2/I and NVP/2/2 Architectures

NVP/2/1 consists of three independent software variants running on four hardware components, as shown in figure 2. By having 2 of 3 software versions running on 3 hardware components and the other one running on 2 hardware components, this model thus can tolerate two hardware faults and one software fault. NVPl212 consists of four independent software variants, each running in parallel on a separate hardware components. This model can provide two hardware faults tolerance and two software faults tolerance. Table 2. Probability of failures of X/2/1 and W2/2

4.4.2 RB/2/1 and RB/2/2 Architectures

RBI211 has three hardware components and two software variants as shown in Figure 2. By having two software variants running on every hardware component, this can tolerate two hardware faults and one software fault. REV212 has three hardware components and

227 three software variants. By having 3 software versions run on each hardware component, which can tolerate two hardware faults and two software faults. We can specify probability of system failure for architectures X/2/1 and W2I2 as summarized in Table 2. 5.

Result

Having analyzed all the failure terms, we therefore derive general reliability models for the RB andNVP systems with higher degree of fault tolerance i.e., m and n. Unreliability of RB/m/n system is:

Unreliabilities of NVP/n/n-1 system and NVP/n/n system are: N V P l n l n - 1 = ( n + 2)PH'n+11(1 - ( ( n + I)/(??

+ 2 ) ) P H +) ( n + l)P1"(l- ( n / ( n + l ) ) P l ) -

( n + 2 ) P , ' " + " ( l - ( ( n + l ) / ( n +2 ) ) P H ) ( n + 1 ) P , " ( l - ( n / ( n + 1 ) ) P l ) + 'ID

+npH"pl'"-"

+(n+2)p[n+l]v

+p[n+Z)u

(4)

+'RID

We use our reliability models and extend them to fault tolerance of degree 10. We also assume the following input data set and compare the corresponding system reliabilities. PI = 5.000E-02 P5" = 4.672E-04 Pg, = 1.544E-07 PID= 6.200E-03

P3" = 9.343E-03 P7u= 1.112E-05 PII, = 1.404E-09 PH= 2.5OOE-02

Piv = 2.803E-02 P6v = 7.786E-05 Ploy= 1.544E-08 PRVO = 2.000E-05

PdV= 2.336E-03 Psv = 1.390E-06 = 1.170E-10

Based on our experimental results, when we extend the model to higher degree of fault tolerance, system reliability will increase to a certain point, which can be called "saturation point". Beyond this point, the system reliability will rise at neglil ble rate. --c RtUnln -a- WPlnln

XHI1

XI212

XI313

XlU4

XI95

X/6/6

X17n

XI818

Figure 4. Reliability of X/n/n architecture

XI919

X1101lO

228

1.000

0.960 0 920 0.880

0.840 0 800

X1Z1

XlW2

X1413

XlY4

XI615

Xn16

XIW7

X/9i8

X11019

Figure 5 Reliability of X/n/ii-l architecture

6.

Conclusion

We have simplified the reliability model under the concept of Laprie [7], and taken into account hardware failure from Wattanapongsakorn [8]. Aftcr that we use mathematical induction principle. We demonstrate that the general reliability model can be extended to the order of n, which is the degree of fault tolerance. The system that requires high system reliability can increase the degree of fault tolerance to meet the system reliability requirement. Each system has a unique characteristic and requirement. Critical system needs high system reliability. Real-time system prefers minimum processing time. When applying to the real word system, we have to consider the other factors such as cost and processing time in addition to the system reliability. We found that the RB architecture has higher reliability than the NVP architecture’s. However, the NVP architecture requires less processing time than the RB architecture’s. Based on comparison of our reliability model with the existing models, we can conclude that our equations are more practical and essentially scalable, considering both causes of hardware failure and software failure. Essentially, we consider in the system decider failure, related fault betweedamong software versions, and related fault between decider and software versions. References 1

B Randell, “System structure for software fault tolerance”, IEEE Trans on So$ Eng , June, pp 220-232

(1975)

2 3

A Avizlenis, “Toward Systematic Design of Fault-Tolerant Systems”, Computer, April, pp 51-58 (1977) J B Dugan, S A Doyle and F A Patterson-Hiiie. “Simple Model of Hardware and Software-FaultTolerance”, Proc ofRelrahilrty and MarntarnahilitySym ,pp 124-129 (1994)

4

A Aviueiiis, Software Fault Tolerance, John Wiley & Sons Ltd , pp 23-46 (1995)

5

W Torres-Pomales, Sofmare Fault Tolerance A Tutorial, Langley Research Center, October (2000)

6

L L Pullum, Software Fault Tolerance Technrqueo andImplementatron, Artech House, Inc, (2001)

7

J -C Lapne, J Arlat, C Beounes and K Kanoun, “Definition and Analysis of Hardware and SoftwareFault-Tolerait Architectures”, IEEE Computer, July, pp 35-51 (1990)

8

N Wattanapongsakomand S P Lewtan, “Reliability OptuniMon Model for Fault-TolerautDstnbuted Systems”, Proc ofReliabilrtyandMarntarnahrlrr Sym ,January, pp 193-199 (2001)

ANALYSIS OF WARRANTY CLAIM DATA: A LITERATURE REVIEW

MD. REZAUL KAFUM Department of Statistics, University of Rajshahi Rajshahi - 6205, Bangladesh. Email: [email protected] KAZUYUKI SUZUKI Department of Systems Engineering The University of Electro-Communications Tokyo 182-8585, Japan. E-mail: [email protected] Warranty database is used by the manufacturer for many purposes, for example, manufacturer can predict future claims and warranty costs; determine whether a recall, halt in production, or modification is necessary; ascertain whether product reliability is affected by the manufacturing process or usage environment; and compare failure rates among similar or competing products. This paper presents a brief survey of literature which are directed toward the analysis of warranty claim data. It emphasized on discussion of different kinds of warranty claims data selected from reviews and on comparison of the statistical models and methods used to analyze such data.

1. Introduction

Manufacturers analyze field reliability data to enhance the quality and reliability of their products and to improve customer satisfaction. There are many sources for collecting reliability related data. In many cases, it would be too costly or infeasible to continue an experiment until a reasonable number of items have failed. Warranty claim data is a prime source of field reliability data, which is collected economically and efficiently through service networks. Suzuki, et al. [l]mentioned the main purposes and uses of warranty claim data are (i) Early warning/detection of bad designs, poor production processes, defective parts, poor materials, etc., (ii) Observing the targets of new product development, that is, whether the targets achieved or not, (iii) Grasping the relationships among the test data at the development stage, the inspection results of the production stage, and the field-performance, (iv) Determining whether a recall, halt in production, or modification is necessary, (v) Comparing the reliability of similar or competing products, (vi) Constructing a database about the failure modes/mechanisms and their relation to both the environmental conditions and how the product is used, (vii) Predicting future warranty claims and costs. There are many aspects t o warranty, therefore a number of procedures have been

229

230 developed for analyzing product warranty data and the literature on this topic is very large. To review the literature on the methods of the analysis of warranty data, Lawless [2] pointed out that "Starting with Suzuki [3, 41, there has been a good deal of development over the past 10-15 years (e.g., Kalbfleisch, et al. [5]; Robinson & McDonald [6]; Kalbfleisch & Lawless [7]." Recently Murthy and Djamaludin [8] reviewed the literature on warranties of many different disciplines by omitting mathematical details and highlighted issues of interest to manufacturers in the context of managing new products. In this paper we present a brief survey of literature on statistical models and methods which are directed toward the analysis of product warranty claim data. As a convenience, this survey will somewhat arbitrarily be classified by topics into the following nine sections. 2. E s t i m a t i o n of lifetime d i s t r i b u t i o n using supplementary data

For analyzing incomplete warranty lifetime data for which information is only available for failed products, Suzuki [3, 41 proposes a pseudo-likelihood approach using supplementary data. According to his notations, ( X , , Y , ) , i = 1 , 2 , . . . ,N , represent independent, identically distributed pairs of random variables, where X , is the variable of interest with pdf f ( z ) and survival function p(z),and Y , is some censoring variable with pdf g(z) and survival function G(z). 8 represent a vector of unknown parameters, taking on values in the parameter space 0. The observed quantities are (Z,, d,), i = 1 , 2 , .. . , N , where 2,= min(X,, Y , ) , 6, = I[X, 5 Y,], i = 1 , 2 , . . . , N , and I[.]means the indicator function of set [.I. Suzuki [3] derived a nonparametric approach for the generalized MLE of the survival function F ( t ) and discussed the properties of the estimator F ( t ) . If the lifetime distribution is assumed to have a known parametric form, for example, exponential or Weibull, Suzuki [4] under some assumptions, presents the following likelihood

-

-

where Zz(i = 1,.. . , n u ) is the 2,conditioned on X, 5 Y , (6, = l), and 2, ( j = 1 , . . . ,nc)is the 2, conditioned on X , > Y , (6, = 0). And n, = C,"=, 6, is the number of automobiles that fail in the warranty period, nc = CLl(l- 6,)D, is the number of automobiles without failure in the warranty period but for which mileage N (1 - 6,)( 1 - 0,)is the number of was determined through follow-up, and nl = Cz=l automobiles without failure that have not been followed up in the warranty period. The estimator is 8 ' in G, at which L' in (1) is maximized. Kalbfleisch and Lawless [9] proposed likelihood-based methods for the analysis of field-performance studies with particular attention centered on the estimation of regression coefficients in parametric models. They developed the idea of Suzuki [4] for the analysis of warranty data with missing covariate information and proposed

231 the pseudo-likelihood,

L#

=

L1 1 [el ] Uf(Z2)

(F(2j))”’

.

+

This result from the fact that, in equation (l), 1 nl/n, converges t o p* = (l/N) Di, the percentage of products, e.g., automobiles, followed up; where Di = 1 if the i-th product is followed up; otherwise, D, = 0 (i = 1 , 2 , . . . ,N ) . Hu and Lawless [lo] also apply this approach to the covariate analysis using estimating functions. More general types of pseudo-likelihood methods and their asymptotic properties when covariate information is missing are investigated by Hu and Lawless

EL=,

[Ill.

3. Age-based claims analysis The age-based (or age-specific) analysis of product failure data has engendered considerable interest in the literature (Kalbfleisch, et al. [5];Kalbfleisch and Lawless [7]; Lawless [2]; Karim, et al. [la, 131). Kalbfleisch, et al. [5] assumed that if N, cars are put into service on day x, nxtl be the number of claims at age t and with a reporting lag 1 for cars put into service on day x, then nztl Poisson(p,tl), where the mean of the Poisson is pxtl = N,Xtfl, where At is the expected number of claims for a car at age t and f i is the probability that the repair claim enters the database used for analysis 1 days after it takes place. The data comprise the claim frequencies n,tl, where x t 1 5 T (T is the current date), and give rise to the likelihood N

+ +

Lawless and Kalbfleisch [14] reviewed some issues in the collection and analysis of warranty data and showed that if the Nx’s and the number nxt of age t claims on units which entered service on day x are known, the estimate of At is given by

+

since E(nxt)=N,Xt, (0 5 z t 5 T ) . Estimate in (4) can also be obtained from the likelihood (3) if the probability of reporting lag fi is ignored or known. Kalbfleisch and Lawless [7] and Lawless [2] give a comprehensive review of some methods for age-based analysis of warranty claims and costs, and the estimation of failure distributions or rates. They defined the moment estimate for the expected number of claims for a unit at age a , i ( u ) as

where n T ( u )= CzztnT(d,a)is the total number of age a claims reported up to day T , n T ( d , a ) is the total reported number of claims at age u for those units sold

232 on day d. @ ( a ) = CzGt N ( d ) F ( T- d - a ) , where N ( d ) 2 0 denotes the number of units sold on day d, F ( r ) = f(0) f(1) . . . f ( r ) and f(r)=Pr(a claim is reported r days after).

+

+

+

4. Aggregated warranty claims analysis

Sometimes manufacturers have warranty claims data only in aggregate form and they analyze claim rates for their products by using these aggregated data. Trindade and Haugh [15] discussed the complexities involved in statistically estimating the reliability of computer components from field data on systems having different total operating times for the systems at any specific reference time of analysis. In a related paper, Baxter [16] describes a method of estimation from quasi life tables where no observations of the lifetimes of individual components have been recorded; rather the numbers of components which fail between successive equally-spaced time points are recorded. In relation with the method of Baxter [16], Tortorella [17] studied a problem arising in the analysis of field reliability data generated by a repair process. The author constructed a pooled discrete renewal process model to estimate the reliability of a component and use a maximum likelihood-like method to estimate the parameters. 5. Marginal counts of claims analysis Due to the diffuse organizations of service departments or repair service networks and to reduce the data collecting and maintenance costs, Karim, et al. [la, 131 and also Suzuki, et al. [18] suggested a minimal database for product warranty data combining information from different sources for particular time periods. For example, they suggested to use the monthly sales amounts, Ny, y = 1,2,. . . ,Y , provided by the sales department, and the number of claims registered for a given month, r J ,j = 1 , 2 , . . . ,T, provided by the service department. Suppose { r y t } be the number of products sold in the yth month which failed after t months (at age t ) for t = 0,1,. . . ,min(W - 1,T - y), where T (T 2 Y ) is the number of observed months, W is the length of the warranty period, and min(1.Y)

y=max( 1,j- W+1)

be the count of failures occurring in the j t h month. { r j } is called the marginal count failure data. Karim et al. [la, 131 used a nonhomogeneous Poisson process to model the failure counts for the repairable products and assumed that for each sales month y, y = 1 , 2 , . . . , Y , the ryt, t = 0,1, . . . ,min(T - y, W - 1) are independently distributed as Poisson distributions with mean NY&,that is, ryt N

Poisson(N,&)

(6)

233 where At is the mean number of failures at age t per product. Under model ( 6 ) , { r j } ,j = 1,2,.. . ,T , are independently distributed according to Poisson with mean rn3 - C y=max(l,j-W+l) mWd Ny&,. Therefore, the observed data log likelihood is

c T

logL(A,;rj)

=

{-mj

+ T j log (rnj) - log(rj!)}.

(7)

j=1 The unconstrained MLE of (7) gives

At

is derived by directly maximizing the log likelihood

I TlINl,

if t = 0, min(Y-l,t)

(Tt+l

- C,=l

(8) ,+l~t-y)

/ N ~if, t = 1,2,.. . ,T

-

1.

Karim et al. [12] and Karim and Suzuki [19] also derived the constrained MLE of At via the Expectation-Maximization (EM) algorithm and discussed the properties of the estimators.

6. Warranty claims analysis by using covariates Sometimes expected claims may depend on factors such as manufacturing conditions or the environment in which the product is used (Lindley and Singpurwalla [LO]; and Li [all). The Poisson model ( 6 ) could be extended in the usual way to allow the covariate analysis. Suppose that there is a vector of covariates z associated with different groups of products that were produced in different time periods or operated in different seasons. The expected number of claims at age t entering service on month y, {ryt},can be conveniently modeled in the log linear form

where p is a vector of regression parameters. From model (9), Karim and Suzuki [22]considered two models: Model M 1 for the effects of operating seasons and Model M 2 for the effects of manufacturing characteristics, where z specifies respectively different operating seasons and different production periods. Model M 1 included the model presented in Karim et al. [12] as a special case when ,Bs = 0, ‘ds,s = 1,2,.. . , 5’. Also if we put S = 2, log(p1) = v and ,B2 = 0, Model M 1 becomes the model discussed in Karim and Suzuki [23] where they assumed only two different seasons in a year and the effect of the environment El is to either increase or decrease the parameter A t by a common positive factor 7. Model M 2 becomes the model presented in Karim et al. [13]to detect change-point .

7. Two dimensional warranty There are situations where several characteristics are used together as criteria for judging the warranty eligibility of a failed product. For example, for automobiles,

234 sometimes warranty coverage has both age and mileage limits, whichever occurs first, more specifically the 5-year-50,000-mile protection plan. Moskowitz and Chun [24] suggest a Poisson regression model for tweattribute warranty plan. They assumed that the number of events ni under the tweattribute warranty policies is distributed as a Poisson

where the parameter pi = f ( X i , p )with i = 1,2,. . . ,m, and ni = 0,1, . . . ,co,is a regression function of the age and usage amounts and ,B is the coefficient vector of the regression model. Lawless, et al. [25] discussed methods to model the dependence of failures on age and mileage, and to estimate survival distributions and rates from warranty claims data using supplemental information about mileage accumulation. Singpurwalla and Wilson [26] propose an approach for developing probabilistic models in a reliability setting indexed by two variables, time and a time-dependent quantity such as amount of use. They used these variables in an additive hazard model. Suzuki [27] considers the lifetime estimation measured in mileage considering age as a concomitant variable. Given the concomitant variable, the random variable of interest is assumed t o have a normal distribution. Independently of Suzuki [27], Phillips and Sweeting 1281 deal with the analysis of exponentially distributed warranty data with an associated variable having a gamma distribution as a concomitant variable.

8. W a r r a n t y

costs

analysis

There is an extensive volume of literature on the analysis of warranty costs. Robinson and McDonald [6] review the statistical literature on warranties relating t o the cost of warranty, the advertising value of a warranty, the warranty as a product attribute, dealer relations, customer satisfaction, and reliability. Blischke and Scheuer [29] analyzed p r e r a t a and free replacement warranty policies from both the buyer's and the seller's points of view. Blischke and Scheuer [30] provide further application of renewal theory to the analysis of the free replacement warranty from the seller's points of view. Nguyen and Murthy [31, 321 present a general model for repairable products for estimating both the warranty cost for a fixed lot size of sales and the number of units returned for repair in any time interval during their life cycle. Nguyen and Murthy [33] later reviewed free warranty policies for nonrepairable products and derived the expected total cost to the consumer and the expected total profit to the manufacturer over the product life cycle. More information on the analysis of warranty costs are also given in Mamer 134, 351, Matthews and Moore [36], Balcer and Sahin 1371, Frees [38],Blischke and Murthy [39], Sahin and Polatogu [40], Vintr [41] and Murthy and Djamaludin [8].

235 9. Sales lag and reporting lag analysis There are few literatures on the analysis of sales lag. Majeske, et al. [42] and Lu [43] discussed the importance of the estimation of sales lag. Karim and Suzuki [44] modeled the warranty claims to estimate the parameters, the age-based expected number of claims and the probability of sales lag, where the dates of sale of the products are unknown. They proposed a model based on follow-up information on the dates of sale to provide unique solutions for the parameters. In a series of papers by Kalbfleisch and Lawless and their collaborators discussed the methods for the analysis of reporting lag. References include Kalbfleisch and Lawless [45, 71, Kalbfleisch et al. [5], and Lawless [46, 21.

10. Forecasts of warranty claims Like expected warranty costs, forecasts of warranty claims are also important to the manufacturers. Articles by Robinson and McDonald [6],Kalbfleisch et al. [5],Chen, et al. [47]and Lawless [2] deal with methods for forecasting warranty claims. Meeker and Escobar [48] (Chap. 12) and Escobar and Meeker [49] explained methods for computing predictions and prediction bounds for the number of failures in a future time interval. 11. Concluding remarks

This review pointed out why field performance data, especially warranty claims data, is important and given a survey of the literature pertaining to the analysis of such data. The emphasis is given on the analysis of minimal databases, constructed by combining information from different sources. The research to be applicable for those who are responsible for product reliability and product design decisions in manufacturing industries. Since the literature on product warranty data is vast, more work on this problem is needed and expect to be performed in future by the authors. References 1. K. Suzuki, M. R. Karim and L. Wang, Handbook of Statistic: Advances in Reliability, eds. N. Balakrishnan and C. R. Rao, Elsevier Science, Vol. 20, 585 (2001). 2. J. F. Lawless, International Statistical Review, 66 No. 1, 41 (1998). 3. K. Suzuki, Journal of the American Statistical Association, 80, 68 (1985a). 4. K. Suzuki, Technometrics, 27,263 (1985b). 5. J. D. Kalbfleisch, J. F. Lawless and J. A. Robinson, Technometrics, 33,273 (1991). 6. J. A. Robinson a n d G. C. McDonald, In Data Quality Control: Theory and Pragmatics. Eds. G. E. Liepins and V. R. R. Uppuluri. New York: Marcel Dekker (1991). 7. J. D. Kalbfleisch and J. F. Lawless, In Product Warranty Handbook, Eds. W.R. Blischke and D.N.P. Murthy. New York: Marcel Dekker (1996). 8. D. N. P. Murthy and I. Djamaludin, Int. J. Production Economics, 79,231 (2002) 9. J. D. Kalbfleisch a n d J. F. Lawless, Technometrics, 30, 365 (1988).

236 X. J. Hu and J. F. Lawless, Biometrika, 83, 747 (1996). X. J. Hu and J. F. Lawless, Canadian Journal of Statistics, 25, 125 (1997). M. R. Karim, W. Yamamoto and K. Suzuki, Lifetime Data Analysis, 7,173 (2001a). M. R. Karim, W. Yamamoto and K. Suzuki, J. of the Japanese Society for Quality Control, 31, 318 (2001b). 14. J . F. Lawless and J . D. Kalbfleisch, In Survival Analysis: State of the Art. (J. P. Klein and P. K. Goel, eds.), Kluwer Academic Publishers, 141 (1992). 15. D. C. Trindade and L. D. Haugh, Microelec. Heliab., 20, 205 (1980). 16. L. A. Baxter, Biometrika, 81, No. 3, 567 (1994). 17. M. Tortorella, Lifetime Data: Models in Reliability and Survival Analysis, N. P. Jewel1 et al. (eds.), Kluwer Academic Publishers, pp. 331 (1996). 18. K. Suzuki, W. Yamamoto, M. R. Karim and L. Wang, In Recent Advances in Reliability Theory - Methodology, Practice and Inference, eds. N. Limnios and M. Nikulin, Birkhauser: Boston, pp. 213 (2000). 19. M. R. Karim and K. Suzuki, Znt. Journal of Statistical Sciences, 2, 1 ( 2 0 0 3 ~ ) . 20. D. V. Lindley and N. D. Singpurwalla, J. Appl. Prob., 23, 418 (1986). 21. L. Li, Lifetime Data Analysis, Vol. 6, 171 (2000). 22. M. R. Karim and K. Suzuki, Znt. J . of Reliability and Application, 4, 79 (2003a). 23. M. R. Karim and K. Suzuki, J . of the Indian Statistical Association, 40, 143 (2002). 24. H. Moskowitz and Y. H. Chun, Naval Research Logistics, 41, 355 (1994). 25. J. F. Lawless, X. J. Hu and J. Cao, Lifetime Data Analysis, 1,227 (1995). 26. N. D. Singpurwalla and S. P. Wilson, Adv. Appl. Prob., 30, 1058 (1998). 27. K. Suzuki, Rep. Stat. Appl. Res, 40, 10 (1993). 28. M. J. Phillips and T. J. Sweeting, J. of the Royal Statistical Society B, 58, 775 (1996). 29. W. R. Blischke and E. M. Scheuer, Naval Research Logistics Quarterly, 22, 681 (1975). 30. W. R. Blischke and E. M. Scheuer, Naval Research Logistics Quarterly, 28, 193 (1981). 31. D. G. Nguyen and D. N. P. Murthy, IIE Transactions, 16, 379 (1984a). 32. D.G. Nguyen and D.N.P. Murthy, Naval Research Logistics Quarterly, 31,525 (1984b). 33. D. G. Nguyen and D. N. P. Murthy, Operations Research, 22, No. 2, 205 (1988). 34. J. W. Mamer, Naval Research Logistics, 29, 345 (1982). 35. J. W. Mamer, Management Science, 33, No. 7, 916 (1987). 36. S. Matthews and J . Moore, Econornetrica, 55, 441 (1987). 37. Y. Balcer and I. Sahin, Operations Research, 34, 554 (1986). 38. E. W. Frees, Naval Research Logistics Quarterly, 33, 361 (1986). 39. W. R. Blischke and D. N. P. Murthy, Product Warranty Handbook (editors). New York: Marcel Dekker (1996). 40. I. Sahin and H. Polatogu, Quality, Warranty and Preventive Maintenance, Boston: Kluwer Academic Publishers (1998). 41. Z. Vintr, Proceedings Annual Reliability and Maintainability Symposium, 183 (1999). 42. K.D. Majeske, T.L. Caris and G. Herrin, Znt. J. of Production Economics, 50, 79 (1997). 43. M. W. Lu, Quality and Reliability Engineering International, 14, 103 (1998). 44. M. R. Karim and K. Suzuki, Pakistan Journal of Statistics, 20(1), 93 (2004). 45. J. D. Kalbfleisch and J. F. Lawless, Statistica Sinica, 1 , 19 (1991). 46. J. F. Lawless, Canadian Journal of Statistics, 22, No. 1, 15 (1994). 47. J. Chen, N. J. Lynn and Singpurwalla, In Product Warranty Handbook, Eds. W. R. Blischke and D. N. P. Murthy. New York: Marcel Dekker (1996). 48. W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliability Data, John Wiley & Sons, Inc., New York (1998). 49. L. A. Escobar and W. Q. Meeker, Technometrics, 41, 113 (1999).

10. 11. 12. 13.

SIMULATED ANNEALING ALGORITHM FOR REDUNDANCY OPTIMIZATION WITH MULTIPLE COMPONENT CHOICES HO-GY UN KIM Dept. of Information & Industrial Engineering, Dong-Eui University, 995 Eomgwangno, Busanjin-gu Busan, 614-714,Korea CHANG-OK BAE ' SUNG-YOUNG PARK Dept. of Information & Industrial Engineering, Dong-Eui University, 995 Eomgwangno, Busanjin-gu Busan, 614- 714, Korea

This paper considers the series-parallel redundant reliability problem where each subsystem has multiple component choices. The subsystems are characterized by their reliability and resources such as cost and volume. If resource constraints comprise nonlinear functions, the problem becomes an NP-hard problem of combinatorial optimization problems. In this paper, a simulated annealing (SA) algorithm which determines the maximal reliability of series-parallel system configuration subject to the resources is proposed. To show its effectiveness, several test problems are experimented and the results are compared with those of previous studies.

1

Introduction

In general, two methods are used to improve the system reliability: (1) to increase the component reliabilities, and (2) to provide component redundancies. Using the methods also causes any increment of additional resources such as cost, weight, volume, etc. Therefore the design engineer has to decide suitable component reliabilities and redundancy levels. Redundancy optimization is to determine the optimal redundancy level of components in system subject to several resource constraints. Tillman et a[.'' considered only one component in each subsystem for the redundancy optimization. However, for more realistic system design, variety of different component types should be considered. In this paper, we consider the series-parallel redundant reliability problem where each subsystem has multiple component choices. The subsystems are characterized by their reliability and resources. Some heuristics for the redundancy optimization problem have been developed. Chern & Jan3 has dealt with a redundancy optimization for a series system where more than one component types were allowed for each subsystem. Fyffe et ul.' proposed a dynamic programming approach with a Lagrangian multiplier for searching optimal solution of the problem with two resource constraints. Nakagawa & MiyazakiI4 used surrogate constraints algorithm for the problem with two resource constraints. Sung & Cho" used branch-and-bound method based on the reduced solution space for the problem considered with only budget constraints and tested randomly generated numerical examples to show its efficiency. Hsieh' used a linear approximation for the

237

238 problem with two resource constraints and compared its performance with former studies. Ramirez-Marquez & C o d 6 proposed a heuristic approach to minimize system cost for the problem that is the multi-state series-parallel system. The heuristic methods have disadvantages that there exists no way to improve a solution at local optima and they should be properly developed for each problem characteristic. Therefore metaheuristics such as GA (genetic algorithm), SA (simulated annealing) and TS (Tabu search) are used to search the optimal solution of the ~ , & Smith6 and Yokota combinatorial optimization problems. Painton & C a m ~ b e l l ' Coit ei ~ 1 . ' used ~ GA to search optimal solution of the problem and showed that GA provides better solutions. Ida et a/.'' also used GA for the problem with several failure modes. More related papers are referred to the excellent survey paper by Kuo & Prasad". While several studies have used GA for the optimal reliability design, there are a few studies using SA algorithm. Angus & Ames' used SA algorithm to find the optimal redundancy levels minimizing system cost subject to reliability constraints. Ravi et al." considered the optimal redundancy levels maximizing system reliability subject to cost, mentioned that SA has advantages for the weight, and volume constraints. Kuo et application of the complex discrete optimization problems, but no many studies on the optimal reliability design exist. For the redundancy optimization with multiple component choices, GA is only used. In this paper, an SA algorithm is presented to search the optimal solution of the problem. To show its effectiveness, several test problems chosen from the previous studies are evaluated. This paper is organized as follows. In Section 2, redundancy optimization problems and some notations are briefly explained; in Section 3, the concept of the SA algorithm and its parameters are described; in Section 4, numerical examples chosen from previous studies are solved and discussed. Finally, conclusions and further studies are provided in Section 5.

2

Redundancy Optimization Problem

General formulation of the series-parallel redundant reliability problem with multiple component choices and some notations for the problem are as follows: Maximize R, = f ( x , l k , ) subject to g ( x , l k , )I b

rn : number of subsystems R, : system reliability R, : reliability of subsystem i (i = 1,2;..,m) X I : = (x,, , X d .> X , k , ) x , :~number of the kth component used in subsystem i ( k = 1 , 2 ; . . , k , ) k, : numder of component choices for subsystem i q,,k: failure probability of type k component in subsystem i g, :j" constraint function 9..

W the upper limit on the weight of the system C: the upper limit on the cost of the system (x, ) : total weight of subsystem i C, (x, ) : total cost of subsystem i b : the upper limit on the resource of the system

239

Due to the solution's nature representing integer variables, the problem is an integer programming problem. If resource constraints are constituted of nonlinear functions, the problem will be a nonlinear integer programming problem. Chern4 determines the computational complexity of problems for series-parallel redundant systems and proved that these types are an NP-hard problem of combinatorial optimization problems. In this paper, an SA algorithm is used to search an optimal solution of the problems and some numerical problems chosen from previous studies are experimented.

3

SA algorithm

Metaheuristic methods are developed to make up for the weak points of the heuristic methods to search near optimal solution. Although these methods are developed with different characteristics, many optimization and decision-making fields have used them because they have simple concepts and excellent searching performance for the solution space. The field of optimal reliability design is as well. SA that is one of the metaheuristics is presented by Kirkpatrick et ul." and Cerny' as alternative of local search and it has been successhlly applied to many combinatorial optimization problems. SA is an approach to search the global optimal solution that attempts to avoid entrapment in poor local optima by allowing an occasional uphill move to inferior solutions. In this paper, an SA algorithm which determines the maximal reliability of series-parallel system with multiple component choices subject to the resources is proposed. To apply the SA algorithm for the problem, the representation of solution and the energy function are to be determined and initial solution, initial temperature, cooling rate and stopping criterion are to be initialized.

3.1. Znitiulization Step A solution of the problem should represent the redundancy levels and the component choices of each subsystem. Figure 1 shows the representation of the solution that has m subsystems in series. Each subsystem is constituted several digits that is equal to the number of component types and each digit represents parallel redundancy level of the component. For example, the subsystem 1 has four types of component(kl=4) and 2 components of the third type are connected in parallel.

240 Subsystcm 1

Subsystem 2

Subsvstcm m

......

0

1

... 0

Figure I . Solution representation of the problem

The energy function E that is an evaluation function of the performance of the SA used the objective function of the problem and its value will be zero if it violates the constraint functions. The initial solution is randomly generated. In the initialization step, single component choice in each subsystem is allowed to get a feasible solution more easily. The initial solution is evaluated by the energy function, and then the value of the energy function becomes ones for both the current solution (XC)and best solution (&). Initial and final values of the control parameter temperature referred to as To and TF, respectively, are specified. The length of iterations for each level of the current temperature, T,, referred to as L , is set y times of neighborhood size. This paper uses To = 50, TF= 1 and y = 0.01. 3.2. Generation Step of a Feasible Neighborhood Solution An efficient scheme to generate a neighborhood solution should be developed for enormous solution space. A two-step generating scheme is proposed to generate a feasible neighborhood solution from the current solution. (1) Step 1 (painvise-swapping scheme): Two positions are randomly chosen and their elements are exchanged if both are not zeros. In the case that this process sequentially generates infeasible solutions five times, go to Step 2. Figure 2 shows an example of a system consisting of 14 subsystems with three or four choices at each subsystem. Subrvrtrm 1

n

n

z

o

Subrrstrm 2

3

n

o

Suhrvstem 14

....

n

i

n

o

....

2

1

0

0

Figure 2. A case of the pairwise-swapping scheme.

(2) Step 2 (resource-based scheme): For two positions randomly chosen as in Step 1, one component is added to the current solution if resources are available. Otherwise, it is subtracted. When resources are not available and the value of the position is zero, it keeps the value of the position is zero(s). This scheme is applied to the system consisting of 14 subsystems in Figure 3.

241

"

0

3

0

3

0

....

0

1

I

0

,I

(I

0

1

"

3

(I

/I

....

0

1

0

n

3.3. Evaluation Step for Acceptance of Neighborhood Solution If the energy function value of the neighborhood solution is more than that of the current solution (EN Ec), the neighborhood solution will replace the current solution. Then compare this neighborhood solution's energy function value with that of the best solution found thus far (En). If EN> En, then replace the best solution with EN. Otherwise, if EN < Ec then whether or not to accept the neighborhood solution is determined by the acceptance probability P(A) = exp(-AEIT), where hE = Ec - EN is referred to as the difference between the energy function values of the current solution and the neighborhood solution. 3.4. Zncrement Step of Iteration Counter

Increase the iteration counter, N, by one. And then return to Generation Step of a neighborhood solution. If the value of N is greater than or equal to the maximum number of iterations for each temperature level (L),proceed to Cooling Schedule Step.

3.5. Cooling Schedule and Stopping Step The temperature is adjusted by its cooling rate a. This is calculated by Tc = aTc-1 (C = 1, 2, .. .). If the new value of Tc is greater than or equal to the stopping value of TF(if Tc = TF) then reset N to one and return to Generation Step of a Feasible Neighborhood Solution. Otherwise, stop. This paper uses a = 0.98. 'able 1. Cc

jonent data for the example. ( -

P 1

2 3 4 5 6 7 8 9 10 11 12 13 14

0.90 0.95 0.85 0.83 0.94 0.99 0.91 0.81 0.97 0.83 0.94 0.79 0.98 0.90

-

llpon

choic -

hoice - C W P

subsystem

0.94

1 1

3 4 2 3 4 0.90 0.99 0.95 2

0.99

5

3 4 4 3 3 4

4 10 5 6 3 4 8 7 9 5 6 5 5 7

_ .

0.9 1 0.93 0.87 0.85 0.95 0.97 0.94 0.9 1 0.96 0.90 0.96 0.85 0.97 0.95 -

ioice 3

-

P 0.95

W -

5

0.92

4

0.96

4

0.91

8

0.90

I

0.99 -

9

6

6

242

4

Numerical Experiments

To evaluate the performance of the SA algorithm, some numerical experiments are conducted. 4.1. Examples

The example used in this paper is taken from the example of Fyffe et al.*. The problem has 14 subsystems in series and each has three or four component choices. The objective of the problem is to maximize the system reliability subject to two resource constraints such as cost(G130) and weight(F190). Table 1 shows component data of the experiment for the problem. The problem (P) is as follows:

s.t.

2C,(x,)<130 ,=I

(x!) I190 ,=I

xt.kE

nonnegative integer

Many researchers have studied two types of both single and multiple component choices, since the problem was presented by Fyffe et aL8.Fyffe et al.’, Yokota et al.” and Coit’ considered the problem with single component choice at each subsystem. It is possible to obtain the global optimal solution. Nakagawa & MiyazakiI4, Coit & Smith526 and Hsieh’ considered the problem with multiple component choices. In this paper, the two types of problems are experimented to evaluate the SA algorithm. The SA algorithm is coded in C++ and numerical experiments are executed on an IBM-PC compatible with a Pentium IV 2.4GHz.

4.2. Result of Experiments To evaluate the SA algorithm, we explored the optimal solution of the problem (P) with single component choice. Table 2 shows the optimal solution and it is identical to the result of Yokota et a1.*’. [0030, 200, 0003, 003, 030, 0200, 200, 400, 0020, 030, 200, 4000,020,0020] is the solution representation of the SA algorithm. The system reliability that is value of the objective function is 0.970015 and remained resources are = 11 , W=O.

c

Subsystem Choice #ofRedundancy

1 3 3

2 1 2

3 4 3

4 3 3

5 2 3

6 2 2

7 1 2

8 1 4

9 3 2

1 0 1 1 1 2 1 3 1 4 2 1 1 2 3 3 2 4 2 2

243

R, C

W

Nakagawa & Miyazaki14 0.9584 132 189

Coit & Smith' 0.98603 129 190

Hsieh' 0.9863 16 130 190

The SA algorithm 0.9863 16 130 190

Secondly, we explored the optimal solution of the problem (P) with multiple component choices. Table 3 compares the best solution of the SA algorithm with those of ~ an infeasible former studies. The solution presented by Nakagawa & M i y a ~ a k i 'is solution violating the cost constraint. [0030, 200, 0003, 004, 030, 0200, 300, 400, 1100, 012, 101, 4000, 020, 001 I ] is the solution representation of the SA algorithm. We know that the system reliability is 0.9863 16 and remained resources are = 0 , = 0 for the best solution. In the 30 times of the experiments, the average value of the objective function is 0.984656. In both of the cases, the CPU time for the experiment is within 1 second. The SA algorithm is very efficient to search the optimal solution of the problems.

c

5

w

Conclusions

This paper considered the series-parallel redundant reliability problem where each subsystem has multiple component choices. The objective of the problem is to maximize reliability of the series-parallel redundant system subject to nonlinear resource constraints. The SA algorithm to search the optimal solution of the problems has been proposed and a generating scheme of neighborhood solutions has been presented. To evaluate the performance of the SA algorithm, numerical experiments has been conducted and compared with the best solutions of former studies. The best solution of the SA algorithm is better than the previous best solutions. As reported in this paper, the SA algorithm is a very efficient method for the redundancy optimization problems with multiple component choices. We assumed the system with one failure mode. To reflect more practical model, we should consider various failure modes of the components in the subsystem. Moreover to evaluate the SA, the more examples is also considered. We remain them to be further study.

References J. E. Angus and K. A. Ames, "Simulated annealing algorithm for system cost minimization subject to reliability constraints", Communications in Statistics: Simulation and Computation 26(2), (1997), pp. 783-790. V. Cerny, "Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm", Journal of Optimization Theory and Applications 45, (1989, pp. 41-5 1. C. S. Chern and R. H. Jan, "Reliability optimization problems with multiple constraints", IEEE Transactions on Reliability 35(4), (1986), pp. 43 1-436.

244

4. M. S. Chern, "On the computational complexity of reliability redundancy allocation in a series system", Operations Research Letters 11, (1992), pp. 309-315. 5. D. W. Coit and A. E. Smith, "Reliability optimization of series-parallel systems using a genetic algorithm", IEEE Transactions on Reliability 45(2), (l996a), pp. 254-266. 6. D. W. Coit and A. E. Smith, "Penalty guided genetic search for reliability design optimization", Computers and Industrial Engineering 30(4), (1 996b), pp. 895-904. 7. D. W. Coit, "Cold-standby redundancy optimization for nonrepairable systems", IIE Transactions 33(6), (2001), pp. 471-478. 8. D. E. Fyffe, W. W. Hines and N. K. Lee, "System reliability allocation and a computational algorithm", IEEE Transactions on Reliability R-17, (1 986), pp. 64-69. 9. Y. C. Hsieh, "A linear approximation for redundant reliability problems with multiple component choices", Computers and Industrial Engineering 44, (2002), pp. 9 1- 103. 10. K. Ida, M. Gen and T. Yokota, "System reliability optimization with several failure modes by genetic algorithm", Proceeding of 16th International Conference on Computer and Industrial Engineering, (1 994), pp. 349-352. 11. S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, "Optimization by simulated annealing", Science 220, (1983), pp. 671-680. 12. W. Kuo and V. R. Prasad, "An annotated overview of system-reliability optimization", IEEE Transactions on Reliability 49(2), (2000), pp. 176-187. 13. W. Kuo, V. R. Prasad, F. A. Tillman and C. L. Hwang, Optimal Reliability Design: Fundamentals and Applications (Cambridge University Press, Cambridge, 200 1 ). 14. Y. Nakagawa and S. Miyazaki, "Surrogate constraints algorithm for reliability optimization problems with two constraints", IEEE Transactions on Reliability 30(2), (1981), pp. 175-180. 15. L. Painton and J. Campbell, "Genetic algorithms in optimization of system reliability", IEEE Transactions on Reliability 44(2), (1995), pp. 172-178. 16. J. E. Ramirez-Marquez and D. W. Coit, "A heuristic for solving the redundancy allocation problem for multiple-state series-parallel systems", Reliability Engineering and System Safety 83, (2004), pp. 341-349. 17. V. Ravi, B. Muty and P. Reddy, "Nonequilibrium simulated-annealing algorithm applied reliability optimization of complex systems", IEEE Transactions on Reliability 46(2), (1997), pp. 233-239. 18. C. S. Sung and Y. K. Cho, "Reliability optimization of a series system with multiplechoice and budget constraints", European Journal of Operational Research 127, (2000), pp. 159-171. 19. F. A. Tillman, C. L. Hwang and W. Kuo, "Determining component reliability and redundancy for optimum system reliability", IEEE Transactions on Reliability 26(3), (1977b), pp. 162-165. 20. T. Yokota, M. Gen and Y. X. Li, "Genetic algorithm for non-linear mixed integer programming problems and its application", Computers and Industrial Engineering 30(4), (1996), pp. 905-917.

ESTIMATION OF FAILURE INTENSITY AND MAINTENANCE EFFECTS WITH EXPLANATORY VARIABLES JONG WOON KIM Technical Research Institute, Rotem Company, 80-10, Mabuk-Ri, Guseong-Eup, Yongin-Shi, Gyunggi-Do, 449-91 0, Korea

WON YOUNG YUN Department of Industrial Engineering, Pusan National Universiv, 30 Changjeon-Dong, Kumjeong-Ku, Pusan 609- 735, Korea

SANG CHEL HAN Technical Research Institute, Rotem Company, 80-10, Mabuk-Ri, Guseong-Eup, Yongin-Shi, Gyunggi-Do, 449-91 0, Korea

Maintenance eJfect is a peculiar factor applied to repairable systems. Malik [lo] and Brown, Mahoney and Sivazlian (BMS) [Z] proposed general approaches to model maintenance effect, where each maintenance reduces the age of unit with respect to the rate of occurrences OfJuilures. The main duerence between Malik’s and BMS’s approaches is that each maintenance is assumed to reduce the elapsed timefrom the previous maintenance in Malik’s model (Model I ) , whereas it is assumed that maintenance reduces system age in BMSk model (Model 0).In this article, maintenance eflect is represented by the age reduction function of explanatory variables such as maintenance time and cost. It descrihes situations that the more we invest on maintenance, the higher its eJfect is. The method of maximum likelihood method is used to estimate the age reduction function and the Weibull intensity junction. Both Model l a n d and two cases are considered: Case A - eflective CM wiihout, Case B - minimal CM and ejfective PM. Simulation results are presented to illustrate the accuracy and properties of estimation.

1

Introduction

Conventional statistical analysis for failure times takes into account one of the two following extreme assumptions, namely, the state of the system after maintenance is either as “good as new” (GAN, perfect maintenance model) or as “bad as old” (BAO, minimal maintenance model). Under GAN assumption, the failure process follows the renewal process and under BAO, the failure process follows the non-homogeneous Poisson process. It is well known in practice that maintenance may not yield a functioning item which is as “good as new”. On the other hand, the minimal maintenance assumption seems to be too pessimistic in realistic maintenance strategies. From this it is seen that the imperfect maintenance is of great significance in practice. In the imperfect maintenance model in Pham and Wang, a maintenance action does not make a system like GAN, but younger [12]. it is Usually assumed that an imperfect maintenance restores the system operating state to somewhere between GAN and BAO. It is thought in the imperfect maintenance model that the failure intensity (rate) can be reduced by maintenance. 245

246 Malik proposed an approach to model the improvement effect of maintenance, where maintenance is assumed to reduce the operating time elapsed from the previous maintenance in proportion to its age (Model I ) [lo]. On the other hand, in BMS’s approach it is assumed that maintenance reduces system age (Model IT ) [2]. The parameter, p, which is called the age reduction factor or the improvement factor describes maintenance effect in Model I and II . If p goes to 0, the state of the maintained unit is almost as the same as that of pre-maintenance, and if p goes to 1, a maintenance renews the unit. The age reduction factor is assumed to be fixed in the existing studies. In this article, however, the age reduction factor is assumed to be a function of explanatory variables such as maintenance time and cost. In this paper, we propose a procedure to estimate the maintenance effect with explanatory variables and the parameters of lifetime distributions. The method of maximum likelihood method is used to estimate the age reduction function and the intensity function. Both Model I and IT and two cases are considered: Case A - effective CM(Corrective Maintenance) without PM(Preventive Maintenance), Case B- minimal CM and effective. Simulation. results are presented to illustrate the accuracy of the estimation and the properties of the models.

2

Literature Review

Higgins and Tsokos studied an estimation problem of failure rate under minimal repair model using the quasi-Bayes method [6]. Tsokos and Rao considered an estimation problem for failure intensity under the Power-law process [15]. Coetzee proposed a method for parameter estimation and cost models of non-homogeneous Poisson process under minimal repair [ S ] . Park and Pickering [ 1 I] studied an estimation problem to estimate parameters of failure process with failure data of multi-systems. Whitager and Samaniego [ 161 estimated the lifetime distribution under Brown-Proshan imperfect repair model [3]. It is assumed that the data pairs (7‘,,Z,) are given, where TI is a failure time and Z, is a Bernoulli variable that records the mode of repair (perfect or imperfect). Lim studied an estimation problem using the EM algorithm when masked data (Z, is unknown) are given under Brown-Proshan imperfect repair model [8]. Lim and Lie extended Lim’s work and considered first-order dependency between two consecutive repair modes [9]. Shin, Lim and Lie proposed a method for estimating maintenance effect and intensity function in Malik’s model [14]. Jack estimated lifetime parameters and the degree of age rejuvenation when a machine is minimally repaired on failures and imperfect preventive maintenance is also carried out [7]. Pulcini used the Bayesian approach to estimate overhaul effect and intensity function under minimal corrective maintenance and effective preventive maintenance [ 131. Baxter, Kijima and Tortorella and Calabria and Pulcini dealt with some properties of the stochastic point process for the analysis of repairable units [ I , 41.

247

3

Parameter Estimation

3.1. Effective CM without PM In this section we consider the case in which the effect of CM is assumed to solely exist, and PM is not performed. Then, the likelihood function is given by: = R(v- (T*))/R(v+(td

))fif ( v -

))/’(‘’

(tk

(tk -1

))

(1)

k=l

where d is the number of failures and ti is the ith failure time and Z* is censoring time and V ( t )is virtual age just before maintenance and V + ( t ) is virtual age right after maintenance. Applying the Weibull distribution for the lifetime distribution, the likelihood and loglikelihood functions are, respectively:

L = exp{ (v+(t&y

- (V+(t, )+ z* - td

y /d}

The virtual ages in Eq. (2) is calculated for Model I and

II as follows:

Model I

c k-l

v-(tk

)=tk

- tk-l +

(t,- tr-l

1’- d X i ))

r=l

(3)

k

v+(tk ) =

- tr-l

-

dXi1)

I=]

Model

II r=l

(4)

248

3.2. Minimal CM and Effective PM Interval I

Interval n+ 1

h

...

k . . . Y

t14

t1J

t1,

5

" h,

y

. .h.

v

&*lJ

Figure 1. Minimal CM and effective PM

It is assumed in this section that the effect of CM is minimal while that of PM is effective. Figure 1 describes the maintenance process and notation. Let n = the number of performed preventive maintenances, Z, = the ith preventive maintenance time, r;. = the number of failures in the ithinterval and t i , j = thejthfailure time in the ith interval. The likelihood function is given by:

where

v-(%+l)=

)+ r* - b+l,r,,,,

Applying the Weibull distribution for the lifetime distribution, the likelihood and loglikelihood functions are given by:

The virtual ages in Eq. (6) are calculated as follows: Model I

249

Model

4

II

Experimenta Results

Simulations are carried out to investigate the accuracy of the estimation of parameters in Model I and Case A for the maintenance effect model with explanatory variables. The following age reduction function is used for this simulation. p(x)= I - exp(- ax)

(9)

In the limit as a + 0 0 , p goes to 1, and if a = 0, then, p = 0. An example of the function for the age reduction factor is given in Figure 2. Table 1 is the result of the experiment performed to investigate the effect of the number of unit. The input values of parameters are set to be a = 2 , p = 2 , and a = 2 . As might be expected, we can know that as the number of the units increases, estimation becomes more accurate. In the second experiment, the effect of the shape and scale parameters is investigated and Table 2 shows the results. It is noticeable that the estimation for a becomes more accurate as the convexity of the intensity function increases and the value of parameter, a gets smaller.

5

15

10 X

Figure 2. An example of age reduction function

20

250 Table I . Mean, Std and MSE of the estimates as a function of the number of units

Table 2 Mean, Std and CV of the estimates as a function of the scale and shape parameters

5

Concluding Remarks

In this study, a feasible procedure to evaluate the maintenance effect is investigated when the failure process is completely unknown. In the existing articles the maintenance effect is assume to be a constant. However, we consider a situation that the more we invest on maintenance, the higher maintenance effect is. In this article, the age reduction factor is assumed to be a function of explanatory variables such as maintenance time and cost. BMS’s model and Malik’s model are used and a Weibull process is considered for two cases : Case A - effective CM without PM, Case B - minimal CM and effective. From the simulation experiment, it is found that as the number of the units increases, estimation becomes more accurate. It is noticeable that the estimation for the parameter in the age reduction function becomes more accurate as the convexity of the intensity function increases and value of the parameter gets smaller.

References 1. L.A. Baxter, M. Kijima and M. Tortorella, A Point Process Model for the Reliability

of Maintained System Subject to General Repair. CommunStatist. - Stochastic Models. 12, 37-65 (1996). 2. J.F. Brown, J.F. Mahoney and B.D. Sivazlian, Hysteresis Repair in Discounted Replacement Problems. IIE Transactions. 15, 156- 165 (1 983). 3. M. Brown and F. Proschan, Imperfect Repair. Journal of Applied Probability. 20, 851-859 (1983). 4. R. Calabria and G. Pulcini, Discontinuous Point Process for the Analysis of Repairable Units. International Journal of Reliability, Quality and Safety Engineering. 6, 361-382 (1999).

251

5. J.L. Coetzee, The Role of N f f P P Models in the Practical Analysis of Maintenance Failure Data. Reliability Engineering & System Safety. 56, 161 - 168 ( 1997). 6. J.J. Higgins and C.P. Tsokos, A Quasi-Bayes Estimate of the Failure Intensity of a Reliability-Growth Model. IEEE Transactions on Reliability. R-30,471-475 (1981). 7. N. Jack, Analyzing Event Data from a Repairable Machine Subject to Imperfect Preventive Maintenance. Quality and Reliability Engineering International. 13, 183186 (1 997). 8. T.J. Lim, Estimating System Reliability with Fulb Masked Data under BrownProschan Imperfect Repair Model. Reliability Engineering & System Safety. 59,217289 (1998). 9. T.J. Lim and C.H. Lie, Analysis of System Reliability with Dependent Repair Modes. IEEE Transactions on Reliability. 49, 153-162 (2000). 10. M.A.K. Malik, Reliable Preventive Maintenance Scheduling. AIIE Transactions. 11, 221-228 (1979). 11. W.J. Park and E.H. Pickering, Statistical Analysis of a Power-Law Model for Repair Data. IEEE Transactions on Reliability. 46, 27-30 (1997). 12. H. Pham and H. Wang, Imperfect Maintenance. European Journal of Operational Research. 94,425-438 (1996). 13. G. Pulcini, On the Overhaul Effect for Repairable Mechanical Units : a Bayes Approach. Reliability Engineering & System Safety. 70, 85-94 (2000). 14. I. Shin, T.J. Lim and C.H. Lie, Estimating Parameters of Intensity Function and Maintenance Effect for Repairable Unit. Reliability Engineering & System Safety. 54, 1-10 (1996). 15. C.P. Tsokos and A.N.V. Rao, Estimation of Failure Intensity for the Weibull Process. Reliability Engineering & System Safety. 45, 21 1-275 (1 994). 16. L.R. Whitager and F.J. Samaniego, Estimating the Reliability of Systems Subject to imperfect Repair, Journal of the American Statistical Association. 84, 30 1-309 (1989).

This page intentionally left blank

ECONOMIC IMPACTS OF GUARD BANDING ON DESIGNING INSPECTION PROCEDURES YOUNG JIN KIMt Department of Systems Management and Engineering Pukyong National University Pusan 608-739, Republic of Korea MYUNG S O 0 CHA Department Industrial Engineering Kyungsung University Pusan 608-736, Republic of Korea

One of the most important aspects in designing production systems is implementing an adequate inspection procedure to ensure the product quality. Our literature study indicates that more extensive efforts in designing inspection procedures need to be investigated based on economic considerations. Two approaches, the use of guard band and the selection of inspection precision level, are simultaneously examined for designing economic inspection procedures. By incorporating these two approaches, this paper proposes an optimization scheme for economic design of inspection procedures.

1

INTRODUCTION

The implementation of a complete inspection scheme for removing non-conforming products has become a common practice especially in high-tech manufacturing industries. Inspection procedures involve a measurement of the quality characteristic of interest since decisions regarding the conformance to specifications are usually made on the basis of the realization of measurement for the quality characteristic. As more companies strive to improve product quality, enhancement of measurement procedures associated with the complete inspection scheme has become an integral part of quality improvement. However, measurement errors are inevitable due to the variations in operators and/or devices regardless of how well measurement procedures are designed or maintained. Since a significant variability in measurement procedures may lead to a wrong interpretation of the product quality, understanding the notion of measurement variability may be a steppingstone for quality improvement. From this perspective, the study of measurement variability has recently drawn a particular attention from researchers in the context of the so-called gauge study. See, for example, Lin et al. (1997), Mader et al. (1 999), McCarville and Montgomery (1996), Montgomery and Runger (1993a, b), Tsai (1 988), and Vardeman and VanValkenburg (1999). Measurement errors due to variability in measurement procedures may create economic penalties. For example, suppose that a nonconforming product is falsely accepted and shipped to the customer. A To whom correspondence should be addressed. (Tel) +82.51.620.1555; (Fax) +82.51.620.1546; (e-mail) [email protected]

253

254 monetary loss may then be incurred to replace the defects. On the other hand, rejection costs, such as scrap and rework costs, may also be incurred by the manufacturer for falsely rejected conforming products. In this regard, Mader et al. (1999) recently evaluated economic impacts of measurement errors for the complete inspection plan. There has also been a great deal of research efforts to reduce the impacts of measurement errors. The most immediate approach to control measurement errors may be the selection of measurement precision level (or measurement variability) since economic penalties associated with measurement errors may be avoided by improving the measurement precision (or reducing the measurement variability). As more precise measurement devices andor better-trained operators are required, inspection cost may be increasingly incurred to reduce economic penalties by improving the measurement precision. Thus, there is a need for a tradeoff among cost factors associated with measurement errors and the selection of measurement precision level. Readers are referred to Chandra and Schall (1988), Chen and Chung (1996), and Tang and Schneider (1 988). The second approach to reducing the impacts of measurement errors is the use of a guard band. Since pioneered by Eagle (1954) and Grubbs and Coon (1954), various economic aspects of guard band have been further examined by several researchers. Deaver (1995) provided a comparative study of several strategies when using a guard band for reducing the impacts of measurement errors, whereas McCarville and Montgomery (1996) recently used an experimental design approach for finding the optimal guard bands for serial gauges. This paper differs from the previous studies in two ways. First, our careful examination on the literature of gauge study reveals that most of the previous studies are mainly concerned with evaluating and analyzing the adequacy of measurement procedures during a manufacturing stage. Although the issue of the adequacy of measurement procedures is clearly important, many manufacturing processes may better benefit from designing enhanced measurement procedures during an early design stage. Second, the common objective of two different approaches, the selection of measurement precision level and the use of guard band, is to minimize the economic penalties associated with measurement errors. Although both approaches pursue the same goal, they have been studied as separate issues in different contexts. To address these issues, this paper proposes an optimization scheme for the design of the most economical measurement procedures to simultaneously determine both the optimum precision level and guard band to reduce the impacts of measurement errors. The remainder of this paper is organized as follows: The effects of guard band and precision level on measurement errors are examined and the resultant economic impacts of measurement errors are discussed. An optimization model for economic design of measurement procedures in terms of the selection of guard band and precision level is then proposed and demonstrated through a numerical example along with sensitivity analysis. Conclusions are drawn in the last section.

255 2

MEASUREMENT PRECISION

The performance of measurement procedures is generally judged on the basis of precision and accuracy. The measurement accuracy is the concept to describe the difference between an actual value of a quality characteristic and a mean of measured values. On the other hand, the measurement precision is directly related to the variability of measurement procedures. To demonstrate the concepts of measurement accuracy and precision, the probability distributions of two measurement procedures are depicted in Figure 1 where pi and oirepresent the mean and standard deviation of process distribution i where i = A and B, respectively, and a denotes the actual value of the quality characteristic to be measured. It can be seen that process A is superior to process B in terms of accuracy since lpA < IpB , whereas process B is more precise than process A since oB< oA. The accuracy of measurement procedures can be improved to some extent through the calibration procedure at little or no additional costs, while variability reduction usually causes an increase in inspection cost since more precise measuring devices andior better-trained operators are required. Thus, the inspection cost may be expressed as a function of measurement precision. The relationship between the inspection cost and measurement precision can thus be established by examining measurement variability.

-4

-4

. -

= PA

Figure 1. Measurement Accuracy and Precision

Measurement precision may be improved by reducing measurement variability. Chandra and Schall (1988) proposed the use of repeated measurements to reduce measurement variability as a means of minimizing economic penalties caused by measurement errors. The average of these repeated measurements is used to determine the conformance of a product to the specifications. Let X be the actual value of the quality characteristic of interest which is normally distributed with mean p and variance o f . Denoting the measured value from a single measurement by Y , further assume that the conditional distribution of Y , given that X = x, is a normal distribution with mean x and variance o$ . Suppose that n measurements are repeatedly taken and each

256

r

measurement has the same variability. Letting be the average of n measurements, it is apparent that the conditional distribution of , given that X = x , is a normal distribution with mean x and variance oflx,where o:,~ = ~ $ , ~. /An total inspection cost associated with repeated measurements, denoted by C, (n) , can be expressed as C,(n)=u+bn,

(1)

where u and b represent the fixed set-up cost and operating cost for individual measurements, respectively. Thus, variability of measurement procedures may be reduced by taking more repeated measurements at an increased inspection cost.

3

ECONOMIC EVALUATION OF GUARD BANDING

Regardless of how precise the measurement procedures are, measurement errors are inevitable due to the measurement variability which frequently leads to the misclassification of outgoing products such as a false acceptance of defects and a false rejection of conforming products. The false rejection of conforming products is often referred to as a type I error (or producer’s risk) while the false acceptance of defects as a type I1 error (or consumer’s risk). These measurement errors due to the variability may in turn result in economic penalties which can be reduced by implementing a higherprecision measurement procedures as discussed in the previous section. Thus, economic impacts of measurement errors need to be incorporated in the economic design of measurement procedures. As a means of reducing the impacts of measurement errors, the use of guard band has been widely implemented since pioneered by Eagle (1954). In many practical situations, a false acceptance of defects incurs much larger economic penalties than a false rejection of conforming products. Falsely accepted defects shipped to the customer may impart a substantial cost to the manufacturer in the form of warranty cost. On the other hand, falsely rejected items incur a rejection cost which is incomparably lower than the warranty cost. From this perspective, many manufacturers impose a guard band to help minimize the penalty associated with false acceptance at the cost of an increased risk of false rejection. The effects of a guard band are depicted in Figure 2, where L and U represent the lower and upper specification limits, respectively. It should also be noted that the big curve represents the density curve of the actual value of the quality characteristic ( X ) while the small curve represents the density curve of the average measurements given the actual value ( I X).It can be observed that the probability of false acceptance decreases by imposing the guard band (see Figure 2a) while the risk associated with false rejection increases (see Figure 2b). It is a current practice to set the guard band based on engineering experiences or on a trial-and-error basis. Although this ad hoc approach may be easy to implement, it lacks a systematic evaluation to determine the most economical width of guard band. Thus, there is a need to incorporate economic aspects of measurement errors on which the determination of guard band is based.

257

(a) False acceptance error with and without guard band.

x

u

L (b) False rejection error with and without guard band.

L

Figure 2. Measurement errors with and without guard band.

3.1. Acceptance cost Let E~ and E~ denote the widths of guard band associated with the lower and upper specification limits, respectively. For the simplicity of notation, v = L + E~ and w = U - E" . Hereafter, v and w are referred to as the lower and upper inspection limits, respectively. Since the conformance of a product is determined on the basis of repeated measurements, a product passes the inspection and is shipped to the customer if F E [v,w]. Let C, represent the monetary loss associated with a false acceptance error, i.e., F E [v,w]and X G [L,U]. The expected cost by falsely accepting a defect, denoted by E[AC], is then given by

V -cc

J

where h ( x , y ) represents the joint density function of X and y . It can easily be shown that X and jointly follow a bivariate normal distribution with a mean vector of ( p , p ) and a variance-covariance matrix of .Z given by

258

2 2 where of represents the variance of marginal distribution of F , and ot; = ox + oflx. Note that the correlation coefficient of X and denoted by y , is defined as

u,

cov(x,r)

ox

After some algebra, the expected acceptance cost E [ A C ] can be written as

where BVN(a,P;p) denotes a standard bivariate normal distribution function with a correlation coefficient of p . To evaluate the expected acceptance cost E [ A C ] , we first need to calculate the bivariate normal probabilities. This paper uses the well-known numerical method developed by Drezner and Wesolowsky (1 990) for evaluating the bivariate normal integrals. Readers are also referred to Dremer (1976) and Mee and Owen (1983).

3.2. Rejection cost If the realization of measurement procedures turns out to be outside the inspection limits, a rejection cost may be incurred by the manufacturer. Let C, and C, denote the rejection cost associated with a product falling below the lower inspection limit (i.e., I v ) and above the upper inspection limit (i.e., 2 w ), respectively. The expected rejection cost, denoted by E[RC],can then be written as

u

-m -m

Noting that the marginal distribution of oz,equation (4) becomes

W-m

is normal with a mean of ,u and a variance of

259 4

THEMODEL

The objective is to determine both the optimum precision level and width of guard band so that the expected total cost can be minimized. Since the precision level of inspection procedures is to be determined by taking repeated measurements, the decision variables for the model are the number of repeated measurements and the width of guard band. The expected total cost can be expressed as the sum of total inspection cost, expected acceptance cost, and expected rejection cost, which are given in equations (l), (3), and ( 5 ) , respectively. That is,

E[TC]=C,(n)+E[RC]+E[AC]

ffy

ffy

-/,

ffx

ff*

-

,/-

where oj = = . Since it is very difficult to obtain a closed-form solution to equation (6), one may use popular mathematical software such as Matlab and MathCAD. This paper uses Matlab to find the optimal solution numerically. Let Y * and w* be the optimal lower and upper inspection limits, respectively, and n* be the optimal number of repeated measurements. Denoting the optimum widths of guard band associated with upper and lower specification limits by E: and si , respectively, * * E~ = v - L and E; = U the resultant measurement variability would be determined by oflx= oylx 5

CONCLUSION

In many industrial settings, manufacturing processes often involve inspection procedures which frequently require measurement of a quality characteristic. Various economic aspects of measurement errors have long been studied. Further, there has been a great deal of research efforts to reduce the impacts of measurement errors. This paper proposes an optimization model for economic design of measurement systems by incorporating the concepts of measurement precision and guard band. The proposed model incorporates various economic aspects of measurement errors when determining the optimum level of measurement precision and optimum widths of guard band to minimize the impacts of measurement errors. A numerical example demonstrates the proposed optimization model, and the sensitivity analysis is performed to examine the effects of process parameters, such as rejection costs and process variability, on the optimum solutions and the expected total cost.

260 REFERENCES 1.

2.

3. 4. 5.

6. 7. 8.

9.

10. 11. 12. 13.

14.

15. 16.

Chandra, J. and Schall S. 1988, “The Use of Repeated Measurements to Reduce the Effect of Measurement Errors”, IIE Transactions, Vol. 20, No.1, pp. 83 - 87. Chen, S.-L. and Chung, K.-J. 1996, “Selection of the Optimal Precision Level and Target Value for a Production Process: the Lower-Specification-Limit Case”, IIE Transactions, Vol. 28, pp. 979 - 985. Deaver, D. 1995, “Using Guardbands to Justify TURs Less Than 4:1”, Proceedings of ASQC 49th Annual Quality Congress, pp. 136 - 141. Drezner, Z. 1976, “Computation of the Bivariate Normal Integral”, Mathematical Computation, Vol. 32, pp. 277 - 279. Drezner, Z. and Wesolowsky, G.O. 1990, “On the Computation of the Bivariate Normal Integral”, Journal of Statistical Computation and Simulation, Vol. 35, pp. 101 - 107. Eagle, A.R. 1954, “A Method for Handling Errors in Testing and Measuring”, Industrial Quality Control, Vol. 10, No. 3, pp. 10 - 15. Grubbs, F.A. and Coon, H.J. 1954, “On Setting Test Limits Relative to Specification Limits”, Industrial Quality Control, Vol. 10, No. 3, pp. 15 - 20. Lin, C.Y., Hong, C.L., and Lai, Y.J. 1997, “Improvement of a Dimensional Measurement Process Using Taguchi Robust Designs”, Quality Engineering, Vol. 9, pp. 561 - 573. Mader, D.P., Prins, J., and Lampe, R.E. 1999, “The Economic Impact of Measurement Error”, Quality Engineering, Vol. 11, No. 4, pp. 563 - 574. McCarville, D.R. and Montgomery, D.C. 1996, “Optimal Guard Bands for Gauges in Series”, Quality Engineering, Vol. 9, No. 2, pp. 167 - 177. Mee, R.W. and Owen, D.B. 1983, “A Simple Approximation for Bivariate Normal Probabilities”, Journal of Quality Technology, Vol. 15, No. 2, pp. 72 - 75. Montgomery, D.C. and Runger, G.C. 1993a, “Gauge Capability and Design Experiments, Part I: Basic Methods”, Quality Engineering, Voi. 6, pp. 115 - 135. Montgomery, D.C. and Runger, G.C. 1993b, “Gauge Capability and Design Experiments, Part 11: Experimental Design Models and Variance Component Estimation”, Quality Engineering, Vol. 6, pp. 289 - 305. Tang, K. and Schneider, H. 1988, “Selection of the Optimal Inspection Precision Level for a Complete Inspection Plan”, Journal of Quality Technology, Vol. 20, No. 3, pp. 153 - 156. Tsai, P. 1988, “Variable Gauge Repeatability and Reproducibility Study Using the Analysis of Variance Method, Quality Engineering, Vol. 1 , pp. 107 - I 15. Vardeman, S.B. and VanValkenburg, E.S. 1999, “Two-way Random-effects Analyses and Gauge R&R Studies”, Technometrics, Vol. 41, No. 3, pp. 202 - 21 1.

THE SYSTEM RELIABILITY OPTIMIZATION PROBLEMS BY USING AN IMPROVED SURROGATE CONSTRAINT METHOD

SAKUO KIMURA Kansai University, 2-1-1 Ryouzenji-Cho, Takatsuki-City, Osaka, Japan E-mail: [email protected]

ROSS J.W. JAMES University of Canterbury, Private Bag 4800, Christchurch, New Zealand E-mail: ross.james @canterbury.ac. nz JUNICHI OHNISHI Kansai University Graduate School, 2-1- 1 Ryouzenji- Cho, Takatsuki- City, Osaka, Japan E-mail: [email protected] YUJI NAKAGAWA Kansai University, 2-1-1 Ryouzenji-Cho, Takatsuki-City, Osaka, Japan E-mail: [email protected]. kansai-u. ac.jp

It is often difficult to obtain optimal or high quality solutions to multidimensional nonlinear integer programming problems when they have many decision variables or they have a large number of dimensions. Surrogate constraint techniques are known t o be very effective in solving the multidimensional problems. These methods translate the multidimensional problem into a problem with a single dimension by using a surrogate multiplier. When the optimal solution t o the surrogate problem is not the optimal solution to the original problem, it is said that there exists a surrogate duality gap between the translated one dimensional problem and the original multidimensional problem. Nakagawa has recently proposed a n improved surrogate constraint (ISC) method that can close the surrogate duality gap and hence provide optimal solutions t o problems which previously could not be solved due to the size of their surrogate duality gap. By applying this ISC method to multidimensional nonlinear knapsack problems we can obtain an optimal solution t o the coherent systems reliability optimization problem of Fyffe-HinesLee that previously could not be solved due to the existence of a surrogate duality gap. We also found that we could efficiently find the optimal solution to the system reliability optimization problem of Prasad-Kuo. Furthermore it is clear that this method can also be used to solve large-scale problems, as the problems with 250 variables can be solved using this method.

261

262 1. Introduction

It is often difficult t o obtain optimal or high quality solutions to multidimensional nonlinear integer programming problems within a reasonable period of time when either the number of decision variables or the number of dimensions is large. Surrogate constraint techniques have proved to be an effective method of solving this problem by reducing the multidimensional problem into just one dimension by using a surrogate multiplier. However these techniques will fail when there exists a surrogate duality gap which is caused by the optimal solution to the surrogate dual problem not being feasible and is therefore not a strict solution to the original problem. Nakagawa [lo] has recently proposed an improved surrogate constraint (ISC) method that can close the surrogate duality gaps and provide a strict solution to the original problem. In this paper we use the ISC method to solve problems that previously could not be treated exactly due t o the presence of surrogate duality gaps.

2. The Application to Reliability Design Problems In this section we will show how the ISC can be applied to two different system reliability optimization problems (one with two and the other with four constraints) which previously have not been solved exactly due t o the existence of surrogate duality gaps. System reliability can be improved by either using more reliable components or providing redundant components. If at each stage of the system several components of different reliabilities and costs are available or redundancy is allowed, the system optimization problem can be formulated as a nonlinear integer programming problem [5][11]. System reliability can be written as a function,f(zl, x2, ...,z n ) ,of decision variables z1, z2, ...,zn(nonnegative integer). Therefore, the problem of maximizing system reliability of series system over component choices and redundancy options is represented as:

[Po]: Maximize

R

= f ( x l , x2, ..., zn)

n

subject t o

cgji(xi) 5

bj

for i

=

1 , 2 , ...,n

k l Zi

5 xi 5 ui for j

= 1 , 2 , ...,m

where f(x1,x2, ...,2,) is system reliability when stage i has xi redundant components, gji(xi) is amount of resource j required to allocate xi components at stage i , n is the number of stages in reliability system, m is the number of resources under consideration, li and ui are the lower and upper bound of xi respectively. The constraint functions in [Po]are separable. In the redundant allocation problems of a series system, the objective functions can be written in a separable form using logarithms as 2 = CZl fi(zi)

263 In this paper, the value of the Percent Gap Closure (PGC) is used t o measure the difficulty of original problem [Po][4]. f R e a l - yOPT[pSD]

PGC

= fReal

-

vOPT[pO]

x 100

where f R e a L is the linear programming relaxation of the linear integer programming problem corresponding t o the original problem [PO]. voPT[0]means the optimal objective function value of problem [el. A small PGC value means that the surrogate duality gap is wide. A PGC value of 100 % suggests that the optimal function value of [PsD]is equivalent to the optimal value of [PO]. 2.1. System Reliability Optimization Problem with Two Constraints

We consider the problem of determining the number of redundant units at each stage in order to maximize reliability while keeping the total system cost and weight within allowable limits [3][8] 14

[P1]: Maximize

R

=

n[1- ((1 i=l

-

~-i(ki))~.]

(3)

14

subject to i=l 14

i=l

1 5 ki 5 4, 1 I di; ki,di integer

where variable ki is the k-th design alternative of stage i and variable di is the number of units of ki. The coefficients r i ( ki ) ,ci(ki),w i ( k i ) are the reliability, cost, and weight of k, respectively. C and W are the allowable system cost and weight, respectively. The total allowable cost, C , in this problem is 130. The values of ~ i ( k i ) , c i ( k iwi(ki) ), are shown in Table 1 [3]. The symbol * in the Table means that a design alternative is not available. When we assume 1 5 ki 5 4 and 1 5 di I 5, the combinations of variables ki and di ( i = 1,..., 14), can be transformed into a new variable zi where the value of zi indicates the pair of ( k i , d i ) , hence 1 I zi 5 20 [3]. The objective functions can be transformed into a separable multidimensional nonlinear knapsack problem by taking logarithms. In Nakagawa and Miyazaki [8],33 test problems, which differ only in the values of W = 191,190,. . . ,159, are solved by the surrogate constraint (N&M) method proposed by Nakagawa and Miyazaki [8]. 3 problems of W = 190,189,187 could not be solved in the sense of strict optimality even using the N&M method due to surrogate duality gaps present in the problem. For these three problems, the ISC method is able t o cope with these surrogate duality gaps and hence exact solutions can now be obtained.

264 Table 1. D a t a of r;(ki), ci(ki), and wi(k,).(Fyffe, Hines a n d Lee (31) Design alte Stage No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 __

3 8 7 5 4 5 7 4

0.99

0.97

6 5 4 5 6 ternati

0.94 0.79 0.99

187

k

1 3 4

0.93 0.94 0.90 0.87 0.93 0.98 0.92 0.90 0.99 0.85 0.95 0.92 0.99 0.92

8

-S.

* means no design

.

ri(s)

wi(l) ri(2)

-

1 2 2

3

1

1

3

3

3

3

3

4

2 2 2

ttive ( k, ) ‘1

__

~

5 3 4 4 3 3 4

3 3 2

7

5 5 7

125 (133)

0.91 0.93 0.87 0.85 0.95 0.97 0.94 0.91 0.96 0.90 0.96 0.85 0.97 0.95

4

a uri(3) 2

1

1 5 3 2 5 6 4 5 5 4 2 5

2 9 6 4 5 5 9 6 7 6 6 6 6 6

---

187 (186)

0.983568 (0.9840*)

wi(4) 5

*

4

*

* 4

0.96

*

* *

0.91

8

* 0.90

*

0.99

*

*

* 7

*

*

9 -

38.89

,

* means the upper bound

value obtained by N&M algorithm.

Table 2 shows the strict solutions (the numbers of units di and design alternative obtained and the total system cost, total weight, and total reliability. Further the values of PGC are shown to indicate the difficulty of original problem [PO].The numbers in the parenthesis are the solutions obtained by N&M method 1201. Therefore the strict solutions for these 3 problems can be obtained by using ISC method. All computational times are less than 1 second. Ici)

265 2.2. System Reliability Optimization Problem with Four Constraints

The test problem of reliability optimization, which Prasad and Kuo [ll]considered for the parallel redundancy allocation, is as follows:

n[l n

[P'] : Maximize R

=

-

((1- T,}"]

(4)

i=l

1 5 z, 5 10 for i = 1,2, ...,n where x, is the number of redundant components at stage i (nonnegative integer), t, is the minimum 2, value allowed. The coefficients a,, p,, y,,6, are generated from uniform distributions in the range of [6, lo], [l,51, [ll,201, [21,40],respectively. Table 3 shows the value of unreliability (1 - T,) of each unit at stage i, and the values of a,, p,, y,,6, [ll]. The component reliabilities are generated from the uniform distribution in the range of [0.95,1.0] [ll].6' is the percentage for calculating the minimum requirement of each resource based on t,. 6' = 33 implies that 33 % of the minimum requirement of each resource (based on t z )is available for optimization [ll].In this paper, we use 6' = 33, t, = 1 , similar to the reference [ll],and also consider the case of 6' = 3, t, = 4. The average minimum requirements become the total of lower bounds of the components j at stage i and becomes Cr=lg,,,(t,) and the average values of

(

x g j , i ( t i ) for n = 50 bj = 1 + 1080) 2.n =1

(5)

are in Table 4. By taking logarithms of the objective functions of problem [P'], it is transferred to the separable multidimensional nonlinear knapsack problem. In the case of 6' = 33 and ti = 1, we solve the problem [P2]by using ISC method. The optimum number of redundant components at each stage, the optimum reliability (0.4053895) as the value of objective function and the values of left side of each constraint, are shown in the upper column of Table 5. These values are same as the results of Prasad and Kuo [ l l ] . As the value of PGC is 100 %, there is no surrogate duality gap.

266 Table 3.

1-

-

0.005 0.026 0.035 0.029 0.032 0.003 0.020 0.018 0.004 0.038 0.028 0.021 0.039 0.013 0.038 0.037 0.021

8 10 10 6 7 10 9 9 7 6 6 10 9 10 7 10 10 -

Numerical data for [ P 2 ] with n =50. (Prasad and Kuo [ l l ] ) 1-

- i Ti mi Yi - 0. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Table 4.

4 4 4 3

1

4 2 3 4 4 5 3 1

4 4 2 1

-

0.0023 0.0027 0.0028 0.0030 0.0027 0.0018 0.0013 0.0006 0.0029 0.0022 0.0017 0.0002 0.0031 0.0021 0.0023 0.0030 0.0026

14 13

15 -

j bj

1 408 542.64

2 265.44 353.04

Table 5.

x3

19 18 13 15 12 20 19 15 18 16 15 18 19 14

- - - 1- T i mi & 7i 6i -

>

2

2 2 5 5 1 3 3 1 2 5 5 3

0.009 0.019 0.005 0.019 0.002 0.015 0.023 0.040 0.012 0.026 0.038 0.015 0.036 0.032 0.038 0.013

11

15 14 -

3 -

4 4 4

6 10 9 10 6 8 10 8 8 6 6 8 7 10 8 10

5 5 5 5 2 3 5 3

-

-

1

4 4 1 4 2 3 2

15 17 19

11

17 17 17 18 18 19 13 19 14 19 15 11

29 33 37 22 34 33 33 35 35 38 26 37 28 37 30 22

- -

-

I

1

xgji(ti)

zJ

-

3 5 4

--

I -6

xJ

8 10 7 6 6 7 8 9 8 8 9 10 9 7 9 6 7 -

Average minimum resource requirements and corresponding bj for n=50.

t;

3

-

Ti ai Oi -

13 16 12 12 13 16 19 15 12 16 14 15 17 20

5 4

4

5 3 5

4 4 4

A

3 782 1040.06

4 1540 2048.2

1 6528 6723.84

2 1189.64 1225.33

3 3128 3221.84

4 3080 3172.4

The exact optimal solutions and the objective value

4 4 4

4 4 5

4 4 4

4

4 4

4 3 5

4 4 5

4 3 4

3 4

4 3

0.999985

C

6712

Y C6?& ~ ~

3141

3078.831

1217.4595

~P G C ( % ) 10 8

We now consider the reliability optimization problem in the case of 6 = 3 and 4, all other coefficients are identical. As this problem has a wide surrogate duality gap, the existing methods fail to obtain the optimal solution. The optimal solution, the value of objective function (maximum reliability), and the values of the left side of each constraint are shown in the lower column of Table 5 . The PGC value of this problem equals to be 10.8 % and it means that this problem has wide surrogate duality gap. All of the computational times are less than 1 second. From these examples we can see that that ISC method can acquire the exact solutions to some problems whose strict solutions cannot be obtained by the existing methods. ti =

267 2.3. Large-Scale S y s t e m Reliability O p t i m i z a t i o n P r o b l e m

In order to make sure of the validity of ISC method, we solve the large-scale reliability optimization problem with a number of variables. The coefficients of the problem [ P 3 ]considered are generated on the basis of Table 3 that was used in section 2.2. The problem has 250 variables and those coefficients are shown in Table 6. Most of the values in this Table are omitted on account of limited space. We solved the problem [ P 3 ]using the ISC method. The optimum number of redundant components at each stage, the maximum reliability (0.912227) and the values on the left side of each constraint are shown in Table 7. The PGC value of this problem equals to be 56.6 % and it means that this problem has wide surrogate duality gap. The computational time involved was 338 seconds.

3. Conclusions In this paper two kinds of system reliability optimization problems were solved which previously could not be solved exactly by the existing solution methods due to the presence of surrogate duality gaps. These problems can now be solved exactly using the ISC method therefore proving that the ISC method is able t o solve some reliability problems that have surrogate duality gaps. Furthermore we solve largescale reliability optimization problems with a large number of variables using the ISC method and is proven to be effective. All test problems were solved on a DOS/V computer (Pentium 4 1.7GHz) running Windows XP. References 1. R.E.Barlow and F. Proschan,Mathematical Theory of Reliability, John Wiley & Sons, New York, 1965. 2. M.E. Dyer, “Calculating surrogate constraints, ” Math. Programming, vo1.19, pp.255278, 1980. 3. D.E.Fyffe, W.W.Hines, and N.K.Lee,“ System reliability allocation and a computation algorithm,” IEEE Trans. Reliab., R-17,pp.64-69, 1968. 4. M.H. Karwan, R. L. Rardin, and S. Sarin, “A new surrogate dual multiplier search procedure,” Naval Res. Logist., 34, pp.431-450, 1987. 5. W. Kuo and V.R. Prasad, “An annotated overview of system-reliability optimization, ”IEEE Trans. Reliab., vo1.49, no.2,pp.176-187, 2000. 6. Y. Nakagawa and K. Nakashima, “A heuristic method for determining optimal reliability allocation, ” IEEE Trans. Reliab., R-26, pp.156-161, 1977. 7. Y. Nakagawa and Y . Hattori, ‘‘ Reliability optimization with multiple properties and integer variables, ” IEEE Trans. Reliab., R-28, pp.73-78, 1979. 8. Y. Nakagawa and S. Miyazaki, “ Surr0gat.e constraints algorithm for reliability optimization problems with two constraints,” IEEE Trans. Reliab., R-30, pp.175-180, 1981. 9. Y. Nakagawa, “A new method for discrete optimization problems,” Electronics and Communications in Japan, Part 3,vo1.73, pp.99-106, 1990. 10. Y . Nakagawa, “An improved surrogate constraints method for separable nonlinear integer programming,” Journal of the Operations Research Society of Japan, vo1.46, no.2, pp. 145-163,2003.

268 Table 6. Numerical data for [ P 3 Kuo[ll])

i -

1 with n=250.

- - P3 73 4

1 2 3

4

3 1

12 13 16

26 31 38

49 50 51

4 4 3

16 12 13

23 24 31

99 100 101

4 4

12 12 16

24 26 38

1

Table 7.

- -

(generated from Prasad and

-

i

1- T j

149 150 151

0.038 0.013 0.005

10 10

199 200 201

248 249 250

L*3

- P3 73 6, -

4

7

3 4

12 13 19

26 31 29

0.038 0.013 0.005

10 6 10

3 1 2

13 16 15

31 38 23

0.032 0.038 0.013

10 6 7 -

3

13 31 16 38 29 19 -

1

4 -

T h e optimal solutions and t h e objective value of the problem [ P 3 ]with n=250.

f

I

0.912227 I 10135

I C P j e x p ( z j ) 1 C-yjxj I I

2456.254773

I

ESjfi

I PGC ( % )

8538 I 11341.09116 I

56.6

11. V.R.Prasad and W. Kuo,"Reliability optimization of coherent systems," IEEE Trans. Reliab., vo1.49,no.3,pp.323-330,Sept. 2000.

EFFICIENT COMPUTATION OF MARGINAL RELIABILITY IMPORTANCE IN A NETWORK SYSTEM WITH K-TERMINAL RELIABILITY

T. KOIDE Faculty of Service Industries, University of Marketing €4 Distribution Sciences, Gakuen-nishi-machi 3-1, Nishi, Kobe 651-2188, JAPAN E-mail: [email protected] S . SHINMORI Faculty of Science, Kagoshima University, Korimoto 1-21-30, Kagoshima 890-0065, JAPAN E-mail: [email protected] H. ISHII Graduate School of Information Science and Technology, Osaka University, Yamadaoka 1-1, Suita, Osaka 565-0871, JAPAN E-mail: [email protected] Marginal reliability importance is a n appropriate quantitative measure on a system component against system reliability and it contributes t o design of reliable systems. Computing marginal reliability importance in network systems is time-consuming due to its NP-hardness. This paper proposes a procedure t o compute the importance efficiently where k-terminal reliability is employed as network reliability. T h e procedure is based on some relational formulas with respect to marginal reliability importance between original networks and transformed networks. The formulas improve t h e computational complexity of t h e procedure compared to a traditional method.

1. Introduction

Reliability importance is a n appropriate quantitative measure on a system component against system r e l i a b i l i t ~ ~ ,Especially, ~,~. reliability importance for a system component is referred to as marginal reliability importance (MRI). MRI measures the rate at which system reliability changes with respect to change in reliability of the component. This paper focuses on MRI in network systems. Reliability importance contributes to design reliable network system efficiently. Network design problems considering network reliability are hard since it is NPhard to compute network reliability such as all-terminal reliability, two-terminal

269

270 reliability and k-terminal r e l i a b i l i t ~ ~ ,Therefore, ~ , ~ ~ . many approximate algorithms have been proposed to the network design problem^'^^^^^'^^^^. Reliability importance helps the algorithms to evaluate variation of system reliability efficiently. For instance, Lin and Kuo have employed reliability importance in their heuristics for an optimal reliability allocation problem". In our previous study, we have proposed an algorithm to compute MRI with respect to all-terminal reliability in a network system to rank network components (edges) according to their importance". Improving a traditional method developed an algorithm in which some network transformations, namely network reductions and a network decomposition, were applied to reduce computational task. In this paper, we propose a procedure to compute MRI with respect to k-terminal reliability, which is a dominant conception of all-terminal reliability. The procedure is accomplished by several theorems that show the relationship between MRI in an original network and that in transformed networks. Section 2 introduces basic concept for this study and our last results. Section 3 proves some theorems to proposes a procedure for efficient MRI computations. All proofs are omitted due to space limitation. 2. Preliminaries 2.1. Computation of Network Reliability Consider a probabilistic network G = (V,E ) where nodes are always working but edges fail. Edge-reliabilities are assumed to be mutually s-independent and the reliability of edge ei E E is denoted by pi. Let R K ( G )be Ic-terminal reliability of network G where K C V, k = IKI, which means the probability that all k nodes in K are connected by working edges. K is called target nodes. It is proved that computing network reliability is NP-hard2. In this paper, G is assumed to be a connected and undirected network without irrelevant edges. Moreover, let ei E G denote e E E ; G = (V,E ) . In order to compute the exact value of network reliability, the following factoring theorem is often employed12:

for an edge e, E G where q, = 1 - p , . G * e, and G - e, represent the networks G with e, contracted and deleted, respectively. If neither of the two end-vertices of e,, u and w, exists in K , it holds K' = K . Otherwise K' = K - { u , w} U {w} where w is the vertex constructed by the contraction of e,. The factoring theorem is applied recursively until resulting networks are enough siniple to compute their reliability easily. Some reliability-preserving reductions and decompositions are integrated with the factoring theorem to reduce computational task. In this paper, reductions and decompositions are called transformations as a generic term. Basic transformations

271

Legend: 0 target, 0 not target, @ target or not target

Figure 1. Basic network transformations: reductions and a decomposition

for k-terminal reliability, three reductions and a decomposition, are shown in figure 12. The transformations shown in figure 1 are executed in polynomial time. It holds

(2)

R K ( G )= OR,(G)

for a reduction as general form where R is the multiplicative factor of the reduction, which is obtained by the reliabilities of the reduced edges. 2.2. Reliability Importance in Network Systems

MRI of edge e, in network G with respect to k-terminal reliability, denoted by I K ( e , , G), is defined as d R ~ ( G ) / a p ,From ~ . equation (l),we obtain

I K ( e , , G ) = R K f ( G* e,) - R K ( G- e,).

(3)

Note that I K ( e , , G ) is independent ofp,. It is hard to compute MRI since computing network reliability is NP-hard. MRI is an useful measurement in reliable network design problems. If G is modified to G' by changing an edge-reliability p , to p , 6,, the next equation holds"

+

R K ( G ' )= R K ( G )+6,1x(ez,G).

(4)

Therefore, reliability of a network with an edge-reliabilities modified can be computed by using that of an original network and MRI of the reliability-modified edge. The fact helps to develop efficient algorithm for reliable network design problems.

2.3. The Results in Our Past Study This paper deals with a problem, named CCMRI (Complete Computation of MRI), which computes MRI for all edges in E , namely Z K ( G )= ( I K ( e , , G ) I e, E E ) . Our

272 aim in this study is to propose efficient computation method for CCMRI. An algorithm for CCMRI considering all-terminal reliability have been proposed" by improving a traditional method'. The traditional method derives IK(ei,G) by Eq. (3) after computing R K f ( G* e,) and R K ( G- e i ) for each edge ei in G. Hence, the computational complexity of the method is O ( ( E ( r ( G )where ) r ( G ) is the computatioal complexity to compute R K ( G ) . Our proposed algorithm consists of three phases: (i) apply reliability-preserving reductions and a decomposition to a target network recursively as many as possible. (ii) solve CCMRI for the resulting network. (iii) solve CCMRI backward against the order of applicability of the transformations by using some lemmas that compute MRI in original networks using that in transformed networks. Numerical experiments revealed that the proposed algorithm reduced CPU time drastically compared to the traditional method although the theoretical computational complexity was not improved.

3. Main Results This section proposes a method for CCMRI with k-terminal reliability developed by extending the past results.

3.1. MRI Computation Using Network Reductions Here shows a method to compute MRI by using the network reductions shown in figure 1. Now, we focus on the three network reductions; parallel reduction, series reduction and degree-2 reduction. The next lemma computes MRI using that in reduced networks and reliability. Lemma 3.1. Let G be the reduced network from G by a reduction which replaces edges e,, e y E G with e,. For ei E G , it holds that

where K is the target nodes in 6 We assume to apply network reductions to a target network G recursively T times (T > 0). Let Gt be the network obtained by the 2-th reduction from Gt-l where G o = G. Then, Eq. (2) shows that R a , ( G t ) = C2tRpt-l(Gt-1) where C2t is the multiplicative factor of the t-th reduction and kt is the target nodes in Gt. The following is a lemma with respect to MRI of edges that remains in reduced networks.

273 Lemma 3.2. If- a n- edge ei E G does not reduced by T recursive reductions, namely ei remnins in G I , Gz, . . . , GT, it holds that (6)

We define r, and f i t as the smallest t where edge e, belongs in Gt and respectively. The above two lemmas prove the following theorem.

n,=,O,, t

Theorem 3.1. If the T-th reduction reduces GT-I to GT by replaczng e z , e y E G T - ~wzth e,, zt hold that & R - KT ( G T )+ f i T e I p T ( ( e z , ap, % R ~ ~ ( G T ) + oT a@p = , IKT -

otherwzse.

Theorem 3.1 derives the following corollaries with respect to the three reductions shown in figure 1.

Corollary 3.1. I n case that the T - t h reductzon zs parallel reductzon, zt holds that OT-1

n , q y I K T (ez 7 GT) z f

-

OT--l

n,d,T(e,, GT) if i e I k T (e,, G T )

= 2, = Y,

(8)

otherwzse.

Corollary 3.2. I n case that the T - t h reduction is parallel reduction, it holds that

'kT,(eZ,GTi)

=

{

PI-

_RT,

%p n,

KT

X

(e,,GT)

I KT - ( e-, , G T )

F I k T(ei, GT)

ifi=x; ifi=y, otherwise.

(9)

274 3.2. Non-Separable Decomposition Non-separable decomposition decomposes a network into its non-separable components. A non-separable component of a network is a subset of the network without cut-nodes. A cut-node is a node whose removal makes a network disconnected. Non-separable decomposition is executed in O(lV1 lE1)15. If target network G is decomposed L non-separable components G I , G z , . . . ,G L , then

+

L

where is a set of target nodes with cut-nodes in Gl. The following theorem has been proved with respect to MRI with non-separable decomposition.

Theorem 3.2. Let network G have L non-separable components G I , G z , . . . , GL. For e, E G p , it holds that

where

Kz is the target nodes with cut-nodes in Gz.

Theorem 3.2 enables us to compute MRI in an original network using MRI in nonseparable components.

3.3. Extended Factoring Theorem for M R I Computations The factoring theorem effectively reduce task to compute network reliability. The next theorem shows an extended factoring theorem for MRI computations.

Theorem 3.3. Let ei, e j be edges in network G. Then, it holds that

Theorem 3.3 factors G into G * ei and G - ei in computations of RI. Even if we cannot apply any network transformations to G, the theorem decompose G into two smaller networks, which creates new possibilities of applications of network transformations.

3.4. Procedure for C C M R I with k-Terminal Reliability Figure 2 shows a procedure to solve CCMRI with k-terminal reliability using the proved theorems. In the first phase, steps 1 through 3, target network G is decomposed into its L non-separable components. Next, steps 4 through 7 recursively

275 procedure CCMRI(G,K ) input: network G = (V,E ) and target nodes K C V output: Z K ( G )= { I G (1G(2), ~ ) , . . . ,IG(m)} and R(G) begin 1. for a = 0 , 1 , . . . ,m do x, := 1 end for; 2. decompose G into L non-separable components G I ,G2,. . . , G L ; 3. for 1 = 1,2, . . . , L do 4. GO:= Gi, KO:= h : i , t := 0, !& := 1; while a network reduction can be applied to Gt do 5. Gt+l := Gt reduced by a network reduction whose multiplicative factor is 0; 6. ht+1:= Gfit, t := t 1 end while; 7. 8. if I,,(.,Gt) can be cQmputed obviously then compute I,,(.,&); 9. else 10. call CCMRI(ct * e,, K’) and CCMRI(Gt - e,, K ) for an edge e, E G t ; ( e ,Gt) using Theorem 3.3 end for; 11. for all e in Gt do compute 12. compute R(G1) using Eq. (1) end if; while t > 0 do 13. for all e,, reduced edges by the t-th reduction, do 14. compute I k T z( e 5 ,Gt-1) using Theorem 3.1 end for; 15. A

A

-

-

+

16. compute R(Gb-1) by Eq.(2), h := h - 1 end while; 17. xo := xoR(G0); 18. for all e, in GOdo x, := I R ~ ( ~ ~ , G O ) / R end ( G Ofor; ) 19. end for; 20. returnZK(G) = { x ox 1 , x 0 ~ 1 , . ., . x ~ x and ~ }R(G) =no; end procedure. Figure 2. Procedure for CCMRI with k-terminal reliability

applies network reductions to each non-separable component and compute 0,. If G t is enough simple such as a single edge, I k , (., G t ) and R,qI( G t ) are computed in step 8. Otherwise, Gt is factored into two smaller networks to compute them. for e, E G, are computed from step 13 Then, Rgl (G,) and 7rz = I k I (e,, Gl)/Rkl(Gi) to step 16. Step 17 computes R K ( G ) .Finally, for e, E G p , I K ( e , , G ) is computed from Eqs. (11) and (12) as follows:

= RK(G)I~[*(~,,G~*)/R,,.(G~*) = nor,.

The following theorems have been proved with respect to computational complexity of the proposed procedure.

Theorem 3.4. The tzme complexaty of procedure C C M R I ( G ,K ) as O ( r T ( G ) )where rT(G) zs the tame complexzty to compute R K ( G ) . Theorem 3.5. The space complexzty of procedure C C M R I ( G , K ) zs O ( m r s ( G ) )

276 where r s ( G ) is the space complexity t o compute R K ( G ) . Especially, the proposed procedure executes CCMRI in polynomial time if the network reliability of a target network is computable in polynomial time by using the network transformations shown in Fig. 1. Practical computing time is expected t o be improved such as the result in our last research".

References 1. K. K. Aggarwal, Y. C. Chopra, and J. S. Bajwa, newblock Microelectronics and Reliability, 22, 347 (1982). 2. M. 0. Ball, C. J. Colbourn, and J. S. Provan, Handbook of Operataons Research and Management Science: Network Models, Elsevier, Amsterdam, 673 (1995). 3. R. E. Barlow and F. Proschan, Statistical Theory of Reliability and Life Testing, Holt, Rinehart, Winston (1975). 4. Z. W. Birnbaum, Multivariate Analysis 11, Academic Press, 581 (1969). 5. P. J. Boland and E. El-Neweihi, Computers and Operations Research, 22, 455 (1995). 6. C. J. Colbourn, Cornbinatorics of Network Reliability, Oxford University Press (1987). 7 . B. Dengiz, F. Altiparmak, and A. E. Smith, IEEE Trans. Rel., 42, 17 (1993). 8. S. J. Hsu and M. C. Yuang, IEEE Tkans. Rel., 50, 98 (2001). 9. S. Kiu and D. F. McAllister, IEEE Trans. Rel. 37,433 (1988). 10. T. Koide, S. Shinmori, and H. Ishii, IEICE Trans. Fundamentals, E87-A, 454 (2004). 11. F. H. Lin and W. Kuo, Journal of Heuristacs, 8 , 155 (2002). 12. F. Moskowitz, AIEE Trans. Commun. Electron. 39,627 (1958). 13. D. R. Shier, Network Reliability and Algebraic Structures, Oxford University Press (1991). 14. I. M. Soi and K. K. Aggarwal, IEEE Trans. Rel., 30,438 (1981). 15. R. E. Tarjan, SIAM J . Comput. 1, 146 (1972). 16. A. N. Venetsanopoulos and I. Singh, Problems of Control and Information Theory, 15,63 (1986).

RELIABILITY AND RISK EVALUATION OF LARGE SYSTEMS

KRZYSZTOF KOLOWROCKI Department of Mathematics, Maritime University Morska 81-87, Gdynia, 81-225, Poland

The paper reviews the state of the art on the application of limit reliability functions to the reliability evaluation of large systems The results developed by the author and his research group are especially highlighted Two-state and multi-state large systems composed of independent components are considered The main emphasis is laid on multistate systems with degrading components due to the importance of such an approach in safety analysis, assessment and prediction and operation processes effectiveness analysis of real technical systems

1

Introduction

A lot of technical systems belong to the class of complex systems. It is concerned with the large number of components they are built of and with their complicated operating processes. This complexity very often causes the system reliability and safety evaluation to become difficult. As a rule these are series systems composed of large number of components. Sometimes the series systems have either components or subsystems reserved and then they become parallel-series or series-parallel reliability structures. We meet large series systems, for instance, in piping transportation of water, gas, oil and various chemical substances. Large systems of those kinds are also used in electrical energy distribution. A city bus transportation system composed of a number of communication lines each serviced by one bus may be a model series system, if we treat it as not failed, if its all lines are able to transport passengers. If the communication lines have at their disposal several buses we may consider it either as a parallel-series system or an “m out of n” system. The simplest example of a parallel system or an “m out of n” system may be an electrical cable composed of a number of wires, which are its basic components, whereas the transmitting electrical network may be either a parallel-series system or an “m out of n”-series system. Large systems of those types are also used in telecommunication, in rope transportation and in transport using belt conveyers and elevators. Rope transportation systems like port elevators and ship-rope elevators used in shipyards during ship docking and undocking are model examples of series-parallel and parallel-series systems. Taking into account the importance of the systems’ safety and systems’ operating process effectiveness it seems to be reasonable to expand the two-state approach to the multi-state approach in their reliability analysis. The

277

278 assumption that the systems are composed of multi-state components with reliability states degrading in time without repair gives the possibility of more precious analysis of their reliability, safety and operational processes effectiveness. This assumption allows us to distinguish a system reliability critical state which exceeding is either dangerous for the environment or does not assure the necessary level of its operational process effectiveness. Then, an important system reliability characteristic is the time to the moment of exceeding the systems reliability critical state and its distribution called the system risk function. This distribution is strictly related to the system multi-state reliability function that is a basic characteristic of the multi-state system. In case of large systems, the determination of the exact reliability functions of the systems and the system risk functions leads us to a very complicated and often useless for reliability practitioners . formulae. One of the important techniques in this situation is the asymptotic approach to system reliability evaluation. In this approach instead of the preliminary complex formula for the system reliability function, after assuming that the number of system components tends to infinity and finding the limit reliability of the system, we obtain its simplified form. 2

The State of The Art

The mathematical methods used in the asymptotic approach to the system reliability analysis of large systems are based on limit theorems on order statistics distributions considered in very wide literature. These theorems have generated the investigation concerned with limit reliability functions of the systems composed of two-state components. The main and fundamental results in this subject, which determine the three-element classes of limit reliability functions for homogeneous series systems and for homogeneous parallel systems have been established by Gniedenko. These results are also presented, sometimes with different proofs, for instance in [I]-[ 1 I]. The generalisations of those results for homogeneous “m out of n” systems have been formulated and proved by Smirnow, who has fixed the seven-element class of possible limit reliability functions for these systems. Some partial results obtained by Smirnow, additionally with the solution of the speed of convergence problem may be found in other publications. The same as for homogeneous series and parallel systems classes of limit reliability functions have been fixed for homogeneous series-parallel and parallel-series systems by Chernoff and Teicher. Their results were concerned with so called “quadratic” systems only. They have fixed limit reliability functions for the homogeneous series-parallel systems with the number of series subsystems equal to the numbers of components in these subsystems and for the homogeneous parallel-series systems with the number of parallel subsystems equal to the numbers of components in these subsystems. These results may also be found for instance in 141- [ 5 1.

279 Generalisations of the results on limit reliability functions of two-state homogeneous series and parallel systems for these systems in case they are nonhomogeneous are mostly considered in the author’s works. More general problem is concerned with fixing the classes of possible limit reliability functions for so called “rectangular” series-parallel and parallel-series systems. This problem for the homogeneous series-parallel and parallel-series systems of any shapes, with different number of subsystems and the numbers of components in these subsystems, have been progressively solved in the author’s works. The main and new result of these works was the determination of seven new limit reliability functions as well for homogeneous series-parallel systems as for parallel-series systems. This way, new ten-element classes of all possible limit reliability functions for these systems have been fixed. Moreover, in these works it has been pointed out that the kind of the system limit reliability function strongly depends on the system shape. These results allow us to evaluate reliability characteristics of homogeneous series-parallel and parallel-series systems with regular reliability structures, i.e. systems composed of subsystems having the same numbers of components. The extensions of these results for non-homogeneous series-parallel and parallel-series systems have been formulated and proved successively in the author’s works. These generalisations additionally allow us to evaluate reliability characteristics of the series-parallel and parallel-series systems with non-regular structures, i.e. systems with subsystems having different numbers of components. In some of the mentioned works as well as the theoretical considerations and solutions and the numerous practical applications of the asymptotic approach to real technical system reliability evaluation may also be found. More general and practically important complex systems composed of multistate and ageing in time components are considered among other in [4]-[5].An especially important role in the evaluation of technical systems reliability and safety and their operating process effectiveness plays defined in these works large multi-state systems with degrading components. The most important results being the generalisations of the results on limit reliability functions of two-state systems dependent on transferring them to series, parallel, “m out of n”, series-parallel and parallel-series multi-state systems with degrading components are given in the newest author’s works. Some of these publications also contain practical applications of the asymptotic approach to the reliability evaluation of various technical systems [4]-[5]. The results concerned with the asymptotic approach to system reliability analysis have become the basis for the investigation concerned with domains of attraction for the limit reliability functions of the considered systems [I],[6]-[7]. In a natural way they have caused the investigation with the speed of convergence of the system reliability function sequences to their limit reliability functions. These results have also initiated the investigation on limit reliability functions of “m out of n”-series, series-“m out of n” systems [lo]-[1 11, systems with hierarchical reliability structures [ I]-[3] and the investigations on the

280 problems of the system reliability improvement and optimisation [8]-[9] as well. All these problems are completely presented in [ 121. 3 The Book “Reliability of Large Systems” The aim of this book [12] is to deliver the complete elaboration of the state of art on the method of asymptotic approach to reliability evaluation for as wide as possible range of large systems. Pointing out the possibility of this method extensive practical application in the operating processes of these systems is also an important reason for this book. The book contains complete current theoretical results of the asymptotic approach to reliability evaluation of large two-state and multi-state series, parallel, “m out of n”, series-parallel, parallelseries systems together with their practical applications to reliability evaluation of a wide range of technical systems. Additionally some recent partial results on asymptotic approach to reliability evaluation of “m out of n”-series, series-“m out of n” and hierarchical systems, the result application to large systems reliability improvement and to large systems reliability analysis in their operation processes are presented in the book. The following construction of the book has been assumed. In chapters concerned with two-state systems the results and theorems are presented without the proofs but with exact reference to the literature where their proofs may be found. Moreover, the procedures of the results practical applications are described and applied to the model two-state systems reliability evaluation. In chapters concerned with multi-state systems the recent theorems about their multi-state limit reliability functions are formulated and shortly justified. Next, the procedures of the result applications are presented and applied to real technical systems reliability and risk evaluation. Moreover, the possibility of the computer aided reliability evaluation of these systems is suggested and its use is presented. The book contains complete actual solutions of the formulated problems for the considered large systems reliability evaluation in the case of any reliability functions of the system components. The book consists of Introduction, 8 Chapters, Summary and Bibliography. In Chapter 1 that follows the Introduction, some basic notions necessary for further considerations are introduced. The asymptotic approach to the system reliability investigation and the system limit reliability function are defined. In Chapter 2 two-state homogeneous and non-homogeneous series, parallel, “m out of n”, series-parallel and parallel-series systems are defined. Their exact reliability functions are also determined. Basic notions of the system multi-state reliability analysis are introduced in Chapter 3. Further the multi-state homogeneous and non-homogeneous series, parallel, “m out of n”, series-parallel and parallel-series systems with degrading components are defined and their exact reliability functions are determined. Moreover, the notions of the multi-state limit reliability function of the system,

281

its risk function and other multi-state system reliability characteristics are introduced. Chapter 4 is concerned with limit reliability functions of two-state systems. Three-element classes of limit reliability functions for homogeneous and nonhomogeneous series systems are fixed. Some auxiliary theorems that allow us to justify facts on the methods of those systems reliability evaluation are formulated and proved. The chapter also contains the application of one of the proven facts to the reliability evaluation of a non-homogeneous gas pipeline that is composed of components with Weibull reliability functions. The accuracy of this evaluation is also illustrated. Three-element classes of possible limit reliability functions for homogeneous and non-homogeneous parallel systems are fixed as well. Some auxiliary theorems that allow us to justify facts on the methods of those systems reliability evaluation are formulated and proved. The chapter also contains the application of one proved fact to the reliability evaluation of a homogeneous energetic cable used in the overhead electrical energy distribution that is composed of components with Weibull reliability functions. The accuracy of this evaluation is illustrated in the table and figure. The class of limit reliability functions for a homogeneous “m out of n” system is fixed and the “16 out of 35” lighting reliability is evaluated in this Chapter. The Chapter contains also the results of investigations on limit reliability functions of two-state homogeneous and non-homogeneous series-parallel systems. Apart from formulated and proved auxiliary theorems that allow us to justify facts on the methods of those systems reliability evaluation their ten-element classes of possible limit reliability functions are fixed. In this Chapter, in the part concerned with applications there are two formulated and proved facts that determine limit reliability functions of series-parallel systems in the cases they are composed of components having the same and different Weibull reliability functions. On the basis of those facts the reliability characteristics of the homogeneous gas pipeline composed of two lines of pipe segments and the nonhomogeneous water supply system composed of three lines of pipe segments are evaluated. The results of investigations on limit reliability functions of two-state homogeneous and non-homogeneous parallel-series systems are given in this Chapter as well. Theorems, which determine ten-element classes of possible limit reliability functions for those systems in the cases they are composed of identical and different components, are formulated and justified. Moreover, some auxiliary theorems that are necessary in practical reliability evaluation of real technical systems are formulated and proved. In the part concerned with applications one fact is formulated and proved and then applied to evaluation of the reliability of a model homogeneous parallel-series system. The generalisations of the results of Chapters 4 on limit reliability functions of two-state systems consisting in their transferring to multi-state series, parallel, “m out of n”, series-parallel and parallel-series systems are done in Chapters 5 . The classes of all possible limit reliability functions for these systems in the cases when they are composed of identical and different in the reliability sense components are fixed. The newest theorems that allow us to evaluate the

282 reliability of large technical systems of those kinds are formulated and proved in this Chapter as well. Apart from the main theorems fixing the classes of multistate limit reliability functions of the considered systems some auxiliary theorems and corollaries allowing their direct applications to reliability evaluation of real technical objects are also formulated and proved. Moreover, in this Chapter there are wide application parts depending on the results applying to the evaluation of reliability characteristics and risk functions of different multi-state transportation systems. The results concerned with multistate series systems are applied to the reliability evaluation and risk function determination of homogeneous and non-homogeneous pipeline transportation systems, the homogeneous model telecommunication network and the homogeneous bus transportation system. The results Concerned with multi-state parallel systems are applied to reliability evaluation and risk function determination of an energetic cable used in the overhead electrical energy distribution network and to reliability and durability evaluation of the three-level steel rope used in the rope transport. Results on limit reliability functions of a homogeneous multi-state “m out of n” system are applied to durability evaluation of a steel rope. Model homogeneous series-parallel system and homogeneous and non-homogeneous series-parallel pipeline systems composed of several lines of pipe segments are estimated as well. Moreover, the reliability evaluation of the model homogeneous parallel-series electrical energy distribution system is performed. Chapter 6 is devoted to the multi-state asymptotic reliability analysis of the port and shipyard transportation systems. Theoretical results of this Chapter and Chapter 5 are applied to the reliability evaluation and the risk functions determination of some selected port transportation systems. The results of the asymptotic approach to reliability evaluation of non-homogeneous multi-state series-parallel systems are applied to the transportation system used in the Baltic Grain Terminal of the Port of Gdynia for transporting grain from its elevator to the rail carriages. The results of the asymptotic approach to the reliability evaluation of the non-homogeneous multi-state series-parallel systems are applied to the piping transportation system used in the Oil Terminal in Debogorze. This transportation system is destined for taking the oil from the tankers that deliver it to the unloading pier located at the breakwater of the Port of Gdynia. The results of the asymptotic approach to reliability evaluation of non-homogeneous multi-state series-parallel and series systems are applied to the transportation system used in the Baltic Bulk Terminal of the Port of Gdynia for loading bulk cargo on the ships. The results of this Chapter and Chapter 5 are also applied to reliability evaluation and risk function determination of the shipyard transportation system. Namely, the results of the asymptotic approach to reliability evaluation of homogeneous multi-state parallel-series systems are applied to the ship-rope transportation system used in the Naval Shipyard of Gdynia for docking ships coming for repair. The performed reliability analysis of the considered systems in this Chapter is based on the data concerned with the

283 operation processes and reliability of their components coming from experts, from component technical norms and from their producer’s certificates. In Chapter 7 the classes of possible limit reliability functions are fixed for the considered systems in case their components have exponential reliability functions. Theoretical results are represented in the form of the very useful guide containing algorithms placed in the tables and giving sequential steps of proceeding in the reliability evaluation in each of possible cases of the considered system shapes. The application of these algorithms for reliability evaluation of the multi-state non-homogeneous series transportation system, the multi-state model homogeneous series-parallel, the multi-state nonhomogeneous series-parallel pipeline transportation system and the multi-state non-homogeneous parallel-series bus transportation system is illustrated. The evaluation of reliability functions, risk functions, mean values of sojourn times in subsets of states and mean values of sojourn times in particular states for these systems is done. The calculations are performed using the computer programme based on the algorithms allowing automatically evaluating the reliability of large real technical systems. In Chapter 8 the open problems related to the topics considered in the book are presented. The domains of attraction for previously fixed limit reliability functions of the series, parallel, “m out of n”, series-parallel and parallel-series systems are introduced. More exactly, there are formulated theorems giving conditions which reliability functions of the components of the system have to satisfy in order that the system limit reliability function is one of the function from the system class of all limit reliability functions. Some examples of the result application for series systems are also illustrated. Practically very important problem of the speed of convergence of system reliability function sequences to their limit reliability functions is investigated as well. There is presented an exemplary theorem, which allows estimating the differences between the system limit reliability functions and the members of their reliability function sequences. Next, an example of the speed of convergence evaluations of reliability function sequences for a homogeneous series-parallel system is given. Partial results of the investigation on the asymptotic approach to reliability evaluation of “m out of n”-series, series-“m out of n” and hierarchical systems and on system reliability improvement are presented. These result applications are illustrated graphically as well. The analysis of large systems reliability in their operation processes is given at the end of this Chapter. The book is completed by the Summary that contains the evaluation of the presented results, the formulation of open problems concerned with large systems reliability and the perspective of further investigations on the considered problems. All problems described generally in the paper will be presented in details during the AIWARM Workshop.

284

References 1. A. Cichocki, D. Kurowicka and B. Milczek, Statistical and Probabilistic Models in Reliability, Ionescu D. C. and Limnios N. Eds. Birkhauser, Boston, 184 (1998). 2. A. Cichocki, Applied Mathematics and Computation 120, 55 (2001). 3. A. Cichocki, PhD Thesis, Gdynia Maritime University-Systems Research Institute Warsaw (2003). 4. K. Kolowrocki, International Journal of Pressure Vessels and Piping 80, 59 (2003). 5. K. Kolowrocki, International Journal of Reliability, Quality and Safety Engineering 10, No 3, 249 (2003). 6. D. Kurowicka, Applied Mathematics and Computation 98, 6 1 (1998). 7. D. Kurowicka, PhD Thesis, Gdynia Maritime University-Delft University (2001). 8. B. Kwiatuszewska-Sarnecka, Applied Mathematics and Computation 123, 155 (2001). 9. B. Kwiatuszewska-Sarnecka, PhD Thesis, Gdynia Maritime UniversitySystem Research Institute Warsaw (2003). 10. B. Milczek, Applied Mathematics and Computation 137, 161 (2002). 1 1. B. Milczek, PhD Thesis, Gdynia Maritime University-System Research Institute Warsaw (2004). 12. K. Kolowrocki, Reliability of Large Systems, Elsevier (2004).

AN OPTIMAL POLICY TO MINIMIZE EXPECTED TARDINESS COST DUE TO WAITING TIME IN THE QUEUE

JUNJI KOYANAGI AND HAJIME KAWAI Tottori University, Koyama Minami 4-101, Tottori City, Tottori, Japan We consider a discrete time queueing system. The service time and the interarrival time have geometric distributions. We assume that there is a single decision maker (DM) who has a task (Task A (TA)) which need to be processed by the server of the queueing system and another task (Task B (TB)) which is processed outside the queue and need a constant time. While processing TB, DM observes the queue at every unit time and can interrupt T B to join the queue, but DM cannot process T B while in the queueing system. After TA is finished, DM resumes TB. There is a deadline to process both tasks and if the both tasks are finished beyond the deadline, a tardiness cost is incurred. We aim at minimizing the tardiness cost and show the properties of the optimal policy.

1. Introduction

In queueing theory, it is usually assumed that customers arrive at the queue without their decisions. In a paper dealing with a decision problem by a customer, Nam proposed the system in which the customer decides whether t o join the queue5. When the customer arrives, he/she observes the queue length and compares the waiting cost with the service merit. If the service merit is larger than the waiting cost, the customer joins the queue. When the arrival rate is high (but smaller than the service rate), the queue length becomes long, but the arrival will stop when the service merit becomes smaller than the expected waiting cost. Then the queueing system behaves as a queueing system with finite capacity. Though the arrival stops when the queue length becomes this threshold, Naor showed that it is better t o stop the arrival, from the view point of social optimal control, before the queue length becomes the threshold. One way to stop the arrival before the threshold is to incur the admission fee t o the customers, which decreases the service merit. Naor showed that the socially optimal control is attained by the individually optimal control by charging the suitable admission fee. Mandelvaum and Yechiali deal with a model where one customer called smart customer can decide whether to enter the queue or leave the system'. As an additional action, the smart customer can defer the decision outside the queue observing the queue length and paying the waiting cost smaller than the waiting cost in the queue. The authors showed that the optimal policy has the following structure. When the queue length is short, smart customer should enter the system. When the queue length is middle, he should defer the decision and when it is long, he

285

286

should leave. We can find many other models dealing with a decision problem in a queueing system in Hassin and Haviv4. In a usual decision problem, the customer’s decision depends on the queue length only. We have studied some models where the number of decisions is finite or stochastic, therefore, the decision is affected by the queue length and the number of decisions. In our models the number of decisions is considered as the number of steps of the task (TB) which is processed outside the queue. We consider a decision maker (DM) with a task (TA) served in the queueing system who wants to minimize the waiting time in the queue. At every step of TB, DM decides whether to interrupt T B and enter the queue. If DM chooses to enter the queue, he resumes T B after TA is finished. If DM finishes T B before entering the queue, DM must enter the queue and wait for the service. One example of this problem is a repair problem of a machine with two functions. Consider a job which needs a hardcopy of thumbnail images of many photos. There is a printer which can be used as a scanner and we can use only this printer for the job. However, the print function is found to be broken, but the scan function can be used. If we bring the printer to the repair factory with a queue of other devices which need to be repaired, we must wait until the repair is completed and we cannot use the scan function while the printer is in the factory. Since print function is needed after all photos are scanned, we can postpone the repair and scan photos before the repair. Then the problem is when to bring the printer to the factory. We consider a cost which depends on the time needed to process both TA and TB. In previous papers, we considered the expected time for TA and T B as the cost in the continuous time system’ and we also considered the probability not to finish TA and T B by the deadline in a discrete time system3. In this paper, we consider the tardiness cost in a discrete time system and show the switch curve structure of the optimal policy. 2. Model We consider one decision maker (DM) who has two tasks TA and TB. TA is processed in a discrete time queueing system. The arrival and the end of the service happens at every unit time with constant probability, the end of the service happens with probability q and the arrival does with probability p ( p < q ) . The end of the service happens just before the arrival. The other taqk T B needs b units of time and Dhil observes the queueing system at every unit time for processing T B and decides whether to join the queue. Two tasks should be finished within b 1 time and if the two tasks are finished after b 1 time, a tardiness cost is incurred. Then the cost is incurred as follows (Fig. 1).

+

+

(1) Suppose that DM processes T B for m(< b ) units of time. (2) After m units of time is spent for TB, DM observes queue length i and decides to join the queue.

287 ( 3 ) If X time is needed for i + 1 customers (including DM), DM processes the rest of TB which needs b - m time after TA is processed in the queue. (4) In the above situation, the tardiness cost becomes m&x{O, X - I}. Therefore, if the time in the queue is less than (or equal to) 1, no cost is incurred.

cost

~

I

I

I

I

I

I

I

I

v

I

I

I

I

*time

Figure 1. The tardiness cost

To minimize the tardiness cost, DM chooses an action between two actions a t each time epoch: action A is t o join the queue; action B is to process T B for one unit of time. If DM chooses action B, DM makes a decision again after one unit of time until T B is finished. We define (i,m,1)as the system state where i is the queue length, m units of time has been spent in TB, and 1 is the maximum time in the queue without the tardiness cost. For the optimality equations, we define the following functions. (1) A(i,m, 1 ) is the expected cost when DM chooses action A while in (i, m, I ) . (2) B(i,m,1) is the expected cost when DM chooses action B while in (i,m,I) and behaves optimally thereafter. ( 3 ) V ( i ,m, 1) is the optimal expected cost for (i, m, I ) . The queueing system is a discrete time system and at most one customer arrives or leaves in one unit of time. Therefore, if action B is taken, the state transition from (i,m,1) isrestricted to ( i - l , m + l , l ) , (i > 0), ( i , m + l , I ) or ( i + l , m + l , l ) . With these state transition, we have the following optimality equations. a

A(i,m, 1) =

(:)

qk-' (1 - q)'-'(i

+1

-

k)

k=O

B(i,m,l) = q ( l - p ) V ( i -

l,m+1,1)+(qp+(l -q)(1-p))V(i,m+l,l)

+ (1- q)pV(i + 1,m + 1,I )

(i > 0)

V(i,m,1) = m i n { A ( i , m , 1 ) , B ( i , ~ , I ) } ,V(i,b,1) = A(i,b,I) The expression of A(i,m, 1 ) is obtained as follows.

(2) (3)

288

(1) DM chooses to join the queue in (i,m,I), then DM becomes (i+l)st customer in the queue. (2) During 1 time, k customers are served; the distribution of k is binomial with success probability q. (3) If k >_ i + 1, no cost is incurred because TA is finished within I , otherwise (i 1- k ) / q is the expected cost.

+

Note that -4(i,m, I) is independent of m. 3. Analysis

Let us define the policy that DM takes action B in (i,m,I) and takes action A after one unit of time irrespective of the queue length. If DM takes this policy, the expected cost C ( i ,m, I ) becomes

C(i,m,I) = q ( l - p ) A ( i - l , m + l , I ) + ( q p + ( l - q ) ( l - p ) ) ~ 4 ( i , m + 1 , 1 )

+ (1- q ) p i l ( i + 1,m + 1,I)

(4)

Note that C ( i ,m, I ) 2 B ( i ,m, I) by definition. Since while in (i, b, I), DM must take action A, we first examine the optimal policy in ( i , b - 1 , I ) . In ( i , b - l , I ) , B ( i , b - 1,l) = C(i,b- 1 , I ) holds because he must take action A after one unit of time if he takes action B. From (l),we have

.4(i + 1, m, I) = A(i,m, I)

+

i+l

(:)

q"'(1

- Q)'-'

(5)

k=O

By using ( 5 ) , 2

c ( ~ , b - l , I ) ~ - 4 ( i , ~ - l , I ) - q ( l - (p; ) q) k~- l ( l - q ) z - k k=O

= il(i,b - 1 , I )

Let us define S ( i ) by the second term of (6), that is, i

S ( i ) r (p-q)C ( ; ) q " ' ( l - y ) ~ - k + + p ( i ~ l ) Q ~ ( l - 4 ) ~ - i(7) . k=O

If S ( i ) is positive, then C ( i ,b - 1,I ) 2 A(i,b - 1,I) and action A should be taken in (i, b - 1 , l ) . We have the following lemma.

Lemma 3.1. The followang properties hold for S ( i ) .

289

+

(1) For i 2 2 p(1 (2) For i + 2 < p ( l

+ l ) , S ( i ) is decreasing in i. + I), S ( i ) is positive.

Proof.

S ( i ) - S ( i + 1) = ( q - p )

-p From this, S ( i ) 2 S ( i

(.

2+1

)qyl - qy-i-1 + p ( i

(7

qZ+I(l

+

-

f Jqyl

- qy-2

qy-a-1

(8)

+ 1) holds if

(Q - P )

(2

1

1) + P U - 4 ) (i:

1) L P,(,

f 2).

(9)

Then we obtain

i

+ 2 L p ( l + 1).

(10)

This completes the proof of Lemma 3.1 (1). Next, we prove Lemma 3.1 (2). Assume that p

S(i)

=.(k

(;)qk-l(l-q)l-k+

k=O

a k=O

1+1

k=O i

We prove that

i

k=O

by induction. For i = 0, it holds because 1+1

2

- m(1 ( q-+ql1 21-l)

>o.

(

> (i + 2)/(1+ l ) , then

i + l)q"(l-q)'-"J

290 Now, we assume that

and prove (11). From (13)

i-1

2q c ($-1(1-

q ) V

k=O

Then we have

From the above inequality, we obtain

1 + 1k=O i-1 k=O

Substituting (14) to the left side of (11)

291

The last inequality holds because

This completes the proof of Lemma 3.1 (2). 0

+

Lemma 3.1 indicates that if B(i,b - I, I ) 5 A(i, b - 1,I ) , then i + 2 2 p ( l 1) and B ( j ,b - 1,Z)5 A(j, b - 1,I ) for j 2 i . Therefore, as queue length i increases in ( i ,b - I,I), the optimal action changes from A to B if it changes, and never changes from B to A. We showed that the threshold policy is optimal when the last decision (TB has been processed for b - 1 units of time.) by Lemma 3.1. For ( i ,m, I), we have the following lemma.

Lemma 3.2. For B(i,m, 2 ) and V(i,m, l ) , it holds that B ( i , m - 1,I) 5 B ( i ,m, 1 ) and V ( i ,m - 1,I) 5 V ( i ,m, I ) . Proof. By definition, V ( i ,b,l) = A(i, b, I) and V ( i ,b holds that A(i,m,I)= A(i,m - 1,l) by (1). Thus

-

1,l) 5 -4(i,b - 1,I). It also

V ( i ,b - 1,I) 5 A(i, b - 1,I ) = -4{i1b, I ) = V(i, b, I).

(16)

ItiseasytoshowthatifV(i,rn-1,l)5 V(i,m,Z), thenB(i,m-2,I) 5 B ( i , m - 1 , l ) and V(i,m - 2,I) 5 V ( i , m - 1 , I ) . Thus, by induction Lemma 3.2 holds. This completes the proof of Lemma 3.2. 0

k

Lemma 3.2 shows that if B(i,m, I) 5 *4(ilm, I), then B(i,k , I) 5 A(i,k , 1) for 5 m. Thus, as the time spent in TB increases, the optimal action changes from

B to A if it changes, and never changes from A to B. Lemma 3.3. If A(i,m,l) 5 B(i,m,Z) for 0 5 i 5 I,, B ( i ,m - 1 , l ) holds for 0 5 i 5 I , - 1

then i l ( i , m - 1,Z) 5

Proof. Since the queue length i changes at most once, for 0 5 i 5 Im - 1, B ( i , m - 1,I) = C ( i , m - 1,l). By definition, C ( i , m - 1 , l ) = C ( i , m , I ) and C(i,m,I) 2 B ( i , m , l ) hold. Thus .4(2,m - 1,I ) = -4(i,m, 1 ) 5 C(i, m ,I) = C(i,m - 1 , l ) = B(i,m - 1 , I ) holds for 0 5 i 5 I , - 1. This completes the proof of Lemma 3.3. 0 With the help of these lemmas we have the following theorem about the properties of the optimal policy.

Theorem 3.1. Let I , = max{ilA(i, m, I) 5 B ( i ,m, l ) } . (1) F o r 0 5 i

5 I,, the optimal action i s action A.

292 [2) I , is increasing and increases at most once as m increases, i.e., Im-1

5 Im 5 Im-1

+ 1.

By this theorem t h e optimal policy h a s t h e structure shown in Fig. 2. References 1. A. Mandelbaum and U. Yechiali, Optimal entering rules for a customer with wait option at an M f G I 1 queue, Management Science, 29-2, 174-187 (1983) 2. J. Koyanagi and H.Kawai, An optimal join policy t o the queue in processing two kinds of jobs, Proc. of the Int. Conf. Applied Stochastic System Modeling, 140-147 (2000). 3. J. Koyanagi and H.Kawai, A maximization of the finishing probability of two jobs processed in a queue Proc. of the 32nd I S C I E International Symposium o n Stochastac Systems Theory and Its Applications, 171-176 (2001). 4. R. Hassin and M. Haviv, To queue or not to queue, Kluwer Academic Pubishers, Boston (2003). 5. P. Naor, On the regulation of queue size by levying toll, Econometrica, 37, 15-24 (1969).

Optimal policy

A A A

A A A

A A A

A A A

A A A

A A A

A A A

A A A

A A A

A A B

A B B

A A A A A

A A A A A

A A A A A

A A A A A

A A A A A

A A A B B

A A B B B

B B B B B

B B B B B

B B B B B

B B B B B

0

5

A B B

A B B

A B B

A B B

A B B

B B B B B

B B B B B

B B B B B

B B B B B

10

Figure 2. T h e structure of the optimal policy

i

RELIABILITY OF A k-OUT-OF-n SYSTEM WITH REPAIR BY A SERVICE STATION ATTENDING A QUEUE WITH POSTPONED WORK

A. KRISHNAMOORTHY*AND VISWANATH C. NARAYANAN,t Department of Mathematics, Cochin University of Science and Technology, Kochi 682 022, Kerala, India E-mail: [email protected]

T. G. DEEPAK* Regional centre M. G. University, Kochi 682024 India

In this paper the reliability of a repairable k-out-of R. system is studied. Repair times of components follow a phase type distribution. In addition the service facility offers service to external customers which on arrival are directed to a pool of postponed work if the service station is busy. Otherwise the external customer is taken immediately for service. Service times of components of the system and thatof the external customers have independent phase type distributions. At a service completion epoch if the buffer has less than L customers, a pooled customer is taken for service with probability p , 0 < p < 1 If at a service completion epoch no component of the system is waiting for repair, a pooled customer, if any waiting, is immediately taken for service. We obtain the system state distribution under the condition of stability. A number of performance characteristics are derived. A cost function involving L , M , y and p is constructed and its behaviour investigated numerically.

1. Introduction

In this paper we consider the reliability of k-out-of-n system with repair by a single repairman facility which also provides service to external customers when the components of the system are all functional. We assume that the k-out-of-n system is COLD. A k-out-of-n system is characterized by the fact that the system operates as long as there are at least k operational components. The system is COLD in the sense that operational components do not fail while the system is in down state (number of failed components at that instant is n - k 1). Using the same analysis as employed in this paper, one can study the WARM and HOT systems also (a k-out-of-n system is called HOT system if operational components continue

+

*Research supported by NBHM (DAE, Govt. of India) t CSIR Research fellow

293

294

to deteriorate at the same rate while the system is down as when it is up. The system is WARM if the deterioration rate while the system is up differs from that when it is down). A repair facility consisting of a single server, repairs the failed components one at a time. The life-times of components are independent and exponentially distributed random variables with parameter X / i when i components are operational. Thus on an average X failures take place in unit time when the system operates with i components. The failed components are sent to the repair facility and are repaired one at a time. The waiting space has capacity to accommodate a maximum of n - k 1 units in addition to the unit undergoing service. Service times of main customers (components of the k-out-of-n system) follow phase distribution and are independent identical for &ll components. In addition to repairing failed components of the system, the repair facility provides service to external customers. However these customers are entertained only when the server is idle (no component of the main system is in repair nor even waiting). These customers are not allowed to use the waiting space at the repair facility. So when external customers arrive for service (arrival process forms a Poission process) when the server is busy serving a component of the system or an external customer, they are directed to a pool of infinite capacity. We stress the fact that at the instant when an external customer undergoes service, suppose a component of the system fails, the latter's repair starts only on completion of service of the external customer. That is, external customers are provided non-emptive service. The service times of external customers are zid rvs following a phase-type distribution. Postponement of work is a common phenomena. This may be to attend a more important job than the one being processed at present or for a break or due to lack of quorum (in case of bulk service) and so on. Queueing systems with postponed work For details regarding is investigated in Deepak, Joshua and Krishnamoorthy queues with postponed work refer to the above and references therein. k-out-ofn system is investigated extensively (see Krishnamoorthy et a1 and references therein). The objective of this paper is to maximize the system reliability. This paper is arranged as follows. In section 2 the problem under investigation is mathematicaly formulated. Section 3 deals with the stability of the system. Stationary distribution of the system is studied in section 4 and some system performance measures are given. In section 5, a cost function is constructed and some numerical illustrations provided.

+

'.

2. Mathematical modelling We consider a k-out-of-n cold system in which the components have exponentially distributed lifetimes with parameter $, when there are i operational components. There is a single server repair facility which gives service to failed components (main customers) and also to external customers. The external customers arrive according

295 to a Poisson process of rate 6. Repair times of main and external customers follow PH-distribution with representations S1) of order ml and (p2,S2) of order m2, respectively. $' and S: are column matrices such that

(a,

S z e + ~ = O , i=1,2. Let Yl(t)be the number of external customers in the system including the one getting service, if any, and Yz(t)be the number of main customers in the system including the one getting service, if any, at time t. If an external customer, on arrival, finds a busy server and that Y2(t) < M (A45 n - k l),it joins a pool of infinite capacity with probability 1; on the other hand if Yz(t)2 M then with probability y it joins the pool or leaves the system forever.

+

If 0 < Y2(t)_< L - 1, ( L I M ) , at a service completion epoch, then with probability p a pooled cnstomer, if there is any, is given service. If Yz(t)= 0 at a service completion epoch then with probability 1 a pooled customer, if any, gets service. If Y2(t)> L - 1 at a service completion epoch, then with probability 1 a main customer gets service. If K ( t ) = Yz(t)= 0 then an external customer arriving at time t is taken for service. Define 0 1

if a main customer is getting service at time t if an external customer is getting service at time t

Let Y4(t)denote the phase of the service process at time t. NOW X ( t ) = (Yi(t), Yz(t), Y3(t), Y4(t)) forms a continuous time Markov chain which turns out to be a level independent quasi birth and death process with state space, U E o I ( i ) where the levels I(i) are defined as = { ( 0 , j I 10 , j a ) : 1 5 j l I n - k 1,1 I j , I ml} u (0) qi) = { ( i , j l , O , j Z ) : 1 I jl I n-k 1,1 I j , I ml} u { ( i , j l , l , j Z ) : 0 L jl I n - k + l , l I j z Im,} where ( 0 ) denotes the state corresponding to k;(t) = % ( t ) = 0. Arranging the states lexicographically we get the infinitesimal generator Q of the process X ( t ) as,

V)

+

+

296 with

0

1

297

3. Stability condition Theorem 1. The steady state probability vector ri of the generator matrix A = A0 A1 Az which is partitioned as r = ( ~ ( o )r(l), , . . . ,~ (- nk + 1)) is given b y X r ( 0 ) = (r(0)e)Pz(XI 1 - h(X) 1 ~ ( i=)(r(0)e)pz(XI - Sz)-’BolRf for 1 5 i 5 L - I 1 - h(X) 1 ~ ( i=) (r(0)e)D ~ ( X-I s ~ ) - ~ R ~B- ~ R~; - ~f+ o r’L 5 i 5 n - IC 1 - h(X) X 7r(n- k 1 ) = (r(0)e)pz(xr - s ~ ) - ~R ~B - ~~ R~ ; - ~ +R3’ - ~ 1 - h(X)

+ +

+

where h(X) = ,&(XI - SZ)-’$?’ Bol = [0,XIm2]

[ s1 s“1, z S,”Pl

C1

=

e(qPl,pPz), CZ = e(P1,O) where ‘e’ is a column vector of 1’s

+

of order (ml mz) x 1 The quantity r ( 0 ) e is obtained from the normalizing condition r e = 1. The system will be stable if and only i f , x A o e < r A z e .

Proof. The equation 7rA = 0 is equivalent to the system

+ X I ) + 7r(l)BlO = 0 T(O)BOl + T(1)Bll + n(2)Bz1= 0 Xr(i 1) + 7r(i)B11 + r(i + 1)BZl = 0 X r ( i 1) + r(i)Blz+ ~ (+ iI)& = 0 r(O)(Sz s:pz -

-

and Xx(n - k )

(1)

-

+ r(n

-

k

+ 1)B13= 0

(2)

5i5 L -1 ;L 5 i 5 n - k ;2

(3) (4) (5)

298 Post multiplying each of these equations by e we get for 1 5 i 5 n - k

X T ( ~- 1)e = r(i)Bloe

+1

Now (1) can be written as 4O)(Sz,- XI) ~ ( 0 ) S ; p z T ( I ) B l o e p z = 0 ie., r(O)(Sz- X I ) ~ ( o ) $ ? P z XT(O)ePZ = 0

+

+

+

+

+

+ ~ ( 0 ) s := [ T ( O ) S+~X ~ ( 0 ) e ] h ( X )

ie., ~ ( 0 = ) ~ ( 0 ) [ S gXe]/3z(XI-

x

) (T(0)e)Substituting in (6) we get ~ ( 0 = ,&(XI - S 2 ) - l . 1 - h(X) computation we also get, ~ ( 1=) (T(O)e)&,&(XI - Sz)-'BoiRi where Blo =

[ s: ] P2

Bzl =

[ 4s:P1ps:hj

B2z =

0 0 Proceeding similarly the theorem follows.

[ s?pl 0

After some

01 0

Note: The invertibility of the matrix XC1 + B11 follows from its property of being irreducibly diagonally dominant. 4. Stationary distribution Since the model is studied as a QBD Markov Process, its stationery distribution (if it exist) has a matrix geometric solution. Under the assumption of the existence of the stationary distribution, let the stationary vector x of Q be partitioned by the levels into subvectors x, for i 2 0. Then x,'s are given by x, = zlRZ-'

for i

22

where R is the minimal non-negative solution to the matrix quadratic equation

R2A2 The vectors

50,x1

+ RA1 + A.

= 0.

are obtained by solving the equations

zoBo +

= 0,

+

+

~ o B i X I [ A ~ RAz] = 0

subject to the normalizing condition xoe + x1(1 - R)-'e = 1. To complete the R matrix numerically we used the logarithmic reduction algorithm (see Latouche and Ramaswami 3 , Neuts 4).

4.1. S y s t e m p e r f o r m a n c e m e a s u r e s (1) System reliability which is defined as the probability that there is at least k operational components is given by 81 = r ( 0 ) e o

+ ~ ( 1 ) ( 1R)-'el -

299 where eo is a column vector whose last ml entries are 0’s and all other entries are 1’s and el is a column vector whose last ml mz entries are 0’s and all other entries are 1’s. (2) Probability that system is down 82 = 1 - 81. (3) Expected number of pooled customers

+

n-ktl

03

i=l

cc n - k t l

ml

jl=O

i=l

j,=1 j 2 = 1

m2

j2=1

(4) Expected loss rate of external customers

+

where ez is a column vector whose first 1 ( M - l)ml entries are 0’s and all other entries are 1’s and e3 is a column vector whose first mz + (hi1 l ) ( m l mz) entries are 0’s and all other entries 1’s. (5) Expected number of transfers from the pool when there is at least 1 mai customer present, per unit time

+

cc

L

cc L - l

ml

m2

where q ( k ) is the kth entry of the column matrix S!, i = 1,2. 5. A cost function and numerical illustrations Let C1 be the cost per unit time incured if the system is down, C2 be the holding cost per customer per unit time, C3 be the cost due t o loss of 1 customer and C, profit obtained by serving an external unit when there is at least one main customer present. We construct a cost function as

C = 82C1

+ 83Cz + 84C3

- 05C4

By varying over parameters that are crucial and fixing the rest we plot different graphs.The graphs support what is intuitively believed. For the following graphical representations the following parameters are common: n = 30, k = 10, X = 1, 6 = 1.1 -4.0 0.2 -3.38 0.5 Pz = [0.45 0.551. S1 = 0,5 -4,4]1 PI = [0.4 0.61, SZ = 0.2

[

[

Concluding remarks

In this paper we analysed the reliability of a k-out-of-n system. Condition for the system stability is established and various performance measures of the system obtained. A cost function is studied numerically. Numerical results show that the cost function is convex in L ( M ) when the rest of the parameters are kept fixed.

300

Figure 1. y = 0.5, M = 18, p = 0.7, C I = 10000, CZ = 1, Cs = 2, C4 = 3. (a) shows that a s L increases the system reliability decreases. (b) shows a profitable value of L for the cost function.

Figure 2. y = 0.5, L = 6, p = 0.8, CI = 1000, Cz = 20, C3 = 20, C4 = 40. (a) shows that as the level M increases the system reliability decreases first but it soon reaches a stage after which the decrease in reliability is negligibly small. Also (b) suggests that looking at the cost function we can find a profitable value of M .

This suggests t h a t a global optimal value of L exists which minimizes t h e expected t o t a l system running cost. As expected, system reliability decreases with increasing values of L , A4 a n d p . B y the same procedure we can s t u d y t h e warm a n d hot systems. References 1. T. G. Deepak, V. C. Joshua and A. Krishnamoorthy, Queues with postponed work, To appear in TOP. 2. A. Krishnamoorthy, P. V. Ushakumari and B. Lakshmy, k-out-of-n system with repair; the N-Policy, Asia Pacific Jl. OR, 19, 47-61, (2002). 3. G. Latouche and V. Ramaswami, Introduction t o Matrix Analytic Methods in Stochastic Modelling, SIAM, (1999). 4. M. F. Neuts, Matrix-geometric methods in stochastic models-An algorithmic approch, John-Hopkins Univ. Press (1981).

RELIABILITY EVALUATION OF A FLOW NETWORK WITH MULTIPLE-CAPACITY LINK-STATES

SEUNG MIN LEE Department of Statistics, Hallym University Chunchon 200-702, Korea E-mail: smleel Oha1lym.ac.kr

CHONG HYUNG LEE Department of Computer, Konyang University Nonsan 320-711, Korea E-mail: chleeOkonvang.ac. kr DONG HO PARK Department of Statistics, Hallym University Chunchon 200-702, Korea E-mail: dhparkBhallym.ac.kr Many real-world complex systems such as computer communication networks, transport systems of a large town and hydraulic systems which carries gas or fluid through pipeline networks can be regarded as flow networks. T h e reliability of such a flow network is defined as the probability of transmitting the required amount of flow successfully from the source node to the terminal node, and can b e computed in terms of composite paths each of which is a union of simple paths of the network. This paper proposes a method to evaluate the reliability of a n undirected flow network with multiple-capacity link-states. T h e proposed method is based on the expanded minimal paths defined in the text, which are generated from the given set of minimal paths of the network, and the composite paths are then generated in terms of those paths.

1. Introduction

In real fields, flow networks with multiple-capacity link-states are considered more practically and reasonably than flow networks with binary-capacity link-states. Generally, a flow network is modeled as a graph G(V,E ) which V and E represent a node set and a link set, respectively. In flow network with multiple-capacity linkstates, links have multi-states, and different capacities are assigned to each state of links. Therefore, a flow network with multiple-capacity link-states is the network considering both connectivity and an amount of flow transmitted from source to terminal. Also, maximum capacity flow is considered when a flow is transmitted. Many researchers have considered the performance index or the reliability as measures for evaluating the performance of flow networks with multiple-capacity

301

302 link-states when minimal paths or minimal cuts are known. Performance index is the expected value of source to terminal capacity divided by maximum source to terminal capacity. Ref [lo] suggest the method to evaluate performance index on flow network with multiple-capacity link-states and use the expanded minimal paths ( e m p ) representing all permutation of link states with non-zero capacity in each minimal path. But [9] presents the counter example that the method of [lo] are incorrect in some cases. Ref [3], [4], [5], [6], [7], [la] and [13] use minimal paths to evaluate network reliability, and ref [2], [3], [4], [8] and [13] use minimal cuts. Among these papers, [7], [8] and [la] consider the multiple-capacity link-states as well as node failures. Ref [13] suggest the algorithms which find all minimal paths vectors and all minimal cut vectors transmitting the required flow d, refered to as d-MPs and d-MCs, but [2] and [6] point out that the algorithm of [13] has many superfluous steps in finding all d-MCs and d-MPs, respectively, because the algorithm have to transform the original network to series-parallel network when the original network is not series-parallel network. Ref [6] use the flow conservation law to present a more efficient algorithm which can apply a directed flow network with multiple-capacity link-states. The papers, such as [2], [6], [7], [8], [ll],[12] and [13], basically follow multi-state structure model. Therefore, in these papers, link states and the values of system structure function of multi-state structure model are treated as link capacities and a maximum flow transmitted from source to terminal, respectively. That is, link capacities take only non-negative integer values with any upper limit. and minimal path vectors transmitting a required flow are obtained for evaluating network reliability. In this paper, we consider an undirected flow network with multiple-capacity link-states consisted of undirected links, and the flow network do not follow the multi-state structure model. Thus we do not use minimal path vectors but the union of minimal path sets for evaluating network reliability. For finding unions of minimal paths, we basically follow the method given in [5] which consider an undirected flow network with binary-capacity link-states. For considering multiplecapacity link-states, expanded minimal paths representing all permutation of link states are used. Section 2 gives acronyms, notations and assumptions, and an efficient algorithm is described in Section 3. Section 4 gives a numerical example to illustrate the method.

2. Acronyms, notations and assumptions

Acronyms and notations mP minimal path composite path which is union of paths CP ep, emp expanded path and expanded minimal path, respectively F E M P set of failure e m p

303 AFEMP N A F EM P ecp, seep P

C w >x, Y,

p,

CZ W(C) W-ALL Milnin

1.1 u=v

set of additive failure e m p set of non-additive failure e m p expanded cp and success expanded cp, respectively mp CP

link state vector of its corresponding path expanded m p with x current expanded cp with z maximum capacity flow of the (sub)network induced by C = W-({all links with their maximum states in the network}) a required flow transmitted from source node t o terminal node number of elements of . ui= w i for all i and IuI = Iv1

Assumptions 1. The nodes are perfect and each has no capacity limit. 2. The links are s-independent and have multi-states with known state probabilities. 3. All links are undirected and each link flow is bounded by the link capacity. 4. The network is good if a specified amount of flow can be transmitted from the source node to the terminal node. 5. The m p of the network, considering connectivity only, is known. 3. Algorithm

In a network with multiple-capacity link-states, we need information which link is functioning in a state. To obtain this information, at initialization, the proposed method generates expanded minimal paths ( e m p ) for representing all permutation of link states with non-zero capacity in each minimal path. For example, let ( A ,B ) be a m p and link A and B have two states and three states containing state 0, respectively. Then, the e m p of ( A , B )are obtained as (A1,Bl)and ( A l ,B2). Our algorithm focus on how t o find efficiently the expanded composite paths, union of expanded paths ( e p ) consisted of e m p and subpaths of e m p ( s e m p ) , transmitting a required flow. To do this, we present methods which make a comparison of e m p or s e m p given in Sec. 3.1, and check and remove redundancy given in Secs. 3.2 and 3.3. 3.1. Comparison of expanded paths

Let G, and GS, be e m p or s e m p . Two G, and GS,, are equal when G = G' and x = y, and the union of G, and G;, G, U GS,, are obtained by G U G' with the link state vector which consists of the link state of uncommon links and the larger state of common links in G and G'. Also, the difference of G, and GS,, G, - GS,,

304 is a s e m p of G, on GI, and is consisted of the expanded links on G, except the same expanded links in both G, and GI. For example, ( A l ,B2) U (A2,C2) and ( A l , B 2 ) - ( A l , C 2 )are (A2,B2,C2) and (B2),respectively. Also, ( A l , B 2 ) - ( A l , C 2 ) and ( A l ,B2) - ( A l ,D 2 ) are equal because of the same s e m p , ( B 2 ) ,are obtained. Let all links in G be in G’. Then, G, is said to be a subset of G&if all elements of y - x for common links are not negative, and it is denoted by G, c G&. Also, G, is said t o be a proper subset of G I if all elements of y - x are not negative for common links as G c G’ and G # G‘, or at least one positive and 0’s as G = G‘. For example, both ( A l ) and ( A 1 , B l )are subsets of ( A l ,B l ) , ( A l ) is a proper subset of ( A l ,B1) but ( A l ,B1) is not. Also, ( A l ,B I )is a proper subset of ( A l ,B2), ( A * ,B1) and ( A 2 ,B2). But ( A 2 , B l )is not a proper subset of ( A l ,B2). 3.2. Algorithm

Basically, the proposed algorithm add a n e m p or a s e m p , one by one, t o current expanded cp until a success expanded c p transmitting a required flow from source node t o terminal node is obtained. For determination of e m p t o add t o current expanded cp, the emp’s having lower states among the same mp’s are considered as candidates. Among all candidates, we select one candidate giving maximal increase on maximum capacity flow. Let FEMP be the set of failure emp’s which can not transmit a required flow and AFEMP be the set of additive failure emp’s which are candidates added t o current ecp, union of e m p . Also, NAFEMP be the set of non-additive failure e m p . Therefore, FEMP is consisted of AFEMP and NAFEMP. At initialization, we expand all m p that each of links in a m p obtains all permutations of link states with non-zero capacity, and the e m p which all elements in a link state vector are 1’s are considered as candidates added t o current ecp. Set the emp’s in the set of additive failure e m p (AFEMP) and others in the set of non-additive failure e m p (NAFEMP), and set FEMP by {emp’s in AFEMP : emp’s in NAFESP}. The ecp transmitting a required flow from source node to terminal node is refered t o as success ecp (secp). Let P, be the e p which gives the maximal increase on maximum capacity flow among AFEMP, C, be a current ecp, and ELGEMP be a set of e p preventing the generation of a secp containing the obtained secp. That is, we can obtain minimal secp by the use of ELGEMP, efficiently. Set C, = 8 and ELGEMP = 8. If WALL< Wmin,STOP. Otherwise, find P, in AFEMP. Case 1. W ( C , u P,) 2 Wmin Record C,UP, as a secp, and search for next secp with C, and FEMP = FEMP{P,}. Remove Pi in NAFEMP if P, c Pi and set ELGEMP = ELGEMP U P,. Case 2. W ( C , u P,) < Wmin Update C, = C, UP, and FEMP = FEMP - {P,}, and apply M1-M3, given in subsection 3.3, t o FEMP for efficient searching secp.

305 Case 3. There is no choice : Retreat to the step where the last ep was added to generate, C,, at which time, C, = CA,U (last ep) for some CL,. Remove Pi in NAFEMP if P, C Pi and set ELGEMP = ELGEMP u P,. Remark. At the end of the Case 1-3, decide P, from the remaining ep's in AFESP and compute W ( C zU P,) for searching another new seep. According to the computation, select a Case 1 or 2. 3.3. Some methods raising computational efficiency This subsection suggest some methods which raise the computational efficiency by removing the possible redundancy are given in the following.

M1. Among ep in FEMP, the ep which are equal are removed except the one. M2. The proper subsets among ep in FEMP are candidates added to a current cp,

c,. M3. Let P, and Pi be in FEMP and ELGEMP, respectively. Remove P, satisfying P~-C,CP,-C,. In our algorithm, M1 and M2 reduce the number of remaining ep in FEMP and candidates t o add to current ecp, C,, respectively. Through M3, we can prevent the generation of a new seep, C, U ( e p in FEMP), containing the obtained seep before making a new seep. 4. An example

We consider the multi-state flow network with undirected links. All links have three states, and their capacities and state probabilities are given in Figure 1. Let Wmin = 8. In this network, we have 4 minimal paths: ( A ,B ) , ( A ,El D ) , (C, D ) and (C,E , B ) . As the minimal paths are expanded, the number of ep corresponding to ( A , B ) ,( A ,E , D ) , (C,D ) and (C,E , B ) have 4,4, 8, 8, respectively. We present one part of the whole procedure. Let the current ecp, C,, be ( A l , B2),and FEMP and ELGEMP corresponding the ecp be ~ ( ~ ~ , ~ ~ ) , ( A ~ ) , ( ~ 1 ,: ~(l~ )1 ,1 (~ ~~ l) , 1 ~ ( ~ 1 ~) 1 ~ 1 ) ( ~ l , ~ 2 ) r ( ~ 2 , ~ 1 ~ , ( ~ 2 , ~ 2 ) , ( ~ 2 , ~ l , ~ 1 ) ,(-421 ~ A 2 ~, 2~ ,1 ~, 1~ )2 1) (1 A 2 > ~ 2

( C l , E z ) ,(C2,E1),(C2,E2)} and 8, respectively. Since W((A1,B2))< 8, Case 2 is considered.

306 LinWCapaJProb. 0 0.05 A 5 0.35

B

E

S

c

2

0.2

4

0.6

4

0

0.1

3

0.3

0.3

D

t

2

0.1

4

0.87

Figure 1. Bridge Network

Case 2. We update the current ecp, C,, with (C1,Dl) which gives maximal increase on MCF amongallepin AFEMP. Then, C, = ( A l , B a , C 1 , 0 1 ) ( =( A l , B 2 ) U ( C 1 , 0 1 ) ) and W ( C z ) is 7. Also, FEMP = FEMP - { ( C l , D l ) } . Check M1 if there are equal in FEMP, and one ( E l ) and one (E2) are removed from AFEMP and NAFEMP, respectively. Also, by updating FEMP by using of M2 and M3. (C2)) and NAFEMP = NAFEMP - { ( 0 2 ) , ( C 2 ) ) . AFEMP = {(A2), ( E l ) ,(Dz), Case 1. The MCF of the union of C, and the ep, ( A z ) , is larger than Wmin. Thus, record the union, (A2,B2,Cl,D1), as a secp. Delete ep which contain ( A ? ) from NAFEMP and update ELGEMP = ELGEMP U I ( & ) } = { ( A 2 ) } . Case 3. Search for next secp with current C, and ep in AFEMP. As the MCF of the union of the current ecp and any one ep in AFEMP is 7, and is less than Wmin7 we update the current ecp with El to find next secp. Then, the current ecp and FEMP are (Al, B 2 ,C1, D 1 ,E l ) and FEMP - { ( E l ) } ,respectively. Applying M1-M3 to FEMP, we obtain AFEMP = { ( C 2 ) ,( E 2 ) ,( 0 2 ) ) and NAFEh4P = { ( C 2 , 0 2 ) ,( E 2 , 0 2 ) > (C2,E d ) . Case 1. The MCF of the union of C, and the ep, (Cz), is larger than Wnain. Thus, as a secp. Delete ep which contain (C2) record the union, (Al,B2,Cz,D1,El), from NAFEMP and update ELGEMP = ELGEMP U ( ( ( 7 2 ) ) = { ( A 2 ) (C2)). ,

We omit the remaining procedure. In the following we obtain two more secp, (Al, B2, c2,0 2 ) and ( A 2 ,B2, 0 1 , E l ) . All 4 secp is also minimal secp. By using the reliability evaluation method of [l]with all minimal secp, the

307 network reliability, R, is obtained as:

The probability p l , means P{state of link 1 2 state j of link l } where 1 = A , B , C, D ! E and j = 1 , 2 . Then, the reliability is 0.46647 according to the probabilities in Figure 1.

5. Conclusion This paper proposes a method to evaluate the reliability of an undirected flow network with multiple-capacity link-states. The proposed method is based on the expanded minimal paths which are generated from the given minimal paths of the network. Throughout the proposed method, efficient reliability evaluation is possible because redundancy in the procedure of obtaining expanded composite paths can be redueced. References 1. T. Aven; Reliability evaluation of multistate systems with multimode components, IEEE Transactions on Reliability, 34:473-479 (1985). 2 . C. C. Jane, J. S. Lin and J. Yuan, Reliability Evaluation of a limited-flow network in terms of minimal cuts, IEEE Transactions on Reliability, 42,354-361 (1993). 3. J. C. Hudson and K. C. Kapur, Reliability analysis for multistate systems with multistate components, IZE Transactions, 15,127-135 (1983). 4. J. C. Hudson and K. C. Kapur, Reliability bounds for multistate systems with multistate components, Operations Research, 33, 153-160 (1985). 5 . S. M. Lee and D. H. Park. An efficient method for evaluation of network reliability with variable link-capacities, IEEE Transactions on Reliability, 50,374-379 (2001). 6. J. S. Lin, C. C. Jane and J. Yuan, On reliability evaluation of a capacitated-flow network in terms of minimal pathsets, Networks, 25, 131-138 (1995). 7. Y. K. Lin, A simple algorithm for reliability evaluation of a stochastic-flow network with node failure: Computers B Operations Research, 28,1277-1285 (2001). 8 . Y. K. Lin: Using minimal cuts to evaluate the system reliability of a stochastic-flow network with failures a t nodes and arcs, Reliability Engineering B System Safety, 75: 41-46 (2002). 9 . R. Schanzer, Comment on : Reliability modeling and performance of variable linkcapacity networks! IEEE Transactions on Reliability, 44,620-621 (1995). 10. P. K. Varshney, A. R. Joshi and P. L. Chang, Reliability modeling and performance evaluation of variable link capacity networks, ZEEE Transactions on Reliability, 43, 378-382 (1994). 11. W.C. Yeh, A simple algorithm t o search for all d-MCs of a limited-flow network, Reliability Engineering B System Safety, 71,15-19 (2001). 12. W. C. Yeh, A simple algorithm to search for all d-MPs with unreliable nodes, Reliability Engineering B System Safety, 73,49-54 (2001). 13. J. Xue; On multistate system analysis, IEEE Transactions on Reliability, 34,329-337 (1985).

This page intentionally left blank

A RANDOM SHOCK MODEL FOR A CONTINUOUSLY DETERIORATING SYSTEM

KYUNG EUN LIM; JEE SEON BAEK AND EUI YONG LEE D e p a r t m e n t of Statistics, Sookmyung Women’s University,

Seoul, 140-742, Korea E-mail: [email protected]

A random shock model for a system whose state deteriorates continuously is introduced. It is assumed that the state of the system is modeled by a Brownian motion with negative drift and is also subject t o random shocks. A repairman arrives according to a Poisson process and repairs the system if the state has been below a threshold since the last repair. Explicit expression is deduced for the stationary distribution of the state of the system. An optimization is also studied. Keywords: Brownian motion, random shock, first passage time, stationary distribution.

1. Introduction

We consider a random shock model for a system whose state deteriorates continuously. It is assumed that the state of the system is initially p > 0 and, thereafter, follows a Brownian motion with drift ,u < 0, variance u z > 0 and reflecting barrier at p. ,D is assumed to be the perfect state of the system. It is also assumed that shocks arrive at the system according to a Poisson process of rate v > 0. Each shock instantaneously decreases the state of the system by a random amount Y , where Y is a non-negative random variable with distribution function G. It is further assumed that the system is checked by a repairman who arrives at the system according to another Poisson process of rate X > 0. If the state of the system has been below a threshold a (0 5 a 5 p ) since the last repair, he instantaneously increases the state of the system up to p, otherwise, he does nothing. We, in this paper, obtain the stationary distribution of the state of the system by establishing the Kolmogorov’s forward differential equations and by making use of a renewal argument. A diffusion model for a system subject to continuous wear is introduced by Baxter and Lee(1987, 1988). They obtain the distribution of the state of the system and study an optimal control of the system. Lee and Lee(1993, 1994) extend the earlier analysis to the system whose state decreases by random shocks. The present model is for the system subject to both continuous wear and random shocks. Let { X ( t ) , t2 0} be the state of the system at time t in our model. To obtain the stationary distribution of { X ( t ) , t2 0}, we divide the process { X ( t ) , t2 0} into the following two processes: Process { X , ( t ) , t 2 0} is formed by separating from

309

31 0 the original process the periods in which the state of the system moves from to a and by connecting them together. Process { X ; ( t ) , t 2 0} is formed by connecting the rest of the original process together. 2 0}, in section 3, In section 2, we derive the stationary distribution of {Xl(t),t the stationary distribution of { X ; ( t ) ,t 2 0}, and finally, in section 4,the stationary distribution of { X ( t ) ,t 2 0) by combining the results obtained in sections 2 and 3. In section 5 , after assigning several costs to the system, we show that there exists a unique X which minimizes the long-run average cost per unit time. 2. Stationary distribution of X,(t)

Let Fl(z, t) = P{X1( t )5 x} denote the distribution of X1 ( t ) . Note that {Xl( t ) ,t 2 0} is a regenerative process. Let Tl(zo,a)= inf{t > OlXl(t)5 a} be the first passage time t o state less than or equal to a with X l ( 0 ) = 20 ( a 5 20 5 8 ) and define

with

h(u) =

1,ulx 0, otherwise.

Then, w(z, 20) is the expected period where X1 (t) is less than or equal to x during T1(zo,a).Since { X , ( t ) , t 0) is a regenerative process, the stationary distribution of X 1 ( t ) is given by

>

We, for the convenience of calculation, consider X,(t)- a instead of X,(t), since w(x, 2 0 ) = w(z - a , 20 - a ) and E[Tl(zo, a)]= E[Tl(zo- a , O)]. With this consideration, we obtain the formulas of w(x, 2 0 ) for 0 5 x,xo 5 8 - a and of E[T1(zo, O)] for 0 5 50 5 - a. Notice that until the state of the We first derive the formula of EIT1(zo,O)]. system reaches 0, Xl(t)can be expressed as

Xl(t) = Z ( t ) - S ( t ) , where { Z ( t ) , t 2 0) is a Brownian motion starting at xo with parameters p < 0 and o2 > 0, and { S ( t ) , t 2 0), S ( t ) = CE(i’Yi,is a compound Poisson process with { N ( t ) , t 2 0} being a Poisson process of rate v and K’s being 2.2.d. random variables having the distribution function G. Again, for the convenience of calculation, let { Z’(t),t 2 0} be a Brownian motion starting at 0 with - p > 0 and n2 > 0, and define a new process

X l ( t ) = Z‘(t) + S ( t ) .

31 1 Then, by symmetry, it can be easily seen that the first passage time of X,(t) to state 0 is equal in distribution to that of X ; ( t ) to state XO, say TI(0,xo):

Ti(0,XO)=

L

inf{t : X ; ( t ) 2 Q } , if X ; ( t ) 2 20 , for some t 2 0 if ,Y;(t) < xo , for all t 2 0.

Since T;(O,xco)is a Markov time, an argument similar to that of Karlin and Tayis given by lor(1975, pp. 361-362) shows that the Laplace transform of T;(O,zo) E[e-vT;( 0 4 0 ) ] = e--uxo,

(1)

3.'

where u is related t o 71 by equation 71 = u p + - v(1 - m c ( u ) ) with m ~ ( u= ) E [ e U y ]the , moment generating function of Y . By differentiation, we can show that

Now, we derive w ( x ,X O ) by establishing a backward differential equation. Suppose that X l ( 0 ) = x0, 0 5 xo p - a. Conditioning on whether a shock occurs or not during [0, At] gives that

<

W ( z , 20) =

{

E [ J t t h ( Z ( t ) ) d+t W ( Z , Z O + A ) ] , if no shock occurs E [ J t th(Z(t))dt+ ~ ( zz o, + A - Y ) ] if, a shock occurs and if a shock occurs and O(At),

20 20

+A -Y > 0

+A

-

Y 5 0,

where Y is the amount of a shock and A = Z(At) - Z ( 0 ) Hence, we have for 0 5 20 5 p - a ,

w(z,z,) = (1 - vAt)E

1

[ i A t m w+t w ( x , z o + A) + vAtO(At)

L

J

Taking Taylor series expansion on w ( x ,xo + A) with respect to A, rearranging the equation and letting At -+ 0 yield

Then,

W ( X , 20) satisfies

the following renewal type equation:

Lemma 2.1.

with boundary conditions w ( x ,0 ) = 0 and & W ( X , ~ ~ ) l =~ 0 , ~where = ~H ( x- 0 )~= - %dt, and p = urn with G, being the equilibrium distribution of G .

s,"' h(t)dt,K ( x o )= s,' $G,(t)

312

Proof. Integrating both sides of equation (2) with respect t o xo with boundary condition w(z,O) = 0, we have

(4) If we integrate equation (4)again with respect t o renewal type equation.

20,

then we obtain the given

It is well known [see, for example, Asmussen(l987, p.113)] that the unique solution of the renewal type equation in Lemma l is

where M ( q ) = Cr=o I d T L(20). ) Here, Id") denotes the n-fold Stieltjes convolution of K with K(O)being the Heaviside function. To get & w ( z , z ~ ) l , ~ = o , we differentiate equation (5) with respect t o zo and put zo = p - a with boundary condition aaz o ~ ( ~ , ~ : o ) l z o == 3 -0,a then

2

3

--'W(z,

[J:-"

3x0

3.

+

~ h . l ( x o ) l s o = 8 - a - t H ( t ) d t H(P - a ) ]

~ o ) l r o = o=

f f 2 h I ( / ?- a )

Stationary distribution of X,(t)

Note that in our model the state of the system can cross a down either through a continuous path or by a shock. Hence, we first obtain the distribution of L(z0,a ) = a - Xl(T1(zo,a)),given that X l ( 0 ) = zo, a 5 xo 5 p. We, for the convenience of calculation, consider X , ( t ) - a instead of X l ( t ) , since L(z0 - a,O) = L(z0,a). With this consideration, we obtain the formula for the distribution of L(z0,O)for 0 5 zo 5 - a. Let f i ( ~ o , O )= P r { L ( q , O )> 1}, 1 2 0. Then, 8(zo,O) satisfies the following renewal type equation:

B

Lemma 3.1.

with boundary conditions 8 ( 0 , 0 ) = 0 and & f i ( ~ o , O ) l ~ ~ = ~=- 0~ , where Gl(z) = P [ G e (x + I ) - Ge ([)I. Proof. Conditioning on whether a shock occurs or not during time interval [0,At] gives that

+ +

if no shock occurs E [ f i ( z o 4 011, E[Pi(zo+A-Y,O)], ifashockoccursandzo+A-Y > O Pr(z0 A - Y 5 - l ) , if a shock occurs and zo A - Y 5 0.

+

313

Hence, we have, for 0 Pl(x0,O) =

5 xu 5 /3 - a ,

(1 - Yht)E[P[(Z:o + A,O)] + vhtPr(x:o+ A - Y 5 - I ) +vAtE[Pl(x, A - Y,0)12,+ A - Y > O]Pr{z, + A - Y > 0)

+

+ o(At).

Taking Taylor series expansion on fi (zo+A, 0), rearranging the equation, letting At + 0 give

20

S ( ~-O Y,O)dG(y).

fv

Integrating the above equation twice with respect to identity on the way: Y

l z o ( l - G(t

+I))&

=v

20,while

using the following

lzo+l(l -

G(Y))dY

= p[Ge(zo +I) - Ge(l)]= Gi(zo),

we can derive the given renewal type equation for Pl(xo,0). W The renewal type equation in Lemma 3.1 has the unique solution as follows:

Differentiating the above equation with respect t o xu and using the boundary condition &Pl(xo, O)lzo=fi-a = 0, we have

M’(P - a

3 8x0

- ~ ~ ( ~ o , ~ ) l z o = o=

-

+

t ) G ~ ( t ) d t Gl(/3 - a ) ]

M(B - a)

Now, let F2(x,t)= P ( X , ( t ) 5 x} denote the distribution function of X 2 ( t ) . Notice that until the repairman arrives, X,(t) = Z ( t )- S ( t ) . Hence, we can deduce an expression for F2(2, t ) when -m < x 5 ,L? by an renewal argument. Conditioning on whether a repair during (0, t] gives that t

F2(x,t)= E [ V ( x , t ) P r { E X> t } + X i

s:-z

+

V(x,t-u)Pr{EX > t - u } d u ] ,

where V ( x , t ) = B ( x v,t)dC(y,t), B ( z , t ) = P r { Z ( t ) 5 x}, C ( x , t ) = P r { S ( t ) 5 x}, and the renewal function of the exponential distribution with rate X > 0 is At. The distribution of the compound Poisson process is well known [see, Tijms(1986, pp.37-38)]. Moreover, an argument similar to that of Cox and Miller

314 (1965, pp.223-224) shows that

+

(20

-pt

- p ) e z ~9 { 1 - 2pt so + p t + p

42opt - ( Z 4exP{ -

20

- pt)2

2u2t

where @(z)is the standard normal integral. Therefore, by making use of the key renewal theorem, the stationary distribution of X 2 ( t ) is given by F2(z) = X

4.

Lrn

V ( x ,u)e-’”du.

A formula for F(x)

We know that the points where the actual repair occurs form an embedded renewal process in { X ( t ) , t2 O}. Let T* be the generic random variable denoting the time between successive renewals. Then

where EXis an exponential random variable with rate A.

Proposition 4.1. F ( x ) is given b y the following weighted average of F l ( x ) and F 2 (x):

Proof. Suppose that we earn a reward at rate of one per unit time when processes { X ( t ) t, 2 0 } , { S l ( t ) ,t 2 0) and {X,(t),t 2 0) are less than or equal to x 5 p, we see by the renewal reward theorem [Ross(l996, p.133)] that

F(x)=

E(reward during a cycle T*)

E(T*)

+

- E(reward during a cycle Tl(P,a)) E(reward during a cycle E X ) -

-

E(T*) E[Tl(8,a ) ] E(reward during TI (8,a ) ) E ( E X ) E(reward during EX) E(T*) E[Tl(PI all E(T*) E(EX ) urn-p X(P - a ) F2(x). X(8 - a ) um - p F1(x) X(P - a ) + vm - p

+-.

+

+

315 5.

Optimization

Let c1 denote the cost per visit of the repairman, let c:! denote the cost to increase the state of the system by a unit amount and let cgdenote the cost per unit time of the system being in states below a threshold a. We calculate C(X), the expected long-run average cost per unit time for a given arrival rate X and a given threshold a . To do this, we define a cycle, denoted by T * ,as the interval between two successive repairs. Then, by the renewal reward theorem [see, Ross(l996, p.133)], C(A) is given by C(X) =

E[total cost during a cycle] E[length of a cycle]

- E[N]Cl -

+ E[X’(T*)]c*+ i c 3 1

E[T*I

where E ( N ) is the expected number of visits of the repairman during a cycle and E [ X ‘ ( T * )is ] the expected amount of repair. Note that T’ can be expressed as T’ = T l ( P , a ) E X ,where Tl(p,a) is the first passage time from /3 to a and E X is an exponential random variable with rate A. Then,we can show that

+

and

E ( N ) = XE[T*]=

A(/?

- a ) + vm - p vm-p

E[X’(T*)] can be calculated by using the argument in section 2 as follows:

E[X’(T*)= ] E [ Z ’ ( T *+ ) S(T*)]

+

= E [ E [ Z ’ ( T * ) S(T*)IT*= t ] ] = (vm - p ) E ( T * )=

+

X(D - a ) vm - /I X

Therefore, C(X) = X C l

+ (vm - p)c2 + A(/?

vm-p -

a ) + vm - p

c3.

Differentiating the above equation with respect to X gives

a ax

-C(X)

= [A(/?

-

a ) + vm - p12c1 + (a - P)(vm - p)c3 [X(D - a ) vm - p]2

+

Lemma 5.1. If c1 2 s c 3 , then C(X) achieves its minimum value (vm- p ) + ~ c g , at X = 0 , otherwise there exists a unique X*(O < X < co) which minimizes C(X).

316 Proof. Suppose that c1 2 &Q,

then limA(X) 2 0. X+O

Further,

A'(X) = 2(,f?- LY)[X(P- LY)

+

- p]c1 2 0.

Hence, A(X) 2 0 for all X 2 0 and C(X) is minimized at X = 0. Suppose, now, that c1 < z c z , then limA(X) < 0. Since A(X) is an increasing X+O

function with lim A(X) = co, there exists a unique X*(O

A(X*) = 0.

w

x+m

Figure 1 illustrates an example of C(X) when

c1

2

<

X

< co) such

that

SC~.

lamtdm

Figure 1. C(X)

(c1

= 2,

c2

= 1, c3 = 3, v = 0.8, p = -1.5~1= 2, p = 10, a = 3)

References 1. S. Asmussen, Applied Probability and Queues, Wiley (1987). 2. D. R. Cox and H. D. Miller, The Theory of Stochastic Processes, London: Methuen (1965). 3. L. A. Baxter and E. Y. Lee, A Diffusion Model for a System Subject to Continuous Wear, Prob. Eng. Znf. Sci., 1, 405-416 (1987). 4. L. A. Baxter and E. Y. Lee, Optimal Control of a Model for a System Subject to Continuous Wear, Prob. Eng. Znf. Sci., 2, 321-328 (1988). 5. S. Karlin and H. M. Taylor, A first course in stochastic processes, 2nd ed, Academic Press (1975). 6. E. Y. Lee and J. Lee, A Model for a System Subject to Random Shocks, J . Appl. Prob., 30, 979-984 (1993). 7. E. Y. Lee and J. Lee, Optimal Control of a Model for a System Subject to Random Shocks, Oper. Res. Lett., 15, 237-239 (1994). 8. S. M. Ross, Stochastic Processes, 2nd ed, Wiley (1996). 9. H. C. Tijms, Stochastic Modelling and Analysis, Wiley (1986).

IMPROVEMENT IN BIAS AND MSE OF WEIBULL PARAMETER ESTIMATORS FROM RIGHT-CENSORED LARGE SAMPLES BY USING TWO KINDS OF QUANTITIES

CHENGGANG LIU Department of Industrial and Systems Engineering College of Science and Engineering, AOYAMA GAKUIN University Futinobe 5-10-1, Sagamihara, Kanagawa 229-8558, JAPAN E-Mail: [email protected]

SHUN-ICHI ABE Professor Emeritus, AOYAMA GAKUIN University The purpose of the paper is to search for most desirable estimators of the Weibull parameters, which have possibly the smaller lBIASl and M S E than the conventional ones. We introduce, just as in our recent paper[l], two kinds of quantities:(i) errors on both axes, and (ii) certain predicted values of unknown data in right-censored large samples. Using these and performing Monte Carlo simulation, we can propose certain estimators for the shape parameter perhaps to achieve our primal purpose adequately.

1. Introduction and Summary

Let the failure time distribution be t,he Weibiill:

where and 7 are unknown shape and scale parameters, respectively, of the function. Assuming that only the smallest r data

have been observed, while the (n - r ) ones

Tj(> T,,j = T

+ l , +~2,. . . , n ) ,

(3)

among the sample of size n have not yet been obtained. Moreover, putting

X

X

= In T ,

Xi = In Ti (i = 1 , 2 , . . . ,n) ,

(4)

distributes obviously as

G(z) = 1 - exp[- exp{Jz 317

-

@}] (0 = In 7 ) ,

(5)

31 8 to generate the smallest r data

and the remaining ( n - r ) ones Xj(>

xr;j = T + 1,r + 2 , . .

’

,n ).

(7)

As for the parameter estimation problems from the right-censored large samples given above, we should remark the following three points: (i) LSE in Section 2.2 and MLE in Section 2.3 are obviously applicable to our data; however, in cases of large n and smaller r , IBIASI and M S E of the estimators may frequently become very large; (ii)BLUE for 1/[ and 2-point Estimator discussed in Ref. [l,31 for small samples can not be applied now since for larger samples the numerical tables to construct the optimal estimators have not yet been known; (iii)the complete sample methods MME in Section 2.5 as well as MPM in Section 2.4 can approximately be utilized by supplying certain predicted values given in Section 2.1 for unknown data. Finally, we may conclude that the modified parametric moment estimators (MPM) proposed by us in Ref. [3] is almost unbiased and most desirable for wide range of large samples with such size as n 2 30 and r / ( n 50) 2 0.25. Moreover, LSE’s defined in Section 2.2 may be better for samples with n 2 30 and r / ( n 50) < 0.25 because of the small M S E and of simplicity of the methods, although the biases are not necessarily small, as seen in Table 1. We will show which of the estimators and the data-types among the several ones is more desirable than the others through the discussions in the sequel.

+

+

2. Various Estimators for Weibull Parameters

Let us estimate the shape parameter 6 of the Weibull distribution by using Monte Carlo simulation under the setting [ = 1.0, q = 10.0. The simulation is performed through N = 20000 samples iteratively for each of the sample sizes n = 30,50,100; r = 5,10, .. . ,n.In order t o compare the simulation results of /BIAS1 and M S E each other among , (i)the known conventional methods of analyzing right-censored data given in Formula (2) or (6), and, (ii) the proposed methods of treating hypothetical complete samples:

-

-

XI < X 2 < . . . < X , < X T + l < “ ‘ < X n,0r

-

-

T~
+

+

(8) (9)

where Tj = exp ( X j ) or X j ( j = r l , r 2, . . . , n ) are the predicted values of unknown data Tj or X j ( j = r 1,r 2 , . . . ,n) given in Section 2.1. We will perform lots of computations to estimate the parameter <. The whole methods applied in the paper are shown as follows:

+

+

31 9 2.1. Three a p e s of Predictions Define yi

= In

-

{xj] for

Unknown Data { X i }

[- In{ 1 - i/(n+ l)}] (i = 1,2,. . . ,n) and put

x j = (Yj - Y r ) ( X r - X T ) / ( Y T -

xj = (Yj - Y d C O + x,

v,) + x, ,

(10)

(11)

1

Xj

+

1 =

for j = r 1,r by Eq. (17).

EO

In[- In{

n-j+l

n-r+l

+ 2,. . . ,n, where

$0

R(XT;to, 6o))l and

+ lniio, ( R = 1

60 are defined by

-

G) ,

Eq. (13),and

(12)

y, and ?j,

2.2. Least Square Estimators(LSE) and Extensions Each of LSE for

< and q is defined by the respective formula: Co

=s

El =

-

62

x Y r / ~ x x, r60

= exp

(X,- to) ;

JE, ~i = exp ( X r - y,/Ei)

= S X y n / S X X n , ~2 = exp ( T n r

E3 =

-

;

-gn/c2)

JEr 6 3 = exp (xnr

-gn/C3)

;

,

2.3. Maximum Likelihood Estimators(MLE) and Extensions The well known conventional MLE are given by the solutions of the equations

where let us put

InT,= f x:=lI n z , (&,

60) = the solution of the likelihood Eq. (19)and Eq. (20).

(21) (22)

Presuming that the quantities in Formula (9) were ordered from n independent random observations, let us apply the usual conventional likelihood equations for

320 complete samples approximately:

6=

$[c;=, T,F + c,”=,+, T, -i ] lli .

Let us denote as ((1

,GI)= the solution of the simultaneous Eq. (23) and (24) .

Furthermore, we improve

[1

and

61,

respectively, as

= & / ( l+ 1.65/n), r:
62 =

>I

cy=,+ Tj1 €2 lli2

’

2.4. Estimators by Modified Parametric Moments(MPM) Let us solve the following Eq. (28) introduced in Ref. [4] to get

[f :

# ‘=, lnTaOEa - 1InTPoEf = 0.689846 an

1.46,

(a0 =

PO -

Now we improve the solution

[r

TaE =

= 0.08,

$[c;==, T,”‘ + C,”=,+, T’

-a5

I).

[f of by

= [f(1- 1.398/n),

[,# = [ f ( l

-

1.398/n).

The quantity [,# is MPM of [ introduced and defined in Ref. [3] by us. The or [f. estimator of r j is obtained from Eq. (27) by replacing (2 with

[f

2.5. Estimators by Modified Method of Moments(MME) The mean E ( T ) and the variance V ( T )of the Weibull distribution of Eq. (1) are known t o be, respectively,

E ( T ) = Vr(i

+ I/[),

+

+

V ( T ) = rj2{r(i 2/[) - r2(i I/[)}.

Presuming n data of Formula (9) to be a complete sample, its mean unbiased variance V,, are evaluated, approximately, by

(30)

Tn, and

MME [+ of unknown [ is given by the solution of the following equation:

Jr(i

+

+

+a/[+) - r2(i i/[+)/r(1 I/[+)

MME q+ of the unknown

rj

=

G/TnT.

(32)

is estimated by

rj+ =

Tn r/ r(i+ I/[+).

(33)

321 3. Monte Car10 Simulation and Discussions 3.1. Simulation Models and Data Types

In the Weibull distribution given by Eq. ( l ) ,setting [ = 1.0,q = 10.0, we generate the smallest T random numbers in Formula (2) or (6). Predicting unknown data Xj bywayofzjinEq. (lO),or(11),or(12)forj=r+1,r+2,.~.,n,letusdefine the data-types D as follows D = 0 : the smallest r data of Formula (2) or (6)-without predicted ones; D = 1 , 3 , 5 : data of Formula (8) or (9) with { X j } given by Eq. (lo), (ll), (la), respectively, for j = T 1,r 2 , . . . ,n. The LSE and MLE are calculated for each of D = 0 , 1 , 3 , 5 , while MPM and MME are obtained only for D = 1 , 3 , 5 except for the cases of n = T . The simulation is executed for N = 20000 samples iteratively t o compute EIAS and Z S E :

+

+

<

where $ ( i ) is the ith estimate of (i = 1 , 2 , . . . , N ) for each combination of methods and data-types mentioned above. Note that in the cases of n = r , only the datatype D = 0 can occur except for the case of MLE yielding &. At the other points, the simulation is done just as in Ref. [l]. 3.2. Simulation Results and Their Comparisons

Our simulation results are shown in Table 1, where the double underlines r e p resent that the methods and data-types are most desirable for the estimation of [, while estimation of q is omitted here. In Table 2, choosing the cases of (n,T ) = (30,20), (50,30), (100,60) with D = 1, we give the results of the statistical tests for the hypotheses Ho : M S E ( k )= M S E ( ' ) ,Hi : M S E ( k )# M S E ( ' ) , and for Ho : B I A S ( k )= B I A S ( ' ) ,H I : B I A S ( k )# BIAS('), t o compare results of two methods k and 1 to each other. The test statistics t k l have been constructed by the method in Appendix 2 of Ref. [2]. For almost all ( k , 1 ) in Table 2, the null hypotheses Ho are rejected with significance level 0.01, for such ( k , I ) that l t k l l > 2.576. 4. Conclusive Remarks

Let us remark the following five points : (i) Even if n is large, the conventional statistics & defined by Eq. (22), for example, is not good for small T . See & for the cases of ( n , r ) = (30,5), (50,5), (100,5) in Table 1. Therefore, we should note which of the methods and of the data-types are to be used in the parameter estimation. (ii) Define p = r / ( n 50) and we may conclude in Table 1 that

+

322 rable 1. Evaluation results of 6 I A S and Z S E by Monte Carlo simulation. (N = 20000,

n r D 305 0 1 3 5 10 0 1 3 5 15 0 1 3 5 __ 20 0 1 3 5 25 0 1 3 5 30 0

-

-

~

LSE(&) BIAS/( M S E J [ : BIAS/[ M S E / [ ' 0.0459 0.0831 0.0012 -0.0078 -0.0587 -0,0247 -0.0747 -0.0890 -0.0708 -0.0459 -0.0732 -0.0872 -0.0708 -0.0547 -0.0671 -0.0772 -0.0684 -0.0615 -0.0650 -0.0695 -0.0698

0.6020 0.5499

0.5182 0.5331 0.1601

0.1233 0.1354 0.1493 0.0949

0.0689 0.0779 0.0875 0.0660

0.0482 0.0533 0.0592 0.0481

0.0386 0.0405 0.0431 0.0340

-0.0089 0.0782 -0.0016 -0.0105 -0.0952 -0.0349 -0.0809 -0.0948 -0.1000 -0.0596 -0.0827 -0.0960 -0.0955 -0.0706 -0.0801 -0.0891 -0.0899 -0.0792 -0.0812 -0.0847 -0.0889

0.5292 0.5444

1).5153 ~

0.5302 0.1537

0.1220

0.1350 0.1488 0.0958

0.0695 0.0787 0.0882 0.0686

o.0501 0.0550 0.0607 0.0511

0.0412 0.0429 0.0453 0.0370

>a 505 0 1 3 5 __ 10 0 1 3 5 __ 15 0 1 3 5 20 0 1 3 5 __ 25 0 1 3 5 __ 30 0 1 3 5 35 0 1 3 5

-

0.0479 0.0909 -0.0004 -0.0073 -0.0566 -0.0161 -0.0810 -0.0922 -0.0691 -0.0320 -0.0797 -0.0940 -0.0694 -0.0385 -0.0720 -0.0864 -0.0671 -0.0424 -0.0650 -0.0777 -0.0638 -0.0443 -0.0589 -0.0694 -0.0603 -0.0460 -0.0545 -0.0623

0.6160 0.5705

0.5335 0.5479 0.1685

0.1266 0.1451 0.1615 0.1007

o.0720 0.0857 0.0980 0.0731 = 0.0603 0.0700 0.0574

0.0395 0.0465 0.0537 0.0467

0.0322 0.0371 0.0424 0.0387

0.0278 0.0309 0.0346

-0.0068 0.0882 -0.0020 -0.0087 -0.0938 -0.0220 -0.0844 -0.0954 -0.0989 -0.0400 -0.0847 -0.0986 -0.0949 -0.0480 -0.0785 -0.0923 -0.0895 -0.0531 -0.0728 -0.0849 -0.0839 -0.0559 -0.0681 -0.0779 -0.0785 -0.0583 -0.0651 -0.0720

Note: K = l ( D = O ) , K = 3 ( D > O ) ;

0.5424 0.5671

0 .5318 __ 0.5463 0.1610

0.1255 0.1449 0.1610 0.1010

0.0718 0.0860 0.0983 0.0753

o.0508 0.0611 0.0708 0.0601

o.0405 0.0475 0.0547 0.0494

0.0336 0.0384 0.0437 0.0414

0.0294 0.0324 0.0360

= 1 . 0 ,= ~ 10.0;

MLE(F^J)

MPM(8)

BIASJt M S E / [ '

BIAS/[ M S E / [ '

MME(E+) BIAS/[ MSEJE'

0.1018 0.5722 0.0046 0.5313 -0.0018 0.5464

0.2232 0.6406 0.1229 0.5702 0.1211 0.5824

0.0048 o.1300 -0.0750 0.1394 -0.0888 0.1572

0.1187 0.1488 0.0261 0.1399 0.0182 0.1533

-0.0080 o.0708 -0.0695 0.0770 -0.0899 0.0912

0.1085 0.0857 0.0241 0.0772 0.0046 0.0889

-0.0114 -0.0529 0.0481 -0.0764 0.0575

o.0460

0.1107 0.0622 0.0411 0.0498 0.0087 0.0579

-0.0183 0.0324 -0.0370 -0.0561 0.0359 -0.0451 0.0230

0.1093 0.0505 0.0690 0.0376 0.0313 0.0373 0.0618 0.0339

0.1112 0.5962 0.0096 0.5532 0.0052 0.5652

0.1924 0.6276 0.0916 0.5574 0.0911 0.5662

0.0096 0.1330 -0.0753 0.1530 -0.0836 0.1701

0.0847 0.1384 -0.0038 0.1459 -0.0053 0.1586

-0.0012 0.0752 -0.0753 0.0895 -0.0889 0.1045

0.0734 0.0794 -0.0102 0.0849 -0.0172 0.0957

-0.0034 0.0517 -0.0660 0.0616 -0.0831 0.0748

0.0721 0.0564 -0.0054 0.0593 -0.0181 0.0701

-0.0042 -0.0552 0.0454 -0.0745 0.0563

0.0395

0.0729 0.0450 0.0030 _0.0446 -0.0156 0.0548

-0.0032 -0.0437 0.0341 -0.0643 0.0427

o.0307

0.0763 0.0374 0.0141 0.0342 -0.0107 0.0433

o.0250

0.0793 0.0331 0.0275 0.0271 -0.0021 0.0337

0.6461 0.1484 0.0470 0.0405 0.2328 0.0472 -0.0362 -0.0504 0.1357 0.0340 -0.0305 -0.0519 0.0931 0.0306 -0.0130 -0.0378 0.0678 0.0234 0.0038 -0.0163 0.0502 -0.0046 0.6638 0.1381 0.0340 0.0297 0.2422 0.0339 -0.0530 -0.0613 0.1466 0.0228 -0.0531 -0.0669 0.1032 0.0206 -0.0437 -0.0612 0.0777 0.0198 -0.0327 -0.0525 0.0625 0.0209 -0.0209 -0.0421 0.0516 0.0212 -0.0095 -0.0303

1.6527 0.6323

0.5798 0.5956 0.2491

0.1435 0.1469 0.1652 0.1104

o.0780 0.0794 0.0931 0.0640 0.0508

0.0494 0.0575 0.0421 0.0355

0.0323

0.0357 0.0278

0.0228 1.7476 0.6315

0.5817 0.5940 0.2598

0.1405 0.1577 0.1753 0.1204

0.0794 0.0910 0.1062 0.0737

0.0547 0.0621 0.0753 0.0519

0.0419 0.0456 0.0562 0.0385

0.0327 0.0342 0.0423 0.0302 0.0268

0.0266 0.0318

J=O(D=O), J = ~ ( D > o )

-0.0030 -0.0327 0.0264 -0.0528 0.0323

323 Table 1 ~~~~

n

r

D

50 40 0 1 3 5 45 0 1 3 5 50 0

-

LSE(iK) LSE(EJ) MLE(~J) B I A S I S M S E I ~ BIASIS MSEIE2 BIAS16 M S E I S -0.0572 -0.0486 -0.0526 -0.0573 -0.0544 -0.0506 -0.0518 -0.0538 -0.0547

0.0323 _0.0245

0.0263 0.0286 0.0268

o.oa25 0.0233 0.0243 0.0214

-0.0738 -0.0615 -0.0644 -0.0683 -0.0696 -0.0640 -0.0647 -0.0662 -0.0687

0.0348

0.0263 0.0280 0.0302 0.0291

0.0245 0.0251 0.0261 0.0235

>(

100 5

0 1 3 5 10 0 1 3 5 __ 20 0 1 3 5 30 0 1 3 5 40 0 I 3 5 50 0 I 3 5 60 0 1 3 5 70 0 1 3 5 80 0 1 3 5 90 0 1 3 5 100 0

-

>(

(Continued).

__

0.0414 0.0840 -0.0098 -0.0141 -0.0597 -0.0127 -0.0899 -0.0976 -0.0723 -0.0307 -0.0845 -0.0969 -0.0672 -0.0324 -0.0701 -0.0836 -0.0612 -0.0331 -0.0590 -0.0715 -0.0556 -0.0332 -0.0506 -0.0613 -0.0508 -0.0335 -0.0446 -0.0530 -0.0466 -0.0342 -0.0407 -0.0466 -0.0428 -0.0347 -0.0379 -0.0416 -0.0395 -0.0358 -0.0368 -0.0384 -0.0384

0.6140 0.5428

0.5343 0.5479 0.1744

0.1338 0.1569 0.171 1 0.0794

0.0526 0.0698 0.0817 0.0525

0.0331 0.0440 0.0530 0.0391

0.0242 0.0314 0.0380 0.0307

0.0193 0.0240 0.0288 0.0249

0.0161 0.0192 0.0225 0.0206

o.0140 0.0159 0.0181 0.0171

0.0126 0.0137 0.0150 0.0141

0.0116 0.0121 0.0127 0.0111

-0.0131 0.0828 -0.0106 -0.0147 -0.0969 -0.0156 -0.0915 -0.0990 -0.0983 -0.0354 -0.0874 -0.0996 -0.0882 -0.0384 -0.0741 -0.0873 -0.0790 -0.0399 -0.0639 -0.0760 -0.0712 -0.0406 -0.0563 -0.0666 -0.0646 -0.0414 -0.0511 -0.0591 -0.0590 -0.0425 -0.0479 -0.0534 -0.0540 -0.0433 -0.0459 -0.0491 -0.0496 -0.0447 -0.0454 -0.0467 -0.0476

0.5414 0.5415

0.5336 0.5472 0.1670

0.1332 ~

0.1568 0.1709 0.0817

0.0526 0.0702 0.0820 0.0555

0.0334 0.0446 0.0537 0.0418

0.0248 0.0321 0.0388 0.0332

0.0200 0.0248 0.0297 0.0270

0.0169 0.0200 0.0234 0.0224

0.0149 0.0168 0.0191 0.0187

0.0136 0.0146 0.0160 0.0154

0.0127 0.0131 0.0137 0.0122

0.0420 0.0182 -0.0007 -0.0182 0.0356 0.0142 0.0063 -0.0048 0.0292 -0.0036 0.6575 0.1133 0.0126 0.0103 0.2460 0.0174 -0.0718 -0.0765 0.1070 0.0040 -0.0693 -0.0793 0.0676 0.0059 -0.0547 -0.0685 0.0485 0.0081 -0.0412 -0.0572 0.0377 0.0104 -0.0291 -0.0462 0.0305 0.0120 -0.0185 -0.0354 0.0248 0.0123 -0.0094 -0.0249 0.0211 0.0125 -0.0011 -0.0140 0.0175 0.0101 0.0043 -0.0035 0.0142 -0.0023

0.0236 0.0214

o.0208 0.0234 0.0190 0.0174

0.0169 0.0174 0.0149

MPM(E5 B IASIE M S E I S

MME(E+) BIASIS MSEIS2

-0.0059 o.0201 -0.0242 0.0204 -0.0411 0.0238

0.0785 0.0290 0.0416 o.0220 0.0106 0.0250

-0.0098 0.0164 -0.0175 0.0164 -0.0283 0.0174 -0.0273 0.0133

0.0753 0.0262 0.0567 o.0207 0.0316 0.0188 0.0378 0.0197

0.1004 0.5618 0.0008 0.5529 -0.0015 0.5617

0.1494 0.5614 0.0528 0.0531 0.5366

0.0058 0.1387 -0.0825 0.1644 -0.0873 0.1758

0.0509 0.1365 -0.0360 0.1532 -0.0357 0.1605

-0.0074 o.0540 -0.0800 0.0742 -0.0900 0.0864

0.0361 o.0530 -0.0403 0.0690 -0.0436 0.0768

-0.0055 0.0335 -0.0654 0.0463 -0.0791 0.0573

0.0381 0.0336 -0.0304 0.0444 -0.0383 0.0525

-0.0033 -0.0520 0.0318 -0.0677 0.0411

0.0238

0.0409 0.0247 -0.0200 0.0315 -0.0324 0.0398

-0.0011 0.0184 -0.0400 0.0230 -0.0568 0.0303

0.0443 o.0201 -0.0095 0.0233 -0.0263 0.0312

0.0005 -0.0295 0.0169 -0.0462 0.0224

0.0146

0.0473 o.0170 0.0011 0.0174 -0.0193 0.0243

0.0008 -0.0206 0.0128 -0.0358 0.0164

0.0118

0.0490 0.0148 0.0120 0.0131 -0.0107 0.0184

0.0009 = -0.0125 0.0100 -0.0252 0.0119

0.0508 0.0137 0.0241 o.0106 0.0010 0.0132

0.0131 1.6797 0.5776

0.5663 0.5751 0.2733

0.1422 0.1666 0.1781 0.0785

0.0552 0.0744 0.0867 0.0430

0.0343 0.0462 0.0572 0.0286

0.0244 0.0316 0.0409 0.0211

o.0190 0.0228 0.0301 0.0163

0.0151 0.0168 0.0221 0.0128

0.0122 0.0127 0.0161 0.0105 0.0102

0.0100 0.0117 0.0085 0.0082

0.0080 0.0084 0.0066

Note: K = 1 ( D = 0), K = 3 ( D > 0 ) ; J = 0 ( D = 0), J = 2 ( D > 0)

o.0079

-0.0015 -0.0072 0.0079 -0.0149 0.0085 -0.0138

o.0062

0.5298

0.0486 0.0124 0.0351 0.0096 0.0171 o.0091 0.0194 0.0096

324

table2. Statyicaltestsstatictics on tyhe differences

[(IC)

LSEK

LSEj

MLE MPM

[(l)

BIAS/[

LSEj MLE MPM MME MLE MPM MME MPM MME MME

1.553+02 -1.41E+02 -7.423+01 -1.973+02 -1.51E+02 -9.13E+01 -2.02Ef02 6.203+02 -3.033+02 -4.423+02

M S E / ~ -2.433+01 -6.15E+00 6.94Et00 -2.03Ef01 -1.60E+00 1.13E+01 -1.65E+01 2.92E+01 -3.83E+01 -3.673+01

BIAS/[ 1.49E+02 -1.433+02 -9.233+01 -2.02Ef02 -1.523+02 -1.06E+02 -2.063+02 7.32Et02 -3.353+02 -4.683+02

MSEI[~ -2.80E+01 -1.94Ef00 7.343+00 -1.31E+01 2.91E+00 1.20Ef01 -9.lOE-tOO 2.733+01 -2.96Ef01 -3.00Ef01

BIAS/< 1.42Et02 -1.323+02 -9.953+01 -1.823+02 -1.41E+02 -l.llE+02 -1.873+02 8.523+02 -3.043+02 -3.923+02

M S E I ~ -3.423+01 8.333+00 1.44E+01 -5.14E+00 1.35Et01 1.93E+01 -4.50E-01 2.31E+01 -2.71Et01 -2.69E+01

LSE(&) ( D = 1 or 3) is preferable for p 5 0.15; LSE(&) ( D = 1 or 3) is most desirable for 0.15 < p < 0.25; MPM(5.f) ( D = 1) is most desirable for p 2 0.25 ; MLE(&) is most desirable for complete samples with n = r . (iii) Except for the cases of n = T , the data-type D = 0 is never recommended in Table 1. (iv) It is important to utilize the most desirable method and the most appropriate data-type, which correspond to figures with (single or) double undeslines of the smallest M S E in Table 1. For example, in the case of ( n , r ) = (50,20), LSE(&) (D = 1) is most desirable to realize BIAS(&) = -0.0385, M S E ( & ) = 0.0503. On the other hand, even if (n,r ) = (100,30), by using the non-desirable conventional estimator LSE(&) ( D = 0) we get BIAS(&) = -0.0882, M S E ( & ) = 0.0555, which is inferior to the former case given above. Thus, the merit of the larger number of observation of (n,r ) = (100,30) is lost by the inappropriate choices of the method & and the data-type D = 0. In other words, by using our proposal in Table 1 one can perhaps reduce the ”costs” of data-gathering very much. (v) For comparison of our results in Table 1 with the other previous studies, numerical tables for BIAS and M S E of the estimator of the scale parameter ( from right-censored large samples have few been published in the literature. Ref. [6] has shown some graphical representation, which i s inconvenient for the numerical comparison. References 1. 2. 3. 4. 5.

C. Liu and S. Abe, J1.Rel. Eng. Ass. Japan. Vo1.26,No.l, 77-99 (2004). C. Liu and S. Abe, Proc. ICRMS’2001. Vol.1, 115-122 (2001). C. Liu and S. Abe, i n preparation. (to be submitted to J1.Rel.Eng.Ass. Japan).

S . Abe, Abstracts of 1991 Autumn Meeting of ORS. Japan. 94-95 (1991). A. Clifford Cohen, TRUNCATED AND CENSORED SAMPLES, MARCEL DEKKER,INC. 121-123 (1991). 6. T. Ichida and K.Suzuki, ISBNd-8171-3103-9. Japan. 197-241 (1984).

SOFTWARE SYSTEM RELIABILITY DESIGN CONSIDERING HYBRID FAULT TOLERANT SOFTWARE ARCHITECTURES DENVIT METHANAVYN Faculty of Computer Engineering, King Mongkut 's University of Technology Thonburi, Bangkok, 10140, Thailand NARUEMON WATTANAPONGSAKORN Faculty of Computer Engineering, King Mongkut 's University of Technology Thonburi, Bangkok, 10140, Thailand

Fault-tolerant technique is an effective approach to achieve high system reliability. This paper proposes six hierarchical fault-tolerant software reliability models with multi-level of Recovery Block (RB) modules, multi-level of N-Version Programming (NVP) modules and combinations of RB and NVP modules called hybrid fault-tolerant architectures. These system reliabilities and costs are evaluated at various degrees of failure dependencies, and then compare with those of the classical RF3 and NVP models. Reliability results with s-independent failure assumption are compared with those considering failure dependency assumption. System cost for each model is evaluated as well.

1

Introduction

Many critical systems, such as power plant control, flight control, transportation and military system need high reliability. Fault-tolerant technique is commonly applied to achieve system design objectives reliability and safety in operation. Several techniques have been proposed for structuring hardware and software systems and providing fault tolerance. For software fault tolerance usually requires design diversity and decision algorithm. Therefore software variants and adjudicator are main components for redundant computations to gain high reliability in software system. The first software fault tolerant technique is Recovery Block (RB), and then is NVersion Programming (NVP). These classical techniques have some differences in term of judging results to be final output. For RB, adjudicator is called acceptance tester, which acts as a computation module and checks results of all software variants, so the tester needs to be complexly design and do iteration work. For NVP, adjudicator is called voter, which acts as a comparator of all software variants and choose majority results as output, so voter can guarantee to pass the correct output using majority voting. Two classical models were combined to generate hybrid fault-tolerant techniques such as Consensus Recovery Block (CRB), N-Self Checking Programming (NSCP) to enhance system reliability of the original RB and NVP. In Previous work, J.B. Dugan, et al [ l ] proposed a quantitative comparison of RB and NVP schemes in 1993 considering related faults such as probability of failure between two software variants and among all software variants. In I996 Wu, et al [2]

325

326 proposed hybrid software fault-tolerant models which nested RE3 with NVP and embedded RE3 within NVP. They provided system reliability comparison for these architectures without considering related faults. In 1997 F.D. Giandomenico, et a1 [3] evaluated schemes for handling dependability and efficiency of Self-Configuring Optimal Programming (SCOP) scheme which accepted consistent result, NVP with tiebreak and NVP schemes. In 2002, S.P. Leblance and P.A. Roman [4] proposed a simple approach to estimate the reliability of software system that composed of a hierarchical of modules with s-independent of software failure assumption. There are many literatures on new fault-tolerant software architectures developing as well as software system reliability optimization [5, 7, 81. However, none of them provide reliability and cost evaluation of hierarchical or hybrid software systems considering failure dependencies or related faults in the software variants. In this work, we extend the work of Wu, et a1 [2] by considering failure dependencies in software system reliability analysis using sum-of-disjoint products. We consider hierarchical fault-tolerant schemes of multi-level of RBs, multi-level of NVPs and hybrid RE3 -NVP. These system reliabilities and costs are evaluated at various degrees of related failures and then compare with those of the traditional RE3 and NVP models. Assumptions and Notations used though out this paper are as follows. Assumptions: 1. Each software variant has 2 states: functional or fail. There is no failure repair for each variant or system 2. Reliability of each software variant is known. 3. Related fault between software variant(s) and adjudicator does not exist. Notations: Probability of failure of each software variants. PV Reliability of each software variants; Qv = 1 - Pv Qv Probability of failure from related fault between two software variants, PRV QRV = 1 - PRV Probability of failure from related fault among all software variants. PULL QRALL= 1 - PULL Probability of failure of an adjudicator (tester or voter), Qv = 1 - PV PD PDEP(X) Probability of failure of system considering related faults, QDEC(X)= 1 - PDEP@) 2

Research Background

2.1

Fault Tolerant Techniques

Software Fault Tolerance usually requires design diversity. For design-diversity, two or more software variants are designed to meet a common service specification and provided for redundant computations. The variants are aimed at delivering the same

327 service, but implemented in different ways. Since at least two variants are involved, tolerance to design faults necessitates an adjudicator that determines a single error-free result based on the results produced by multiple variants. Several techniques have been proposed for structuring a software system, and providing software fault tolerance. Classical techniques are such as Recovery Block, N-Version Programming which are discussed below, and hybrid architecture techniques are such as N-Self Checking Programming, Consensus Recovery Block and Acceptance Voting. Recovery Block (M) [ 5 ] is the first scheme developed for achieving software fault tolerance. Variants are organized in a manner similar to standby sparing used in hardware. RB performs run-time fault detection by augmenting any conventional hardwarelsoftware error detection mechanism with an acceptance test applied to the results of execution of one variant. If the test fails, an alternate variant is invoked after backward error recovery is performed. N-Version Programming (NVP) [6], which directly applies the hardware N-Modular Redundancy to software. N versions (variants) of a program are executed in parallel and their results compared by an adjudicators. By incorporating a majority vote, the system can eliminate erroneous results and pass on the presumed-correct results. 2.2 Reliability Analysis

In software system, faults can be divided into two modes: one is s-independent fault and the other one is related fault. These faults affect directly to the system reliability. Related faults result from design specification fault which is common to all software variants, or from dependencies in the separate designs and implementations. SIndependent faults are simply those that are not related. 2.2.1

Considering S-Independentfaults

The prob. of failure of the RE3 scheme (PRB)with s-independent faults is as follows [7]. Pm = I - R N ; R N

r-'

1

=P(Yi)+,Z II P ( X k ) P(Yi) 1=2 k=2

P(Yi)= ( ~ - P v ~ ) ( ~ - P D ) ~ P ( X p v~( i)-=~ ) ( l - p D ) + ( I - p V ( j - I ) ) p D

P(YJ is probability of tester accepts corrects result while P(XJ is for tester rejects correct output or tester accepts incorrect result. N is the number of software variants. The prob. of failure of the NVP scheme (Pwp) is as follows [2].

The first term is prob. of failure in case of all software variants produce incorrect result. The second term is when only one variant produces correct output. Again N is the number of s o h a r e variants.

328 2.2.2

Considering Relatedfaults

The probability uf failure of RB and NVP schemes considering related faults [l, 81 can be represented by sum-of-disjoint products [9]. The prob. of failure of the RE3 scheme with two software variants is as follows. P(=*)=

2 PRALL +PDQRALL +PRVQRALLQD+PVQRALLQDQRV

(3)

The prob. of failure of the RB scheme with three software variants is as follows.

The prob. of failure of the NVP scheme with three software variants is as follows. 2

3

2

3

2

Pv Q RALLQDQRV + PvQ RALLQDQRvQv

+

3

3

(5)

PVQ RALLQD Q,Qv

Reliability Models of Hierarchical Fault -Tolerant Software System

Hierarchical fault-tolerant software system consists of multi-level of fault-tolerant modules. At the lower level, RB or NVP modules are used. Each output from the lowerlevel modules will be sent to the upper-level module to perform similar process again and then release the finalized output. The probability of failure of the hierarchical fault-tolerant system can be considered as two parts. The first part is from the lower-level modules considering failure dependencies [l, 81. The latter is from the upper-level module where failure dependencies across the lower-level fault-tolerant modules are assumed negligible. Hence, s-independent assumption is applied at this upper-level. The probability of failure of each lower-level module is applied as a software failure probability used in the upperlevel module. 3.1 Hierarchical Fault- Tolerant Model 3.1. I

RBiRBj

RBiRBj consists of i lower-level RB modules each consisting of j software variants and a tester, and one upper-level RB module which uses i outputs from the lower-level to test for the final output. Example of FU32RB3 as shown in Figure 1.

329

1 - 1

IW

pi

I

p2

I Figure 2. NVP3RB2

Figure 1. RB2RB3

3.1.2

NVPiRBj

NVPiRBj consists of i lower-level RE3 modules each consisting of j software variants and a tester, and one upper level NVP module which uses i outputs from the lower-level to vote for the final output. Example of NVP3RB2 is shown in Figure 2. 3.1.3

RBZRB3NVP3

RB2RB3NVP3 consists of two lower-level modules; one is RB (with two variants and a voter) and the other is NVP (with three variants and a tester), and one upper-level RE3 module which uses two outputs from the lower-level to test for the final output, as shown in Figure 3. I -

I I-

I 1 -

Irnl

1 - 1 p2

h

Figure 3. RB2RB3NVP3

3.1.4

I

-6-

IW

I

p3

f

I

Figure 4. NVP3NVP3

NVPiNVPj

N V P i N V . consists of i lower-level NVP modules each has j software variants and a tester, and one upper-level NVP module which uses i outputs from the lower-level to vote for the final output, as shown in Figure 4.

3.2 Proposed Reliability Analysis Models The probability of failure of hierarchical fault-tolerant system can be obtained by finding probability of failure of the lower-level modules using Eqs. (3), (4) and (5). For the upper-level modules, we use Eqs. (1) and (2) to analyze the system reliability.

330

the pfobabilityof failureofRB2RB3,RB2RB3NVP3ANDNPV3 schewmeare shownrespectively

4

Experimental Result

The following example illustrates our proposed reliability models comparing with the classical reliability models and other hybrid models [2]. Table 1 presents the input dataset of prob. of failures. Dataset 1 has PULLand PRVequal to zero, assuming that no error in specification and no single error to activate failure of two software variants. Other datasets have varying values for PULLand PRv,referenced from [ 11 and [3]. Table 1. Input data: prob. of failures

Table 2. Input date: cost values

Table 2 presents costs of a software variant, a tester and a voter used in the system cost analysis. It is assumed that the voter has less cost when compare with acceptance tester according to [2, 71. A software variant cost is considered equal to the cost of an acceptance tester. This is because the tester needs to do computation to check output of each variant. A voter has less cost than a tester’s cost because the voter doe not have to know the correct result but instead using an algorithm to find majority result. So, its cost can be less than the cost of a variant and the cost of a tester. Table 3 provides reliability and cost evaluations of FU3, NVP, hierarchical RB, hierarchical NVP and hybrid RB-NVP models assuming s-independent of software failures. The m o d k number 3 to 10 each consists of six software variants but different in the number and allocated positions of adjudicators. The models number 1, 2, and 11 each has 2, 3, and 9 software variants, respectively. From our analysis, the model which gives the highest reliability is FU32RI33 at the cost of 90.

331 The top 3 in the reliability rank each consists of RB modules in its hierarchical structure. With less number of software variants and at lower system costs, less number of faults can be tolerated and lower system reliabilities are obtained, as shown with RB2, RB3 and NVP3. Table 3. Reliability and cost evaluations with s-independent assumption

Table 4 presents reliability and cost evaluations of the proposed models together with traditional RB and NVP models considering related faults in software variants. Dataset 1 has P R A Land ~ P R equal ~ to zero. The results with dataset 1 of both Table 3 and Table4 are compared. With input dataset 2, the result shows that RB3RB2 gives the highest system reliability compared with other models. With input dataset 3, the result shows that RB2RB3 gives the highest system reliability while RB3RB2 ranks the second. Dataset 4 has lower value of PD then dataset 3; the results show RB3RB2 gives the highest system reliability. NVP3NVP3 and NVP3 models give lowest reliabilities compared to other models. The cost of each model is the same as analyzed with sindependent failure assumption, shown in Table 3. Table 4.Reliability and cost evaluations with related fault

332 5

Conclusion

In this research, we proposed six hierarchical fault-tolerant software models that consist of multi-level RBs, NVPs, or combinations of RB(s) and NVP(s) (called hybrid structures). We also provide reliability and cost evaluations with those of the classical RB and NVP models. Reliability results with s-independent failure assumption are compared with those considering failure dependency assumption. System cost for each model is evaluated as well. The proposed hierarchical fault-tolerant models of RB provide higher reliability than those of the classical models and the hybrid models. However, if we consider in cost, the proposed hierarchical models cost more than the others are as well. While the proposed NVP models give less reliability compare with those of other models. In summary, we can rank the models from the highest to the lowest in terms of reliability: Hierarchical RB Model > Hybrid RB-NVP Model > Classical RB Model > Classical NVP Model > Hierarchical NVP Model In terms of system cost, from the most to the least expensive: Hierarchical NVP Model ? Hierarchical RB Model ? Hybrid RB-NVP Model > Classical NVP Model > Classical RB Model

References

1. 2.

3.

4. 5. 6. 7. 8. 9.

J.B. Dugan, F.A. Patterson-Hine, Simple Models of Fault Tolerant Software, Proceedings Annual Reliability and Maintainability Symposium, 354 (1993). J. Wu, E.B. Fernandez, and M. Zhang, Design and Modeling of Hybrid FaultTolerant Software With Cost Constraints, J. System Software Vol. 35, 141 (1996). F.D. Giandomenico, A. Bondavalli, J. Xu and S. Chiaradonna, Hardware and Software Fault Tolerance: Definition and Evaluation of Adaptive Architectures in A Distributed Computing Environment, Proceedings of ESREL 97 Int. Conference on Safety and Reliability, 582 (1997). S.P. LeBlance, P.A. Roman, Reliability Estimation of Hierarchical of software system, 2002 Proceeding of Annual Reliability and Maintainability Symposium. 249, 368 (2002). R.K. Scott, J.W. Gault and D.F. McAllister, Fault-Tolerant Software Reliability Modeling, IEEE Transaction on Software Engineering Vol. 13,582 (1987). A. Avizienis, The N-version approach to fault-tolerant software, IEEE Transaction on S o f i a r e Engineering Vol. 12, 1491 (1985). 0. Berman, U.D. Kumar, Optimization models for recovery block schemes, European Journal ofOperationa1 Research Vol. 115,368 (1999). J.B. Dugan, M.R. Lyu, System Reliability Analysis of an N-version Programming Application, IEEE Transaction on Software Engineering Vol. SE3,103 (1993). M. Veeraraghavan, K.S. Trivedi, An Improved Algorithm for the Symbolic Reliability Analysis of Networks, IEEE Transaction Vol. 4, 34 (1 990).

SOFTWARE RELIABILITY PREDICTION USING NEURAL NETWORKS WITH LINEAR ACTIVATION FUNCTION R.B. MISRA Reliability Engineering Center, Indian Institute of Technology Kharagpur, Kharagpur, (w.B.) 721302 India P. V. SASATTE Reliability Engineering Center, Indian Institute of Technology Kharagpur, Kharagpur, (W.B.) 721302 India In the past the neural network models were used for software reliability prediction In various experiments it has been found that neural network prediction results are better than the conventional statistical models but main drawback of the neural network approach is that one cannot extract the knowledge stored in parameters (weights of connecting links) of the neural networks that are directly related to software reliability metrics like the number of faults remaining in software This paper presents a new neural network model to overcome drawback of the earlier neural network models The proposed neural network model uses a linear activation function and input to the neural network is transformed using an exponential function This transformation helps to express the neural network results in terms of software reliability metrics like the number of faults remaining in software Applicability of the proposed approach for software reliability prediction is shown with the help of real software project data

1.

Introduction

In this new era, software systems are widely utilized in various areas, which include home appliances, communication systems, safety critical systems, etc. Most of these applications require reliable sofiware. Many Software reliability models have been developed during last three decades [ 11. These models provide quantitative measure of software reliability. These are of help to investigate the reliability improvements during software testing phase. On the basis of reliability growth a software developer can allocate resources for software testing efficiently. Conventional software reliability models are based on statistical theory and show better accuracy for fitting software failure data. However, no single model is able to provide an accurate prediction for all software project data [ 2 , 3 ] . In an adopted standard practice, best software reliability model is identified among the available software reliability models. In the process, software failure data collected during testing phase is divided into two subsets. First subset is used for parameter estimation of the model and second subset is used to analyze predictive quality of the software reliability models. In an analysis, the software reliability model whose prediction results are better than rest of the models is selected for further software reliability analysis. The conventional statistical approach has two main drawbacks [4]: 1) There is uncertainty about the model’s (which is selected as a best after comparing with rest of the models) predictive capability. Further, at some point during testing, the model, which gives better prediction compared

333

334 to others, doesn’t necessarily provide the same throughout the testing phase. 2) In early stage of testing it is difficult to select best software reliability model, since one cannot obtain sufficient failure data for the purpose. In the past recent years, the neural networks have become a popular approach for solving nonlinear problems. It finds solution for curve fitting problem in terms of multiparameters instead of two or three parameters as used in the statistical models. Problem solution in terms of multi-parameters makes the neural network more flexible and efficient. Karunanidhi [4] introduced the neural network for modeling software failure process and he used the trained neural network for prediction of future software failure behavior. Karunanidhi found that the neural network model shows better accuracy for prediction of software failure behavior than the conventional statistical models. Further various neural network models were presented for software reliability prediction and most of the neural network models provide better prediction results than the convention statistical models [4-91. However, main drawback of all neural network proposed earlier for software reliability prediction [4-91 is that one cannot extract the knowledge stored in parameters (weights of connecting links) of the neural networks that are directly related to software reliability metrics like the number of faults remaining in software. Software reliability metrics are useful in quantitative software reliability analysis. In this paper a new neural network model is presented. It overcomes drawbacks of both conventional statistical models and existing neural network models used for software reliability prediction. It provides better prediction results and values of some software reliability metrics parameters like the number of faults remaining in software. The proposed neural network model is applied to real software project data and comparison of results is carried out with conventional statistical models Notations: i = Interval numbermata point number in data set = 1,2,3,. . ..j.. .k..l.. .n t = Execution time T = Total execution time m(t) = Actual cumulative failures by time t n = Number of data points in a software failure data set N(t) = Expected cumulative failures by time t a = Total number o f faults present in the program at the beginning of the testing q = Number of data points in an extracted subset used for neural network training 2.

Neural Network Model

2.1. Earlier work

In last decade Karunanidhi [4] introduced the neural networks for software reliability prediction. The neural network architecture used by Karunanidhi was input layer, output

335 layer and one hidden layer. The input and output to the neural network were execution time and the number of cumulative faults respectively as shown in Figure 1.

Neural Networks Execution time

Cumulative faults

u Figure 1. Neural network model

Karunanidhi used the sigmoid activation function for the neural network training. Output of the neural network with the sigmoid activation function is limited between the values 0.0 to 1.0. However in software reliability prediction problem output value (cumulative faults) always lies out of this range. Thus, it is necessary to scale output of the neural networks over the scale 0.0 to 1.0. This scaling needs information regarding the total number of cumulative faults detected at the end of testing phase. During testing phase this value is not available and therefore scaling of output is not possible. Although the prediction results of neural networks with sigmoid activation function are better than conventional statistical models the scaling problem has been not addressed yet [5]. Karunanidhi [ 5 ] suggested a new method to overcome scaling problem, in which a clipped linear function was used in output layer as an activation function instead of the sigmoid function. The advantage of this method is that, it can predict positive values in any unbounded region and therefore scaling of output is not required. However, results obtained with this approach were worse than some statistical models for some of the software project data. In Ref. [7] the GMDH (group method of data handing) network, which is adaptive machine learning, is applied for software reliability prediction during testing phase. The GMDH network is based on the principle of heuristic self-organization. Advantage of this approach is that it overcomes problem of determining a suitable network size in the use of multi layer perceptron neural network and its prediction results were found better than the conventional models. Recently, a modified Elman recurrent neural network [9] is presented for software reliability prediction. The predictive performances of the Elman recurrent neural network were found better than conventional nonhomogeneous Poisson process (NHPP) models. Above discussion provides evidence about neural network’s ability in identifying trends in software failure process and prediction of future software failure behavior. Most of the neural network approaches described above shows better prediction results than the conventional statistical models but the main drawback of the neural network approach is that the results obtained with neural networks can’t be expressed in terms of software

336 reliability metrics like the number of faults remaining in software. These software reliability metrics are useful in quantitative analysis of software reliability data. 2.2. Proposed Neural Network Model A multi-layer feed forward neural network is being considered in this study and back propagation algorithm is used for neural network training. The neural network architecture used is input layer, output layer, and one hidden layer. The proposed model uses the linear activation function and therefore its results are less subjected to the architecture of neural networks, training mode, number of hidden layer, and number of connecting links in each layer. The neural network is trained with execution time as an input and the number of cumulative failures observed as the target output. In the proposed model it is assumed that if testing is continued for infinite time then all faults in the software will be detected. On the basis of assumption input to the neural network is transformed using an exponential function. The transformed input is given by Transformed Input

= t’ =

1- exp (-t/T)

(1)

This transformation helps to reduce the neural network training error. Another important benefit of this transformation is that user can relate the knowledge stored in parameters of the neural network to the software reliability metrics like number of faults remained in software. Behavior of the proposed model with transformed input is shown in Table 1 and explained as: At the beginning of the testing, execution time is zero i.e. transformed input is also zero. According to the assumption, at infinite time of testing, faults detected in testing must be equal to total faults in software. If user uses input to the neural network without transformation then at infinite time, ‘infinity’ value becomes input to the neural network and neural network cannot produce output for input ‘a’.However, if input is transformed then it becomes ‘ 1 ’ instead of ‘CO’and the neural network can provide output for ‘ 1 ’ . The output of the neural network for transformed input ‘l’is ‘total number of faults present in software at the beginning of the testing’.

2.2.1. Assumptions: 1.

At infinite time, all faults in software are detected. Table 1. Neural Network Model Behavioi

Input = Execution Time or CPU time

Transformed Input

Cumulative faults

t’ = I-exp(-tR)

‘t’

At beginning of testing t=O At infinite time t =cc

0

m(t = 0)

1

m(t = -) = Total faults in software

=0

337 2.2.2. Selection of training data Set In the proposed model, subsets are extracted from the data set reserved for the neural network training rather than training the neural network with all data reserved for it. The neural network is trained for all extracted subsets. The neural network data fitting performance for each subset is evaluated using the measure ‘Data Fitting Error’. The subset, for which the neural network gives minimum ‘Data Fitting Error’, is selected. The neural network trained with this subset is used for prediction of software failure behavior. The ‘Data Fitting Error’ is given as Data Fitting Error =

‘

4

The subsets are obtained as Let, p = data set number of the subsets extracted form training data q = data points in a data set used for training do while

p = l = d a t a s e t = ( { i , t & m ( t ) } f o r i = l , 2 ...n) for successive subset, remove first data point {i ,t & m(t)} from earlier dataset q2 3

First subset comprises ‘n’ data points, while every successive subset comprises one data point less than earlier one and last subset comprises last 3 data points of training data set. The minimum data points in a dataset used for the neural network training are restricted to three. This restriction is imposed to avoid the neural networks from random guessing because the neural network prediction without proper training is random guess [4]. 3.

Prediction Results

A Software failure data collected from TROPICO system software is used to test applicability of the proposed neural network approach [lo]. In this study failure data obtained during testing phase (failure data of 300 days) is reserved for neural network training and remaining failure data (failure data of last 510 days) is used to check adequacy of proposed model. A procedure to find out optimal number of data points in a training set is applied as described in earlier section and neural network is trained with it to predict for failures in last 510 days. Plot of TROPICO system failures with neural network prediction for last 510 days is shown in Figure 2. If the neural network’s prediction is examined in two sections then for first 300 days (from 300 days to 600 days) the prediction is almost same as the actual failures occurred, for next 210 days (from 600 to 810 days) the predicted value is lower than the actual value.

338

3.1 Estimation of the Remaining Software Faults

As described earlier, input to the neural network is assigned ‘1’ to estimate the ‘total number of faults present in the program at the beginning of the testing’. Predicted total number of faults present in the program at the beginning of the testing ( ‘ u ’ ) by the neural network is 457. The residual faults in software is estimated by subtracting faults found by time ‘t’ from ‘total number of faults present at the beginning of the testing. After 300 days of testing 297 faults were found and therefore residual faults in software after 300 days are 160. 4.

Model Comparisons

In this section neural network prediction results are compared with some commonly used statistical models [l I]. The models used for comparison are exponential NHPP, S-shaped NHPP model, Generalized Poisson model, Schick-Wolverton, and Schneidewind’s Model. The criteria used in this study to judge the performance of the model is Prediction Error [ 101, A Prediction error measures how close the predicted cumulative number of failures to the actual failures occurred.

Prediction Error =

(3)

I=/

k- j+l

The software failure data used in earlier section for neural network prediction is used in here for prediction of last 510 days failure with the help statistical models. First thirty data points are used for the conventional statistical model fitting and parameter estimation Figure 3 shows plot of predicted failures vs. testing time for the proposed neural network model and some of the statistical models used in this study whose prediction results are better than rest of the statistical models. Among compared models,

339 exponential NHPP and Generalized Poisson model Predicts higher value of cumulative failures as compared to actual failures occurred in that interval. On the other hand, Sshaped NHPP model and proposed neural network model Predicts lower value of cumulative failures as compared to actual failures occurred in that interval. The proposed model’s prediction curve is more close to the actual one as compare to rest of the models. 750 +Neural 650

Network

+Exponential *Generalized

NHPP Poisson

*

.d

c d 7

450

E

J

350

250

1 31

41

51

61

71

81

Failure Interval

Figure 3 Software failure predictions by models

The performance of the model used in the study is judged using the measure ‘Prediction Error’. The comparison results are shown in Table 2. It is found that, Sshaped NHPP model turns out to be the best among the statistical models. From Table 2, it is clear that the proposed neural network model provides significantly better prediction results as compared to the conventional models in terms of low prediction error. Table 2. Comparison of prediction

Model S-shaped NHPP model Generalized Poisson model Exponential NHPP model Neural Network

5.

Prediction Error

105.9 16.54

Conclusion

In this study, new neural network model is presented for software reliability prediction. The proposed model uses the linear activation function and therefore its results dependency on the architecture of neural networks, training mode, number of hidden

340

layer, and number of connecting links in each layer is less. This feature makes the proposed model robust and simple for implementation. Applicability of the proposed neural network model for quantitative software reliability analysis is shown with real software project data and results are compared with conventional approaches used for software reliability prediction. It is noted that the neural network model turned out to be the best in terms of low prediction error as compared to existing models and therefore provides a better alternative for software reliability analysis. Acknowledgments

This work is supported by MHRD, New Delhi (India) under the Hybrid System Safety Analysis [HYB] Project References 1. M. Lyu, (Editor), Handbook of Software Reliability Engineering, McGraw-Hill, (1 996). 2. Y. K. Malaiya, N. Karunanidhi and P. Verma, IEEE International Computer Software Applications Conference. 7 (1 990). 3. A.A Abdel-Ghaly, P.Y. Chan, B. Littlewood and J. Snell, IEEE Transactions on Software Engineering. 16 (4), 458 (1 990). 4. N. Karunanithi, D. Whitley, and. Y. K. Malaiya, IEEE Transactions on Software Engineering. 18 (7), 563 (1992). 5. N. Karunanithi, and Y. K Malaiya, International Symposium on Software Reliability Engineering. 76( 1992). 6. R. Sitte, IEEE Transactions on Reliability. (48) 3, 285 (1999). 7. T. Dohi, S. Osaki and K.S Trivedi, International Symposium on Software Reliability Engineering. 40 (2000). 8. K. Y. Cai, L. Cai, W. D. Wang, Z. Y. Yu and D. Zhang, Journal of Systems and Software. (58) I , 47(2001). 9. S. L. Ho, M. Xie and T. N Goh, Computers d Mathematics with Applications. (46) 7,1037 (2003). 10. M. R. B. Martini, K. Kanoun and J. .M. De Souza, IEEE Transactions on Reliability. 39(3), 369 (1990). 1 1. A. P. Nikora, Computer Aided Software Reliability Estimation (CASRE) User's Guide Version 3.0, (1 999).

BLOCK BURN-IN WITH MINIMAL REPAIR MYUNG HWAN NA Department of Statistics, Chonnam National University, 300 Yongbong-dong, Gwangju, 500-757, Korea SANGYEOL LEE Department of Statistics, Seoul National University, San 56-1 Sillim-dong, Seoul, 151-742, Korea YOUNG NAM SON Department of Computer Science and Statistics, Chosun University, 375 Seoseok-dong, Gwangju, 501 -759, Korea Burn-in is a method to eliminate early failures. Preventive maintenance policy such as block replacement with minimal repair a t failure is often used in field operation. In this paper, the burn-in and maintenance policy are taken into consideration a t the same time. The optimal burn-in time is obtained in an explicit form. 1. Introduction

Let F ( t ) be a distribution function of a lifetime X. If X has density f ( t )on = f ( t ) / F ( t ) where , F ( t ) = 1- F ( t ) is the survival function of X . Based on the behavior of failure rate, various nonparametric classes of life distributions have been defined in the literature. The following is one definition of a bathtub-shaped failure rate function which we shall use in this article.

[o, co),then its failure rate function h(t) is defined as h ( t )

Definition 1. A real-valued failure rate function h(t) is said to be bathtubshaped failure rate with change points tl and t 2 , if there exist change points 0 5 tl 5 t 2 < co,such that h ( t ) is strictly decreasing on [ O , t l ) , constant on [tl,t 2 ) and then strictly increasing on [ t z , co). The time interval [0,t l ] is called the infant mortality period; the interval [ t l ,t z ] , where h(t) is constant, is called the useful life; the interval [ t z ,co) is called the wear-out period. Burn-in is a method used to eliminate early failures of components before they are put into field operation. The burn-in procedure is stopped when a preassigned reliability goal is achieved. Since burn-in is usually costly, one of the major problem is t o decide how long the procedure should continue. The best time t o stop the burn-in process for a given criterion is called the optimal burn-in time. An introduction to this context can be found in Jensen and

341

342 Petersen (1982). In the literature, certain cost structures have been proposed and the corresponding problem of finding the optimal burn-in time has been considered. See, for example, Clarotti and Spizzichino (1991), Mi (1994), Cha (2000), and Block and Savits (1997) for a review of the burn-in procedure. In this paper, we propose block burn-in with minimal repair. Our strategy is a modified version of Cha’s (2000) model since we consider additively the time for burn-in when we obtain the cost function. Cha (2000) showed that the optimal burn-in time b* must occur before the change point tl of h(t) under the assumption of a bathtub-shaped failure rate function. But in our model, the optimal burn-in time is given by an explicit form. Also, it is shown that optimal burn-in time decreases as the cost for burn-in increases, as the minimal repair cost during burn-in increases, or as the minimal repair cost in field operation decreases. 2. Block Burn-in Strategy

We begin to burn-in a new component. If the component fails before a fixed burn-in time b, we repair it minimally, and continue the burn-in procedure for the repaired component. After the burn-in time b, the component is to be put into field operation. For a burned-in component, block replacement policy with minimal repair at failure will be adopted. Under this replacement policy, the component is replaced by a burned-in component at planned time TI where T is a fixed positive number, and is minimally repaired at failure before a planned replacement. The cost for burn-in is assumed to be proportional to the total burn-in time with proportionality constant cg. Let c1 and c2 denote the costs of a minimal repair during burn-in period and in field operation, respectively. We assume that c1 < c2; this means that the cost of a minimal repair during a burn-in procedure is lower than that of a minimal repair in field operation. Let c, be the cost of a planned replacement. The total expected cost in the interval [0, b T ] is the sum of the expected burn-in cost, cob c1 h(t)dt, and the expected replacement cost, c, cz h(t)dt. The expected length of the interval is b T . Thus, the total expected cost per unit time C(b,T) is given by

+

+ s,””

C(b,T)=

cob

+

si

+

+ c1 s,” h(t)dt + c, + cz s,”” h(t)dt b+T

(1)

We will consider the problem of finding the optimal burn-in time b* and the optimal age T * such that

C(b*,T*)= min C(b,T). b/O,T>O

343 Throughout this paper, we assume that the failure rate function h ( t ) is differentiable and bathtub-shaped with change points tl and t2.

3. Optimal Burn-in Time For a given b

2 0, first, we can determine Tt

as a function of b satisfying

C(b,T z ) = min C(b,T ) . TZO

Note that d -C(b,T) dT

=

(b

cob

+T)2

-

+ c, + s,” h(t)dt c1

c2

where

+

qb(T) = ( b f T ) h ( b T ) -

rT

h(t)dt.

Hence, dC(b,T ) / d T = 0 if and only if the following equation (2) holds.

(b

+ T ) h ( b+ T )

Since qL(T)= ( b

1

b+T

-

h(t)dt =

cob

+ c, + c1 s,” h(t)dt c2

b

+ T)h’(b+ T ) , decreasing on [0,tl - b ) , constant on [tl - b,t2 - b ) , increasing on [t2 - b, m).

Now we need a following partition of interval [0,m)

Note that for any b E A2, the equation (2) has a unique solution which we denote by T; and C ( b , T ) has minimum at T;. For b E A3, C(b,T) has minimum at T; = 0 or the solution of equation (2). Also note that for any given b E A l , d C ( b ,T ) / d T < 0 for all T 2 0 and C(b,T ) is strictly decreasing. Thus C(b,T ) has minimum at Tz = 00.

344

Therefore, we can define

{ b L 0 : T l = m } ,B2 = { b 2 0 : 0 < Tl < co}, and

Bi

B3

= { b 2 0 : Tl = 0).

Then, min

b>O,T>O

C ( b ,T ) = min

{

min C ( b ,m), min C(b,T,*), min C ( b ,0 ) be B P

bEBi

bEB3

Before we state main theorem, we need the following lemma.

Lemma 1. There exists b: such that C ( b i , O ) = co

+ c l h ( b i ) 5 C ( b ,0 ) = b (cob + c

1 l h(t)dt

+ c r ) for all b L 0,

where bi is either the unique solution of (3) or equal to 00 according to whether (3) has a solution or not:

bh(b) -

Jlub

h(t)dt = -.C r C1

(3)

Define s1 and s2 be the solutions of the following equation if they exist

h ( t ) = -, CO c2 - c1

(4)

where 0 5 s1 5 tl 5 t2 5 s2. If h(0) < C O / C ~- c1 and h(00) > C O / C ~- c1, then let s1 = 0 and s2 = co,respectively. The following is the main result of this paper.

Theorem 1. If none of s1 and s2 exists or if sl+T:l > s2 and ~ 2 h ( s l + T : ~>) co clh(b;), the optimal burn-in time b* and the corresponding optimal age T* = Tl* is given by b* = b; and T" = T$ = 0,

+

otherwise

b* = s1 and T* = TS*,> t 2 - s1. Remark 1. None of s1 and s2 exists means that h ( t ) > C O / ( C ~ - c1) for all t 2 0. Remark 2. From (4) Theorem 1 indicates that optimal burn-in time gets smaller as C O , the cost for burn-in, or c1, the minimal repair cost during burnin, becomes higher, or as c2, the minimal repair cost in field operation becomes lower.

345

Acknowledgement This research was supported in part by KOSEF through Statistical Research Center for Complex Systems a t Seoul National University.

References

[l] Barlow, R. E. and Proschan, F. 1965, Mathematical Theory of Reliability. Wiley, New York. [2] Block, H. and Savits, T . 1997, Burn-In, Statistical Science, 12, 1, pp. 1-19.

[3]Cha, J . H. 2000, On a Better Burn-in Procedure, J. of Applied Probability, 37, 4, pp. 1099-1103. [4] Clarotti, C. A and Spizzichino, F. 1991, Bayes burn-in decision procedures,

Probability an the Engineering and Informational Science, 4, pp. 437-445. [ 5 ] Jensen, F. and Peterson, N. E. 1982, Burn-in, John Wiley, New York. [6] Mi, J. 1994, Burn-in and Maintenance Policies, Adv. Appl. Probability, 26,

pp. 207-221.

This page intentionally left blank

FIVE FURTHER STUDIES FOR RELIABILITY MODELS

T. N A K A G A W A

Department of Management and Information Systems Aichi Institute of Technology 1247 Yachigusa, Yagusa-cho, Toyota 470-0392, Japan E-mail: [email protected] This paper surveys my five recent studies; (1) reliability of complexity, (2) reliability of scheduling, (3) expected number of failures, (4) sequential maintenance policies, and (5) service reliability. Some results might not be useful to actual fields, however, these would certainly offer interesting topics to researchers and practicians.

1. Introduction

This paper surveys my five recent studies in which some results have been already obtained, and further studies are now continued and expected in near future. Section 2 defines the complexities of systems as the number of paths and the entropy. There exist many problems on complexity which have to be solved theoretically and practically. For example, we have t o show how to define the complexity of more complex systems. Section 3 defines the reliability for a job with scheduling time as the probability that it is accomplished successfully by a system. This would be modified and be extended, and could be applied to many fields of actual scheduling problems. Section 4 obtains the time that the expected number of failures is k and the distribution of H ( X ) when Pr{X t } = 1 - e--H(t).This would give theoretically interesting topics of expected number of events in stochastic processes. Section 5 proposes the sequential maintenance policies where the preventive maintenances are made successively. This is one of modified and extended models of periodic and age replacements. Finally, Section 6 defines service reliability on hypothetical assumptions. This would trigger t o begin theoretical studies of service reliability.

<

2. Reliability of Complexity

In modern information societies, both hardware and software become large-scale and complex with increasing requirements of high quality and performance. It is well known that the reliability of such systems becomes lower than our expectation, owing to the complex of communication networks and the increase of hardwares such as fault detections and switchover equipments [I,21. The science of complexity has been recently spread to many fields such as physics, economics and mathematics 347

348 [3]. It is important to make more discussions on system complexity in reliability theory. Nakagawa and Yasui [4, 51 have already defined a complexity of redundant systems. This section surveys these results briefly. 2.1. Definition 1 of complexity Suppose that we can count the number of paths of a system with two terminals, and define its complexity as the number of Pa of minimal paths. Example 2.1. The nuniber of paths of a series system with n components is 1. The complexity of this system is Pa = 1. The nutnber of paths of a parallel system with n components is n. The complexity is Pa = n. Next, suppose that each module is composed of several components and the complexity of module A& is P,(i)(i = I, 2 , . . . ). Example 2.2. The number of paths of a series system with m modules is P a ( l )x Pa(2) x . . . x P,(m), and hence, the complexity is Pa = P,(i). That is, the complexity of a series syst,em is given by the product of complexities of each module. The number of paths of a parallel system with m modules is P a ( l ) P,(2) . . . P,(m), and hence, the complexity is Pa = P,(i). That is, the complexity of a parallel system is given by the sum of complexities of each module. Further, we specify a reliability function of complexity. A discrete reliability function decreases from 1 t o 0 and its typical one is discrete Weibull IS]. Thus, we define a reliability function with complexity n as

nzl

c,"=,

RJn)

where

=

(n = 1 , 2 , . . '),

+

+ +

(2.1)

p > 0 and q = e-" (0 5 a < c0,O < q <_ 1).

2.2. Definition 2 of complexity Suppose that the number of paths of a system with two terminals is countable. When the number of minimal paths is Pa, we define the complexity of a system as P, = log, Pa, using the concept of entropy. Example 2.3. The number of paths of a series system with n components is Pa = 1, and hence, its complexity is P, = log, 1 = 0. The number of paths of a parallel system with n components is Pa = n, and hence, its complexity is P, = log, n. Next, suppose that each module is composed of several components and the complexity of module Mi is P,(i) (i = 1 , 2 , . . . ). Example 2.4. Since the number of paths of a series system with m modules is Pa = P,(i)from Example 2.2, the complexity is

n,"=,

349

That is, the complexity of a series system is given by the sum of those of each module. This fact corresponds to the result that the failure rate of a series system is the total sum of those of each module. Further, we define a reliability function of complexity: When P, = log, n,

~ , ( n=) e-OP< = exp(-alog,n) (n = 1 , 2 , . . . ) , (2.3) for a > 0. It is evident that Re(n)is decreasing from 1 to 0. The failure rate of the reliability is

which decreases strictly from 1- e P a to 0. Moreover, putting ,O = a/ log, 2 the reliability is R,(n) = n-0.

(p > 0),

3. Reliability of Scheduling

Most systems are usually performing their functions for a job with scheduling time. However, the general definition of reliability is given by the probability that a system is operating at time t or during the interval (0, t] [7]. Suppose that a job has a scheduling time S such as operating time, working time and processing time, and is achieved by a system. A job in a real world is done in random environments due to many sources of uncertainty IS]. So that, it would be reasonable to assume that a scheduling time is a random variable, and to define the reliability as the probability that the work of a job is accomplished successfully by a system. We define the reliability of a job with scheduling time S as the probability of comparing two independent distributions of scheduling and failure times, and investigate its properties. h r t h e r , introducing costs needed for scheduling for a job, we present an optimal scheduling time which minimizes the expected cost. We also consider a scheduling problem how many number of units for a parallel system is appropriate for a job with scheduling time S.

3.1. Definition of reliability Suppose that a positive random variable S is the scheduling time such as operating, working and processing times of a job or a task, and X is the failure time of a system. Further, two random variables S and X are independent with each other, and have the respective distributions W ( t )and F(t)with finite means, i.e., W ( t )= Pr{S 5 t } and F ( t ) = Pr{X 5 t } , where F = 1 - F . We define the reliability of a system with scheduling time S as 03

R ( W ) s Pr{S 5 X} =

1

03

W(t)dF(t)=

F(t)dW(t),

(3.1)

350 which represent,s the probability that the work of a job with scheduling time S is accomplished by a system without failure. Barlow and Proschan [7]defined R ( W ) as the expected gain with some weight function W ( t ) . We have the following quantities of R ( W ) : (1) When W ( t ) is the degenerate distribution placing unit mass a t time T, R ( W ) = F ( T ) which is the usual reliability function of a system. Further, when W ( t )is a discrete distribution

iu

C pi

~ ( t= )

for 0 5 t < T I , for

T~I t < T ~ +( j ~= 1 , 2 , . . . , N

- 11,

i=l

for t 2 T N ,

N

R(W) = E p , F ( T , ) . j=1

(2) When W ( t )= F ( t ) for all t 2 0, R ( W )= 1/2. (3) When W ( t ) = 1 - ePwt,R(W) = 1 - F * ( w ) ,where G * ( s )is the LaplaseePStdG(t).InStieltjes transform of any function G ( t ) ,i.e., G * ( s )f versely, when F ( t ) = 1 - e-xt, R ( W ) = W*(X). (4) When both S and X are normally distributed with mean p1 and p2, and and respectively, R ( W ) = @ [ ( p-~p 1 ) / & 5 7 ] , where variance a(.) is a standard normal distribution with mean 0 and variable 1. ( 5 ) When S is uniformly distributed during [O,T],R ( W )= s:F(t)dt/T, which represents the interval availability for a given interval [0,T I .

s,"

01

022,

3.2. Optimal scheduling time

Some works of a job need to be set up the scheduling time [8]. If the work is not accomplished up to the scheduling time, its time is prolonged, and this causes some losses to the scheduling. Conversely, if the work is accomplished too early before the scheduling time, this involves a waste of time. The problem is how t o determine the scheduling time in advance. , its schedulSuppose that a job has a working time S with distribution W ( t ) and ing time is L (0 5 L < m) whose cost is sL. If the work is accomplished up t o time L, it needs cost cl, and if it is not accomplished until L and is done during ( L ,m), it needs cost c f where cf > c1. Then, the expected cost until the completion of work is

C ( L )= c1 Pr{S 5 L } + cf Pr{S > L } + S L = ClW(L) C f [ l - W ( L ) ]+ S L .

+

(3.2)

Since C(0) = c f and C(m) = m, there exists a finite scheduling time L* (0 5 L* < m) which minimizes C ( L ) .

351 We seek an optimal time L* which minimizes C ( L ) . Differentiating C ( L ) with respect t o L and setting it equal to zero, we have w ( L ) = s / ( c f - c1) where w ( t ) is a density of W ( t ) . In particular, when W ( t )= 1 - e-wt,

Therefore, we have the following result: (i) If w > s / ( c p - c1) then there exists a finite and unique L* (0 < L* which satisfies (3.3). (ii) If w 5 s / ( c f - c1) then L* = 0.

< co)

3.3. Parallel system Consider the scheduling problem how many number of units for a parallel redundant system is appropriate for the work of a job with scheduling time S [8]. Suppose that an n-unit parallel system works for a job with scheduling time S with W ( t ) = Pr{S I t } where = 1 - W , whose operating cost is ns. It is assumed that n units are independent and have an identical failure distribution F ( t ) . If the work of a job is accomplished when at least one unit is operating, it needs cost c1, and if the work is accomplished after all n units have failed, it needs cost c f where c f > c1. Then, the expected cost of n-unit parallel system is

C ( n )= c1

+ (cp - c1)

IW-

+

W ( t ) d [ F ( t ) l n ns

( n = 0 , 1 , 2 , . . .).

Since C(0) = c f and C(w) = co,there exists a finite number n* (0 which minimizes C ( n ) . iFrom the inequality C ( n 1 ) - C ( n )2 0, we have

(3.4)

I n* < co)

+

whose left-hand side is strictly decreasing to 0. Therefore, we have the following result: (i) If J," F(t)dW(t) s / ( c f - c1) then n* = 0. (ii) JOwF(t)dW(t) > s / ( c f - c1) then there exists a unique minimum n* (1 I n* < co) which satisfies (3.5).

Example 3.1. When W ( t )= 1 - e-wt and F ( t ) = 1 - e P x t , equation (3.5) is

352 If w / ( w + A) > s / ( c f - c1) then there exists a positive number n* (1 5 n* < m). Table 3.1 gives the optimal number n* of units for s / ( c f - c1) and A/w. Table 3.1. Optimal number n* of units Sl(Cf

- Cl)

0.1 0.05 0.01

9

4 12

2 11

4. Expected Number of Failures

We are greatly interested in the expected number of failures during ( O , t ] . When failures occur a t a non-homogeneous Poisson process with an intensity function h ( t ) , the expected number of failures is H ( t ) = Jih(u)du which is called cumulative hazard function. When units are replaced immediately upon failures, i.e., failures occur at a renewal process with distribution F ( t ) , the expected number of failures is M ( t ) which is called renewal function. This section obtain (1) the time that the expected number of failures is k and ( 2 ) the distribution Y = H ( X ) when Pr{X _< t } = l - e - H ( t ) , and investigates their reliability quantities. Further, we apply these quantities to maintenance models.

4.1. Expected number (1) Poisson process Suppose that the failure times x k ( k = 1 , 2 , . . . ) of a unit occur a t a nonhomogeneous Poisson process with an intensity function h ( t )and N ( t ) is the number of failures during (0,t ] . Then, we have the following results [9, 101:

and the mean time t o the k-th failure is

Let z we have

k

be the time that the expected number of failures is k . Then, from (4.2),

353

JXk Xk-1

Xh

h(t)dt = 1 or

h(t)dt = k

(k

=

1 , 2 , . . .).

(4.5)

In particular, when F ( t ) is IFR,

we have that X k 5 % k , and we use it as a n estimator of the mean time to the next failure. For example, when a unit fails at time t , the mean time t o the next failure is about l / h ( t ) . Further, from Theorem 4.4 of [7, p.271, when F is IFR, F ( p ) 2 e-l. Thus,

H ( p ) I H(Zl), i.e., /I 5x11

(4.8)

where 2 1 is called characteristic lzfe and represents the life time that about 63.2 % of units has failed until time 2 1 . Example 4.1. When F ( t ) = 1 - exp(-Xtm), i.e., H ( t ) = A t m , Xk = ( k / X ) m ( k = I , 2, ’ . .) and p = (l/A)mr(ll / m ) . Therefore, we evidently have that (i) 5 1 p for m > 1. = x 1 , p from (4.7) and E ( X j ) from (4.3) Table 4.1 gives x k , :k when for m = 2 and X = 0.01.

+

c:=,

Table 4.1. Comparisons of times of expected number

j=1

1

2 3 4 5 6 7 8

9 10

10.00 14.14 17.32 20.00 22.36 24.49 26.46 28.28 30.28 31.62

10.00 15.00 18.33 21.06 23.43 25.56 27.52 29.34 31.04 32.65

8.86 14.50 17.95 20.74 23.15 25.31 27.29 29.12 30.84 32.46

8.86 13.29 16.62 19.39 21.81 23.99 25.99 27.85 29.59 31.23

354 (2) Renewal process Suppose that failures occur a t a renewal process with distribution F ( t ) . Then, we have the following results [7]: Pr{N(t) = k> = ~ ( ~ ) -( ~t () ~ + ‘ ) ((IC t )= 0 , 1 , 2 , . .

c

(4.9)

w

E { N ( t ) }=

F ( k ) ( t )= M ( t ) ,

(4.10)

k=l

E{X1 + X 2 + . . . + X k } The time

Xk

(4.11)

= kp.

that the expected number of failures is k is given by

M ( x k ) = k , i.e.,

LXk

m(t)dt = k

( k = 1 , 2 , . . . ).

Example 4.2. W h en F ( t ) = l-(l+Xt)e-Xt, M ( t ) = X t / 2 - ( 1 - ~ ” ~ ) / 4 , p and xk is a unique solution of the equation

and x1 => p

(4.12) = 2/X

= 2/X.

4.2. Distribution of H ( X )

The failure time X of a unit has a general distribution F ( t ) = 1 - e--H(t) where H ( t ) is continuous and strictly increasing from 0. Then, Y 3 H ( X ) , which is the expected number of failures until unit’s own life, has the following distribution Pr{Y 5 t } = Pr{H(t) 5 t } = Pr{X 5 H-’(t)} = 1 - exp[ - H ( ~ - ’ ( t ) ) ] = 1 - e-t,

i.e., Y has an exponential distribution with mean 1. Conversely, when Pr{Y 5 t } = 1 - ePt, the distribution of X

(4.13)

= N-’(Y)

Pr{X 5 t> = Pr{H-l(Y) 5 t } = Pr{Y 5 ~ ( t )=}1 - e-H(t).

is (4.14)

Example 4.3. An optimal checking time T* of periodic inspection policy when F ( t ) = 1- ecXt is given by a unique solution of the equation [lo]: XCl

eXT - (1 +AT) = -, c2 where c1 = cost of one check and c2 be the loss cost for the time elapsed between failure and its detection per unit of time. Further, since et N 1 t + t2/2 for small t , we have approximately

+

355

Thus, defining that n(t)= tion times are given by

d-

is the inspection rate, approximate inspec-

which agrees with asymptotic inspection intensity [lo, 111. Example 4.3. A optimal PM (preventive maintenance) time T* of periodic replacement with minimal repair a t failures is given by a solution of the equation [71: T h ( T ) - L'h(t)dt = -, c2 c1 where c1 = cost of minimal repair and c2 = cost of replacement a t time T In particular, when F ( t ) = 1 - exp(-Atm) ( m > I),

X(m - l)(T*)m=

s. c1

Thus, defining that n(t)= ( m - l)X(t)(c1/~2)is the PM rate of a Weibull distribution, optimum PM times are given by

In this case,

where xk is given in Example 4.1.

5. Sequential Maintenance Policies Suppose that a unit has t o operate for an infinite time span. Then, it may be wise t o maintain preventively a unit frequently with its age or the increase of its failure rate. Barlow and Proschan [7] computed the optimal planned times of sequential age replacement for a finite span, using dynamic programming. Further, they showed the optimal inspection policy where checking times are sequent,ially computed from one equation. Nakagawa [ 12, 131 introduced imperfect maintenance

356 and derived optimal sequential policies where the preventive maintenance (PM) is made a t successive times according to the age and failure rate of a unit. This section proposes the sequential maintenance policies where the PM times are successively determined one after another by knowing the previous PM time. We derive optimal sequential maintenance times of periodic and age replacements. 5.1. Periodic maintenance We consider two periodic maintenance policies for a unit in which (1) the PM is done a t planned times and only minimal repair a t failures are made between planned maintenances, (2) the PM is done at some number of failures and minimal repair is made between planned maintenances. (1) Periodic maintenance with minimal repair A unit begins to operate a t time t = 0. The PM of a unit is done a t planned times T k ( k = 1 , 2 , . . . ). When a unit fails, only minimal repair is made between planned maintenances. This is called periodic replacement with minimal repair at failures. Then, the expected cost rate, which is the total expected cost per planned time T I , is [7]

(5.1) where c1 = cost of minimal repair and c2 = cost of maintenance a t time T k . An optimal time T: which.minimizes C(T1),when the failure rate is strictly increasing, is given by a unique solution of the equation:

Tlh(T1)-

1"

C2

h(t)dt = c1

+

Next, the PM is done at time TI T2, given that the P M was done a t time TI. Then, the expected cost rate during the interval (TI,TI T2) is

+

s,

Ti +Tz

C(T21Tl)=

c1

+

h(t)dt c2 (5.3)

T2

Differentiating C(T21T1)with respect to T2 and setting it equal to zero implies

+ + + T k , given that it was

Generally, when the PM is done a t time TI T2 . . . done a t time T2 . . . T k - 1 , the expected cost rate is

+ + +

357 Ti+Tz+. +TI

h(t)dt

+ cz

( k = 1 , 2 , ' . .), (5.5) Tk where To = 0. Differentiating C(Tk lT1, T2,. . . ,Tk-1) with respect to Tk and setting it equal to zero,

C(Tk IT1 TZ, . . . Tk-1 ) 1

Tkh(T1

1

=

+ T2 + . . . + T k ) -

J

Ti+T2+ +Th

c2

h(t)dt = - ( k = 1 , 2 , . . .).

Ti+Tz+

+TALI

c1

(5.6)

Example 5.1. Suppose that F ( t ) = 1 - exp(-Xtm)(m > 1), i.e., h ( t ) = Xmtm-l. Then, an optimal T;*is from (5.2),

(5.7) and from (5.4),

(Ti

+ Tz)m-l[(m 1)Tz -

-

TI]

+ T;" = -.Xc2C l

(5.8)

Letting Q(Tz(T1)be the left-hand side of (5.8), it can be easily seen that Q(OITl) = 0, Q(colT1)= co and

(5.9) Thus, there exists a finite and unique T,'(0 < T,' < co) which satisfies (5.8). Further, we easily have that T,' 2 T; for 1 < m 5 2 and T,' < T; for m > 2. Generally, from (5.6),

+ + Tk)"-l

(TI . . .

[mTk -

(TI + . . . + T k ) ] + (TI+ . . .

+T

c2

~ - I=) ~ .

(5.10)

XCl

There exists a finite TL (0 < TL < co) which satisfies (5.2), and T;*= T,* = . . . = TL for m = 2. (2) Maintenance at N-th failure The P M is done a t Nl-th failure (N1 = 1 , 2 , . . . ) and only minimal repair is made between planned maintenances. Then, the expected cost rate per the mean time to Nl-th failure is from [lo, 141,

358 where p j ( t ) C(N1) 2 0,

= ( [ H ( t ) ] J / j ! } e (~j ~=( 0~ ,)1 , 2 , . . .). 1

From inequality C(N1 + 1) -

N1-1

- 1

It has been shown that there exists a unique minimum which satisfies (5.12) if it exists, when h(t) is strictly increasing. Next, when the PM is done at N1 N2 . . . Nk-th failure, given that it was done at N1 NZ . . . Nk-l-th fa,ilure, the expected cost rate per this interval is

+ + +

+ + +

6. Service Reliability

The theory of software reliability [I51 has been highly developed apart from hardware reliability, as computers have widely spread on many sides. From similar points of view, the theory of service reliability begins to study slowly: Calabria et al. [16] presented the case study of service dependability for transit systems. Masuda [I71 proposed some interesting methodologies of dealing with service from the point view of engineering, and of defining service reliability by investigating its qualities. However, a reliability function of service reliability has not been established yet theoretically. This section tries to make a theoretical approach to define service reliability and to derive its reliability function on one’s way of thinking. 6.1. Service reliability 1

It is assumed that service reliability is defined as the following two independent events: Event 1: Service has N ( N = 0, I, 2 , . . . ) faults at the beginning, which will occur successively, and its reliability improves gradually by removing them. Event 2: Service goes down with time due to faults which occur randomly. Firstly, we derive the reliability function of Event 1: Suppose that N faults occur independently according t o an exponential distribution (1 - e-’It) and are removed. We define the reliability as e - ( N - k ) p l t when k (Ic = 0, 1,2, . . . , N ) faults have occurred at time t . Then, the reliability of Event 1 is given by

359

=

(1 - e-Xlt + e - ( ' l + P I ) t ) N

(N= 0 , 1 , 2 , . . .),

(6.1)

It is evidently seen that Rl(0) = Rl(oo)= 1. Differentiating R l ( t ) with respect to t and setting it equal to zero, we have

Thus, R l ( t )starts from 1 and takes a minimum at tl = (l/p1) ln[(XI + p l ) / X 1 ] , and after this, increases to 1. In general, service has a preparatory time to detect faults, like a test time in software reliability. Thus, it is supposed that service starts initially after a preparatory time t l . Putting t = t tl in (6.1), the reliability of Event 1 is

+

) e-(Xl+Pl)(t+tl) Rl(t) = [l - e- X l ( t + t ~ +

IN .

(6.3)

If the number N of faults is a random variable which has a Poisson distribution with mean 0, Rl(t) is

Next, suppose that faults of Event 2 occur according to a Poisson distribution with mean X 2 , and its reliability is defined by e - ' P z t . Then, the reliability of Event 2 is given by

which is strictly decreasing from 1 t o 0. Therefore, when both Events 1 and 2 consist independently in series, we give service reliability as

R(t) = R l ( t ) R 2 ( t )

= exp{-Oe- X l ( t + t l ) [ l

- e-Pl(t+tl)

]

-

A t ( l - e-"")}.

(6.6)

360 6.2. Service reliability 2 Even if we give service reliability in (6.6), it would be actually meaningless because it has five parameters. Thus, simplifying R ( t ) in (6.6), we define service reliability as

k(t)= (1 - (ye-pl')e--/lz' (0 < a < 1). (6.7) It is evident that k(0)= 1 - a , k(w)= 0, and differentiating k(t)with respect to t and setting it equal to zero, e -'llt

-

1

P2

a P1 + P2

Therefore, we have the following results:

+

(i) I_fa > p2/(p1 p2) then R(t) starts from 1 - a and takes a maximum R(t1) = [~i/(Pi +~2)]e--/l~ at~ti' = (-l/pi) 1n{p2/[a(pl + p 2 ) ] } , and after this, decreases to 0. (ii) If a I p2/(p1 p2) then k(t)decreases strictly from 1 - a t o 0.

+

Figure 6.1 shows k(t)for a

> p2/(pl + p 2 ) .

Figure 6.1.

R(t) for 01

> p2/(p1 + p 2 )

6.3. R e m a r k s

We have derived theoretically two reliability functions of service reliability under hypothetical assumptions. The reliability function R(t)in (6) has many parameters to estimate and is not effective to actual models. The reliability function generally increases firstly and becomes constant for some interval, and after that, decreases

361 gra.dually, i.e., it draws a n upside-down b a t h t u b curve [18]. T h e reliability function i n Figure 6.1 draws generally such a curve. A reliability function of service would be investigated from many kinds of service ways a n d b e verified to have a general curve such as a b a t h t u b one i n reliability theory.

References 1. J. Pukite and P. Pukite, Modeling for Reliability Analyszs, Institute of Electrical and Electronics Engineering, New York (1998). 2. P. K. Lala, Self-checking and Fault-Tolerant Digital Design, Morgan Kaufmann, San Francisco (2001). 3. M. M. Waldrop, Complexity, Sterling Lord Literistic Inc., New York (1992). 4. T. Nakagawa and K. Yasui, Quality in Maintenance Eng., 9, 83 (2003). 5. T. Nakagawa and K. Yasui, Math. and Comp. Modelling, 38, 1365 (2003). 6. T . Nakagawa and S. Osaki, IEEE Trans. Reliability, R-24, 300 (1975). 7. R. E. Barlow and F. Proschan, Mathematical Theory of Reliability, John Wiley & Sons, New York (1965). 8. M. Pinedo, Scheduling Theory, Prentice Hall, New Jersey (2002). 9. T. Nakagawa and M. Kowada, Eur. J . Oper. Res., 12, 176 (1983). 10. T . Nakagawa, Maintenance and Optimum Policy, Handbook of Reliability Engineering (ed. H. Pham), Springer-Verlag, London, 367 (2003). 11. N. Kaio and S. Osaki, d . Oper. Res. SOC.,40, 499 (1989). 12. T. Nakagawa, J . of Appl. Prob., 23, 563 (1986). 13. T. Nakagawa, Imperfect Preventive Maintenance Models, stochastic Models in Reliability and Maintenance (ed. S . Osaki), Springer-Verlag, Berlin, 125 (2002). 14. T.Nakagawa, J . Oper. Res. SOC.Japan, 24, 325 (1981). 15. H. Pham, Software Reliability, Springer-Verlag, Singapore (2000). 16. R.Calabria, L. D. Ragione, G. Pulcini and M. Rapone, Pro. Ann. Relib. and Maintainability Symp. 366 (1993). 17. A. Masuda, J . Relzab. Eng. Assoc. Japan, 237 (2002). 18. J. Mie, IEEE Trans. Reliability, 44, 388 (1995).

This page intentionally left blank

NOTE ON AN INSPECTION DENSITY

T.NAKAGAWA Department of Management and Information Systems, Aichi Institute of Technology, 1247 Yachigwa Yagwa-cho, Toyota 470-0392, Japan E-mail: [email protected]

N. KAIO Department of Economic Informatics, Hiroshima Shudo University, 1-1 Ozukahigashi, Asaminami-ku, Hiroshima 731-3195, Japan E-mail: [email protected]

It has been generally well-known t h a t it would b e practically sufficient to calculate t h e approximate checking times for a n optimal inspection policy. Using an inspection density proposed by Keller, this paper suggests two approximate inspection policies and compare them numerically with t h e optimal one. Further, such a n inspection density is applied to t h e finite interval case, the inspection model with imperfect preventive maintenance and the checkpoint scheme for a database recovery.

1. Introduction

It would be greatly important forever since early times to inspect and detect any failures of units such as standby generators, computers, airplanes, plants and defense systems, and so on. Barlow and Proschan (1965) summarized the schedules of inspections which minimize the expected costs. After that, many papers have been published and were surveyed extensively in Kaio and Osaki (1984), Valdez-Flores and Feldman (1989), and Hariga and Al-Fawzan (2000). On the other hand, Keller (1974) defined n ( t )to be the density of the number of checks at time t , and showed a n easy method of computing approximate checking times. Kaio and Osaki (1984, 1988, 1989) investigated and compared in detail some nearly optimal inspection policies, and concluded that there are not significant differences among them and Keller’s computational method is the simplest. This paper summarizes the known properties of an inspection density and adds some new results. Using such properties for a sequential inspection policy, we get two approximate checking times and compare them numerically. Further, we show that an inspection density can be easily applied to the finite interval, the imperfect preventive maintenance and the checkpoint scheme.

363

364 2. Inspection Intensity

Define that a smooth density n ( t )is the number of checks per unit of time, which is called inspection density (Keller, 1974). Then, we investigate the properties of a function n(t): (1)iFrom the definition of n ( t ) , n(7-)d7represents the number of checks during the interval (0, t]. If a unit is checked at successive times 0 = x o < x1 < . . . < X k < . . . then we have the relation

Ji

JOXk

( k = O , 1 , 2 , . . . ).

n(7)d7=k

(2) iFrom the property ( I ) , we easily have

I:*"

n(.r)d7.=1

( k = O , 1 , 2 ) . . . ).

Thus, if n ( t )is a n increasing function o f t then

and

which gives the upper bound of xk+l for given x k . Further, if x1 is given then the next checking times are approximately computed by the recursive equation

Conversely, if n(t)is a decreasing function o f t then

+

(3) Suppose that the next checking time from any time t is t x. Then, since t is arbitrary, we put formally that the number of checks during ( t ,t x ] is l / 2 , i.e.,

+

Using the Taylor expansion in (7),

So that, we have approximately

x=-

1 2n(t)'

i.e., the next checking time from any time t is about 1/2n(t). Similarly, using the Taylor expansion in (2), it is shown that the next checking time when a unit was checked at time x k is about l / n ( x k ) , which is also derived from ( 5 ) .

365 3. Approximate Inspection Time

An operating unit fails according t o a failure density f ( t ) and distribution F ( t ) . Then, the failure rate is X(t) = f ( t ) / F ( t )and the cumulative hazard is A(t) = X(r)d r , where F ( t ) = 1 - F ( t ) , i.e., F ( t ) = e-n(t). It is assumed that a unit is checked at successive times 0 = 20 < z1 < . . . < %k < . . . and its failure is always detected only through checks. Let c1 be the cost for one check and c2 be the cost per unit of time for the time elapsed between failure and its detection at the next check. Then, using the above properties, we derive approximate checking times. Supposing that the next checking time is t + z when a unit fails at time t, the expected cost is, from (8),

Denoting that

Q ( n ( t ) )= cln(t)F(t)

+c2f ( t ) 2n(t) '

and differentiating Q(n(t))with respect to n(t)and setting it equal to zero, we have

It is easily shown that if X(t) is increasing or decreasing then nl(t) is increasing or decreasing, respectively. Therefore, the approximate checking times are, from (1) and (lo),

I**4 7

I

C2X(t)

dt = k

( k = 1,2, . . . ).

(1) Suppose that a unit is checked at periodic times kx ( k = 1 , 2 , .. . ) and the failure time is exponential, i.e., F ( t ) = 1 - e P X t . Then, the expected cost is, from Barlow and Proschan (1965),

Differentiating with respect to x and setting it equal to zero, we have XCl

(1 +Ax) = -. c2 Hence, a n optimal time x*,which minimizes C(z) in (12), is given by a finite and unique solution of equation (13). Further, since ea M 1 a a2/2 for small a, a solution t o (13) is approximately

+ +

Thus, replacing formally X by X(t) and

d-

by n l ( t ) , (14) agrees with (11).

366 ( 2 ) We can extend the above results as follows: We have approximately exz = 1 Ax (Ax)’/2 E where E 2 ( A z ) ~ / ~ .Substituting this into (13) and solving its equation for x imply

+ +

+

:/

x=

(1

-

z).

Further, putting that in (15),

we have

LFrom the above discussions, denoting a n inspection density by

the approximate checking times are given by

Table 1 gives three checking times of Barlow’s method, Keller’s method in (ll), the approximate method in (17), and the resulting expected costs C for k = 1 , 2 , . . . ,15 when F ( t ) = 1 - e--(t/400)2, c1 = 20 and c2 = 1. Unfortunately, the expected cost of the approximate method in this case is not less than that of Keller’s method, however, the checking times are approaching t o those of Barlow’s method as the checking number becomes large. It would be more necessary t o verify the usefulness of approximate methods from several practical points of views. 4. Inspection Time for a Finite Interval

A unit has to be operating for a finite interval (0, S],i.e., the processing time of a unit is given by some specified value S (0 < S < m). Then, the total expected cost during the interval (0, S] is, from (9)

Differentiating C ( n ( t ) )with n ( t )and setting it equal to zero, we have (10).

367 Table 1. Checking times x k of Barlow’s method, f k of Keller’s method, ? k of approximate method, and resulting costs C when F ( t ) = 1 - e - ( t / 4 0 0 ) 2 c1 , = 20 and c2 = 1

215

Barlow’s method 220.2 328.8 418.6 498.2 571.1 638.9 702.9 763.6 821.7 877.5 931.1 982.7 1032.2 1079.2 1123.0

xk

Keller’s method P k 193.1 306.5 401.7 486.6 564.6 637.6 706.6 772.4 835.5 896.3 955.1 1012.1 1067.6 1121.7 1174.5

We compute approximate checking times xk (k number N where X N = S. At first, we put that

=

i5k

188.31 297.06 387.57 467.90 541.40 609.85 674.36 735.66 794.30 859.66 905.03 957.66 1008.74 1058.43 1106.86

1 , 2 , . . . ,N ) and checking

and [XI = N , where [XI denotes the greatest integer contained in X . Further, we put that AN = N I X , i.e.,

and we define a n inspection density as

Using (19), we compute checking time xk which satisfies

LXk

n(t)dt = k

(k

=

1,2,.. ., N ) .

(20)

Then, the total expected cost is

+

Next, we put N by N 1 and do a similar computation. At last, we compare l), and choose the small one as the total expected cost.

C ( N ) and C ( N

+

368 Table 2. Checking times x h , expected costs C and approximate checking times i k

N 1

2 3 4 5 6 7 8 9 10 11 12 -

4 198.4 315.0 412.7 500.0

C 98.4 -

500 5 171.0 271.4 355.7 430.9 500.0

101.0

?k

175.9 310.6 411.9 500.0

100.0

S 1000 12 190.8 302.9 396.9 480.7 557.9 630.0 698.1 763.1 825.5 885.5 943.6 1000.0 116.4

11 202.2 320.9 420.6 509.5 591.2 667.6 739.8 808.7 874.8 938.4 1000.0 116.2

il,

242.2 357.1 451.8 536.0 613.2 685.5 753.8 819.0 881.5 941.7 1000.0 116.4

Consider a numerical example when S = 500, 1000 and the other parameters are the same as those of Sec. 3. Then, since n ( t ) = &/(800&), N = 4 and AN = 24/25 for S = 500. In this case, checking times are

( k = 1,2,3,4).

=k

Further, when N = 5, AN = 6/5 and checking times are ( k = 1,2,3,4,5).

k

Table 2 shows the checking times and the resulting costs for S = 500, 1000. LFrom this Table, the optimal checking numbers are N * = 4, 11 for S = 500, 1000, respectively. It is noted that the checking times are almost nearly those in Table 1 for S = 1000. Further, the approximate checking numbers ? k are computed recursively by giving that xk+l = in (5). The expected cost is a from 2 k + l = x k little greater than that of the case of N = 4 and 11, however, these checking times are almost the same as the optimal ones. Thus, if S is large, it would be sufficient to compute checking times x1,22,. . . , xk recursively from equation

+ 9,

by setting that xk+l

=

s

s.

5. Imperfect Preventive Maintenance

Nakagawa (1980, 1984, 1988) considered the maintenance policy in which an operating unit is checked and maintained preventively at times xk ( k = 1 , 2 , .. . ) and the failure rate after maintenance reduces to aX(t) (0 < a 5 1) when it was X ( t )

369 before maintenance. We apply an inspection density to the inspection model with preventive maintenance. Since the failure rate is a"'X(t) for xk-1 < t 5 x k , the approximate checking times are, from (ll),

Next, consider a system with two types of units where unit 1 is like new after every check, however, unit 2 does not become like new and is degraded with time (It0 and Nakagawa, 1995). That is, the system has the failure rate X ( t ) = Xl(t ~ - 1 ) X z ( t ) for xk-1 < t 5 xk. In this model, the approximate checking times are

+

6. Checkpoint Scheme In a database system, constitutions and recovery techniques of files play an important role. Fukumoto, et al. (1992) discussed checkpointing policies for rollbackrecovery which is one of the most general file recovery techniques. When files on a main memory are lost in failure, we reprocess transactions from the latest checkpoint instead of the starting point of the system operation. Checkpoints are respecified time points at which the information of the files is collected in a stable secondary storage. It is important to decide an effective checkpointing policy. Fukumoto, et al. (1992) developed the checkpointing policy as the extension of the inspection one with the inspection density. The following checkpointing and checkpoint are corresponding to inspection and t o checking time, respectively. Model is presented as follows: The kth checkpointing is instantaneously executed a t the checkpoint xk (k = 1 , 2 , 3 , .. . ; xo = 0). The rollback-recovery is executed instantaneously, the recovery action is always complete and the system is restarted immediately. One cycle is defined as the .interval from the start of the system operation t o the restart on the recovery completion, and the cycle repeats itself continually. Furthermore, each parameter is defined as follows: A, is the arrival rate of an update transaction which is reprocessed in rollback-recovery, ps is the processing rate for transactions, where p = X,/ps 5 1, and a, is the ratio of the overhead for checkpointing t o the overhead for reprocessing of the update transactions. The cost attendant on checkpointing is c,, the cost for checkpointing per unit of time is k,, the cost attendant on rollback-recovery is c, and the cost for rollback-recovery per unit of time is k,. We have the expected cost of one cycle by

where n(t)is the checkpointing density in ( l ) , K , have

= k,a,p,

K,

E k,p.

Similarly, we

370 7. Conclusions We have derived two approximate checking times, using the inspection density proposed by Keller, and compared them numerically with the optimal one. Further, we have shown that this method is easily applied t o the finite interval, the imperfect preventive maintenance and the checkpoint scheme. This is very simple and would be easily brought into practice for any inspected systems. It is of great interest that Keller introduced the smooth function for checking times with discrete number. As further studies, it could arouse a high interest in reliability theory to define replacement and preventive maintenance rates, using the notion of an inspection density.

References 1. R. E. Barlow and F. Proschan, Mathematical Theory of Reliability, John Wiley & Sons, New York (1965). 2. S. Fukumoto, N. Kaio and S. Osaki, Optimal checkpointing policies using the checkpointing density, J . of I n f o m a t i o n Processing, 15,87-92. (1992). 3. M. Hariga and M. A. Al-Fawzan, Discounted models for the single machine inspection problem, in M. Ben-Daya, S. 0. Duffuaa and A. Raouf (2000), Maintenance, Modeling and Optimization, Kluwer Academic Publisher, Massachusetts, 215-243 (2000). 4. K. Ito and T. Nakagawa, A n optimal inspection policy for a storage system with three types of hazard rate functions, J. of the Operations Research Society of Japan, 38, 423-431 (1995). 5. N. Kaio and S. Osaki, Analytical considerations of inspection policies, in S. Osaki and Y. Hotoyama (1984), Stochastic Models in Reliability Theory, Springer-Verlag, Berlin 53-71 (1984a). 6. N. Kaio, and S. Osaki, (1984b), “Some remarks on optimum inspection policies”, ZEEE l3-ansactions on Reliability, Vol. R-33, pp. 277-279. 7. N. Kaio and S. Osaki, Inspection policies: Comparisons and modifications, R . A . Z. R. 0. Operations Research, 22, 387-400 (1988). 8. N. Kaio and S. Osaki, Comparison of inspection policies, J. of the Operational Reserach Society, 40, 499-503 (1989). 9. J. B. Keller, Optimum checking schedules for systems subject to random failure, Management Science, 21, 256-260 (1974). 10. T. Nakagawa, Replacement models with inspection and preventive maintenance, Microelectronics and Reliability, 20, 427-433 (1980). 11. T. Nakagawa, Periodic inspection policy with preventive maintenance, Naval Research Logistics Quartevly, 31,33-40 (1984). 12. T. Nakagawa, Sequential imperfect preventive maintenance policies, I E E E Ransactions o n Reliability, R-37,295-298 (1988). 13. T. Nakagawa, S. Mizutani and N. Igaki Optimal inspection policies for a finite interval, The Second Euro-Japanease Workshop on Stochastic Risk Modelling for Finance, Insurance, Production and Reliability, in S. Osaki and N. Limnios (2002), Chamonix, 334-339 (2002) 14. C. Valdez-Flores and R. M. Feldman, A survey of preventive maintenance models for stochastically deteriorating single-unit systems, Naval Logistics Quarterly, 36,419-446 (1989).

AN IMPROVED INTRUSION-DETECTION MODEL BY PROFILING CORRELATED ACCESS DATA

H. OKAMURA, T. FUKUDA AND T. DOH1 Graduate School of Engineering, Hiroshima University, 1-4-1 Kagamiyama, Hagashi-Hiroshima 739-8527, JAPAN E-mail: { okamu, dohi} Orel.hiroshima-.u.ac.jp In this paper, we develop an intrusion-detection (ID) model which statistically detects malicious activities by profiling access data. In the proposed model, we consider two kinds of statistics; long-term profile and short-term profile, which are constructed from audit records of accessing network. In particular, taking account of the correlation of metrics in profiles, we improve the accuracy of detecting malicious activities such as DoS attack. In numerical experiments, our model is evaluated by comparing the ID model in which the correlation of profiles is not considered through actual audit records.

1. Introduction This paper describes statistical intrusion-detection (ID) models' to detect malicious activities by profiling access data. Statistical ID models are constructed with the aim of statistical detection of anomaly on audit records. Many malicious activities such as denial-of-service (DoS) attack and port scanning always leave traces on the network traffic. Even unknown malicious activities leave traces on the network traffic with no exception, because they are usually performed through the Internet. Thus the ID models which detect anomaly statistically from audit records will be useful t o unknown malicious activities. In the statistical ID models, the detection of anomaly is performed by comparing the stationary behavior of audit records with recent sampling records. That is, we test whether the recent records are sampled from the stationary probability distribution for audit records. The stationary distribution is usually unknown and is substituted by empirical distribution, i e . , the frequency table of received audit records. The frequency table is called the long-term profile. On the other hand, the recent records also generate a frequency table, which is called the short-term profile. In the statistical ID models, the detection of anomaly is, in short, to compare the long-term profile and the short-term profile which are generated from audit records. Ultimately, ID models can be classified by the ways of generating and comparing profiles. In this paper, we first introduce an existing method of generating profiles and point out a theoretical problem for the existing method. More precisely, taking account of the correlation of several profiles, we improve the accuracy of detecting

371

372 malicious activities such as DnS attack. Furthermore, the comparison method of profiles is also improved in viewpoints of statistical analysis. In numerical experiments, we evaluate the accuracy of detectionability for the proposed models by comparing through actual audit records. 2. Statistical ID Model 2.1. Generating profiles

Profiles are historical records on network ports of targeted hosts and can be regarded as frequency tables in traditional statistical analysis. That is, during pre-specified time interval, when the audit records are received, we count the frequency in the corresponding class of audit range. For instance, consider the profile on the number of inbound packets with three classes of range [0, loo), [loo,200) and [20O, co). When an audit record on the number of inbound packet exhibits 5 5 , the frequency in the first class, which has the range [0,loo), increases. However, in the case of usual frequency tables, the frequency table will overflow, because the frequency does not decrease. Iguchi and Goto’ thus use the method of generating profiles which is based on the NIDES statistical component developed by SRI International3. The method is efficient in terms of memory used. More specifically, let p , denote the profile gerierated from the first i audit records. Given the frequency vector f,+l on the (i 1)-st record, the profile is updated in the form:

+

Pz+1

+

= ffP, fz+l3

(1)

where 0 < a: < 1 is the decay rate for the profile. This formula provides the weighted mean of audit records, and a: is often called the aging rate. Essentially, in the algorithm, the memory required is only the current profile p , and, in addition, the profile will not overflow if the sequence of audit records is stationary. Iguchi and Goto’ consider the profiles on network ports, and select the following metrics t o detect malicious activities from the audit records. 0 0

0

0

Total duration time of a connection Total numbers of inbound/outbound packets received/sent through a connection Total numbers of inbound/outbound bytes received/sent through a connection Overall bytes per packet rate inbound/outbound through a connection

The terms inbound/outbound mean respective directions from the targeted host. Of course, these are representative metrics to characterize the activities on network. Iguchi and Goto’ generate the profiles under the condition that all the metrics are independent. However, in fact, we observe several strong correlations between total number of inbound packets and total numbers of inbound bytes. In this point, the existing method does not function well, and should be improved from the statistical

373 points of view. As a result, we expect to guarantee the high accuracy in detecting malicious activities. Consider the profiles with correlation of metrics. Suppose that there are m kinds of metrics on an audit record. The problem in Iguchi and Goto' method is caused by using the frequency vector fz+l. That is, the assumption that the frequency is given as a vector violates the correlation of metrics in the profiles. Therefore, we assume that the frequency of metrics is given by a multidimensional matrix with m dimensions (m-dimensional tensor). By using the improved frequency matrix, we can generate the profiles with the correlation. Let F , and P , denote the frequency and profile matrices which are multidimensional matrices with m dimensions, respectively. Similar to the frequency vector, we count the frequency in the corresponding class of audit range. In the existing argument, profiles are generated for respective metrics and an audit recored is classified into a class in the corresponding profile by the metric. To the contrary, in our model, the audit records are classified by taking account of all the metrics. For example, suppose that an audit record with 50 inbound packets and 2000 inbound bytes and that the profile are divided into the classes, the ranges [O, loo), [loo, 200) and [200, a)on the number of packets, and the ranges [O, 1000), [lOOO, 5000), [5000,m) on the number of bytes. Then the corresponding class for 50 inbound packets and 2000 inbound bytes is the class of [0, 100) packets and [1000,5000) bytes. On the update of profiles, we propose the following two formulas: 0

0

Time-independent method: Similar t o Eq. (l),update the profile with the decay rate 0 < a < 1.

Time-dependent method: The profile is updated by the decay rate e d t . That is,

where fl (> 0) is the continuous decay rate and t is the elapsed time from the last update. These formulas also provide the weighted mean of the audit records, so that the memory required is less and the profile does not overflow. Although the two formulas are quite similar, the difference is remarkable when the audit recored is not received for long time. Lastly, we describe the long-term and short-term profiles. In the generating methods which use the decay rate, the difference between the long-term and shortterm profiles appears in the decay rate. If the decay rate takes a large value, the recent audits are highly weighted. This indicates that the generating profile with high decay rate is regarded as the short-term profile. Thus, to generate both longterm and short-term profiles, we assume that a1 < as and f l l > ,&, where cq and

374 ,@ are corresponding to the long-term profile, and a, and profile.

ps are to the short-term

2.2. Statistical detection of anomaly

To compare the long-term and short-term profiles, the traditional statistical test is performed. For example, in Iguchi and Goto’, they adopt the well-known chi-square test. The chi-square test can be summarized as follows. Let 1 = ( 1 1 , . . . , In) and s = ( ~ 1 ,. .. , s,) be the long-term and short-term profiles for a metric with n classes, respectively. Then the chi-square statistic is given by

where n

This represents the squared error between the original sample and its expected value. In the traditional way, the hypothesis test is performed. First, we set the null hypothesis: 0

Ho: The long-term and short-term profiles are the frequency tables which are generated from the same probability distributions.

Under this hypothesis, it is well known that x i given by Eq. (4) follows the chisquare distribution with n - 1 degrees of freedom. When the test is significant at 0.05, we compare x i with ~’(n1,0.05),where ~’(n1,0.05) is the 0.95 percentile on the chi-square distribution with n - 1 degrees of freedom. If x i < ~ ’ ( n 1,0.05), then Ho is accepted, otherwise, Ho is rejected, z.e. the anomaly is detected. However, it is difficult to select the appropriate significance for the detection of anomaly. Thus we use the so-called pvalue instead of the level of significance. The p-value is defied as the probability that the null hypothesis is rejected, and therefore provides the probability of anomaly. Furthermore, since we take account of the correlation of metrics in the model, the chi-square statistic is just a little bit different from Eq. (4). Let L and S be the long-term and short-term profiles which are multidimensional matrices with m dimensions, respectively. The i-th metric is divided into n,classes , respectively. In our and the elements of L and S are given by lZl, and szl, modeling framework, the chi-square statistic is given by ,,?,,

375 where

Finally, the pvalue, which is the probability of anomaly, is given by

where I?(.) is the standard gamma function. In Eq. (8),the parameter f denotes the degrees of freedom and is derived by f = (nl x . . . x n,) - 1. 3. Evaluation Test

In this section, we evaluate the detectionability of malicious activities based on the statistical ID model through the actual access data. The access data is collected by an I P transaction audit tool called Argus". The targeted host is installed in the Department of Information Engineering, Hiroshima University, Japan. The audit records are collected from 9/2/2003 to 10/1/2003. The total number of records is 27291. We select the following four metrics: The numbers of inbound/outbound packets received/sent through a connection The numbers of inbound/outbound bytes received/sent through a connection In this experiment, we make the following scenario: 0

Scenario: A malicious user sends a lot of misused packets instantaneously to the mail server. That is, the host undergoes DoS attack.

Figure 1 depicts the records observed during the audit period. In particular, this shows the numbers of inbound and outbound packets for each audit record. In the figure, a pattern of DoS attack appears in the period from the 16303-rd record to the 20685-th record. To detect the DoS attack by the statistical ID model, we first define the ranges of classes for the metrics. Both the numbers of packets and bytes are divided into three classes. On the numbers of inbound/outbound packets, we have [0, lo), [lo, 20) and [20,03). On the numbers of inbound/outbound bytes, the classes are given by [0, lOOO), [lOOO, 3000) and [3000, m). Thus, in our model, there are 34 classes due to the combination of them. Profiles are generated by the two methods: time-independent method (Method I) and time-dependent method (Method 11). To evaluate the improvement on accuracy, we compare our methods with the existing aArgus Open Project - the network audit record generation and utilization system, http://ww.qosient.com/argus/

376

Figure 1. The numbers of inbound and outbound packets.

Figure 2.

The probability of malicious activities in Method I

detection method by Iguchi and Goto’ (Method 111). In Method 111, we use Eq. (1) to update the profiles. Also, in Methods I, I1 and 111, the decay rates are given as follows. 0 0 0

Method I: cq = 0.9999 and a, = 0.9998 and 0,= 1.2 x lop4 M e t h o d 11: /3l = 3.0 x M e t h o d 111: Q I= 0.99990 and a , = 0.99988.

Figures 2-4 demonstrate the probabilities of anomaly in Methods I, 11 and 111, respectively. In these figures, x-axis and y-axis denote the number of an audit record and the probability of anomaly, respectively. When the probability of anomaly is higher, the probability that the host undergoes the malicious activities is also higher. In the experiments, if the probabilities of anomaly are 1 during the period of DoS attack and are equivalent to 0 in the other period, the best detection could be achieved. In this point, we find that Method I1 can execute the most accurate detection. In order to evaluate the accuracy quantitatively, we introduce the falsepositive and false-negative probabilities for three methods. The false-positive and

377

Figure 3.

The probability of malicious activities in Method 11.

Figure 4.

The probability of malicious activities in Method 111.

false-negative probabilities are defined as follows. 0

0

False-positive probability: This is the probability of misleading the decision that the host undergoes malicious activities but the host does not undergo malicious activities actually. False-negative probability: This is the probability of misleading the decision that the host does not undergo malicious activities but the host undergoes malicious activities actually.

In our experiments, the false-positive probability is estimated by the time average on the probabilities of anomaly during the period in which the host does not undergo the DoS attack. On the other hand, the false-negative probability cannot be directly computed from the probabilities of anomaly. Thus we consider the true-positive probability, which is defined as the probability of the correct decision when the host actually undergoes malicious activities. The true-positive probability is given by the time average on the probabilities of anomaly during the DoS attack. Finally, the false-negative probability is calculated by 1 - (true-positive probability). Table 5

378 Figure 5. The

false-positive and false-negative probabilities

Method I Method I1 Method I11

false-Dositive mob. 0.1215 0.0222 0.1275

for three methods.

false-negative mob. 0.0613 0.0442 0.3475

presents the false-positive and false-negative probabilities for three methods. From this table, we also find that Method I1 gives the most accurate results in terms of detectionability of the anomaly. Furthermore, Method I11 provides the low accuracy of detection. Therefore, we conclude that the detection of anomaly can be improved by considering the correlation of metrics. 4. Conclusions

In this paper, we have considered the statistical ID models t o detect malicious activities based on the profiles of network ports. In particular, taking account of the correlation of several metrics in profiles, we have improved the accuracy of detecting anomaly. In numerical experiment, our methods are compared with the existing method. As a result, t h e profiles which are generated with the correlation of metrics can provide more accurate detection of anomaly, so that it can be concluded that our model i s effective t o detect malicious activities on network. In future, we will develop the method t o generate the profiles which include the information of accessing ports. Moreover, the Bayesian network will be applied to the detection methods by considering many factors which are caused by malicious activities.

Acknowledgments This research was partially supported by the Ministry of Education, Science, Sports and Culture: Grant-in-Aid for Young Scientists (B), Grant No. 15700060 (20032004) and Exploratory Research, Grant No. 15651076 (2003-2005).

References 1. D.E. Denning, An intrusion-detection model, IEEE Transaction o n Software Engineering, SE-13, pp. 222-232, 1987. 2. M. Iguchi and S. Goto, Detecting malicious activities through port profiling, IEICE Transaction o n Information and Systems, E82-D, pp. 784-792, 1999. 3. H.S. Javitz and A. Valdes, The NIDES statistical component: description and justifi-

cation, Technical Report, SRI International, California, 1994.

DEPENDENCE OF COMPUTER VIRUS PREVALENCE ON NETWORK STRUCTURE - STOCHASTIC MODELING APPROACH

H. OKAMURA, H. KOBAYASHI AND T. DOH1 Department of Information Engineering, Graduate School of Engineering, Hiroshima University, 1-4-1 Kagamiyama, Higashi-Hiroshima 739-8527, JAPAN E-mail: { okamu, dohi}Orel.hiroshima-2l.ac.jp Computer virus prevalence is a severe problem in the Internet. Recently, several researchers devote to analyze the phenomenon of computer virus prevalence. Kephart and White (1991, 1993) propose the concept of Kill Signal which means a warning signal of influence of computer viruses, and analyze the temporal behavior of computer virus prevalence by using ordinary differential equations. However, the deterministic model based on differential equations cannot distinguish the difference of computer virus prevalence depending on network structures with terminals. In this paper, we develop a stochastic model to evaluate the computer virus prevalence. The proposed model focuses on the infection of computer virus for each terminal, and can evaluate dependence of viral prevalence on network structures. We reveal quantitatively characteristics of network structures on computer virus prevalence.

1. Introduction

The Internet plays an important role in information technology. Because of growth of the Internet, we can communicate easily with a number of users in the Internet. On the other hands, the Internet gives us some social problems. Among them, the computer virus prevalence is the most severe problem. Since the damage caused by the computer virus is growing day by day, its activities are becoming more and more malicious, such as a Denial of Service attack, etc. In the research area on computer virus prevalence, there are two main topics: security issue and assessment. The objectives of security issue are, in short, to develop the security system which can prevent the influence of computer virus and to reduce the damage caused by computer virus. Okamoto and Ishida' develop an anti-virus system which can remove the computer virus autonomously. Badhusha et aZ.* discuss the effectiveness of updating virus pattern files in an anti-virus system. On the other hand, in the area of assessment of computer virus, several researchers devote t o analyze the phenomenon of computer virus prevalence. Thimbleby et aL3 develop a computer virus model based on Turing machine, and describe the characteristics of the computer virus qualitatively. Kephart and White4i5 propose that the concept of Kill Signal, which means a warning signal of influence of com-

379

380 puter viruses. They analyze the temporal behavior of computer virus prevalence by using a deterministic model based on ordinary differential equations. Similar analyses based on differential equations are made by Kephart6 and Toyoizumi and Kara7. They also propose the various types of computer virus model. For instance, Toyoizumi and Kara7 introduce the idea of predators which can combat malicious computer viruses. The analysis based on differential equations is essentially deterministic. In fact, since the computer virus prevalence involves some uncertain factors arising in their prevalence and removal, the deterministic models have a limitation to describe the behavior of computer virus prevalence. The computer virus model, taking account of the probabilistic behavior, are developed by Wang et aL8 , Billings et al.' , Wierman et aZ.1° and Kobayashi e t d.". Wang et aL8 simulate the computer virus prevalence and evaluate the security policy based on the simulation. Also, Kobayashi et al." propose a stochastic model based on the continuous-time Markov chain to represent the computer virus prevalence, and characterize the quantitative properties of computer viruses by measures derived from the model. However, in both deterministic and stochastic models mentioned above, the difference of viral prevalence depending on network structures of terminals cannot be represented due to the assumption that any terminal connects t o all the terminals. In this paper, we develop a stochastic model t o evaluate computer virus prevalence. The proposed model focuses on the infection of virus for each terminal, and can investigate the dependence of viral prevalence on network structures. Based on the proposed computer virus model, we reveal quantitatively characteristics of network structures on computer virus prevalence. 2. Computer virus model

2.1. Kephart and White ( K W ) model Kephart and White4,5 develop a computer virus model based on ordinary differential equations, and introduce the concept of Kill Signal (KS). KS can be regarded as a warning signal for influence of computer virus. For example, consider the situation where a terminal is infected with the computer virus. When the terminal cleans and removes the computer virus, the terminal is never infected with the same virus in general, that is, it has immunity t o the computer virus. Also, when the computer virus is removed, the infected terminal sends warning signals to its neighbors. If the neighbors have been already infected with the computer virus, the computer viruses can be removed after receiving the warning signal. Kephart and White5 define KS as the immunity t o computer virus and, at the same time, the warning signal on influence. Let n ( t ) and m(t)denote the number of infected terminals and the number of the terminals with KS at time t , respectively. Kephart and White5 describe the temporal behavior of viral prevalence by the following differential equations:

Mi)Pn(t){K- n ( t ) - m ( t ) } - bn(t) - PJi(t)m(t),

-dt

-

381

+

d m ( t ) - P,.m(t){K - m ( t ) } bn(t) - b,.m(t),

--

dt

where K is the total number of terminals, p is the infection rate of computer virus, b is the removal rate of computer virus, PI.is the spread rate of KS and 6,, is the disappearance rate of KS. This virus model by Kephart and White5, of course, can represent the temporal behavior of viral prevalence, but just gives an average trend of computer virus prevalence. Note that the viral prevalence is in fact random and uncertain. In addition, since their model implies the rather strong assumption that any terminal connects to all the terminals, we cannot analyze the actual computer virus prevalence depending on network structure by using it. Taking account of the adjacency of terminals, we develop a stochastic model to represent the computer virus prevalence. 2.2. stochastic model with adjacency

To represent the fluid of computer virus in detail, we develop a stochastic model of computer virus with KS. First of all, we consider a simple example consisting of three terminals. Consider three terminals which are connected as in Figure 1. Let X A ( t ) , xB(t) and xC(t) be the probabilities of infection for respective terminals a t time t. To simplify our analysis, it is assumed that KSs are not sent t o the terminals at all. Under this assumption, we derive the probability of infection in Terminal A at time t + At, namely, xA(t + A). Since no terminal sends KSs to the others, Terminal A is naturally infected at t + At whenever Terminal A is infected with computer virus at time t. On the other hand, if Terminal A is not infected at time t , the probability of infection in Terminal A at time t At depends on the infection in Terminals B and C. For example, in the situation where both Terminals B and C have been already infected, let p denote the probability that an infected terminal influences other terminals. Then the conditional probability of infection in Terminal A is given by 1 - (1- P ) ~ Similarly, . by taking account of all the cases of infection in Terminals A, B and C, we can derive

+

r A ( t + At) = ( 1 - q ) r A ( t ) f ( 1 - K A ( t ) ) { p x B ( t ) ( l- r C ( t ) )

+P(l

-

+ ( 1 - (1 - PI2) n o ( t ) r c ( t ) } ,

TB(t))TC(t)

(3)

where q is the removal probability of computer virus. Hence it is immediate to obtain nA(t

+ At) - n A ( t )

= -qnA(t)

+ (1 - T A ( t ) ) { p r B ( t )+ p r C ( t )

-p2xC(t)nB(t)}.

(4)

Suppose that the probabilities of infection and removal of computer virus are proportional to the time difference. That is, define p = /?At and q = b a t . By taking

382 Connection

Terminal B

Connection

Terminal A

Terminal C

Figure 1. Simple network configuration.

At

--t

0, we get the differential equation which governs the infection:

d

-7TA(t) dt

= -bTA(t)

+ p { T B ( t ) + zC(t)}(l - r A ( t ) ) .

(5)

We can derive the differential equations gvering the other probabilities as follows:

Roughly speaking, the differential equations for infection probabilities are composed with the sum of infection probabilities in directly connected terminals. Next, we expand the above result to a general case. Let v(t)be the column vector of probabilities, with the i-th component representing the probability that the i-th terminal is infected with computer virus at time t . Similarly, let p ( t ) be the column vector of probabilities, with the i-th component representing the probability that the i-th terminal keep KS at time t . To represent connectivity structure of the terminals, define the adjacency matrix C . When the i-th terminal connects to j - t h terminal, the (i,j)-element in C is given by 1. Then we derive the following differentia3 equations: d v( t ) - P{1 - v ( t )- p ( t ) } T C v ( t )- 6v(t)- P r v ( t ) T C p ( t ) ,

dt

where 1 is the column vector of 1s and T denotes the transpose of vector. Solving these differential equations under initial conditions v(0) = Y O and p(0) = po, we obtain the expected number of infected terminals and the expected number of terminals receiving KS as v(t)'l and ~ ( t ) ~respectively. l , Figure 2 shows the behavior of number of infected terminals in KW and the proposed model, where the adjacency matrix is given by 0 1 .'. 1 1 1 0

C= 1 0 1 1 1 ..' 1 0

383

Figure 2. Comparison between the KW model and the proposed model (high connectivity case; K = 5, p = 0.2, 6 = 0.2, 9,.= 0.0, 6, = 0.0).

Figure 3. Comparison between the KW model and the proposed model (low connectivity case; = 0.0). K = 5, p = 0.2, 6 = 0.2, flT = 0.0,

This means that all the terminals are connected t o each other. In this case, we can find just a little difference in the number of infected terminals, by comparing both models. Figure 3 also illustrates the time-dependent property of number of infected terminals. In this figure, the adjacency matrix is given by

c=

[ I: :

I).

0 1 0 0 ... 0 0

It can be found from Figure 3 that the expected number of infected terminals in our model is less than the number in the KW model. This is because the network connectivity in the proposed model becomes lower by changing the adjacency matrix. Consequently, the network structure, namely, the adjacency of terminals, strongly affects the computer virus prevalence, but the KW model cannot represent

384

Figure 4. Configuration of tree structure.

Figure 5.

Configuration of lattice structure.

the difference depending on network structures at all. 3. Computer virus prevalence on different network structures

We investigate the dependence of computer virus prevalence on network structures, by comparing two differently configurated network structures. Here we focus on two specific structures: tree structure and lattice structure. Figures 4 and 5 depict the configuration of these two network structures under consideration. In both figures, the circles (nodes) denote the terminals and, in particular, the filled circles represent the terminals infected with computer virus. In the tree structure, the root node is connected to 4 child nodes. Since each child node has 5 grandchild nodes, there are totally 25 nodes which correspond to terminals. In the lattice structure, one node has usually 4 adjacent nodes. For instance, for 5 x 5 lattice, there are the same number of terminals as in Figure 4. Furthermore, we assume the following parameters: the infection rate of computer virus p = 0.2, the removal rate of computer virus 6 = 0.2 or 0.8, the spread rate of KS p, = 0 and the disappearance rate of KS 6, = 0. In this example, the parameters related to KS are assumed to be pT = 0 and 6,. = 0, so that the activity of KS is simply limited to the immunity from computer virus because our main concern is the computer viral prevalence on different network structures. Figures 6 and 7 show the numbers of infected terminals in both tree and lattice structures. More precisely, the expected number of infected terminals in Figure 6 is calculated by solving the differential equations (8) and (9) numerically. The cumulative number of infected terminals in Figure 7 is numerically calculated as the sum of the number of infected terminals in Figure 6, say, v ( t ) T l d t . From these results in both cases with 6 = 0.2 and 6 = 0.8, the number of infected terminals in the tree structure is less than that in the lattice one. In the case where the removal rate is relatively low, namely, the computer viruses widely spread, the difference between tree and lattice structures becomes remarkable. This result is due to the

sow

385

Figure 6.

The number of infected terminals on both tree and lattice structures.

I

Figure 7.

I

The cumulative number of infected terminals on both tree and lattice structures.

fact that the viral prevalence on the tree network is slower than that on the lattice network. 4. Conclusion

In this paper, we have developed a stochastic model to investigate the dependence of viral prevalence on network structures of terminals, and have compared two different network structures. In the proposed model, we have introduced the probability of infection on each terminal, so that we have described the computer virus prevalence by applying the adjacency matrix which represents the connectivity structure for all terminals. In numerical examples, we have compared with different two network structures; tree structure and lattice structure, as typical network topologies. It is concluded that, for the tree network structure, the viral prevalence is restrained. This is because the connectivity of tree network is less than that of lattice network. In other words, it takes long time for a terminal infected by computer virus t o influence all terminals in the network. However, this result also implies that it takes long time even for KS spreading.

386 In future, we will perform the sensitivity analysis of computer virus prevalence on various network structures. In particular, the effective security policy for computer virus prevalence will be developed by estimating the influence of one node in terms of network safety.

Acknowledgments This research was partially supported by the Ministry of Education, Science, Sports and Culture: Grant-in-Aid for Young Scientists (B), Grant No. 15700060 (20032004) and Exploratory Research, Grant No. 15651076 (2003-2005).

References 1. Okamoto, T. and Ishida, Y.: A distributed approach to computer virus detection and neutralization by autonomous and heterogeneous agents, Proceedings of the 4th International Symposium on Autonomous Decentralized Systems, pp. 328-331 (1999). 2. Badhusha, A., Buhari, S., Junaidu, S. and Saleem, M.: Automatic signature files update in antivirus software using active packets, Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications, pp. 457-460 (2001). 3. Thimbleby, H., Anderson, S. and Cairns, P.: A framework for modelling Trojans and computer virus infection, The Computer Journal, Vol. 41, No. 7, pp. 445-458 (1998). 4. Kephart, J. 0. and White, S. R.: Directed-graph epidemiological models of computer viruses, Proceedings of the 1991 IEEE Computer Society Symposium on Research in Security and Privacy, pp. 343-359 (1991). 5 . Kephart, J. 0. and White, S. R.: Measuring and modeling computer virus prevalence, Proceedings of the 1993 IEEE Computer Society Symposium on Research in Security and Privacy, pp. 2-15 (1993). 6. Kephart, J. 0.: A biologically inspired immune system for computers, Proceedings of International Joint Conference on Artificial Intelligence, pp. 20-25 (1995). 7. Toyoizumi, H. and Kara, A.: Predators: good will codes combat against computer viruses, presented at ACM SIGSAC New Security Paradigms Workshop (2002). 8. Wang, C.: Knight, J. C. and Elder, M. C.: On computer viral infection and the effect of immunization, Proceedings of 16th Annual Computer Security Applications Conference, pp. 246-256 (2000). 9. Billings, L.; Spears, W. M .and Schwartz: I.B.: A unified prediction of computer virus spread in connected networks, Physics Letters A , Vo1.297, pp. 261-266 (2002). 10. Wierman, J . C. and Marchette, D. J.: Modeling computer virus prevalence with a susceptible-infected-susceptible model with reintroduction, Computational Statistics and Data Analysis, Vo1.45, pp. 3-23 (2004). 11. Kobayashi, H.: Okamura, H. and Dohi, T.: Characteristic analysis of computer viruses by stochastic models (in Japanese), Journal of IPSJ: Vo1.45, No.5 (2004).

OPTIMAL INSPECTION POLICIES WITH AN EQUALITY CONSTRAINT BASED ON THE VARIATIONAL CALCULUS*

T. O Z A K I ~ T. , DOHI+ AND N. KAIO* Department of Information Engineering, Hiroshima University 1-4-1 Kagamiyama, Higashi-Hiroshima 739-8527, Japan I Department of Economic Informatics, Hiroshima Shudo University 1717 Ohtsuka, Numata-Cho, Asa-Minami-Ku, Hiroshima 731-3195, Japan E-mail: [email protected]. ac.jp / kaio @shudo-u.ac.jp

In this paper we consider inspection problems with an equality constraint over infinite/finite time horizons, and develop approximate algorithms to determine the optimal inspection sequence based on the variational principle. More precisely, the inspection problems with an equality constraint are transformed t o non-constraint problems with the Lagrange multiplier, and are solved by the familiar variational calculus method. Numerical examples are devoted to derive the optimal inspection sequence.

1. Introduction

The inspection policy can be applied t o detect correctively a system failure occurred during the system operation. Barlow e t al. give the mathematical framework to determine the optimal inspection sequence which minimizes the expected operation cost as the sum of inspection cost and system down (penalty) cost. Since their inspection algorithm is difficult for use and is not always stable for computing the inspection sequence numerically, several approximate methods are developed in the past literature. Among them, Keller proposes an approximate method in which the mean number of inspections per unit time is described by a continuous function called the inspection density. More specifically, by the methods of calculus of variations, he finds the optimal inspection density minimizing the expected operation cost and derives the optimal inspection sequence based on it. Kaio and Osaki point out the problem for Keller model and reformulate the same problem. This model is quite simple but can be used potentially to some real examples. For instance, Fukumoto e t al. 4 , Ling e t al. apply the inspection model to place the optimal checkpoint for a file system. In this paper, we consider the different inspection problems with an equality constraint over infinitelfinite time horizons, e.g. like a case where personnel costs *This work is supported by the Grant 15651076 (2003-2005) of Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Exploratory Research, and the Research Program 2004 under the Institute for Advanced Studies of the Hiroshima Shudo University, Japan.

387

388 for inspectors are needed. And we develop approximate algorithms t o determine the optimal inspection sequence based on the variational principle. More precisely, the inspection problems with an equality constraint are transformed t o non-constraint problems with the Lagrange multiplier, and are solved by the familiar variational calculus method. The infinite-time horizon problem with a n equality constraint is first formulated by Yamada and Osaki '. However, their formulation involves the same problem as Keller 2 , it has t o be improved along the similar line t o Kaio and Osaki '. Also, we extend the infinite-time horizon model t o the finite-time one. Viscolani develops the variational calculus approach for the optimal inspection problem with finite-time horizon. We also use Viscolani's idea t o formulate the inspection problems with a n equality constraint. Numerical examples are devoted to derive the optimal inspection sequence. Here, we perform the sensitivity analysis of model parameters on the optimal inspection policy and its associated expected cost. 2. Basic Optimal Inspection Policy 2.1. Barlow, Hunter and Proschan Model

Consider a single unit system with sequential inspection over an infinite time horizon. The system operation is started at time t = 0, and the inspection is sequentially executed at time {tl, t 2 , . . . ,t,, . . . }. At each inspection, t j ( j = 1 , 2 , .. .), the condition of the system is inspected immediately, where the cost co (> 0) is needed for each inspection. Failure may occur according t o an absolutely continuous and non-decreasing probability distribution function F ( t ) having density function f ( t ) and finite mean 1 / p (> 0). Upon a failure, it can be detected at only the inspection time performed after it occurred. Since the time period from the failure t o the inspection time is system down time, the penalty cost L ( . ) , which depends on the length of system down time, is incurred. It is assumed with no loss of generality that the function L ( . ) is differentiable and increasing. Then, the problem is t o derive the optimal inspection sequence t, = { t l ,t 2 , t 3 . . . } minimizing the expected cost function:

where t o = 0. Barlow et al. show that the optimal inspection sequence t, is a non-increasing sequence, i e . , tl < t 2 < t 3 < . . . , if the failure time distribution F ( t ) is PF2 (P6lya frequency function of order 2). If there exists the optimal inspection sequence t;, then it must satisfy the following first order condition of optimality:

Since the problem in Eq.(l) is regarded as a nonlinear programming problem, the quasi-Newton method can be applied t o calculate tT, = { t ; ,t;, ' . . } numerically.

389 Nevertheless, it is not so easy to calculate the optimal inspection sequence with higher accuracy, because the solution based on the quasi-Newton method strongly depends on the initial value t l , and the computation algorithm is quite unstable. This fact motivates to develop approximate methods to calculate the optimal inspection sequence. 2.2. Variational Calculus Approach Keller proposes an approximate method t o derive the optimal inspection sequence, based on the variational calculus approach. Let D ( t ) be an absolutely continuous function of time t and denote the number of inspections placed per unit time, where l / D ( t ) means the mean time interval between successive inspections. We call D ( t ) the inspection density in this paper. The expected cost associated with the inspection and the system down are approximately given by cg t D ( z ) d z and

so

r D (~, t)-’

respectively. Define

X ( t )= Keller

L(z)dz % L ( { 2 D ( t ) } - l ) ,

(3)

Ju

(4)

t

D(z)dz, t

2 0.

considers the variational problem to seek the optimal inspection density

with respect to X ( t ) and further derives the optimal inspection density D ( t ) = X’(t) = d X ( t ) / d t . On the other hand, Kaio and Osaki revisit the Keller’s method and give the easier variational problem:

Once the optimal inspection density D * ( t )is obtained, then the suboptimal inspection sequence t& = {tT,tf,. . . } can be calculated, so as t o satisfy the following equation:

3. Optimal Inspection Policies with an Equality Constraint 3.1. Yamada and Osaki Model Yamada and Osaki consider a just a little bit different inspection problem. In the traditional Barlow, Hunter and Proschan problem in Eq. (l),the tradeoff relationship between the inspection cost and the system down cost is taken into account. In actual maintenance practice, however, we are not always aware such a tradeoff

390 relationship. For example, when the number of inspectors is fixed, the allowable level of inspection cost should be fixed. In such a case, the problem should be formulated as the minimization problem of expected system down cost subject t o the constant expected inspection cost. Conversely, we can consider the minimization problem of expected inspection cost subject to the constant expected system down cost. Yamada and Osaki formulate the former problem based on the Keller’s method as

L ( 1/ 2 x / (t)) dF( t ) coX(t)dF(t) = 6,

s.t.

where 6 is a constant and denotes the allowable inspection cost level. In the literature [6], the above problem is solved with the Keller’s original method. For better understanding the problem, we reformulate it in the following:

By introducing the Lagrange multiplier y, the minimization problem with an equality is rewritten as

Then the Euler equation of this problem is given by

Solving the Euler equation with respect t o D(t) yields

where r ( t ) = f ( t ) / F ( t )is the grange multiplier y in Eq.(12) straint in Eq.(9). As a special case, consider the following result is same as

failure rate and L’(t) = d L ( t ) / d t . Finally, the Lacan be determined so as to satisfy the equality conthe linear cost structure L ( t ) = aot Yamada and Osaki ‘.

(a0

> 0). Then

Proposition 3.1: Suppose that L ( t ) = aot. Then the suboptimal inspection sequence is given by t& = { t ; ,t;, . . . }, where rt,

391 and

(14)

4. Further Results Next, consider the minimization problem of expected inspection cost subject to the constant expected system down cost:

where 0 (> 0) is a constant. With the Lagrange multiplier, it is seen that the problem in Eq.(15) is rewritten as

In a fashion similar to Eq.(12), we obtain

where y is determined from Eq.(15).

Theorem 4.1: For the problem in Eq.(15), suppose that L ( t ) = aot. Then the inspection density of the optimal inspection density is given by

Lemma 4.1: Let D6(t) and De(t) be the optimal inspection densitys in Eq.(14) and (17), respectively. Then the solutions for two problems in Eqs. (8) and (15) are same if

Lemma 4.2: Let K6(D6(t))and Ko(Dg(t)) be the minimum expected costs for ) Ke(Do(t))if two problems in Eqs. (8) and (15), respectively. Then, K s ( D & ( t )=

6

= 0.

Theorem 4.2: Two minimization problems in Eqs. (8) and (15) are equivalent if

392 so that

+

is the solution of the problem minoct){Kb(D(t)) K e ( D ( t ) ) } .

5. Finite-Time Horizon Problem

Next, we consider the finite-time horizon problem which is a natural extension of the infinite-time horizon problem given in Eq.(8) or (15). Suppose that the time horizon of operation for the system is finite, say T (> 0). For a finite sequence tN = { t l ,t2,. . . , t ~ }the , optimal inspection problem is formulated as

where N is the integer satisfying N = min{n : tn+l > T } . To simplify the notation, we define t N + l = T in this paper. From Eq.(22), the approximate problem based on the variational calculus approach is given by

1

rT

min

1

X(t)

s.t.

L(l/2X'(t))dF(t) c o X ( t ) d F ( t )= 6.

Applying the Lagrange multiplier method, we have

Similar to Eq.(12), the optimal inspection density D*(t) is given by

as the solution of the Euler equation:

where ,B is a constant satisfying X ( t ) = N .

Theorem 5.1: Suppose that L ( t ) = aot. Then the optimal inspection density with finite-time horizon for the problem in Eq.(22) is given by

393 Table 1. Dependence of shape parameter on the expected cost with infinite-time horizon: ao = 1, co = 1, 7 = 30.

m 1.o

1.1

6=2 expected cost 7.5000 7.2105 6.9636 6.7462 6.5503 6.3708

6=3 expected cost 5.0000 4.8070 4.6424 4.4975 4.3669 4.2472

6=4 expected cost 3.7500 3.6053 3.4818 3.3731 3.2752 3.1854

where ,Ll is determined so as t o satisfy X ( t ) = N .

so T

In Theorem 5.1, for an arbitrary N , we seek P so as to satisfy N = D*(z)dz. For all possible combinations of N , we calculate all ,Ll satisfying p > F ( T ) ,the optimal number of inspections N * , and the corresponding optimal inspection density D *( t ) . On the other hand, the minimization problem with constant system down cost level 0 is formulated as min

J

T

coX(t)dF(t)

iT

X(t) 0

s.t.

L(1/2X'(t))dF(t) = 8.

(28)

Theorem 5.2: Suppose that L ( t ) = sot. Then the optimal inspection density with finite-time horizon for the problem in Eq.(28) is given by

6. Numerical Examples We calculate numerically the optimal inspection sequence and the corresponding expected operating cost. Suppose that the failure time distribution obeys the Weibull distribution:

F ( t ) = 1 - e-(+)7''

(30)

with shape parameter m ( 2 1) and scale parameter 77 (> 0). Then, it can be seen l / m ) , where I?(.) that the MTTF (mean time to faulue) is given by l/p = $'(l denotes the standard gamma function. When rn 2 1 then the failure time distribution is IFR and the optimal inspection policy for Barlow, Hunter and Proschan model in Eq.(l) is given by a non-increasing sequence. Table 1 presents the dependence of shape parameter on the expected operating cost with infinite-time horizon. When the shape parameter increases, the minimum expected operating cost for the problem in Eq.(9) monotonically decreases.

+

394 Table 2. Dependence of shape parameter and CP cost restriction on the minimum expected recovery cost with finite-time horizon (T = 30): ao = l , y , = l , q = 30.

Table 3. Dependence of shape parameter on the expected cost with finite-time horizon (T=60): a o = l , c o = l , q = 3 0 . s=3

6=2

rn 1.o

6=4

expected cost

p

expected cost

p

expected cost

p

5.6865 5.7669

0.8990 0.9301

3.7390 3.9210

0.9816 0.8957

2.8834 2.8834

0.9301

1.1

On t h e other h a n d , Tables 2 a n d 3 show t h e minimum expected operating cost a n d i t s associated parameter p i n finite-time horizon cases with T = 30 a n d T = 60. It c a n be observed that the optimal inspection policy with an equality constraint does not always exist, where NG implies that no inspection policy with the equality constraint exists. As a remarkable difference from the infinite-time horizon case, t h e expected operating cost does not always increase as m becomes larger (see Table

3). References 1. R. E. Barlow, L. C. Hunter and F. Proschan, Optimum checking procedures, Journal of Society for Industrial and Applied Mathematics 11, 1078-1095 (1963). 2 . J. B. Keller, Optimum checking schedules for systems subject t o random failure, Management Science 21, 256-260 (1974). 3. N. Kaio and S. Osaki, Some remarks on optimum inspection policies, ZEEE Pansactions on Reliability R-33,277-279 (1984). 4. S. Fukumoto, N. Kaio and S. Osaki, Optimal checkpointing strategies using the checkpointing density, Journal of Information Processing 15,87-92 (1992) 5. Y . Ling, J. Mi and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Transactions on Computers 50, 699-707 (2001). 6. S. Yamada and S. Osaki, Optimum number of checks in checking policy, Microelectronics and Reliability 16,589-591 (1977). 7. B. Viscolani, A note on checking schedules with finite horizon, R.A.I.R.0.-Operations Research 25, 203-208 (1991).

OPTIMAL IMPERFECT PREVENTIVE MAINTENANCE POLICIES FOR A SHOCK MODEL

C. H. QIAN

College of Management Science and Engineering, Nanjing University of Technology, 200 Zhongshan Road North, Nanjing, 210009, China E-mail: qch643l7@njut,edu.cn

K. I T 0 Nagoya Guidance & Propulsion Systems Works, Mitsubishi Heavy Industries, Ltd., Komaki, 485-8561 Japan

T. NAKAGAWA Department of Marketing and Information System, Aichi Institute of Technology, 1247 Yachigusa Yagusa, Toyota, Aichi,470-0392 Japan E-mail: [email protected] This paper applies a preventive maintenance(PM) policy for a used system to a cumulative damage shock model where the P M is imperfect. Shocks occur according to a nonhomogeneous Poisson process. The system is assumed to fail only by degradation, as only when the total amount of damage exceeds a prespecified level K , still the system undergoes only PM where it take the place of replacement and its cost is C K . The system undergoes P M where its cost is co at operating time T , or when the total amount of damage exceeds a level k ( k 5 K ) , whichever occurs first. The expected cost rate is obtained and optimal T' and k* to minimize the expected cost are analytically discussed when shocks occur at a homogeneous Poisson process. Several examples are presented.

1. Introduction Many replacement policies for a new system have been studied by many authors, ie., a new system begin t o operate at time 0, and any systems which operate successively are as good as new after In many real situations, it may be more economical t o use a used system than to do a new one in the case where the cost of PM (overhaul etc.) is much less than the one of replacement. However, each PM seems only imperfect in the sense that it does not make a system like new but younger. The replacement policies for a used nit,^,^ and many types of imperfect PM have been considered.5p13 The first imperfect P M model where PM is imperfect with probability p was con-

396 sidered by Chan and Downs.5 Refs. [9-12] introduced improvement factors in hazard rate or age after PM. Further, Kijima and Nakagawa introduced improvement factors in damage after PM to a cumulative damage m0de1.l~ Cumulative damage models, where a system suffers damage due to shocks and fails when the total amount of damage exceeds a failure level K , generate a cumulative process.14 Some aspects of damage models from reliability viewpoints were discussed by Esary, Marshall and P r 0 ~ c h a n . It l ~is of great interest that a system is replaced before failure as preventive maintenance. The replacement policies where a system is replaced before failure at time T,16at shock N,l7,lSor at damage were considered. Nakagawa and Kijima 21 applied the periodic replacement with minimal repair’ a t failure t o a cumulative damage model and obtained optimal values T * , N* and Z* which minimize the expected cost. Satow et al. applied the cumulative damage model t o garbage collection or incremental backup policies for a databa~e.”?~~ In this paper, we apply a P M policy for a used system to a cumulative damage shock model where the P M is imperfect. The used system as begins to operate at time 0 after PM, the amount of damage at time 0 is ZO. Shocks occur at a nonhomogeneous Poisson process. A system undergoes PM at operating time T , or when the total amount of damage exceeds a level k , whichever occurs first. we obtain the expected cost rate. Further, we obtain optimal T’ and lc* to minimize the expected cost when shocks occur at a homogeneous Poisson process. Several numerical examples are given.

Z19320

2. Problem Formulation

Suppose that shocks occur at a nonhomogeneous Poisson process with an intensity function X(t) and a mean-value function R(t), ie., R(t) = s,”X(u)du. Then, the probability that shocks occur exactly j times during (0, t]is 24

where R(0) = 0 and R ( m ) = 00. Let F j ( t ) denotes the probability that the j-th shock occurs during (0, 11, is given by

F,(t)=XH,(t)

(j=o,1,2,.”) .

(2)

2=3

Then, Fo(t) = 1 and F ( t ) = Fl(t) = 1 - e-R(t). Further, an amount Y, of damage due to the j-th shock has an identical distribution G3(z) G PT{Y, 5 x} (3 = 1 , 2 , . . . ) with finite mean. It is assumed that the Y , to the j-th shock where damage is additive. Then, the total damage 2, 3 ZO= 0 has a distribution

c:=,

Pr{Z,

I z> = G(J)(z)

( j= 1 , 2 , . . . ) ,

(3)

397 and G(’)(z) = 1 for z 2 0, 0 for x < 0, where G ( J ) ( x () j = 1 , 2 , . . . ) denote the j-fold Stieltjes convolution of G(z) with itself. Then, the probability that the total damage exceeds exactly a prespecified level K at j - t h shock is G ( j - l ) ( K )- G ( j ) ( K ) . Let Z ( t ) be the total amount of damage at time t. Then, the distribution of Z ( t ) is given by15 00

i=O

Consider the system which should operate for an infinite time span and assume: When the total damage has reached a prespecified level K by the maker of system, the system must undergoes preventive maintenance (PM) at once, and the PM cost C K is a kind of penalty cost, i e . , it would be greater than scheduled PM cost co because C K includes all its cost resulting from the PM. The PM of system during actual operation is costly and dangerous for the total amount of damage to exceed a level K , then it is important t o determine when to undergoes P M before damage level K . It is better to use the level and operating time as a PM indicator. We make a P M of the system when the total amount of damage exceeds a level k (0 < k 5 K ) , or at operating time T after its installation or previous PM, whichever occurs first. It is also assumed that we discuss optimal P M policies with respect to a used system, i e . , the used system begins to operate at time 0, the amount of damage at time 0 is ZO. Because each P M is imperfect and it may be better t o operate a used system than to do a new one in t h e case where the cost of P M is much less than the one of replacement. Further, a n amount of damage after the PM becomes ZO when it was ZT, Zk or ZK before the P M (ZO < Z T , Z ~5 Z K ) . Then, the the probability PT that the system undergoes P M at time T is

and the probability Pk that the system undergoes PM when the total amount of damage exceeds the level k is

Using the relation rT

Fj+1(T) =

H j ( t ) X ( t ) d t , ( j = 0 , 1 , 2 ,’ ‘ ‘ )

+

t 7)

we have Pk = Cj”=, H j ( T )[ 1 - G(j)( k - Z O ) ] and PT Pk = 1. Let E [ U ]denotes the mean time to the PM. Then, from Eqs. ( 5 ) and (6), we

398 have

It is evident that

J;-"[l - G ( K - zo - u)]dG(j-l)(u)Fj(T)and J G(z where PK = C,"=, u)dG(')(u) = G ( z ) . PR- denotes the probability that the total amount of damage exceeds a level K during (0,T ]at the first shock or at the j -t h shock when the amount of damage was u zo (ZO 5 u zo < k ) at the ( j - 1)-th shock, and as the probability that the P M cost is C K . Thus, the total expected cost E[C] t o P M is

+

+

E [ C ]= CO

+

(CK - C 0 ) P K .

(10)

Therefore, from Eqs. (8) and ( l o ) ,the expected cost rate is

3. Optimal Policy Suppose that shocks occur at a Poisson process with rate A, i.e., A ( t ) = A, = 0,1,2,.. . ) . We discuss optimal values k* and T* which minimize the expected cost C ( T , k ) in Eq.(ll). We have that C ( 0 , k ) = limT,oC(T,k) = 00 for any k (ZO 5 k 5 K ) and C ( c o , K ) = limT.+m,k+K C(T,k ) = X c ~ / [ 1 M ( K - ZO)] where M ( x ) = C,"=, G j ( x ) . Thus, there exists a positive pair (T*,k * ) (0 < T" 5 co,ZO 5 k 5 K ) which minimizes C(T,k ) when M ( K - Z O ) < co. Differentiating C(T,k ) with respect to T and setting it equal to zero, we have

R(t) = At and H j ( t ) = [(At)j/j!]e-" ( j

+

where

Similarly, differentiating C ( T ,k ) with respect to k and setting it equal to zero, we have

399 Noting that G ( K - k)C,"O=,G(j)(lc - zo)Hj(T) < ~ ; " = o ~ ~ - z "-Gt ( o Ku)dG(j)(u)Hj(T), since the function G ( z )is continuous and strictly increasing. We have that there does not exist a positive pair ( T * , k * )(0 < T* < 03,zo < k < K ) which satisfies Eqs. (12) and (14), simultaneously. Thus, if such a positive pair ( T k, * ) exists which minimizes C ( T ,k ) in Eq.(ll), then ( T * ,k*) must be (03, k * ) , (T*, K ) or (T*,to).

3.1. Model 1 First, consider an optimal policy for the model 1, i.e., the system undergoes PM a t operating time T , or first shock occurs, whichever occurs first. Since we put k = to in Eq.(ll), the expected cost rate is

It can be easity see that CI(T) is a strictly decreasing in T , and hence, T; = and C1(00)= XcoG(K - 20) k K [ 1 - G ( K - Z o ) ] .

+

03

3.2. Model 2 Secondly, consider an optimal policy for .the model 2, i.e., the system undergoes PM when the total amount of damage exceeds a level K , or at operating time T , whichever occurs first. Putting that k = K in Eq.(ll), the expected cost rate is

Then, Eq.(12) is simplified as

where

Note that G j f ' ( x ) / G j ( ~ is)strictly decreasing in j . Thus, Q ( T )is strictly increasing in T from Ref.[22] and Q(m) = limT-,, Q ( T ) = 1. Letting U ( T ) be the leftU ( T )= hand side of Eq.(17), we have U ( 0 ) = limT-0 U ( T )= 0, U(M) 3 limT,, & ( ~ o ) [ l + M ( K - ~ o ) ] -=l M ( K - t o ) , U'(T)= Q'(T)Cj"=o G(j)(K-Zo)F>+i(T) > 0. Thus, U ( T )is a strictly increasing function from 0 t o M ( K - Z O ) .

Theorem 3.1. If M ( K - to) > C O / ( C K - cg) then there exists a finite and unique T; (0 < T; < m) which minimizes C 2 ( T ) ,and it satisfies Eq.(17). I n this case, the resulting cost is Cz(T;) = X(CK - co)&(T;). I f M ( K - Z O ) I CO/(CK - C O ) then T; = 00 and c z ( o ~ ) / X= CK/[1+ M ( K - ZO)].

400 Example 3.1. Suppose that G(z) = 1 - e-w”, i.e., G(j)(z) = c,”=j[(pz)i/i!] and M ( z ) = p. Table 1 gives the optimal PM times AT,* and the resulting costs C~(T;)/(XCO) for zo = 100,200, 1/11 = 100,150,200 and C K / C O = 1.1,1.5,2.0 when K = 2000. This indicates that the optimal values of T,* are decreasing with C K / C O , and the costs C2(T$)are increasing with both C K / C O and 1/11. However, they are almost unchanged for zo. For example, when the mean time of shock occurs is 1/X = 1 day, C K / C ~ = 1.5, 1/11 = 100 and zo = 100, the optimal P M time T,* is about 21 days. In this case, ( K - z o ) / ( X / p ) = 19 days, and note that it represents the mean time unit the total amount of damage exceed a level K .

100

200

100

90.9297

0.0550

20.7103

0.0722

16.1504

0.0844

150

328.8639

0.0805

16.5211

0.1080

11.7853

0.1308

200

00

0.1048

15.0183

0.1419

9.8295

0.1767

100

98.6480

0.0579

19.9930

0.0762

15.4359

0.0895

150

502.1780

0.0846

16.1489

0.1138

11.3545

0.1385

200

00

0.1100

14.8692

0.1492

9.5509

0.1867

3.3. Model 3 Next, consider an optimal policy for the model 3, i e . , the system undergoes PM only when the total amount of damage exceeds a level k . putting that T = 00 in Eq. (1l), the expected cost rate is

C3(k) C(T,k) - lim ~

T-cc

-

co

+ (CK

-

co){l

-

G ( K - ZO) 1

+

Jgk-’O[l-

+M(k

-

20)

G ( K - 20 - u ) ] ~ M ( ~ L ) } ’

(19)

Then, Eq.(14) is simplified as (20) Letting V ( k )be the left-hand side of Eq.(20), we have V ( z 0 )= limk--ttoV ( k ) = 0, V ( K ) = M ( K - Z O ) , V ’ ( k ) = [l M ( k - zo)]g(K - k ) > 0, where g(z) = d G ( z ) / d z > 0, since the function G(z) is strictly increasing. Thus, V ( k ) is a strictly increasing function from 0 t o M ( K - zo).

+

40 1

Theorem 3.2. If M ( K - zo) > C O / ( C K - CO) then there exists a finite and unique k* (zo < k* < K ) which minimizes C 3 ( k ) , and it satisfies Eq.(20). In this case, the resulting cost is C3(k*)= X(CK - co)[l - G ( K - k * ) ] .I f M ( K - 20) 5 C O / ( C K - CO) then k* = K and C3(K)/X = c ~ / [ 1 +M ( K - zo)]. In particular, if zo = 0 as the P M is perfect, then this result agrees with the result of Nakagawa.20

Example 3.2. Suppose that G(z) = 1 - e-p”. Then, if p ( K - zo) > C O / ( C K - CO) then there exists a finite and unique k* (20 < k* < K ) which minimizes C s ( k ) ,and it satisfies p ( k - zo)e-p(K-k) = C O / ( C K - CO). Table 2 gives the optimal PM levels k” and the resulting costs C3(lc*)/(Xco) for zo = 100,200, l / p = 100,150,200 and cK/co = 1.1,1.5,2.0 when K = 2000. This indicates that the optimal values of k* are decreasing with cK/co, and the costs C3(k*)are increasing with both C K / C O and l / p . However, they are almost unchanged for 20.

Table 2.

Optimal P M levels k’ and the resulting costs C3(k’)/(Xco) when K = 2000. c K / c o = 1.5

CK/@ = 1.1

CK/CO

= 2.0

to

1Ip

k*

C3(k*)/(Xco)

k*

C3(k*)/(Xco)

k*

C3 (k*)/(Xco)

100

100

1939.0738

0.0544

1786.7753

0.0593

1721.4152

0.0617

150

1967.1609

0.0803

1744.8070

0.0912

1649.6915

0.0968

200

2000.0000

0.1048

1720.2311

0.1234

1597.3833

0.1336

200

100

1944.3650

0.0573

1729.5250

0.0628

1727.3891

0.0655

150

1974.7719

0.0845

1753.3843

0.0966

1658.8677

0.1028

200

2000.0000

0.1100

1731.4950

0.1306

1609.4824

0.1419

Note that C1(T;) = C3(z0) < C3(k*) since lirnk+,,,dCs(k)/dk < 0 and from Theorems 3.1 and 3.2, we have Remark 3.1.

Remark 3.1. If M ( K - 20) > C O / ( C K - CO) then ( T * ,k * ) = (m,k * ) or(T*, k * ) = (T;, K ) . If M ( K - ZO) 5 c o / ( c ~- CO) then(T*, k * ) = ( m , K ) . Compaerd with Tables 1 and 2 , it is found that C3(k*) 5 Cz(T,”);that is, k * ) is better than (T,”,K ) . However, it would be generally easier to check the operating time than the total amount of damage. From this point of view, the time policy would be better than the level policy. Therefore, how t o select among two policies would depend on actual mechanism of a system. (00,

References 1. R. E. Barlow and F. Proschan, Mathematical Theory of Reliability, John Wiley & Sons, New York (1965)

402 2. T. Nakagawa, Optimal preventive maintenance policies for repairable system. IEEE Trans. Reliability, R-26, 168-173 (1977). 3. E.J.Muth, An optimal decision rule for repair vs replacement. IEEE Trans. Reliability, R-26, 179-181 (1977). 4. T. Nakagawa, Optimum replacement policies for a used unit. Journal of the Operations Research Society of Japan, 22, 338-347 (1979). 5. P.K.W.Chan and T.Downs, Two criteria preventive maintenance. IEEE Trans. Reliability, R-27, 272-273 (1978). 6. M.Brown and F. Proschan, Imperfect repair. Journal of Applied Probability, 20, 851859 (1983). 7. D.N.P.Murthy and D.G.Nguyen, Optimal age-policy with imperfect preventive maintenance. IEEE Trans. Reliability, R-30, 80-81 (1981). 8. T. Nakagawa, Optimal policies when preventive maintenance is imperfect. IEEE Trans. Reliability, R-28, 331-332 (1979). 9. C.H.Lie and Y.H.Chun, An algorithm for preventive maintenance policy. IEEE Trans. Reliability, R-35, 71-75 (1986). 10. T . Nakagawa, A summary of imperfect preventive maintenance policies with minimal repair. R A I R O Operations Research, 14, 249-255 (1980). 11. D.G.Nguyen and D.N.P.Murthy, Optimal preventive maintenance policies for repairable systems. Operational Research, 29, 1181-1194 (1981). 12. T . Nakagawa, Sequential imperfect preventive maintenance policies, IEEE Trans. Reliability, 37, 581-584 (1989). 13. M. Kijima and T . Nakagawa, Replacement policies for a shock model with imperfect preventive maintenance, European Journal of Operational Research, 57, 100-110 (1992). 14. D. R.Cox, Renewal Theory, Methuen, London (1962). 15. J. D. Esary, A. W. Marshall and F. Proschan, Shock models and wear processes, Annals of Probability, 1, 627-649 (1973). 16. H. M. Taylor, Optimal replacement under additive damage and other failure models., Naval Res. Logist. Quart, 22, 1-18 (1975). 17. T. Nakagawa, A summary of discrete replacement policies, European J. of Operational Research, 17, 382-392(1984). 18. C.Qian, S.Nakamura and T. Nakagawa, Replacement and minimal repair policies for a cumulative damage model with maintenance, Computers and Mathematics with Applications, 46, 1111-1118 (2003). 19. R. M. Feldman, Optimal replacement with semi-Markov shock models, Journal of Applied Probability, 13, 108-117 (1976). 20. T. Nakagawa, On a replacement problem of a cumulative damage model, Operational Research Quarterly, 27 895-900 (1976). 21. T. Nakagawa and M. Kijima, Replacement policies for a cumulative damage model with minimal repair a t failure, IEEE Trans. Reliability, 38, 581-584 (1989). 22. T.Satow, K.Yasui and T. Nakagawa, Optimal garbage collection policies for a database in a computer system, RAIRO Operations Research, 30, 359-372 (1996). 23. C.Qian, Y.Pan and T. Nakagawa, Optimal policies for a database system with two backup schemes, R A IRO Operations Research, 36, 227-235 (2002). 24. S. Osaki, Applied Stochastic Systems Modeling, Springer Verlag, Berlin (1992).

DETERMINATION OF OPTIMAL WARRANTY PERIOD IN A SOFTWARE DEVELOPMENT PROJECT

K. RINSAKA AND T. DOH1 Department of Inforrnataon Enganeerang, Haroshama Unaversaty, 1-4-1 Kagamayama, Hagasha-Haroshama 739-8527, JAPAN E-mad: ( n n s a k a , dohz) &el. hzroshzma-u. ac.3p

This paper presents a stochastic model to determine the optimal warranty period for a computer software, considering the difference between the debugging environment in the testing phase and the executing environment in the operational phase. The software reliability models based on non-homogeneous Poisson processes are assumed to describe the debugging phenomena for both the environments. We model the operational profile of the software based on the idea of accelerated life testing for hardware products. We formulate the total expected software cost incurred in both testing and operational phases, and derive the optimal software warranty period which minimizes it. Numerical examples are provided to investigate the dependence of model parameters on the optimal warranty policy.

1. Introduction

Product warranty plays an increasingly significant role in both consumer and commercial transactions. The problem for designing the product warranty service has been recognized to be important in terms of customer service, even if the product is rather reliable. From such a background, the stochastic models called warranty models [ 1,2] have been developed in the reliability/maintenance research area. For software developers, it is important to determine the optimal time when software testing should be stopped and when the system should be delivered t o a user or a market. This problem, called optimal software release problem, plays a central role for the success or failure of a software development project. Many authors formulated the optimal software release problems based on various different assumptions and/or several software reliability growth models [3-71. Okumoto et d 3assumed that the number of software faults detected in the testing phase is described by an exponential software reliability model based on non-homogeneous Poisson processes (NHPPs) [8], and derived an optimal software release time which minimizes the total expected software cost. Koch et d4considered the similar problem for the other software reliability model. Bai et aL5 discussed the optimal number of faults detected before the release. It is difficult t o detect and remove all faults remaining in a software during the testing phase, because exhaustive testing of all executable paths in a general pro-

403

404 gram is impossible. Once the software is released t o users, however the software failures may occur even in the operational phase. It is common for software developers to provide the warranty period when they are still responsible for fixing software faults causing failures. In order t o carry out the maintenance during the software warranty period, the software developer has t o continue keeping a software maintenance team. At the same time, the management cost in the operational phase has t o be reduced as much as possible, but human resources should be utilized effectively. Although the problem which determines the software warranty period is important only a very few authors paid their attention t o this problem. Kimura et aLg considered the optimal software release problem in the case where the software warranty period is a random variable. Pham et a1.l' developed a software cost model with warranty and risk costs. They focused on the problem for determining when t o stop the software testing under the warranty contract. However, note that the software developer has t o design the warranty contract itself and often provides the posterior service for users after software failures. Dohi et al." formulated the problem for determining the optimal software warranty period which minimizes the total expected software cost under the assumption that the debugging process in the testing phase is described by a n NHPP. Since the user's operational environment is not always same as the software testing phase, however, the above papers did not take account of the difference between these phases. Some reliability assessment methods during the operational phase have been proposed by some authors [12,13]. Okamura et d i 3represented the operational profile by introducing an accelerated life testing for hardware products. In this paper, we develop a stochastic model t o determine the software warranty period under the assumption that the testing phase is different from the operational phase in terms of debugging phenomenon. First, we formulate the total expected software cost based on the NHPP type of software reliability models. In the special case with the exponential fault-detection time distribution, we derive analytically the optimal warranty period which minimizes the total expected software cost under a milder condition. In numerical examples with real data, we compare three debugging scenarios, say three NHPP models and examine the dependence of model parameters on the optimal warranty policy.

2. Model Description

First, the following assumptions on the software fault-detection process are made: (a) In each time when a system failure occurs, the software fault causing the system failure is detected and removed immediately. (b) The number No of initial faults contained in the software program follows the Poisson distribution with mean w (> 0). (c) Time t o detect each software fault is independent and identically distributed nonnegative random variable with the probability distribution function F ( t )

405 and density function f ( t ) Let { N ( t ) ,t 2 O} be the cumulative number of software faults detected up to time t. From above assumptions, the probability mass function of N ( t ) is given by Pr{N(t) = m }

=

[wF(t)]VWF@)

,

m!

m=0,1,2,....

Hence, the stochastic process { N ( t ) t, 2 0} is equivalent to the NHPP with mean value function w F ( t ) , where the fault-detection rate (debugging rate) per unit of time is given by

r(t)=

f( t ) ~

1- F ( t ) '

Suppose that a software developer releases a software system a t time t o (> 0) to the user or market after completing software testing. The length of the life cycle t L (> 0) of the software is known in advance and is assumed to be sufficiently larger ) the software warranty period. More precisely, than t o . Let tw (0 5 t w 5 t ~denote the warranty period is measured from time t o and expires at time t o tw.The software developer covers to the maintenance cost for the software failures within the warranty period. After the warranty expires, even if an additional system failure caused by a software fault occurs, the software developer does not detect and remove the fault in the out-of-control state. We assume that the penalty cost is incurred for the software developer when the software user encounters the software failure after the warranty expires. Further, we define the following costs:

+

cost to remove each fault in the testing phase cost to remove each fault during the warranty period C L : penalty cost per failure after the warranty period ko: testing cost per unit of time kw:warranty cost per unit of time CO:

CW:

3. Total Expected Software Cost

In this section we formulate the total expected software cost which can occur in both testing and operational phases. In the operational phase, we consider two cost factors; the maintenance cost for the warranty period and the penalty cost caused by the software failure after the warranty expires. From Eq.(l), the probability math function of the number of software faults detected during the testing phase is given by

It should be noted that the operational environment after the release may differ from the debugging environment in the testing phase. This difference is similar to that between the accelerated life testing environment and operating environment for

406 hardware products. We suppose that the elapsed time in the operational phase is proportional t o the time in the testing phase, and introduce the environment factor a (> 0) which expresses the relative severity in the operational environment after the release. Okamura et al.I3 apply the similar technique to model the operational phase of software, and estimate the software reliability through an example of the actual software development project. Under this assumption, note that a = 1 means the equivalence between the testing and operational environments. On the other hand, a > 1 ( a < 1) means that the operational environment is severe (looser) than the testing environment. Then, the probability math function of the number of software faults detected during the warranty period is given by

Since the software developer does not eliminate software faults which appear after the warranty expires, the software reliability growth phenomenon can not be observed in this case. It is assumed that the debugging rate r ( t ) becomes uniform after the software warranty period. Since the debugging rate at time to t w is r ( t o a t w ) , the fault-detection process of the software is expressed by

+

+

4. Determination of the Optimal Software Warranty Period Suppose that the time to detect each software fault obeys the exponential distribution [8] with mean 1/X (> 0). In this case, the total expected software cost in Eq. (6) becomes

We make the following assumption:

(A-I)

CL

> cw > co.

Then the following result provides the optimal software warranty policy which minimizes the total expected software cost.

407

Theorem 4.1. When the software fault-detection time distribution follows the exponential distribution with mean 1/X, under the assumption (A-I), the optimal software warranty period which minimizes the total expected software cost is given as follows: (1) If k w

2 (CL

-

cW)wXae-At", then the optimal policy is tb = 0 with

< ( C L - cw)wXae-Ato and k w > ( c -~cw)wXae-X(tn+atL),then there t, (0 < & t, < exists a finite and unique optimal software warranty period & t L ) , and its associated expected cost is given by

(2) If k w

C(t&) = koto

+ QW

+ cww

(3) If kw 5

(CL -

(1 - e p X t o )k w t >

I

e-A(to+atb)

-

+ c w w [e-Atn

-

I

e-A(to+atb)

e-A(to+atL)l.

(9)

cW)wXaepX(to+atL),then we have tb = t L with

5. Numerical Examples Based on 86 software fault data observed in the real software testing process [14], we calculate numerically the optimal software warranty period which minimizes the total expected software cost. For the software fault-detection time distribution, we apply three distributions: exponential [ 8 ] ,gamma of order 2 [15] and Rayleigh [16] distributions. The probability distribution functions for the gamma and Rayleigh distributions are given by

F ( t ) = 1 - (1+ Xt)e-xt

(11)

and

respectively. Suppose that the software is released t o the user or market at the time point when 70 fault data are observed, namely to = 67.374. Then we estimate the unknown parameters in the software reliability models by the method of maximum likelihood. Then, we have the estimates (&,A) = (98.5188, 1.84e-02) for the exponential distribution, ,;( A) = (75.1746, 6.46224e-02) for the gamma model and (G, 8) = (71.6386, 2.45108e+01) for the Rayleigh model. Figure 1 shows the actual software fault data and the behavior of estimated mean value functions. For the other model parameters, we assume: ko = 0.02, kw = 0.01, co = 1.0, cw = 2.0, C L = 20.0 and t L = 1000.

408 90

80

70 60

30 20 10 0

0

20

40

60

80

100

120

f

Figure 1. The actual software fault data and the behavior of estimated mean value functions.

Table 1 presents the dependence of the environment factor a on the optimal software warranty period tk . As the environment factor monotonically increases, i e . , the operational circumstance tends to be severe, it is observed that the optimal software warranty period t& and its associated minimum total expected software cost C(t&,) decrease for both exponential and gamma models. For the Rayleigh distribution, the optimal software warranty period is always 0, that is, it is optimal t o carry out no warranty program. This is because the goodness-of-fit of the Rayleigh model is quite low. In this situation, it can be expected that a lot of software faults will be detected after the release of product. In Table 2, the dependence of the software testing period t o on the optimal software warranty period tk is examined. For the exponential and gamma distributions, we assume a = 2. On the other hand, for the Rayleigh distribution, a = 0.75. As the testing period t o monotonically increases, it is found that the optimal software warranty period t& decreases drastically, and the associated minimum expected software cost C(t>) first decreases and then increases. This observation is quite natural because with the increase in the testing time, it is always possible t o reduce the total expected software cost. But after the certain testing time period, the software fault can hardly be detected. As a result, the total expected software cost increases. Table 3 shows the dependence of the parameters X and 0 on the optimal software warranty period. As X increases or 0 decreases, it is seen that the optimal software warranty period C& decreases.

409 Table 1. Optimal software warranty period for varying environment factor. Exponential

a

Gamma

Rayleigh

t’w

C(t&)

0.50

669.3

136.1

141.3

83.4

0.0

72.1

0.75

475.6

133.9

102.2

82.9

0.0

72.1

1.00

372.3

132.7

80.9

82.7

0.0

72.1

1.25

307.6

131.9

67.4

82.5

0.0

72.1

1.50

262.9

131.4

57.9

82.4

0.0

72.1

2.00

205.0

130.7

45.6

82.2

0.0

72.1

3.00

144.0

130.0

32.4

82.1

0.0

72.1

Table 2. period.

Optimal software warranty period for varying testing

Exponential

Gamma

Rayleigh

to

t’w

C(t’w)

40

218.7

149.0

59.2

97.0

26.4

60

208.7

134.7

49.2

84.5

0.0

72.8

80

198.7

125.0

39.2

79.9

0.0

72.9 73.6

88.0

100

188.7

118.3

29.2

78.4

0.0

200

138.7

106.7

0.0

79.2

0.0

75.6

300

88.7

106.1

0.0

81.2

0.0

77.6

400

38.7

107.2

0.0

83.2

0.0

79.6

Acknowledgments T h i s work is partially based on t h e financial support by t h e Ministry of Education, Science, Sports a n d Culture: Grant-in-Aid for Exploratory Research, G r a n t

NO. 15651076 (2003-2005). References W.R. Blischke and D.N.P. Murthy, Warranty cost analysis, Marcel Dekker, New York (1994). W.R. Blischke and D.N.P. Murthy, (eds), Product warranty handbook, Marcel Dekker, New York (1996). K. Okumoto and L. Goel, “Optimum release time for software systems based on reliability and cost criteria,” J . Sys. Software, 1315-318 (1980). H.S. Koch and P. Kubat, “Optimal release time of computer software,” IEEE Trans. Software Eng., SE-9 323-327 (1983).

410 Table 3.

Optimal software warranty period for varying model parameters X and

e. Gamma

Exponential

x

t&

C(t&)

Rayleigh

X

ttv

C(t&)

e

tG

qtty)

0.005

714.4

178.3

0.03

124.8

108.0

22.0

0.0

72.4

0.010

375.0

154.3

0.04

88.6

96.3

24.0

0.0

72.1

0.015

252.3

138.6

0.05

66.3

88.6

26.0

0.0

72.2

0.020

188.0

127.6

0.06

51.1

83.8

28.0

1.2

73.4

0.025

148.1

119.8

0.07

40.0

80.8

30.0

7.7

75.2

0.030

120.9

114.3

0.08

31.6

79.1

32.0

14.2

77.4

0.035

101.0

110.3

0.09

25.0

78.2

34.0

20.7

79.7

5. D.S. Bai and W.Y. Yun, “Optimum number of errors corrected before releasing a software system,” IEEE Trans. Reliab., R-3741-44 (1988). 6. W.Y. Yun and D.S. Bai, “Optimum software release policy with random life cycle,” IEEE Trans. Reliab., R-39 167-170 (1990). 7. T. Dohi, N. Kaio and S. Osaki, “Optimal software release policies with debugging time lag,” Int. J. Reliab., Quality and Safety Eng., 4 241-255 (1997). 8. A.L. Goel and K. Okumoto, “Time-dependent error-detection rate model for software reliability and other performance measures,” IEEE Trans. Reliab., R-28206-211 (1979). 9. M. Kimura, T . Toyota, and S. Yamada, “Economic analysis of software release problems with warranty cost and reliability requirement,” Reliab. Eng. 63 Sys. Safe., 66 49-55 (1999). 10. H. Pham and X. Zhang, “A software cost model with warranty and risk costs,” IEEE Trans. Comput., 48 71-75 (1999). 11. T. Dohi, H. Okamura, N. Kaio and S. Osaki, “The age-dependent optimal warranty policy and its application to software maintenance contract,” Proc. 5th Int’l Conf. on Probab. Safe. Assess. and Mgmt. ( S . Kondo and K. Furuta, eds.), 4 2547-2552, University Academy Press Inc. (2000). 12. J . Musa, G. Fuoco, N. Irving, D. Kropfl and B. Juhlin, “Chapter 5: The operational profile,” in Handbook of Software Reliability Engineering (M.R. Lyu ed.), McGraw-Hill, New York (1996). 13. H. Okamura, T. Dohi and S. Osaki, “A reliability assessment method for software products in operational phase - proposal of an accelerated life testing model -,” Electronics and Communication in Japan, Part 3 , 84 25-33 (2001). 14. A.A. Abde-Ghaly, P.Y. Chan and B. Littlewood, “Evaluation of competing software reliability predictions,” IEEE Trans. Software Eng., SE-12 950-967 (1986). 15. S. Yamada and S. Osaki, “Software reliability growth modeling: models and applications,” IEEE Trans. Software Eng., SE-11 1431-1437 (1985). 16. A.L. Goel, “Software reliability models: assumptions, limitations, and applicability,” IEEE Trans. Software Eng., SE-11 1411-1423 (1985).

OPTIMAL INSPECTION-WARRANTY POLICY FOR WEIGHT-QUALITY BASED ON STACKELBERG GAME - FRACTION DEFECTIVE AND WARRANTY COST -

H. SANDOH Department of Business Administration, Kobe Gakuin University, 518, Arise, Ikawadani-cho, Nishi, Kobe, 651-2180, J A P A N E-mail: [email protected]. ac.jp

T. KOIDE Department of Healthcare i 3 Social Services, University of Marketing tY Distribution Sciences, 3-1, Gakuen-nishi-machi, Nisha, Kobe, 651-2188, J A P A N E-mail: [email protected] In the final stage of manufacturing some specific products, there is a weighing process where we weigh each product using a scale. However, the scale occasionally becomes uncalibrated and therefore, the product may be shipped out with a label or a mark showing incorrect weight. Such an uncalibrated state of the scale can be detected by inspection carried out to the scale. Further, we should introduce warranty to the products whose labels or marks show incorrect weights. This study considers two types of inspection and warranty policy (inspection-warranty policy in short) and make a comparison between them through a Stackelberg game formulation to discuss an optimal policy, taking into account the consumers’ viewpoint. Numerical illustrations are also presented.

1. Introduction

In the final stage of manufacturing for some specific products such as chemical products, there is a process in which we weigh each product using a scale with a view to obtaining its exact weight, and then marking each product with its weight. This weighing process is observed in the situation, e.g., where drums are filled with some specific chemical product so that each drum contains approximately 250 kilograms of the product, and in the final stage, individual drums are weighed to obtain the actual weight of each drum of product. Such a weighing process is not necessarily regarded as important and its associated cost is reduced as much as possible since it does not affect the product quality itself. However, the scale occasionally becomes uncalibrated particularly when the objective product is very heavy or we are very busy in weighing many products 41 1

412

within a restricted time. Once the scale becomes uncalibrated, it will produce inaccurate weights for individual products, and hence there is a risk that the products will be shipped out with marks or labels indicating incorrect weights. In this study, when a product with a mark or a label revealing incorrect weight is shipped out, it is referred to as defective regardless of its quality. Under real circumstances, such inaccuracy or uncalibrated state of a scale is detected by periodical inspection. In the cases where the products are expensive or exact weight is a critical factor, the scale will be inspected and found t o be normal prior to each shipment. In other cases, however, each lot of products may be shipped out immediately after they are weighed without the scale being inspected. This is because of cost reduction for this weighing process. Even in such a case, the volume of defective products to be shipped out with inaccurate marks of weights can be restrained in various ways. Sandoh and Igakilg have proposed both a continuous and a discrete model for an inspection policy for a scale when inspection activities involve adjustment operations. Sandoh and Igaki” have also considered a case where inspection is executed only for detecting scale inaccuracy and the adjustment operations follow the inspection when the scale is found to be uncalibrated. Sandoh and Nakagawa” have dealt with a different situation where (1) the scale is inspected only twice a day, once in the morning and one more in the evening, (2) if the scale is detected to be uncalibrated in the evening, we weigh again a prespecified volume of the products in the order of tracing them back after the scale is adjusted, and (3) immediately after we finish reweighing products, we ship out all the products without waiting for the scale inspection next morning owing t o their due date. Under these conditions, Sandoh and Nakagawa” discussed optimal volume of products t o be reweighed. In this study, we consider two types of inspection-warranty policy t o make a comparison between them. The comparison is carried out through a Stackelberg game formulation to take into account both the consumer’s viewpoint and the manufacturer’s one. 2. Assumptions and Notations

We make the following assumptions: (1) We consider a monopoly. (2) The manufacturer weighs each product using a scale and ships out each product after he puts a label on each individual product to show its weight. (3) There are many products to be weighed and therefore we regard the volume of products t o be weighed as continuous. The unit of time is defined as the time required for weighing a unit of product. (4) We call the products which are weighed by an uncalibrated scale to be shipped out defective regardless of their quality. ( 5 ) The scale is inspected at iT(i = 1 , 2 , . . . ). (6) Inspection activities involve adjustment operation and hence the scale becomes

41 3

calibrated immediately after inspection. (7) Let co and c1 respectively express the cost per inspection activity and the cost for weighing a unit of product. (8) For i = 1,2,. . . , let us denote, by a random variable X i , the time for a scale to be uncalibrated on an interval ((2 - l)T,iT]. Let X l , X z , . . . be independent and identically distributed with distribution function F and density function f. In addition, we assume t,hat E [ X J = p << +oo. (9) The raw price of the product is given by a. (10) The consumer’s revenue by puFchasing the product is given by R. Based on the above assumptions and notations, we consider the following two types of inspection-warranty policy: [Policy 11 Each product is shipped out immediately after it is weighed. In this case, we have a risk to ship out defective products, and hence, we devote c2 to the warranty for the consumer who purchased a defective product. When the consumer purchased a defective product, he/she can receive W = a c z ( a > 0) through the warranty service. The products shipped out under this policy are called Type 1 products. Type 1 product is sold at price PI (< R). [Policy 21 Products are not shipped out until we assure that the scale is calibrated by inspection. In case the scale is found t o be uncalibrated by an inspection activity, all the products waiting for being shipped out are weighed again until the scale is inspected to be normal. The products shipped out under this policy are called Type 2 products. The price of Type 2 product is denoted by Pz(P1 < P2 5 R). Under Policy 2, we never ship out defective products, and therefore we provide the consumer with no warranty on weight-quality. It should, however, be noted that we need secure some space for the weighed products t o wait for being shipped out. Let c3 and c4, respectively, express the cost for a unit of weighed product t o occupy the space per unit of time and the cost for each weighed product to waste a unit of time without being shipped out. It is very difficult to analytically compare Policy 1 with Policy 2 based on the cost for inspection-warranty policies from the manufacturer’s point of view. In the following, we introduce a Stackelberg game formulation t o make a comparison between the two policies taking the consumer’s and the manufacturer’s viewpoint.

3. Consumer’s Optimal Reaction 3.1. Fraction Defective of Type 1 Product The assumptions described above indicate that the process behavior generates a renewal reward p r o ~ e s s ~where ~,~~ , a renewal point corresponds to the time when the inspection activity has been completed. Hence, under Policy 1, the volume of

41 4

defective products t o be shipped out per unit of time is given by

and D ( T ) expresses the fraction defective of Type 1 products. It should be noted that D ( T ) increases with T from 0 t o 1.

3.2. Optimal Reaction of Consumer 3.2.1. Expected profit of consumer If the consumer purchases a Type 1 product, his expected profit becomes K ( P , W ) = ( R - P1)(1- P )

+ (W

-

Pl)P,

(2)

where p = D ( T ) in Eq. (1). When he chooses a Type 2 product, his expected profit is given by

nz(pz)= R - pz, while his expected profit becomes

IT0 =

(3)

0 when he purchases no product

3.2.2. Optimal reaction of consumer By comparing IIl(p, W ) with nz(P2) or no, we can obtain the optimal reaction by the consumer as follows: (1) In the case of PI < Pz < R, the consumer purchases either a Type 1 or a Type 2 product, and this case can further be classified as i. If (p, W ) E Rl\Ra, the consumer purchases a Type 1 product. ii. If ( p , W ) E Rz\R1, the consumer chooses a Type 2 product,. iii. If ( p , W ) E R1 n Rz,Type 1 product is indifferent t o that of Type 2 for the consumer, where

(2) In the case of Pz

=

R , purchasing a Type 2 product becomes indifferent t o

purchasing no product for the consumer, and we have the following classification: i. If ( p , W ) E R1\Ro, the consumer purchases a Type 1 product. ii. If (p, W ) E Ro\Rl, the consumer chooses a Type 1 product or purchases no product. iii. If ( p , W ) E R1 n 0 0 ,purchasing a Type 1 product becomes indifferent to choosing Type 2 product or purchasing no product,

41 5

4. Manufacturer's Optimal Strategy This section first formulates the expected cost per unit of time under each inspection-warranty policy from the manufacturer's viewpoint, and second, discusses an optimal strategy for the manufacturer, considering the consumer's optimal reaction we have observed above.

4.1. Expected Profit From the renewal reward t h e ~ r y the ~ ~expected ~ ~ ~ ,profit per unit of time under Policy 1 is expressed by

where the manufacturer can control fraction defective p through inspection time interval T and the warranty W via c2 of Type 1 product. On the other hand, the expected cost per unit of time under Policy 2 becomes C~

Qz(T,Pz) = P2

-

a

-

+ clT +

C

~

+T C~4 T 2 [ 1 + F ( T ) ]

TF(T)

,

(9)

where the manufacturer can control inspection time interval T as well as the price Pz of Type 2 product. 4.2. Optimal P o l i c y

4.2.1. Policy 1 In the case of ( p , W ) E 0 1 \ 0 2 for PI < Pz < R as well as ( p , W ) E 0 1 \ 0 0 for PZ = R, the consumer purchases Type 1 product. In theses cases, the manufacturer can increase his expected profit per unit of time by reducing cz(= a W ) since T a Q l ( T , c ~ ) / a c z= - f,, F ( z ) d z / T < 0. Consequently, he can maximize his expected profit with ( p * ,W " ) E 01 locating on the indifference curve given by the following two equations:

416 Let us define I I ( T ) ,Fa and

5?b

by

then we have the following theorem:

Theorem 4.1. (1) I t we have

H ( F a ) > 0 for PI 0 f o r Pz = R

I

the optimal policy (TT,c;) m u z m i z z n g Q1(T,c z ) under Policy I becomes

{ TT;T

+ +

ya, cf Tb,

2,W *

S O , p* + 1 cf + +o, p* + +, p2-p --f

+ +O f o r PI < PZ< R W * + +O f o r P 2 = R

( 2 ) If we have

H ( T a ) 5 0 for Pi olco/R, there exists a unique optimal solution (T:,cf), ie., ( p * ,W * ) o n the c u m e given by Eq. (11). ii. Otherwise, we have -+

03, cf + ( R

+ PI - P2)/ol, p*

+ 03, c; + Pl/Cy,

p*

+

1, W* + R 1, W" + Pl ---f

+ PI

-

PZ f o r PI < Pz < R for Pz = R

4.1.2. Policy 2

If ( p , W ) E Rz\R1 for PI < P2 product. In addition, we have

< R, the consumer naturally purchases Type 2

lim Q2(T,Pz) = lim Q2(T,Pz) = -m,

T-tO

T++CC

(15)

These observations reveal there exists T = Tg maximizing Qz(T,P z ) for a fixed P 2 under Policy 2. It should also be noted that T = T,* is independent of Pz. In the case of Pz = R and ( p , W ) E Ro\f21, the consumer purchases no product, and the expected profit of the manufacturer becomes zero.

41 7 4.2.3. Optimal Strategy of Manufacturer From the above analyses, the optimal strategy for the manufacturer becomes: (1) Pi < Pz < R. For a fixed Pz(< R ) , if we have Q I ( T ; , c ; ) 2 &2(T;,Pz), the manufacturer should make the consumer purchase Type 1 product by letting (T,c2) = (TT,cf) under Policy 1 and setting T arbitrarily under Policy 2. Otherwise, his optimal strategy becomes to make the consumer purchase Type 2 product by letting T = T2 under Policy 2 and setting (T,c2) arbitrarily on the condition that ( p ,W ) E 0 2 \ 0 1 under Policy 1. (2) P2 = R. In this case, purchasing a Type 2 product is indifferent to buying no product for the consumer. It follows that if we have &~(T;,c;) > 0, the manufacturer can maximize his expected profit by setting ( T ,cp) = (TT,c f ) under Policy 1 and setting T arbitrarily under Policy 2. 5 . Numerical Illustrations Let us assume that the distribution function F of X i ( i = 1 , 2 , . . . ) is given by

F ( z ) = 1 - e-X",

3:

> 0.

(16)

Table 1 shows the cases considered in this section, while Table 2 reveals the optimal policies under Policy 1 along with those under Policy 2. In Table 2, Cases 1-(a), (b) and (c) represent the situation where purchasing a Type 2 product is indifferent t o buying no product for the consumer, while Cases 2 and 3 have 0 2 explicitly. If QF > Qf in Cases 2 and 3, the optimal policy under Policy 1 is the optimal strategy for the manufacturer. When Q; < &;, the manufacturer should let ( p , W ) E 0 2 \ C 2 1 under Policy 1 and use the optimal policy under Policy 2 to maximize his expected profit. Table 1. Cases. Case

R

PI

1-(a)

100

90

1-(b) 1-(c) 2-(a) 2-(b) 2-iCj

100 100 100 100 100 100 100 100

94 98

Pz 100 100 100

90 94 98 90 94 98

99 99 99 99 99 99

%(a)

%(b) 3-(c)

a 50

50 50 SO 50 50 50 SO 50

cy

co

c1

c3

1.0

200 200 200 200 200 200 200 200 200

0.01

0.01 0.01 0.01 0.01 0.01 0.01

1 1 1 0.1 0.1 0.1 0.01

0.01 0.01

0.01 0.01

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

c4 1 1 1 0.1 0.1 0.1 0.05 0.05 0.05

X 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001

0.001

418 Table 2.

Optimal Strategy

Policy 1

z-icj 3-(a) P(b) 3-(c)

I I

64.6 191.7 103.5 64.6

0.03 0.09 0.05 0.03

68.4 0.0 0.0 68.4

Policy 2

42.7 38.9 42.1 42.7

1 50.0 1

67.7 67.7 67.7

36.8 43.2 43.2 43.2

References 1. R. E. Barlow and F. Proschan, Mathematical Theory of Reliability, Wiley, New York, (1965). 2. R. E. Barlow and L. C. Hunter, Operations Research, 8 , 90 (1960). 3. G. H. Weiss, Management Science, 8 , 266 (1962). 4. S. Zacks and W. J. Fenske, Naval Research Logistics Quarterly, 20, 377 (1973). 5. J. B. Keller, Management Science, 21, 256 (1974). 6. H. Luss and Z. Kander, Operational Research Quarterly, 25, 299 (1974). 7. N. Wattanapanom and L. Shaw, Operations Research, 27, 303 (1979). 8. T. Nakagawaand K. Yasui, Journal ofthe Operational Research Society, 31, 851 (1980). 9. A. Gupta and H. Gupta, IEEE Transactions on Reliability, R-30, 161 (1981). 10. T. Nakagawa, Naval Research Logistics Quarterly, 31, 33 (1984). 11. N. Kaio and S. Osaki, IEEE Trans. on Reliability, R-33, 277 (1984). 12. N. Kaio and S. Osaki, Journal of Mathematical Analysis and Applications, 119, 3 (1986). 13. N. Kaio and S. Osaki, RAZRO/ Recherche Op&ationnelle, 22, 387 (1988). 14. N. Kaio and S. Osaki, Journal of the Operational Research Society, 40, 499 (1989). 15. D. J. D. Wijnmalen and J. A. M. Hontelez, European Journal of Operational Research, 62, 96 (1992). 16. C. G. Gassandras and Y. Han, European Journal of Operational Research, 63, 35 (1992). 17. N. Kaio, T. Dohi and S. Osaki, Microelectronics and Reliability, 34, 599 (1994). 18. H. Sandoh and N. Igaki, Journal of Quality i n Maintenance Engineering, 7, 220 (2001). 19. H. Sandoh and N. Igaki, Computers and Mathematics with Applications, 46, 1119 (2003). 20. H. Sandoh and T. Nakagawa, Journal of the Operational Research Society, 54, 318 (2003). 21. D. Fundenburg and J. Tirole, Game Theory, The MIT Press, Massachusetts, (1991). 22. R. Gibbons, Game Theory for Applied Economics, Princeton Univ. Press, New Jersey, (1992). 23. M. J. Osborne and A. Rubinstein, A Course in Game Theory, The MIT Press, Massachusetts, (1994). 24. S. M. Ross, Applied Probability Models with Optimization Applications, Holden-Day, San Francisco, (1970). 25. S . M. Ross, Introduction to Probability Models: 5th edition, Academic Press, New York, (1993).

AN AUTOMATIC DEFECT DETECTION FOR C++ PROGRAMS

s. SARALA Research Scholar

S.VALLI Assistant Professor,

Department of Computer Science and Engineering, College of Engineering, Guindy. Anna University, Chennai-25, India. Email: va1lilii)lannauniv.edu Abstract

In this work a tool is developed to generate test cases automatically for C++ Programs.This approach analyses the prototypes and is focused to detect defects in the program. A defect results due to omission or commission made when developing the software. This work checks the correctness of the program when operator[] is overloaded. in the context of inheritance when virtual function is used, it has been observed that expected results are not achieved under certain circumstances. A test case has been developed to handle this situation. Test case has been prepared to ascertain the working of template hnctions in the context of character input. A test case has been developed to tackle dangling reference problem, In the context of exception handling, test cases have been developed to ascertain the working of the program.

1. Introduction

1.1 Testing Object-Oriented Sofhvare

Software testing is an important software quality assurance activity. The objective of software testing is to uncover as many errors as possible with a minimum cost. A successful test should show that a program contains bugs rather than showing that the program works. Defects of omission are those deviations from the specifications that lead to some intended hnction not being implemented. For example, the inability of a software product to display the result of a calculation or query due to the omission of a print function is a defect due to omission. Defects of commission are those deviations from the specification that although functionally implemented, fail to operate properly. They provide incorrect or unexpected results despite the provision of valid inputs; for instance, the print function printing the address of x rather than its value [4]. These defects are 41 9

420 termed as F-type defects. Removing F-type defects logically improve the functionality of the software [4]. 1.2 Developed Test Cases

This tool detects defects caused by omission or commission. When operator[] is overloaded, the C++ compiler does not check for subscript bounds, which results in defects. The tool detects such flaws and reports to the user. When virtual function is overridden and they differ in signature the C++ compiler ignores this situation. The tool fixes this bug. In the context of template function call if the actual parameter is of type character and the value passed exceeds the length of one, the C++ compiler is not intelligent enough to trap this mistake. When the derived class object is assigned to base class object and a member function is accessed, the result is not as expected under certain circumstances. The tool captures this defect caused by omission and reports the same. Dangling reference leads to execution error. The tool handles this defect. Test cases have been developed for defects encountered with exception handling. 1.3 An Overview

Section 2 discusses the existing work. Section 3 brings out the test cases generated. Section 4 describes the developed algorithm to automatically generate test case for C++ programs. Section 5 reports the working of the algorithm and the results achieved. Section 6 concludes the work. 2. Existing Work Several literatures exist [ l ] [7] 181 [15] [18] [20] in the area of Object-Oriented Programs. E. Sabbatini et.al [14] have concentrated on automatic testing of database confirmation commands. B.Korel et.al [ 121 [ 131 have worked on automatically generating test cases using data dependence analysis. Marcio E.Delamaro et.al, [8] present interface mutation, which is used for integration testing. Jean Hartmann et.al [ 1 I] developed a tool for interfacing to components based on COM/DCOM and CORBA Middleware. They derive the test cases from state charts. Y.G.Kim et.al [15] have proposed a set of coverage criteria based on control and data flow in UML state diagrams and they show how to generate test cases from them. Usha Santhanam [9] has applied the tool, Test Set Editor (TSE) for testing Ada programs. G.Antonio1 eta1 [7] seed a significant number of faults in an implementation of a specific container class,Ordset. They seeded faults that cannot be detected by the compiler and that can possibly be found by running test cases, which is similar to our work. Tse et.al [18] focus on classes with mutable objects, which is based on finite state machines. They analyse the class specifications. Amie L.Souter eta1 [101 have tackled structural testing of Object Oriented software systems with possible unknown clients and unknown information. Bor-Yuan Tsai et.al [ 191 [20] combine functional and structural testing techniques. The approach 1201 uses state charts and state transition trees to generate test

42 1

files and inspection trees (IT) automatically. R.K.Doong eta1 [ 171 describe a set of tools such as test case generation, test driver generation, test execution and test checking for performing testing on Object-Oriented programs. They have exploited testing of parameters and combinations of operations. The authors [ 191 combine functional with structural testing techniques of intra class testing. Using data flow technique the authors detect whether the data member or parameter in public methods have been defined prior to being used.

3. Implementation Figure 1 depicts the test cases developed for C++ programs. The number of arguments is counted for virtual functions. Test cases have been developed for the following Object Oriented Programming concepts, namely, Function templates, Inheritance, Operator Overloading, Exception handling and Dangling reference. The tool is developed such that,

size S is determined

4. Algorithm

1 .[Generation of test case for operator[ ] 3 /* The lines containing operator[] are extracted using grep and the array size is

determined as size*/ if (subscript < 0 11 subscript > size) display (subscript out of range)

422

2.[Virtual function header is extracted by grepping virtual( ) as pattern] /* The function name and the parameter types are determined */ All the corresponding hnction prototypes are validated for their signatures. The mismatch is displayed as ‘parameter mismatch ’. 3.[The actual parameters of template function are validated] /* The presence of template function is found using grep with the pattern as ‘template’ */ All template function calls are extracted. until (currentchar[i]! =’(‘ ) i ++; i ++; until (currentchar[i]! =’)’ ) { if(currentchar[i]==’”&& currentchar[i+2]!=’ ’ ’ ) {display wrong character input in template function call break; } else {if(currentchar[i] = = ‘a’ thru ‘z’) {repeat until(currentchar[i] != ” (1 ‘)’ (1 ‘ ,‘) {tempti] = currentchar[i]; j++; i++;} If type of temp is char the value is validated 1 1 i++;) 4. [Check for object assignment in the context of inheritance] The base and derived class is determined using ‘:’ as pattern for the grep command. The base and derived class objects are determined using the corresponding class name and ; as pattern for the grep command. The base and derived object are determined as b and d respectively.

b=d is grepped. if it is present display (statement has no effect due to omission) S.[Test case to report dangling reference problem] grep class obj = obj1 if it is found delete obj1 is used as pattern for grep. If a match occurs display (dangling reference problem) 6.[Test case in the context of exception handling] Throw statements are located using grep with pattern as throw until (currentchar[i]!= ‘w’) i++; until(currentchar[i] == ‘ ’) i++; { /* catch(. ..) presence is checked if present break; } iflcurrentchar[i] == ‘ “ ’) catch (char *) presence is checked if not present display(string exception not handled); if(currentchar[i] = =’ ‘ ’) catch char presence is checked { if not present display (char exception not hand1ed); } until(currentchar[i] != ‘.’) i++; if(currentchar[i]== ‘.’) { catch float presence is checked if not present display(float exception not handled);} if(currentchar[i] is int ) { catch int presence is checked if not present display(int exception not handled) } default: display(‘n0 exception handler’)

423

5. Working of the Algorithm 5.1 Detection of Subscript out of bounds When [ ] is overloaded in the context of array operations, the C++ compiler does not validate the subscript for out of bounds in an assignment operation. This may lead to execution error. In other circumstances junk value is returned. A test case has been developed that takes care of this situation as given by step 1 of the algorithm. The code in Table 1 depicts a situation where an array subscript is out of bounds. There are three elements in the array. When the code displays the fourth element using overloaded operator [ 1, only junk is displayed. Whereas if assignment is tried using an out of bound script, it results in execution error. So, these two instances are tackled by the test case meant for validating array subscript. The developed test case locates the presence of overloaded operator [ 1, by executing grep with the pattern as ‘operator [ ] ( )’. Once it identifies that[ ] is overloaded, the test case identifies the maximum size of the array by grepping with pattern ‘data type [ 1’. The maximum size is determined as ‘size’. The test case then extracts all occurrences of the call to overloaded operator[ 3. The subscript is checked with ‘size’ to see if it is within the bounds. If it exceeds the bound or negative the test case displays ‘subscript out of bound’. When the code in Table 1 was tested, the test case identified that ‘cout << ob [4]’ and ‘ob [4]=44’ are out of bounds. If the code in Table 1 is executed as such ob[4] will display junk and when an attempt to write into ob[4] is made it results in execution error. These flaws are not trapped by the compiler. The developed test case detects this flaw.

5.2 Test case to check the consistency of overridden virtual functions The second test case checks the consistency of overridden virtual functions. This test case works by extracting the virtual function by grepping with the pattern ‘virtual ( )’. Then the test case extracts the number and types of parameter(s). The test case then extracts all versions of over ridden virtual function by grepping with the pattern ‘hnction name ( )’. The test case extracts the type of parameter(s) and compares with the virtual function. If there is a deviation, the test case displays ‘parameter mismatch’. When the code in Table 2 was subjected to the test case, the test case was able to detect the variation in data type of the overridden function ‘vfunc’ in the derived class. On execution the code in Table 2 displayed ‘This is a base’, which is not the intended result. The tool was successful in detecting this flaw.

5.3 Test case for exceptions The last step of the algorithm tackles the defects in the context of exception handling. The test case locates all throw statements by using ‘throw’ as pattern for grep command. The test case greps all catch statements. It checks whether catch (...) is present by

424

executing grep. If the catch block is not present the test case checks all the exceptions raised. The test case checks for string, float and int exceptions by looking for ‘ ”’, ‘.’, (digit)’ patterns. Then it greps the corresponding exception handlers. If it is not present the test case displays ‘exception not handled’. Other than this if there are exceptions the test case displays no exception handler. When the code in Table 3 was subjected to the test case, the test case found the float exception using ‘.’ as pattern. It looked for a float exception handler by executing grep ‘catch (float)’. Since it is not present the test case displayed ‘float exception not handled’. The test case identified an int exception and it looked for an int handler by executing grep ‘catch (int)’. Since it is present it found the next exception. The test case found a string exception when it encountered ‘ “ ’. It located the corresponding exception handler by looking for catch (char *). Since the handler is not present, the test case displayed ‘string exception not handled’. Since, still a exception was captured by grep the default executed and reported ‘exception not handled’ for test () exception. When the code in Table 4 was tested, the test case identified the string exception. The test case was successful in detecting the absence of an exception handler because grep ‘catch(char *)’ returned 0 line. So, the test case displayed ‘string exception not handled’. When the code in Table 5 was tested, function fun was found as the template function. The parameter a was found to be of type char. The value was validated. Since it does not begin with ‘ ‘ ’ the test case displayed ‘wrong character input function call’ in template. When the code in Table 6 was subjected to the test case, the base and derived class were detected as B and D respectively. Inheritance was identified by grepping with the pattern ‘:’. The base and derived objects were detected by grepping with base and derived class and ‘;’ ; b and d were detected as base and derived objects. Then the pattern base object=derived object was grepped. Due to presence of the pattern the test case displayed ‘statement has no effect due to omission’. p ->vfunc(); //base Table 2. Overridden virtual Table 1 . Array Subscript out return O;} function differing in parameter of Bounds class atype class base{ Table 3. An Abnormal {int a[31; public: Exception Thrown public: virtual void vfknc() class sample{ public: atype(int i,int j,int k) {cout << “This is a { a[O]=i; base.\n”;} }; void fun 1() a[ 1]=j; class derived: public {throw 99;) a[2]=k; } base void fun2( ) int &operator[] (int i) {void vfUnc(int) (throw3.14f; } {cout << “This is a {return a[i]; }}; void fun3( ) main() derived.\n”;}}; { throw “error” ;) {atype ob(1,2,3); main() void fun4( ) {base *p,b; {throw test( ) ; } }; cout<vfunc(); //base {sample s ; return O;} p = &dl; s.fun4( ) ;// exe error}

425

catch ( int i ) { cout << “strange“ ; } } Table 4. Exception with Missing Catch Block

class sample{ public : sample(int i){ try {iqi == 0) throw “testing” ; } }}; void main( ) { sample s(0); //execution error} Table 5. Template function called with a wrong input instead of a character

void fun ( float a ) { cout << “float “<< a;1 template

void fun ( T a ) { cout << “all in one“ << a << end1 ; } void main( ) { char a = 10; /I prints nothing ;

fi.l(a); float x = 25.09f; fUn(x);} Table 6. Code with statement which has no effect

class B { public: int i ; B( 1 { i=lO;} void print() {cout << ”in b”<
class D : public B {

public: int i; D( ) { i=70;} void print() {cout << “in d”<
1

6. Conclusion C++ programs were tested for defects. A class is an abstraction of the commonalities among its instances. Therefore, the testing process must ensure that representative samples of members are selected for testing. The ultimate aim of this tool is to reduce the defect rate for users of a class. In this work the tool has been developed to automatically generate test cases. The tool depicts the deviation if any in the actual and formal parameter. The tool detects the defects which are not trapped by the compiler Hence, the tool ascertains that the test case brings out undetected bugs.

References 1. G.Rotherme1, M.J.Harrold and Jeinay Dedhia, “Regression Test Selection for C++ Software”, Journal of Software Testing and Reliability, Vol. 10, No.2, pp. 135(2000). 2. B.Beizer, software-testing techniques, Van Nostrand Reinhold, New York( 1990). 3. G.Booch, Object Oriented Analysis and Design With Applications, BenjamidCummings, Redwood City, California( 1994). 4. Houman Younessi, ‘‘ Object -Oriented Defect Management of Sofmare“, Prentice Hall, USA(2002). 5. Kit Edward, “Software Testing in the Real World”,Addison-Wesley( 1995). 6. Myers, Glenford J, “The Art of Software Testing”, John-Wiley & Sons(1979). 7. G.Antonio1, L.C.Briand, M.Di Penta and Y.Labiche, “A case study using the round trip strategy for state based class testing”, Technical Report, Research Centre On Software Technology, pp. 1- lO(2002).

426

8. Marcio E.Delamaro, Jose CMaldonado and Aditya P.Mathur,“Interface Mutation: An Approach for Integration Testing”,IEEE Transactions on Software Engineering, Vol-27, N0.3,~~.228-248(2001). 9. Usha Santhanam, “Automatic Software Module Testing for FAA Certification”, ACMSIGAda, pp.31-37(2001). 10. Amie L.Souter, Lori L.Pollock,”OMEN: A Strategy for Testing Object Oriented Software”, ACM International Symposium on Software Testing and Analysis, Portland, pp.49-59(2000). 1 1. Jean Hartmann, Claudio Imoberdorf and Michel Meisinger, “UML-Based Integration Testing”, ACM International Symposium on Software Testing and Analysis, Portland, pp.60-70(2000). 12. B.Korel, “Automated software test data generation”, IEEE Transactions on Software Engineering, vol. 16, No.8, pp. 870-879(1990). 13. B.Korel, Roger Ferguson, “The Chaining Approach for Sofhvare Test data Generation”, ACM Transactions on Software and Methodology, vo1.5, No. 1, pp.6286(1996). 14. ESabbatini, M.Crubellati, and S.Siciliano, “Automating test by adding formal specification: an experience for database bound applications”, Advances in engineering software 30,pp.885-890(1999). 15. Y.G.Kim, H.S.Hong, D.H.Bae and S.D.Cha, “Test Case Generation from UML State Diagrams”, IEE Proceedings on Software, vo1.4, No.4( 1999). 16. Vincenzo Martena, Alessandro Orso and Mauro Pezze, Interclass Testing of Object Oriented Software”,In Proceedings of the International Symposium in Software Testing and Analysis (ISSTA’OO), pp. 1-lO(2000). 17. R.K.Doong and P.G.Frank1, “The ASTOOT approach to testing object oriented programs”, ACM Transactions on Software Engineering and Methodology, Vol-3, No.2, pp. 101- 130(1994). 18. T.H.Tse and Zhinong Xu, ‘“Test Case Generation For Class Level Object-Oriented Testing”, Quality Process Convergence: Proceedings of ghInternational software Quality Week, San Fransisco, California, pp. 1-12(1996). 19. Bar-Yuan Tsai, Simon Stobart, Norman Parrington and Ian Mitchell, “A State Based Testing Approach providing Data-flow coverage in Object-Oriented Class Testing”, 15th International Conference on Advanced Science and Technology, Argonne, Illinois, pp. 1-1 8( 1999). 20. Bor-Yuan Tsai, Simon Stobart and Norman Parrington, “A Method for Automatic Class Testing Object-Oriented Programs - Using A State-based Testing Method”, 5th European Conference Software Testing Analysis & Review, Edinburgh, pp. 1lO(1997). ‘I

APPROXIMATION METHOD FOR PROBABILITY DISTRIBUTION FUNCTIONS BY COXIAN DISTRIBUTION

YUKIE SASAKI, HIROEI IMAI, IKUO ISHII Graduate School of Scaence and Technoloqy. Nazgata Unawersaty, Nzzgata, 950-2181 Japan E - m a d . sasakaQcq.ae.naaqata-u.ac.jp MASAHIRO TSUNOYAMA Nazgata Instztute of Tech,nology. Kashzwazaka. 945-1195 Japan This paper provides an extension of t h e approximation method proposed in the literature [l]-[2], using the Coxian distribution. In the existing method, the probability distribution functions with the monotonic decreasing property can be approximated. T h e method proposed in this paper uses the combination of hyper-exponential distribution and Erlang distributions, and enables to approximate non-monotonous distributions. We also present some examples t o show the usefulness of the approximate method.

1. Introduction

Conventional performance evaluation methods, such as GSPN (Generalized Stocha.stic Petri Net) model, have used only exponential probability distribution functions for modeling computer systems. However, several probability distributions for niiiltirnedia traffic in Internet can not be approximated by the exponential distribution [3],[4]. In order to evaluate the system performance, the Coxian distribution functions are often used. This probability distribution can approximate arbitrary probability distribution functions, so that several approximation methods Then, the accuracy of the analysis and the using it have been proposed [5],[6]. number of states should be fully investigated when they are applied to evaluation. An indirect approximation method has been proposed in the literature [1],[2 In this method, a given probability distribution fiinction is approximated by a linear combination of exponential distributions, where the combination obtained is converted to a Coxian distribution. However, the proposed method can be applied to only the probability distribution fiinction having monotone decreasing property. In this paper we propose an alternative method for approximating nonmonotonous probability distribution functions using a Coxian distribution. The proposed method extends the existing method by introducing new procedures. First, we give the definition of Coxian distribution in Section 2. Section 3 provides an approximation method using the Coxian distribution. In Section 4,

427

428

we show some examples to illustrate the approximation performance. Finally, in Section 5 we state our conclusion and refer to the future work.

2. Coxian distribution The Coxian distribution function is represented by a total sojourn time in a queueing network[7]. When a task enters the network, it stays at node 1 for the time period following an exponential probability distribution function with parameter X I . After that, the task goes to the next node with probability a1 or leaves from the network with probability bl. The probabilities a1 and bl are called the arrival probability and departure probability, respectively. When the task reaches the final node L , it exits the network after spendig the time following an exponential probability distribution function of parameter XL, with the probability of 1. The Laplace transform of the density function of the Coxian distribution is represented by [7] L

1

3. Approximation method 3.1. Overview of the method The proposed method in this paper approximates a target probability distribution by linear combination of exponential distribution function and/or Erlang distribution functions first, and converts the combination to the Coxian distribution function. This paper describes mainly the first part of the approximation i.c. the method for approximating a given distribution by linear combination of distribution functions. 3.2. Preliminary Let F ( z ) be the probability distribution function of target distribution, H ( z ) be its approximation, and H ' ( z ) be the estimate of H ( z ) . The coniplementa.ry probability distribution function is defined as follows.

[Definition I ] (Complementary probability distribution function)[2] The complementary probability distribution function F c ( z )of F ( x ) is defined by FC(2)

= 1-

F(2).

(2) I

The mean relative error is defined as follows, where H " ( z ) is given by 1 - H ( z ) .

429

[Definition 21 (Mean relative error)[2]

REk(zi,xj) is the mean relative error between F(:c) and H ~ ( T in ) the interval [ . 7 & X j ] ( i ; j = 1 , 2 . .. , i < j ) . 1

REk(:~i,:r,)= -

pyx) - Hn(x)ldz.

(3) I

3.3. Approximation procedure

We divide the target distribution function into two parts; monotonous part and iinimodal part. 3.3.1. Division of target distribution

The probability density function, f (x)?around x = 0 is called a monotonoils part Po, and the rest of the function is divided into unirnoda.1parts P,(m = 1 , .. . , M). The parameters for then1 a.re denoted as follows; K is the number of samples of target distribution, xk is the value of random variable x at the k-th sample, H,(x)(m = 0 , . . . ,M ) is the approximated distribution function for P,, N,(m = 0,. . . M ) is the maximum number of nodes for H,(s), d,(m = 1,.. . , M ) is the value of x at the boarder of P, and PnL-1.That is, d,, is the m-th minimal value of f(x). Also, let t,(m = 1,.. . , M ) be the value of .x at the peak of the m-th iinimodal part, E,(m. = 0 , . . . , M)be the target error for P,,, E be a small real number, and x, be the value of x of the target distribution. F C ( x ) ,at F C ( x E= ) E. 3.3.2. Approximation of monotonovs part

The length of the tail in the target distribution is evaluated by Procedure 1 mentiond below. If the tail of the distribution is evaluated to be long, the monotonous part can be approximated by a hyper exponential distribution, otherwise it can be approximated by an exponential distribution. For example, t he hyper cxponential distribution to approximate the monotonous part is shown in Figure 1. Then the parameter of the exponcntial distribution, H o ( x ) , for the monotonous part is determined by the gradient of the target distribution at T = 0.

[Procedure 11 (1) Determine thc parameter of estimated distribution H ~ ( Tfor ) monotonous part by Ah = f ( 0 ) and the parameter of estimated distribiiiton H ~ ( Tby)

Al,

=

(Nkr - l)/tiw,

(4)

where Eq.(4) is derived by the first derivative of the Erlang distribution function and the value of T at the peak of HA4(3-)is equal to fll.i. Determine

430 0.006

0.002,

0.004

0.001

.

.

.

.

.

.

f(x)-rs,f*i

.

,

.......

0.002

0'

200

400

600

Figure 1. Monotonous part.

Figure 2 .

Unimodal part.

the least upper bound xampxby

(5)

za,nax = max(x,,o, G , A ~ ) ,

where z,,~ for Ho(I):)and z , , ~ for Hnt(x) satisfy the following conditions

HO(z,,O)

= €1

HM(z&,At)= E.

(2) If :ramax< z,then go to Procedure 2:otherwise go to Procedure 3. I

Next we determine the number of nodes and parameters by the following Procedure 2. In the procedure, no denotes the number of nodes for the hyper exponential distribution function.

[Procedure 21 (1) Set 710 = 2. (2) Determine the range, CO,,,, for the candidate value,

Determine the parameters,

Xno,k

and

Pno,k

for

.Ck

I):kr

of

I):

by

bY 181

(3) Determine the parameters, A0 and po? provided that both X O , and ~ po,k give XK). the minimum REk(zamax, (4) If min(REk(z,,,,,,,x,y)) 5 EOor n o 2 NO then replace the target distribuk

tion P ( x )by F C ( z )- H,~&(I):)~ calculate the new x, by new F C ( z ) and , go to Procediire 3. Otherwise, increase no, replace the target distribution F c ( z )by F c ( x )- H&,(x)> calculate the new z, by new F c ( z ) ,and return to Step 2.

43 1 I

The pa.rameters of exponential distribution function, H0,l(x),are determined by the gradient of the target distribution in the interval [O,dl]. The value of R: used to determine the para.mcters is called da.tum points.

[Procedure 31 (1) Determine the parameters, Ab,l and P;,~, of the exponential distribution HA,l(r)to approximate the interval [0, dl] by the least squares method. (2) Determine the value s,,~ where H~,l(r,,o) = E . If T,,O < T , then set both datum points for r = 0 and d l , and go to Procedure 4-a since the length of the tail in the exponential distribution in the interval [O,dl] is short. Otherwise set the datum points for r = 0 and a point on the tail, and go to Procedure 4-b since the length of the tail is long. I

[Procedure 4-a] (1) Determine the parameters of the Erlang distribution h‘,(dl) which is adjacent to the exponential distribution for the monotonous part.

A: = (Nl - l ) / t l ,

Set the value of f’(d1) by f’(d1) = f(d1) - h$(dl).

(11)

(2) Calculate the parameters of the exponential distribution from the datum points, f(0) and f’(dl), where X0,l

= 1n(f(O)/f’(d1))/d1,

P0,l = f(O)/XO,l.

(12)

(3) Replace the target distribution by

F “ ( s )+ F C ( r )- H,C,,(x).

(13) I

In Step 1 for the above procedure, the value of f’(d1) is set by Eq.(ll) to prevent the degradation of accuracy due to the effect of hl(x) .

[Procedure 4-b] (1) Calculate the parameters. A o , ~and PO,^, using dataum points at x = 0 and Zk(E c o = [ZMr 4). AO,k = ln(f(O)/f(R:k))/Zk,

PO,k = f ( O ) / A l , k .

(14)

432 ( 2 ) Calculate the parameters of the estimated distribution H ; ( r ) for the a p proximated distribution Hl(x) for the unimodal part. The part is adjacent to the exponential distribution approximating monotonous part.

( 3 ) Determine the parameters, X O , ~and p o , ~provided that both X O , ~and p0.k satisfy the ininimiiin REk(0, d l ) . (4) Replace the target dsitribution, i.e. F " ( x ) + F " ( x ) - HF(x).

(17)

3.3.3. Approximataon of the unimodal parts After approximating the monotonous part, the unimodal parts are approximated by using the Erlang distributions, H1 ( x ).~. . , H b f ( x ) in Procedure 5 . Figure 2 shows an example of approximation for unimodal part f (x)- ho(:c). The approximations are made from the tail of the distribution and continue while average error is larger than the target error or the number of nodes is less than the specified maxinium number of nodes, N,.

[Procedure 51 (1) Set 711 = $1. (2) Set n, = 2. ( 3 ) Calculate the parameters Xm,k and p,,k C , = [dml d m + l ] ) , where

by using the candidate value Z ~ ( E

(4) Determine the parameters A, and p, by Am,k and P,,k which satisfy the minimum REk(d,, T K ) in the interval ( d m , ~ ~ - 1 1 . ( 5 ) If mink(REk(d,,rK)) 5 Em or n, 2 N , then go to Step 6, otherwise, increase the number of nodes 71, and return to Step 3. (6) If m = 1 then complete the approximation procedure, otherwise replace the target distribution, decrement m , and return to Step 2. I

433 3.4. Parameters of the Coxian distribution function We determine the parameters Al =

Al

Plyt

in Eq.(l) via [2]. L

-

C t = l + l Al(Pt,l - P%-l,l) 101,l

(20)

In the above equation, &l is the coefficient of l/(s + A L ) when ~ ~ the product n:=, A,/(s in each term is the model A,) in Eq.(l) is expanded. Prameter parameter, and nl is the number of nodes having parameters being equal to A(.

+

4. Examples Figures 3 and 4 show the probability distributions of the task arrival processes in NLANR (National Laboratory for Applied Network Research)[9]. Example 1 corresponds to the task arrival procecess for ftp in OSU-2000, and Example 2 the task arrival procecess for telnet in APN-1999. Figure 5 shows the phase structure of approximated distribution for Example 1, and Figure 7 shows the Coxian distribution function for it. First, we determine the parameters p o , ~= 0.029, A o , ~= 0.002 of Ho,z(z) which approximates the tail which of the distribution by Procedure 2, and p0,l = 0.207, A0 1 = 0.083 of Ho,J(T) approximates [0, dl] by Procedure 4-a. Finally, we obtain pz = 0.319, A2 = 0.035 of H ~ ( Twhich ) approximates the iinimodal part with t 2 = 138, and we obtain pl = 0.445, A1 = 0.138 of Hl(r) which approximates the unimodal part with tl = 27 by Procedure 5. The relative error of the unimodal part in the tail was quite small. however the error increases as the numbert of nodes decreases. In the examples the number of nodes for approximation is limited to less than 20. Figure 6 shows the phase structure of approximated distribution for Exmple 2, and Figure 8 shows the Coxian distribution function for it. We determine thf parameters p o , ~= 0.028, A o , ~= 0.002 of Ho,z(z) by Procedure 2, the parameters p 0 , l = 0.769, A o , ~= 0.007 of H o , ~ ( by T ) Procedure 4-b, and the parameters pl = 0.203, A1 = 0.054 of Hi (z) by Procedure 5. The relative error of Example 2 is quile small. 5. Conclusion In this paper, we have shown an approximation procedure for the probability distributions having non-monotonous property by extending the existing method in [2]. We have also shown the usefulness of this approximation method through examples. We are going to improve the approximation accuracy. and to apply proposed approximation method to actual systems performance evaluation. Also, we will evaluate multimedia systems using the same mothod.

References 1. Y. Sasaki, H. Imai, M. Tsunoyama and I. Ishii, “Approximation method for probability distribution functions using Cox distribution to evaluate multimedia systems”,

434

2.

3.

1.

5. 6.

7. 8.

9.

Proc.2001 Pacific Rim International Symp. on Dependable Computing, pp. 333-340, 2001. Y. Sasaki, H. Imai, M. Tsunoyama and I. Ishii, “Approximation of probability distribution functions by Coxia distribution t o evaluate multimedia systems“, Trans. of IEICE(D-I), Vol..J85-D-I, No.9, pp. 887-895, 2002. W. T. Marshall and S. P. Morgan, “Statstics of mixed data traffic on a local area network, Computer Networks ISDN Systems, No.10, pp. 185-195,1985. .J. Beran, R. Sherman, M. S. Taqqu and W. Willinger, “Long-range dependence in variable-hit-rate video traffic”, IEEE Transactions on Communications, Vo1.43, No.24, pp. 15661579, 1995. S. Asumussen, 0. Nerman and M. Olsson, “Fitting phase-type distributions via the EM algorithm”, Scandinavian J. Statist., V01.23, pp. 419-441, 1996. M. C. Heijden, “On the three moment approximation of a general distribution by a Coxian distribution”, Probability in the Engineering and Information Science, V01.2, pp. 257-261, 1988. E.Gelenbe and I. Mitrani, “Analysis and Synthesis of Computer Systems”, Academic Press, 1988. A. Feldmann and W. Whitt, “Fitting mixtures of exponentials to long-tail distributions to analyze network performaqnce models“, Performance Evaluation, Vo1.31, pp. 245279, 1998. S. Aata, M. Murata and H. Hiyama, “Analysis of network traffic for the design of hight-speed layer 3 switches”, Technical Report of IEICE, Q98-35, pp. 29-36, 1998.

target dist.

-

approximation dist. .....

approximationdist.

......

0.01 0.002

200

0 ‘

400

600

00

800

Figure 3. Approximate probability distribu- Figure 4. tion for Ex.1. for Ex.2.

Approximat probability distribution

0.

0. 0. 0

Figure 5. Phase structurc of Approximated Figure 6 . Phase structure of Approximated distribution in Ex.2. distribution in Ex.1. 0 125 0057 0.024 0010 0564 0.014 0049 0 167 0.382 0.624 0.692 1 O W

Figure 7.

Coxian distribution for Ex.1.

0 105 0.102 0.098 0094 0.090 0.0%

Figure 8.

0.081 0.077 0.509 0.941 1.000

Coxian distribution for Ex.2.

TUMOR TREATMENT EFFICACY BY FRACTIONATED IRRADIATION WITH GENETIC RADIOTHERAPY *

T . SATOW AND H. KAWAI Tottori University 4-101 Koyama-Minami, Tottori 680-8552, JAPAN E-mail:[email protected],kawaiQsse.tottori-.u.ac.jp

Genetics and radiologic therapies were recently combined to improve the score of cancer treatment. From the standpoint of evaluating an influence of radiotherapy, some useful objective indicators were proposed. TCP, NTCP and EUD are considered as typical instances. Using these reliable indicators, we propose two objective functions to evaluate tumor treatment efficacy by fractionated irradiation with genetic radiotherapy. The transforming probability from a susceptible cell to a impervious cell and the transduced fraction at which the number of viral vector injections is assumed to be a variable are used for the formulation. Some numerical illustrations are given to understand the influence of these parameters.

1. Introduction 1.1. Basic Indices Result of remarkable technical improvement, radiation therapy is a necessity and an indispensable tool to cancer treatment. With rapid hardware improvement, the development of software which plan for treatment scheduling etc is also important for the achievement of efficient treatment. To achieve the effective software development, the establishment of evaluation techniques with not only empirical administration but also a theoretical criterion is required. Hit and Target model' is widely known as a technique to derive the survival probability of the cell from the number of lesions by radioactivity. As a concept of hit and target model, the place(target) where the cell is necessary and indispensable to live exists. It is an idea that the cell death happens when a radiation particle hits the target. Cells in microorganism and mammalian organs, the survival probability of calculating with the target theory might be suited well. However, it is not proof of the validity of present target models. Linear-quadratic (LQ) model' pays close attention to the dual helical structure of the DNA strand. The cell survival is a model considered by assuming that it is necessary to pay attention to not an easy single strand break of the reparation but a difficult double strand break of the reparation. 'This work is supported by Grant-in-Aid for Young Scientists (B) 15710115

435

436

From biological viewpoints, tumor control probability(TCP) and normal tissue complication probability (NTCP) are well-known and useful objective indicators. T C P is defined as the probability of local control that given the planned dose distribution. Tom6 and Fowler4 use T C P to indicate the effectiveness of modest boost dose to an arbitrary subvolume of the tumor. Waaijer et aZ3 investigated the influence of waiting time for radiotherapy to the consequences of volume increase. They also used T C P to evaluate consequences for outcome. However, the irradiation to the tumor influences normal cells which are adjacent region to the tumor. NTCP is proposed as one probability which measures the influence of irradiation on the normal cell. For instance, NTCP is used to describe the dose-volume tolerance for radiation induced liver diseases5. Some kinds of NTCP models were proposed for suitable estimation. Comparison among these NTCP models was studied for predicting the incidence of radiation pneumonitis6. From the instances above, T C P and NTCP are being used as effective indicators. As a physical viewpoint, dose-volume histogram(DVH) is used as the limitation of constraint. Since DVH can indicate the dose-volume relation at normal tissue surrounding tumor, it is generally used a s an index which judges the quality of a treatment scheme7. Recently, the concept of equivalent uniform dose(EUD) was introduced by Niemierko'?'. EUD for tumors is defined as the biologically equivalent dose that, if given uniformaly, will lead to the same cell kill in the tumor volume as the actual nonuniform dose distributions. The advantage of EUD is that it can reduce the number of parameters which compose an objective function for the radiation therapy. Further, it allows exploration of a large solution space for the radiation therapy". As for these indices, the following relation is known. By DVH reduction techniques, DVH of a complicated 3D distribution can be replaced by DVH of a uniform dose distribution, where the whole organ is irradiated with EUD". Further, it is possible to convert it into NTCP by using the monotonous increase function such as a sigmoid function. These indices are used for the optimization problem of the radiation therapy scheduling. Mathematical optimization methods keep still being researched by the operation research community12.

1.2. Therapies

As for the treatment method by radioactivity, a lot of techniques are proposed. The method of using the medicine such as hypoxic cell sensitizer was a typical means in clinical15. The fractionated irradiation is generally known as a therapeutic procedure to be effective in the medical p r a ~ t i c e ' ~ The , ~ ~ fractionated . irradiation is a technique for using the difference of the radiosensitivity of the normal and tumor tissues by time-dose fractionation. By the way, genetic radiotherapy was begun to be researched as a method for technically improving the efficiency of the radiation therapy. A basic concept of genetic radiotherapy is to improve the radiotherapeutic effect by controlling the radiosensitivity of the cell by gene therapy. Wheldon et aZI3 discussed modeling the enhancement of fractionated radiotherapy by gene transfer

to sensitize tumor cells to radiation. Keall et all4 investigated the influence of gene therapy by TCP. In addition, the techniques such as brachytherapy(BT) and intensity-modulated radiotherapy(1MRT) are known to be effective.

1.3. The A i m of This Work The research of a criterion modeling on a treatment scheme of genetic radiotherapy is a little. It is a current state that there is no patient outcome data concerning genetic radiotherapy though pointed out by Keall et all4. However, the discovery of the tumor suppressor gene keeps being reported, and the development of genetic radiotherapy in the future is expected strongly now. There are chiefly two purposes in this work. One proposes two objective functions to evaluate the genetic radiotherapy by using the indices where effectiveness is recognized in the radiation therapy. An important point in the evaluation of genetic radiotherapy is to estimate the a transduced fraction of the tissue by virus injections. The transforming probability from a susceptible cell t o a impervious cell and the transduced fraction at which the number of viral vector injections is assumed to be a variable are used for the formulation of objective functions. The second is to examine the influence that the treatment scheme gives the objective function. Some numerical examples are given to understand the influence of treatment parameters.

2. Notations and Model Description Total dose of scheduled treatment is D (20). The total dose D is irradiated by n fractionated delivery, i.e. D = Dl+Dz+. . .+D,. The notation Dj ( j = 1 , 2 , . . . , n ) indicates the dose of j - t h irradiation. Now, we define the following set of a sequence of irradiation doses.

N

is an integer value set. For the purpose of improving the radiosensitivity, genetic therapy is executed before irradiation. The following explanation of genetic radiois injected in multiple positions with therapy quotes from the p a p e r ~ l ~ , A ’ ~tumor . a viral vector. The therapeutic gene carried by the virus is released into the infected cell. The radiosensitivity of the cells including transfected gene is increased. As a result of the transfection, tumor cells are classified in three populations. The first population is transfected cell which is controlled the radiosensitivity. The second is susceptible cell which is available for transfection. The third population is impervious cell which are not ava.ilable for transfection. A part of susceptible cells transforms into the transfected cells by repeating the viral vector injecting. Now, we define a transduced fraction of cells. The transduced fraction is defined as the ratio of the number of transfected cells to the number of tumor cells. Obviously, the transduced fraction depends on the number of viral vector injections. A function ~ ( j () j ,= 0 , 1 , 2 , . . . ) is the transduced fraction of cells after j - t h viral vector

438 injection. It assumes that ~ ( 0 = ) 0, t,he number of cells in the target tumor is m. The probability of transforming to the transdeuced cell from the susceptible cell is p(constant). An initial fraction of impervious cell is a. The number of transfected, susceptible and impervious cells after j viral vector injections are defined as N t ( j ) ,N , ( j ) and N i ( j ) ,respectively. These numbers are

N ( j ) = 11 - (1 -P)jINs(O),

(2)

Ns(j)= (1 - P ) j N S ( O ) , N(j) = a m ,

(3)

+

(4)

+

where N,(O) = rn - am. Clearly, " j ( > 0), N t ( j ) N s ( j ) N i ( j ) = rn. Therefore, the transduced fraction of cells after j - t h viral vector injection is

2.1. TCP for Genetic Radiotherapy

TCP for the genetic radiotherapy is derived under interpatient heterogeneity and nonuniform dose delivery. It assumes that the viral vector injection is carried out

<

every times of irradiation. However, the first injection must be done before the first irradiation. A coefficient p shows the density of cloriogenic cells. W, is the volume of the tumor receiving fractional dose D,. Based on Keall14, TCP is derived as follows. n

A function kI(i, k , D) is the number of surviving cells for subpopulation k at i-th fractionated irradiation. The number of subpopulations is n k . For deriving the function M(i, k, D), we use the concept of LQ model. The number of surviving cells M ( i ,k , D) is derived as follows.

where a coefficient E is called an enhancement factor. The parentheses [ ] in Eq.(7) is a Gauss sign. For inst,ance, means the integer part of X. The enhancement factor E is the fractional increase in radiosensitivity. It is defined as follow^'^.

[XI

E=

log kill with gene targeting log kill without gene targeting'

The log kill is the parameter of surviving fraction, e.g. a surviving fraction of is a 3 log kill. The range of the enhancement factor was investigated by Lammering".

439 2.2. NTCP Several kinds of NTCP are proposed6. NTCP is derived by a sigmoid relation635 between the complication and EUD.

with

&UD(n,D) - TD50 s TD50 The notation TD50 represents the dose for complication rate of 50%. The coefficient s is a slope parameter. EUD is a map of DVH to the dose space. Some conversion methods are proposed, and we use LKB model6.

t ( n , D )=

The coefficient w is a power-low exponent, which depends on the organ and the endpoint. The coefficient utot is the total volume which is irradiated by n fractionated. 3. Objective Functions

To evaluate the treatment scheme, a lot of objective functions keep being proposed and be improved. It is very difficult to discuss which objective function is appropriate for genetic radiotherapy. Objective superiority or inferiority cannot be judged due to the dissatisfaction of clinical data. We propose two kinds of objective functions. It defines that the general efficacy problem is @.(n,D).The final purpose is to maximize the objective function and it is defined as a next problem.

is the upper bound of dose for irradiation. The set A is all treatment schemes. D,, It should be noted that the purpose of this work is not the optimization problem of the function @(A).We propose the following two models for 6 (n,D).

3.1. Multiplication Model The objective function of a multiplication model is defined as I C P ( n , D ) times 1 - " T C P ( n , D ) . It is

440 The conception of the multiplication model is extremely simple. The response to the dose of TCP and NTCP is a direct opposite. Therefore, the relation of the trade-off is established between TCP and NTCP. If T C P and NTCP are 0.86 and 0.50 by a scheme 61, then the objective value @Multz(61)= 0.43. To obtain a good score, it is important how low to suppress NTCP.

3.2. Difference Model The objective function of a difference model is defined as follows.

where

6 (.,!/) E A(GY), A h v ) = { A 7 C P ( Z ) n AnrTCP(Y)), A 7 C P ( Z ) = {(n,D) I 7 C P ( n , W 2 z},

A N7CP(Y)

=

{(%D) I N7CP(n,W I Y}.

(15) (16)

(17) (18)

The concept of this model has the criteria with which TCP and NTCP should be satisfied. In Eqs.(14)-(18),the lower bound of T C P is z(> 0) and the upper bound of NTCP is y ( 2 0). When ( n ,D) satisfies A(z, y), the difference between 7 C P ( n ,ID) and N I C P ( n ,ID) should be maximized. Unlike the multiplication model, a focus of the model is whether schemes t o satisfy the criteria exist or not. 4. Numerical Illustrations

The parameters used through the numerical experiment are set as follows. It is assumed that the tumor which is the treatment object is homogeneous, and the influence of each part of the tumor by each irradiation is equivalent from stochastic viewpoints. For the simplification, 721, = 1. The total volume of tumor utot is 100 cm3. The colonogen density p is 1/2.43 x 10W7. The LQ parameter for single strand break cr is 0.305. cy/p is 10Gy(early tissue). The enhancement factor e is 1.3. The initial fraction of impervious cells a is 0.01. The rate of the viral vector injection to the radiotherapy is 1. It means that the viral vector injection is executed at each irradiation. TD50 is 50Gy. The power-low exponent w is 0.88. The slope parameter s is 0.12. The dose when each irradiating it is assumed to be a dose in which the total dose is simply divided equally by the number of fractions. Figures 1 and 2 give TCP as a function of dose and the number of fractions(abbreviati0n:NF) for tmnsforming probability(p) 0.2 and 0.8. TCP is increasing with the transforming probability p . It is shown to be able to control the tumor as a result of genetic radiotherapy with the low-dose radiation. It corresponds to the clinical study result that NTCP is increasing with dose per fraction.

<

44 1

Figure 1. TCP for p=0.2 ( Dose - NF )

Figure 2.TCP for p 0 . 8 ( Dose . NF )

1

scoc

0. 0.

0093

GY

Figure 3. Multiplication Model Score for p=0.3 (Dose - NF )

Figure 4. Multiplication Model Score for Dose 50Gy ( NF - p )

Figure 3 gives the score of the multiplication model as a function of dose and N F for transforming probability p = 0.3. An appropriate total dose and NF have the possibility of leading to a good treatment outcome. Figure 4 gives the score of the multiplication model as a function of transforming probability p and N F for dose 50Gy. It can be confirmed that NF and the transforming probability violently influence the score around TD50.

Figure 5. Difference Model Score for pzO.2 ( Dose - N F )

Figure 6. Difference Model Score for p=O.8 ( Dose - N F

Figures 5 and 6 give the score of the difference model as a function of dose Now we assume that the lower and NF for transforming probabailies p = 0.2,0.8.

442 bound of T C P and the upper bound of NTCP are 0 and 1, respectively. In a word, these are numerical experiment results in the widest solution space. The wave of score is moving to the direction of the low-dose when the transforming probability p increases.

5. Conclusion In this study, we propose two objective functions to evaluate tumor treatment efficacy by fractionated irradiation with genetic radiotherapy. We formulated the behavior of the radiosensitivity by the viral vector injection and the transforming probability. The modified T C P was derived, and two functions of treatment efficacy were proposed by combined with NTCP. From the result of numeric experiments, we can confirmed that the proposal model makes an extremely commonsense value. However, there are a lot of problems which should be solved in this model. For instance, relation between the tolerance dose with the accumulation rate of dose, the difference among fractionation methods, the verification of validity of objective function, etc. The control of a time dimension will be one of most important factors t o improve a treatment score. Further, the shape of the tumour is a weighty consideration for recent therapy. It is scheduled t o work about this problem in the future.

References 1. 2. 3. 4. 5.

M. L. Turner, Jr., Math. Biosciences, 23, 219-235 (1977). W. C. Dewey and J. S. Bedford, W.B. Saunders Company, Philadelphia 3-25 (1998). A. Waaijer, C. H. J. Therhaard, et al., Radiother. Oncol., 66,271-276 (2003). W. A . Tom6 and J. F. Fowler, Int. J . Radiat. Oncol. Biol. Phys., 48, 593-599 (2000). L. A . Dawson, D. Normolle. et al., Int. J . Radiat. Oncol. Bzol. Phys., 53, 810-821

(2002). 6. Y. Seppenwoolde, J. V. Lebesque, el al., Int. J. Radiat. Oncol. Biol. Phys., 55,724-735 (2003). 7. K. Morita, Jpn. J. Radiol. Technol., 56,693-699 (2000). 8. A . Niemierko, Med. Phys., 24, 103-110 (1997). 9. A . Niemierko, Med. Phys., 26, 1100 (1999). 10. Q. Wu, R. Mohan, et al., Int. J . Radiat. Oncol. Biol. Phys., 52, 224-235 (2002). 11. S. L. S . kwa, J. C. M. Theuws, et al., Int. J . Radiat. Oncol. Biol. Phys., 48, 61-69 ( 1998). 12. M. Langer, E. K. Lee, et al., Int. J. Radiat. Oncol. Bid. Phys., 57,762-768 (2003). 13. T. E. Wheldon, R. J. Mairs, et al., Radiother. Oncol., 48, 5-13 (1998). 14. P. J. Keall, G. Lammering, et al., Int. J. Radiat. Oncol. Biol. Phys., 57, 255-263 (2003). 15. R. C. Urtasun, M. Palmer, et al., Int. J. Radiat. Oncol. Biol. Phys., 40, 337-342 (1998). 16. K. K. Fu, T. F. Pajak, et al., Int. J. Radiat. Oncol. Biol. Phys., 48, 7-16 (2000). 17. C. Philips, M. Guiney, et al., Radiother. Oncol., 68, 23-26 (2003). 18. G. Lammering, P-S. Lin, et al., Int. J. Radiat. Oncol. Biol. Phys., 51,775-784 (2001).

COMPUTATION TECHNOLOGY FOR SAFETY AND RISK ASSESSMENT OF GAS PIPELINE SYSTEMS VADIM SELEZNEV Computation Mechanics Technology Center ofSPE VNIIEF-VOLGOGAZ Ltd., Zhelesnodorozhnaya 4/1. Sarov, 607180, Russia VLADlMIR ALESHIN Computation Mechanics Technology Center ofSPE VNIIEF-VOLGOGAZ Ltd., Zhelesnodorozhnaya 4/1, Sarov. 607180, Russia

Computation technology for investigating failures at gas pipelines is presented. The technology is based on the principle of simultaneous creating and numerical analyzing high accuracy mathematical models those describing failures at the gas pipelines from failure initiation up to localizing its consequences. High accuracy modeling is achieved due to minimizing simplifications assumed in the mathematical models and implementing up-to-date mesh methods for nonlinear numerical analysis and hybrid methods for mathematical optimization. The computation technology allows to execute multi-objective risk assessment for operating pipelines.

1 Introduction

Gas pipelines are high-energy systems. Failures at gas pipelines can entail very serious consequences for the population, attending personal and surrounding environment (Fig. I ).

Figure 1 Afier failure at main gas pipeline

As current world-wide methods used for risk assessment of the pipeline systems there are methods of probability theory and mathematical statistics [ 1,2]. Algorithms for practical use of these methods for risk assessment of failures at industrial objects are 443

444

represented as techniques, management directives, and standards approved by supervisory state institutions. The attractiveness of stochastic methods for risk assessment mostly is conditioned by their simple mathematical formalization and saving computational resources required. The main drawbacks of these methods are the absence of reliable a priori values for probabilistic characteristics of failure events and the necessity to use subjective expert estimations. As for complex objects and single event, the usage of these methods does not allow to obtain the accurate estimations while forecasting and analyzing failures. To avoid the above-mentioned insufficiency it is used a computation technology for investigating failures. This technology was developed at Computation Mechanics Technology Center (CMTC) of SPE VNIIEF-VOLGOGAZ Ltd. The technology is based on the approaches of mathematical physics, up-to-date numerical methods of continuum mechanics, and mathematical optimization. It is grounded on the principle of simultaneous creating and numerical analyzing high accuracy mathematical models those describing failures at the gas pipelines from failure initiation up to localizing its consequences. The mathematical models of pipelines used in this technology are the systems of differential equations (partial differential equations or ordinary differential equations) with relevant boundary conditions and/or formalized notation of mathematical optimization problems. High accuracy modeling is achieved due to minimizing simplifications assumed in the mathematical models and implementing up-to-date mesh methods for nonlinear numerical analysis and hybrid methods for mathematical optimization. The computation technology for high accuracy mathematical modeling of the failures at the gas pipelines includes the following main stages:

0

initial data collection and processing; mathematical formalization of a problem on failure investigation; computation fluid dynamic (CFD) analysis of pipeline systems during pre-failure and failure; numerical nonlinear structural analysis and simulation of pipeline rupture during failures; simulation of harmful impact on the population and surrounding environment caused by failures; development of computation scenarios for failures and scientifically validated recommendations to prevent similar failures.

The computation technology can be implemented for actual failure investigations, forecasting feasible failures and failure preventive measures development. 2 CFD analysis of pipeline systems Failures at pipelines are often caused by hydraulic mode violations while transporting gas mixtures and multiphase flows through pipeline systems. For high accuracy simulation of nominal and failure modes for the pipeline systems operation one can implement CorNet

445

software (development of CMTC) o r AMADEUS software (CorNet version specially purposed for gas pipelines, a joint development of CMTC, SPP-DSTG company (Slovakia), and Mathematical Institute of Slovak Academy of Sciences). We would like to illustrate the software capabilities by the example of CorNet. Mathematical models for multicomponent (gas mixture) and multiphase (gas-liquid) fluid transmission through the pipeline systems have been realized in CorNet [3,4]. These models describe transient non-isothermal turbulent flow of viscous chemically inert compressible heat-conductive multicomponent and multiphase flows in pipes. Along with the transient modes, CorNet is capable of simulating steady-state modes of gases and liquids transmission through the pipeline systems. Also, CorNet allows simulating multicomponent gas mixture compression at compressor stations in nominal and failure modes including surge. All the mathematical models for multiphase and multicomponent fluid transmission through the pipelines are created based on a complete system of the integral Navier-Stokes equations assuming there are no shock waves in a flow. When making transformation from the general mathematical model to ones for the specific fluid dynamic processes in the pipelines, one should obligatory implement a principle of minimizing supplementary simplifications and assumptions. This will make it possible to maximally keep the initial experimentally and theoretically validated approximation of real processes by the basic model. To illustrate the above-mentioned we provide the final form of the mathematical model for transient non-isothermal turbulent flow (without shock waves) of viscous chemically inert compressible multicomponent heat-conductive gas mixture in a pipe with a circular variable crosssection and rigid rough heat-conductive walls derived by transformation from the spatial model to the appropriate one-dimensional model [ 3 ] :

446

where p - pressure; p - density; Y,, - local relative mass fraction of rn-component of the gas mixture; D,, - local diffusion coefficient of m-component;f- area of pipeline crosssection; w - a projection of mixture velocity vector onto geometrical axis of symmetry of the pipeline; g - acceleration of gravity module; z , - the pipeline altitude above sea level; {S,,,,I - a set of parameters for described value; h - friction factor; R - internal radius of the pipeline; E,, - specific internal energy of rn-component; T - temperature; t - time (march variable); x - a spatial coordinate along the geometrical axis of the pipeline. The function @(T,Ts) is determined by law of heat transfer from the pipe to the surrounding environment, Ts - temperature of the surrounding environment. The equations (1) use the physical values averaged across the pipeline cross-section. The system of equations ( 1 ) is completed by boundary and conjugation conditions. As the conjugation conditions one can preset the boundary conditions simulating complete pipe rupture and/or its shutdown (operation of a valve). At present, one-dimensional problems of CFD is well investigated and developed field of numerical methods of continuum mechanics. The development of difference scheme classes for investigating initial systems of gas dynamic equations implemented in the CFD simulators in this case is the niggling work that is performed by the popular methods and techniques those scientifically substantiated.

Figure 2 Simulation of the third line rupture of the multiline main gas pipeline

Figure 2 shows CorNet implementation under development of measures for localizing consequences of hypothetic failure at a multiline main gas pipeline. 3

Nonlinear structural analysis a n d simulation of pipeline rupture

Gas pipeline systems are topologically complex spatial structures with a lot of branches, intersections, tees, elbows etc. They are affected by a variety of loads (internal pressure, nonlinear temperature field, resistance of soil etc.). Rupture of gas pipelines is often

447

caused by the environmental impact, for example, displacement of the pipeline from the design position as a result of soil shearing or local thinning of the pipe walls (corrosion, erosion, mechanical damage etc.). A method for nonlinear structural analysis of the gas pipelines at failure has been developed at CMTC [3,5].The proposed method is based on applying numerical methods for solution of 3-D nonlinear problems of deformable solid mechanics. One of the most currently popular methods for the numerical solution of continuum mechanics is a finite element method (FEM). Among the variety of commercial software realizing FEM as a mean to simulate stress state of the pipelines, it has been chosen ANSYS having quality certificate IS09001. At the first stage numerical structural analysis of pipelines is performed in beam approximation. Thus, the entire pipeline structure is simulated by straight and curved beams of an annular section. While simulating and analyzing the structure at this stage all the loads influencing stress state of the pipeline are taken into account: internal pressure, thermal expansion, initial stresses and strains, nonlinear interaction of soil and underground pipeline segments, weight of pipelines and soil, supports’ reactions etc. Also some technical diagnostics data are considered, for example, a pipeline’s axis displacement from the design position. The analysis by the beam models intends determining stress state of the pipeline structure taken as a whole and exposing the most loaded segments, forces at the boundaries of these sections. At the second stage, more detailed analyses of the most loaded pipeline segments are performed by shell and solid FE-models. While assigning boundary conditions, the results of the previous stage analysis are used. Interpolation of necessary data on the boundary conditions from the beam models to shell and solid models is carried out automatically. Interpolation of the boundary conditions from the shell to solid models is performed using Submodeling procedure realizcd in ANSYS and from the beam to shell models by using programs developed at CMTC. These programs are subjoined to ANSYS as add-in macros. The second stage analysis allows to obtain actual stress state of the pipeline segments considering all the loads and detail pipeline geometry (Fig. 3 , a).

Figure 3 Results of structural analysis of the gas compressor station piping a) underground collectors under operation loads, b) fracture of pipe tee with defective weld joint

448

The analysis of the results obtained within the second stage enables the objective conclusion on strength of each pipeline segment. The strength estimation is based on normative criteria: admissible loads and ultimate states as well as simulation of the pipeline rupture implementing criteria of fracture mechanics (brittle, elastc-plastc etc.). Figure 3,b presents the elastic-plastic fracture simulation of the pipe tee with defective weld joint under internal pressure. 4

Simulation of harmful impact caused by failures

Harmful environmental and societal impact is caused by main hazardous factors of failures at pipelines. These factors can be conditionally divided into two groups: 0

debris affection (an object is affected by primary or secondary debris); gas hazard (toxic affection: caused by natural gas dispersion in the atmosphere; heat affection: an object is affected by combusting methane-air mixture).

Debris affection is typical for failures at high-pressure gas pipelines. At failure, gas expansion energy is spent for the pipes deformation and rupture, surrounding soil and/or atmosphere compression, acceleration of debris, etc. To determine an amount of the gas energy expansion transformed into kinetic energy and estimate initial velocities of the debris, the technology implements both the experimental data and results of numerical simulating the pipelines rupture. For this, the problem on dynamic analyzing the structures is reduced to solution of differential equations of deformable body motion in 3-D nonlinear statement under preset boundary and initial conditions. These equations are solved by FEM that makes it possible to obtain a detail behavior in time for stress state of the structure allowing for all activedynamicloassand elastic-plasticpropertiedof material.

figure4.Nationalsimulationof undergroundmaingas popelinerughter LS-DYNA software is used at CMTC to simulate pipeline rupture. Example of simulation of high pressure gas pipelines rupture and calculations of the debris parameters is represented at Fig. 4. Areas that will be probably affected by the debris are determined based on numerical simulation of the debris flight.

449

Upon pipelines failure, compressed natural gas escapes into the surrounding environment and intensively mixes with the ambient air. The natural gas transmitted and stored contains more 98% of methane. This entails methane-air mixture formation. The methane-air mixture is very flammable and toxic. The applied mathematical methods for the gas hazard simulation are based on numerical analysis of a complete system of the Reynolds equations by mesh methods. The methane-air mixture is considered as two-component homogeneous gas mixture of two viscous heat conductive chemically inert ideal gases. The model for the twocomponent gas flow is considered in diffusion approximation. To account turbulence of the gas emissions and outflows the well known k- E model is implemented. ANSYSELOTRAN and STAR-CD software are used at CMTC for numerical analysis of gas hazard. Figure 5 shows numerical simulation results of methane dispersion in the atmosphere upon gas pipeline rupture.

a)

b)

Figure 5 A field of relative mass fraction of methane upon main gas pipeline rupture a) Is after the failure, speed of wind is 0 m/s (analysis in STAR-CD), b) 20 s after the failure, speed of wind is 15 m / s (analysis in ANSYS/FLOTRAN)

This approach allows high accuracy estimation of the methane mass fraction field taking into account the terrain and conditions of the atmosphere. Risk assessment of the population toxic affection is reduced to an analysis of lethal consequence probability. The latter depends on the population locality in a specific methane mass fraction field that changes in space and in the course of time. Heat affection intensity is assessed by numerical methods simulating ignition of the methane-air mixture based on the calculated mass fraction fields. Then, the methane-air mixture combustion is simulated as a diffusion plume (Fig. 6,a) or combusting ball. The intensity of heat radiation from the combusting methane-air mixture is analyzed with respect to the space and time (Fig. 6,b). Risk assessment for objects affected by heat determines probable ignition and combustion of the object’s surface material. These are analyzed taking into account the preset distance from fire and radiation intensity. Heat affection on the object due to heat conductivity and convection is not so intensive upon fire at the pipelines.

450

b) Figure 6 . Analysis of heat affection intensity: a) FE-model of the adjoining terrain and diffusion plume inclined under wind; b) distribution of heat flux [ W / m 23 on adjoining terrain from inclined diffusion plume.

5

Conclusion

The computation technology for investigating failures at gas pipeline systems allows specialists of gas industry to fulfill the following tasks, namely: scientifically validated investigation of actual failures; multi-objective risk assessment for operating and designed pipelines; reliable identification of feasible failure causes and development measures to prevent failures and/or localize their consequences. The integral character of the obtained results should be especially emphasized. Such approaches have never been used before in the gas industry.

References 1. 2. 3. 4. 5.

E. J. Henley and H. Kumamoto. Reliability Engineering and Risk Assessment. Prentice-Hall, Inc., Englewood Cliffs, N. J. 07632 (1 981). W. Marshal. Main Hazards of Chemical Plants. Translated from English, Mir, Moscow (1989). V.E. Seleznev, V.V. Aleshin, G.S. Klishin. Methods and Technologies for Numerical Simulation of Gas Pipeline Systems. - Editorial URSS, Moscow (2002). Numerical Analysis and Optimization of Dynamic Modes of Natural Gas Transmission I Edited by V.E. Seleznev. - Editorial URSS, Moscow (2003). Numerical Structural Analysis of Underground Pipelines i Edited by V.V. Aleshin and V.E. Seleznev. - Editorial URSS, Moscow (2003).

JOINT DETERMINATION OF THE IMPERFECT MAINTENANCE AND IMPERFECT PRODUCTION TO LOT-SIZING PROBLEM SHEY-HUE1 SHEU Department of Industrial Management, National Taiwan University of Science and Technology 43, Sec. 4, Keelung Rd. Taipei, 106, Taiwan, ROC JIH-AN CHEN Department of Business Administration, Kao- Yuan Institute and Technology 1821, Chung-Shan Rd., Lu-Chu Hsiang, Kaohsiung, Taiwan, ROC W - H U N G CHIEN Department of Industrial Management, Hsiuping Institute and Technology No.] I , Gungye Rd., Da-Li City, Taichung County, Taiwan, ROC

This paper deals with an integrated model for the joint determination of both economic production quantity (EPQ) and level of preventive maintenance (PM) for an imperfect production process. This process has a general deterioration distribution with increasing hazard rate. The effect of PM activities on the deterioration pattern of the process is modeled using the imperfect maintenance concept. In this concept, it is assumed that after performing PM, the aging of the system is reduced proportional to the PM level. After a period of time in production, the process may shift to out-ofcontrol states, either type 1 or type 11. A minimal repair will remove type 1 out-of-control state. If type I1 out-of-control state occurs, the production process has to stop followed by the restoration work. Examples of Weibull shock models are given to illustrate that performing PM results in a lower cost than no PM action.

1

Introduction

The batch mode of production is widely used in most manufacturing industries and the problem of the determination of economic production quantity (EPQ) for manufacturing processes has been well studied in literature [3]. In fact, production system must be maintained through adequate maintenance programs. Despite of the fact that there is a strong link between maintenance production and quality, these fundamental aspects of any manufacturing systems are traditionally modeled as separate problems. Few attempts have been made to integrate them in a single model that captures their underlying relationships. The classical economic production quantity model assumes that the output of the production system is defect-free. When developing EPQ models, consideration of controlling the quality of the product has generally not been taken into account. In addition, the effect of EPQ on the economic design of control charts has not been well studied as well. Rosenblatt and Lee [ 5 ] have found that, when the production process is subject to a random process deterioration that shifts the system from an in-control state to an out-of-control state followed by producing non-conforming items, the resulting optimal EPQ is smaller than that of the classical model as expected. Porteus [4]has also observed 451

452

similar results. For further review of the EPQ topics with imperfect production processes, the reader can refer to Ben-Daya [12, 141. The basic concept of preventive maintenance (PM) activities is to improve both the reliability of production system and the conforming rate of items. A more realistic approach may model the failure rate of the system is somewhere between ‘as good as new’ and ‘as bad as old’. This concept is called imperfect maintenance. The reader can refer to Nakagawa [2, 71. It is noteworthy that the production cycle is rarely interrupted in the realistic situation even the system is in out-of-control state. In this article, we consider that the out-of-control state may include two types. A minimal repair can remove type I out-of-control state. The production system will not be interrupted by type I out-ofcontrol state. Whereas type I1 out-of-control state makes the production systems have to cease and restoration work is carried out. The reader can refer to Sheu [ 10, 131. 2

System operation, notation and assumptions

The production system is considered to produce a single item and begins in an in-control state. That is, the system produces items of acceptable quality. However, after a period of time in production, the process may shift to an out-of-control state. The process is inspected at times t, , t, ..., tm to assess the state of the production system whether keeps in the in-control state or not. At the same time, PM activities are carried out except that the production system has to stop. The production cycle will be ceased either when the system is transferred to type I1 out-of-control state, or after the mth inspection whichever occurs first. The process is then restored to the in-control state and to the as good as new condition by a complete repair or replacement if necessary. Next, we introduce the notation to develop the integrated model: :demand rate in units per unit time :production rate in units per unit time :expected actual production time for each cycle (production run) :setup cost :holding cost per unit time :cost incurred by producing a nonconforming item :inspection cost :probability of type I1 out-of-control state when the system is out-of-control :imperfectness factor :imperfectness coefficient at the kth PM :nonconforming rate with type I out-of-control state :nonconforming rate with type I1 out-of-control state :cost of preventive maintenance :the maximum cost of preventive maintenance :cost of minimal repair by type I out-of-control state :restoration cost

453 m h, ti

:number of inspections undertaken during each production run :length of the jth inspection interval :time of the jth inspection

y,

:actual age of the system right before the jth PM

w,

:actual age of the system right after thejth PM

Finally, the assumptions of the classical EPQ model principally apply to this integrated model except for additional assumptions described as follows: 1. The time that elapses until the production process shifts to the out-of-control state is a random variable and follows a general distribution with increasing hazard rate. 2. The process is inspected at times t, = h,, t , = h, + h, , . .. to assess its state. If the system keeps the in-control state, PM activities are carried out. The time of performing PM and inspection is assumed to be negligible. 3. If any inspection shows that the process is in out-of-control state then the out-ofcontrol state may include two types. Type I out-of-control state occurs with probability 1-0 and can be removed by a minimal repair with cost C",,, whereas type I1 out-of-control state occurs with probability B and production has to cease and restoration work is then carried out. Assuming that once a shift to the out-ofcontrol state has occurred, the production process will stay in that state until the inspection is carry out, regardless of the type of out-of-control state. 4. The hazard rate of the production process remains undisturbed by minimal repair. 5. While in the out-of-control state, the process produces nonconforming items at nonconforming rates a,,, and a,,,, with type I out-of-control state and type 11 outof-control state, respectively. Inspection intervals are determined such that the integrated hazard rate over each 6. interval is constant. Inspections are error free and shortages are not allowed. 7. 8. Production system aging reduction only depends on the level of PM activities. The PM cost is a decision variable and is kept the same throughout the time horizon under consideration. The process is restored to the as good as new state whenever a type I1 out-of-control 9. state occurred or after the mth inspection is performed, whichever occurs first. Thus, a renewal process occurs at the end of each cycle. 3

Model development

The total expected cost per production cycle consists of the setup cost, inventory holding cost, PM cost, inspection cost and quality related cost (i.e. cost of nonconforming items, restoration cost).

454 1.1. The setup cost and inventory holding cost

Before deriving the costs, let us determine the expected production cycle length. The expected inventory cycle length is given by

P E(CT) = -E ( T ) , D

(1)

where E ( T ) is the expected production run length. Let p , be the conditional probability that the process shifts to the out-of-control state during the time interval ( t,-] ,t , ) given that the process was in in-control state at time t l _ ]. Then,

Let E(T,) be the expected residual time in the production cycle beyond t, given that the process was in in-control state as time t, , E ( T , ) = E ( T ) . Consequently,

E m = 4 + (1 - P,)E(Tl) + P,U-owl)= 4 + (1 - @I )E(T,) Similarly, f o r j = I , 2 ..., (m-2), we have E(T,) = h,+]+ (1 - @,+,)E(T,+I). Note that E(q!.,) = h,”. ,-I

,,I

Therefore, the expected production cycle E ( T ) =

h , n (1 - @,) . 1-1

1-1

The various costs per production cycle are easily derived as follows: Setup cost: K

c*

( P - D)P

2

D

Inventory holding cost: E ( H C ) = -(E(T))2

(3)

1.2. The preventive maintenance cost and inspection cost

As mentioned earlier, we use the concept of imperfect maintenance. After performing PM, the aging of the system is somewhere between as good as new and as bad as old depending on the level of PM activities. The reduction in the aging of the production system is a function of the cost of preventive maintenance. Let

where 0 < 77 I 1 . The parameter q is an imperfectness factor which implies that there is degradation in the effect of PM in the aging of the system and yk presents the

455 imperfectness coefficient at the kth PM. A full PM brings the production system when the cost of PM reaches to C:,,,, the maximum cost of preventive maintenance. Let y1(w,) denotes the aging of the production system right before (after) the kth PM. Linear and nonlinear relationships between aging reduction and PM cost may be considered [14]. Here we assume that this relationship is linear and is given by wA

=

-Y A

)YA

(5)

'

Note that the aging of the production system at time t, is given by

y I = h , , y , = w l ~ , + h l , j = 2..., , 3m. ,

(6)

Preventive maintenance cost: Since the inspection is error free and PM activities are camed out after each inspection except that the production system has to stop, the expected PM cost per production cycle, E ( P M ) = C m]=I 2 i,=Ir ( l - c a , ) + ( l - 6 ) C]=I" , , ~,=Ip l ~-Bp,) (~ m-1

Remark 1. npn,= C ]=I

n, (1

-

(7)

Bp, ) can be explained as the expected number of PMs per

I=I

production cycle. ",-I

/-I

,=I

,=I

Remark 2. nn,,= (1 - e)Cp,n(l-@,)can be explained as the expected number of minimal repairs per production cycle.

fnspection cost: The expected number of inspections is equal to the number of PMs in addition to one inspection at the end of the production cycle. Hence E(IC) = (no",+ I)v . 1.3. The quality related costs

Cost of producing nonconforming items: Let E( N:" ) and E(N:"') be the expected number of nonconforming items due to type I out-of-control state and type I1 out-of-control state respectively during jth interval. Then,

The total expected number of nonconforming items per production run is

456 E ( N ) = T[(l- B)E(N:’))+ B E ( N ~ ‘ ” ) ] p , f i (-l @,) . ]=I

,=I

The total cost of producing nonconforming items per unit time is given by E ( D C ) = s E ( N ) = s t [ ( l - B)E(Nj”) + B E ( N ~ “ ) ) ] p l f i (47,). l ,=I

1 4

(9)

Restoration cost:

where ro and r , are some constants and assuming that the restoration cost changes linearly with the detection delay. In other words, R ( y , - t ) = ro + r, ( y , - t ) . Remark 3. The quality related cost E(QC) = E(DC)+ E(RC) . Remark 4. There are some special cases in the integrated model. For example, when rn = 19 = I, this is the classical economic production quantity model. Another case was considered by Ben-Daya [151 when 8 = I . 4

Solution procedure

The expected total cost is composed of setup cost, inspection cost, inventory holding cost, quality related costs and PM cost. For a renewal reward process [ 11, we have the expected total cost per expected cycle length as follows; ETC =

K

+ E(IC) + E ( H C ) + E(QC)+ E(PA4) 1

E(CT) where K, E(IC), E(HC), E(QC), E(PM), and E(CT) are the setup cost, inspection cost, inventory holding cost, quality related costs, PM cost, and the expected inventory cycle length, respectively. The problem is now transferable to determine simultancously thc optimal lengths of the inspection intervals, namely h,, h2, ..., h,, the optimal cost of PM and the number of inspections. For a Markovian shock model, a uniform inspection scheme provides a constant integrated hazard rate over each interval. Rahim [8] extended this idea to non-Markovian shock models by choosing inspection intervals and concluded that the integral is the same for all intervals. That is

J~:,

r(tpt=

Ja

r ( t p t , for j=2, 3, ..., m.

(12)

If the time of the process staying in the in-control state follows a Weibull distribution, that is, its probability density fimction is given by f ( t )= Afltp-‘e-u8, t > 0 , fl 2 1 , A > 0 ,

457 then, the length of the inspection intervals h,, j=2, 3, ..., in, can be determined recursively as follows [15].

h, = (w,",+ h$)?

-

w,_,,j=2, 3, ..., m.

(13)

The solution procedure is reduced to determine the values of the decision variables m, h,, and Cpm. It is suggested that the stepwise partial enumeration procedure can be employed to minimize the cost function. However, due to the characteristics of the cost function, some modifications to the standard method have to be made to account for the inherent integrality constraint on the number of inspections. The optimal value of in 2 2 could be determined by the following two inequalities: ETC(m - 1) 2 ETC(m) and ETC(m + 1) 2 ETC(rn) . Finally, we present numerical examples to illustrate important aspects of the developed integrated model. In the following numerical examples, we assume the time during which the process remains in the in-control state follows a Weibull distribution with scale and shape parameters A = 5 and p = 2.5 , respectively. We consider three types of probability 6' (6'=O.I, 6'=0.5, 6'=l.O) about type I1 out-ofcontrol state and two types of setup cost K (K=150, K=300) in our computations. The following data are used for the other parameters:

a, = 0.2. a, D

= 500,

P

= =

0.4,

c:", 20, c,,

1000,

=

K

= 150,

=

10,

s = 20, v

c,2= 0.5, ro = 50. r/ = 0.5, =

10,

7 = 0.99.

Table 1 .The effectiveness of PM level on expected total cost under different parameters (8and K )

In table 1, the relationship among different 6' (6'=0.1, 6'=0.5, 6'=1.0) and K (K=150, K=300) will be used to investigate the effectiveness of PM level on the expected total cost. With no PM under the setup cost K=150, the expected total cost amount to 300.90, 297.09 and 318.86 for 6'=0.1, 6'=0.5 and 6'=I.O, respectively. The optimal cost of PM on different 0 is obtained, leading to the expected total cost amount to 262.63, 257.50 and 265.09 for 8=0. I , 6'=0. 5 and 6'=1.0, respectively. These results transparently illustrate the effectiveness of PM level. The percentage cost savings is 12.72% if the model of full PM level or PM level = 1.O is used instead of the model of no PM or PM level = 0.0 when

458 6=0.I and K=150. The effectiveness of PM level on the expected total cost shows that the importance of PM action when a production system works with larger setup cost. If the setup cost of the production system is double (K=300), the effectiveness of PM level (the percentage cost savings) for is 20.21%. Similar results are presented here when 6=0.5 and 6=1.0. On the other hand, the PM action also affects the economic production quantity. The production lot-sizing rises from 472.42 to 768.03 if the model of PM level = 1.O is used instead of the model of PM level = 0.0 when 6=0.1 and K=150. As a matter of fact, a production run will be as longer as higher PM level. This is due to the fact that performing PM action will make the production system younger and hence longer production run will still be feasible.

5

Conclusion

In this paper, we constructed an extended production-maintenance model using joint determination of EPQ and PM cost for an imperfect process having a general deterioration distribution with increasing hazard rate. This model improves the practicalities of the assumptions in the production system. We also formulated various scenarios for theoretical analysis and analyzed several experiments to illustrate the effectiveness of PM level. As seen in the simulation experiments, the effectiveness of PM level is well demonstrated when production system associated with larger setup cost. Also, it is found that performing PM will yield reductions in the expected total cost. References 1.

2. 3. 4. 5. 6. 7. 8. 9. 10.

11. 12. 13. 14. 15.

S. M. Ross, Appliedprobability Models with Optimization Applications, (1 970). T. Nakagawa, Operations Research, 14(3), 249-255( 1980). E. A. Silver and R. Peterson, Decision Systems for Inventory Management and Production Planning, (1985). E. L. Porteus, Operations Research, 34, 137-144(1986). M. J. Rosenblatt and H. L. Lee, IIE Transactions, 18, 48-55(1986). H. L. Lee, and M. J. Rosenblatt, Management Science, 33, 1125-1136(1987). T. Nakagawa, IEEE Transactions on Reliability, R-37,295-298( 1988). M. A. Rahim, IIE Transactions, 26, (6), 2-1 l(1994). H. Pham and H. Wang, European Journal of Operational Research, 94, 425438(1996). S. H. Sheu, W. S. Griffith, Naval Research Logistics, 43,3 19-333(1996). M. A. Rahim, M. Ben-Daya, International Journal of Production Research, 36, 277289( 1998). M. Ben-Daya, and M. A. Makhdoum, Journal of the Operational Research Sociecy, 49, 840-853(1998). S. H. Sheu, European Journal of Operational Research, 108,345-362( 1998). M. Ben-Daya, IZE Transactions, 31,491-501(1999). M. Ben-Daya, International Journal of Production Economics, 76,257-264(2002).

OPTIMUM POLICIES WITH IMPERFECT MAINTENANCE SHEY-HUE1 SHEU, YUH-BIN LIN, AND (iWO-LIANG LIAO Department of Industrial Management, National Taiwan University of Science and Technology. 43 Keelung Road, Section 4, Taipei, Tcwan. ROC

This study considers a reparable system that undergoes periodic preventive maintenance Mathematical formulas for the expected cost per unit time are obtained for three models in cases of repair, minimal repair and leaving failures unrepaired For each m d e l , the existence of a unique and finite optimal T’ for preventive maintenance under particular reasonable conditions is demonstrated The probability that preventive maintenance is perfect depends . ~ nthe number of times imperfect maintenance has been performed since the last renewal cycle, B yd the probability that preventive maintenance remains imperfect does not increase Finally, various Fpecial cases are considered

1

Introduction

A manufacturing company requires a cost-effective system for maintaining the peak operations of production machinery and thus remains competitive in the global marketplace. Preventive maintenance (PM) is critical to complex systems because it reduces operating costs and catastrophic breakdown risk. Maintenance policies have been extensively examined [2]. Certain aspects of the PM model minimize the mean cost rate [S]. Many investigations of PM analysis have attempted to apply the PM model to various real-world situations, one of which is the case of a sequential PM process. When a system is maintained at unequal intervals, the PM policy is called sequential PM [3]. Another popular policy is the periodic PM policy, under which the system is maintained at fixed intervals, at times T, 2T, 3T, ..., and also at failure [4]. Liao and Chen [lo] studied the single-machine scheduling problem with periodic maintenance to develop an efficient heuristic for providing the near-optimal solution for large-sized problems. Most preventive maintenance models assume that the unit is “as good as new” following PM. However, this assumption is not always true. The system is more realistically assumed not to return always to an “as good as new ” state following PM. Such PM is known as imperfect PM [6]. Pham and Wang [S] summarized and discussed various methods and optimal policies for imperfect maintenance. Nakagawa’s models [7] assume that PM achieves either imperfect maintenance with a probability p , or perfect “as good as new” maintenance with probability = I - p . However, this assumption is frequently false. After PM, maintenance workers sometimes perform periodic tests to check for abnormalities. These tests show maintenance workers that they cannot be nonchalant and allow problems to recur. To include this testing experience, the probability that PM is perfect should depend on the number of times imperfect maintenance has been performed since the last renewal cycle, and the probability that PM remains imperfect does not increase. This study presents three models for response to failure, similar to those proposed by Nakagawa [7]. Model 1 involves fully repairing the failure, such that the system is as 459

460

good as new following the repair. Model 2 involves minimal repair only. Finally, model 3 involves allowing the failure to persist until the next perfect PM. The s-expected cost rate of the model with imperfect PM is derived and the optimum PM policy thus obtained. 2

General Model

This study considers a generalized PM model using the following scheme. A system involves two outcomes following PM. The type I outcome is called imperfect PM, while the type I1 outcome is called the perfect PM. This system permits the probability that a type I1 outcome occurs, to be a function of the number of type I outcomes obtained since the last renewal cycle. Let M be the number of PM until the first type I1 PM occurs. Additionally, let p , = p ( > ~), . That is, 7, represents the probability that the first j outcomes are all of type I. This study assumes throughout, that the domain of PI is {O,I,2, ... } and that 1 =Po2 PI > F 22 ... [I]. The reliability 7, does not increase with the number of PM itemsj. This study uses the abbreviation { p , } to represent a sequence of with ) ) , domain { l,2,3,...}. probabilities. Let p , = p ( M = j ) = p , , - p,=F,., ( I - ( ~ , / F ~ - ~ Hence, when the j-th PM occurs, then that PM is either type I with probability - 4 , = p l / p l - l or type I1 with probability 8, = 1 - 4 , . Let f ( t ) , F ( t ) ,F ( t ) and r ( t ) denote the pdf, Cdf, SF and hazard rate of time to failure of a unit. Additionally, H ( T )= [C,(P,-, - P , ) ~ f ( j t ) ] /,(P,-] C -F,),F(jt)] ; [weighted sum of pdf f(jt)]/[weighted sum of Sf F(jt)]. 2.1. Model I

Consider the case in which the following assumptions are made; PM occurs at j . T , j = 1,2,... , at cost R, . 1. 2. If failure occurs in ( ( j - 1). T , j . T ) , j=1,2, ..., then an operating unit is repaired; otherwise it is maintained preventively at time j . T , at cost R , . Following the repair, the unit is as good as new. All failures are instantly detected. Additionally, the repaired and PM times are 3. negligible. Let y, , y 2 , .. . be independent copies of Y . For the present policy,

Where u, represents the operating time during the renewal interval r, between the 0-1)th PM and thej-th PM. Additionally, let R;, be the total cost incurred over the renewal interval r, between the (j-l)-th PM and the j-th PM. The s-expected cost rate is, J , ( T ;{i, })

=

c,m:,1 c,E[U, 1

46 1

Differentiating J , ( T ;{F,

with respect to T yields,

thernecessaryconditionfor r to be optiomalis d mustbe the solution of

Theorem 1. If H ( ~ ) > L, &-(R,//.o)>O and (&-4/qd)f@) is monotonously increasing in T , then a finite and unique optimal solution T' >O exists that minimizes the total expected cost per unit time J , ( T ; ~ ,.

h

Proof. If H ( w ) > L , 4 -(I?, /dco))>Oand (4-R, /dT))H(T)is strictly increasing in T , then the left-hand sides of Eqs. (7) and (8) are continuous and strictly increasing in T ; they change si n exactly once from negative to positive as T increases from 0 to 0 0 , as does d J , ( T ; k j ) / d T .The result follows. 0 ) (R, -R, I ~ ( T * ) ) H ( T * ) . From Eq. (8), the s-expected cost rate is, J , ( T ' ; ~ , =

(9)

462

2.2. Model 2 Let J2(T;F,])represent the total cost per unit time between two PMs when the planned PM is performed at time ;. T . The following assumptions are then made; 1. As in model 1. 2. If failure occurs in ( ( ~ - I ) . T , ~ . ~~ )=,i , 2..., , then minimal repair is performed, at cost R, . A minimal repair merely restores the system to a functioning state after it has failed. That is, minimal repairs do not change the hazard rate r ( t ) I 3. As in Model 1. 4. -y(Pl-l-2P, +P,+,)>O. T h e s-expected cost rate is,

find the cptional times r, which minimize with respectto r.

Theorem 2. If r(t) is continuous and strictly increasing in T and [!&(t)> R, ,(TI-,- 2 p , + p 1 + , ) 1then , a finite and unique optimal solution f > O exists that minimizes the total expected cost per unit time

/[R,C

J,(cFJ 1).

C

Proof. If Irntdr(t)> R, / [ R , (Pl-]- 2 P , + P,,, 11 and is strictly increasing in then the left-hand sides of Eqs. (12) and (13) are continuous and strictly increasing in they change si n exactly once from negative to positive as T increases from 0 to 00, does d 1 2 ( T ; ~ ~ ) l The d T . result follows. -

From Eq.(12), J 2 ( T * ; e l }=) R , C

l(P,-~- 2 P ,

-

+P,+i)jr(jT*).

T, T;

as 0 (14)

2.3. Model 3

Let J,(r;e,}) represent the total cost per unit time between two PMs when the planned PM is performed at time T . The following assumptions then are made;

463 As for models 1 and 2. If failure occurs in ((, - I ) . T,, .T ) , = 1,2 ..., then the failure remains until the next perfect PM. Let R, be cost-rate of time lapse between failure and its detection. 3. As in models 1 and 2. 4. As for model 2. The s-expected cost rate is 1.

,

2.

- -

JmeI}) = [R, + R,CP,-I(I - P , /P/-,)(j~~~~(~T-t)~(f)dt + TF((/ I). T ) ] / T -

= [ R,

+ R,C,(P,-,

-2P,

-

(15)

+ P,+1)1:F(t)dt]/T.

Let = r F ( t ) d t . Find the optimum time T’ which minimizes differentiafle J,(T;@,}) with respect to T .

PI)!J J d T

d/,( T ;

=[R,C

-

, ( p , 1-27,+P,+i),F(,T)]/T-[R, + R , x ( 7 , - 1 - 2 P ,

.J3(cF,}) , and

-

+P,+~)j;F’(t)dt]/T’.

(16)

Then, set ~ J , ( T ; @) ,) / d T= 0, implying that T’ satisfies

Theorem 3. If > R, / [ R , , - 2P , + )I , then a finite and unique optimal solution T*>O exists that minimizes the total expected cost per unit time J , ( T ; ~)), .

C

C

,+,

-

-

Proof. If p > R, /[R, ,(P,-, - 2 p , + p,,, )I, then the left-hand sides of Eqs. ( I 7) and (1 8) are continuous and strictly increasing in T ; they change sign exactly once from negative to positive as T increases from 0 to a,as does dJ3(T;@,/)/dT.The result follows. 0 From Eq. (17h J , ( T * ; ~ , } ) = R , ~ , ( ~-,2. PI , +F,+,)jF’(jT*). 3

(19)

Special Case

The learning curve concerns a repetitive job or task and represents the relationship between experience and productivity. This investigation applies the learning curve model to the PM model. Following the discussion of cases 3.2 and 3.3, a learning curve is developed, and the following assumptions are made; P,-, > P , , j = 1,2 ,... . Let r be learning rate. I. 2. If each doubling of the number of PM reduces the probability of imperfect PM by ( ~ - r ) , t h e nP , = P l j b ,j=1,2 ,....

464

3.

4. 3.1.

If each PM can reduce the probability of imperfect PM by ( I - r ) , then j = 1,2,.... b =~ ~ o g ~ ~ l ~ [ ~ o g @ l ~ PO

= I ; P , = 0, j = IJ,

p , =Fir/-',

...

This case involves an operating system that must be as good as new following PM at j . T(j = 1,2, ...) .

J,~T;{1,0,0,...,0}~ = ( R , F ( T ) +R , F ( T ) ) / IOTF(t)dt

(20)

and ~,(~‘;{1,0,0,...,0})=(R, - R, I v ( T ’ ) ) H ( T * )= ( R , - R , ) r ( T * ) . The case in Barlow and Pros Chan (1965)[9,p87]. J~(T;{I,o,o ,...,oh = (R, + R, JoTr(t)dt)/T

and J , (T*;{1,0,0,...,0)) = R,r( T’ ) . The case in Barlow and Proschan (1965)[9,p97]. ~,(~;{l,o,~,...,o}) = (R, + R4 l y F ( r ) d f ) / T

and 3.2.

J , (T’ ;{l,O,O ,..., O}) = R,F(T* ) .

PO=I;P,

-

= r I , j=1,2

(25)

,..., O < r < I ,

r=l-r.

Here, a random number M, of instances of PM until a type I1 PM is performed, has a geometric distribution. This case is considered by Nakagawa (1979) [7]. J , ( T ; { r ’ } )= [R,C

-;c, ~ J - ~ F ( ~ T[C) I,r/-l I / Ji:,TF(t)dt~

, ~ I - ~ F +( ~R JTI )

(26)

In this case, learning curves provide their greatest advantage in performing the early PM in response to new causes of failure. As the number of times PM has been performed

465 becomes large, the learning effect is less noticeable. The following figure provide an example for Fl=O.9 and 80% learning rate. Figure 1 reveals that the probability ?, decreases rapidly during the early PM.

Fig 1 Learning curve - _

J , ( T ;{, Pl,PI

.2b

-

)...})

j=2

-

j=2

3.4-Fo = 1 ; P j = q J a , j =1,2,... 7 0 5 q < 1,a > 0 . Here, a random number M, of instances of PM until a type I1 PM is performed, is a discrete Weibull distribution that is IFR for a 2 1 and DFR for 0 < a I 1.

466

and J I ( T * ; b l u ) =( R , - R , I v ( T * ) ) H ( T * ) .

4

(39)

Concluding Remarks

This study presented a general PM model that incorporates two types of outcomes following PM. Three models for obtaining the optimum times T’ have been investigated. The nature of a PM process and policy leads to the hypothesis that the probability that PM is perfect depends on the number of times imperfect maintenance has been performed since the previous renewal cycle. The results of an investigation of the conditions for optimal policy show that such a policy is more general and more flexible than policies already reported in the literature. Special cases were examined in detail. At a given learning rate, an analyst can use the learning curves to project PM costs. This information can be used to estimate training requirements and develop PM plans. References

1. Shey-Huei Sheu, “Extended optimal replacement model for deteriorating system, ”European Journal of Operational Research ~01.112,pp. 503-516, 1999. 2. Jae-Hak Lim, and Dong Ho Park, “Evaluation of Average Maintenance Cost for Imperfect-Repair Model, ”IEEE Trans, Reliability, vol. 48, no. 2, pp. 199-204, 1999. 3. Toshio Nakagawa, “Sequential Imperfect Preventive Maintenance Policies, ”IEEE Trans. Reliability, vol. 37, no. 3, pp. 295-298,Aug. 1988. 4. Toshio Nakagawa, and Kazumi Yasui, “Periodic-Replacement Models with Threshold Levels, ’’ IEEE Trans. Reliability, vol. 40, no. 3, pp. 395-397,Aug. 1991. 5. Jae-Hak Lim, and Dong Ho Park, “Evaluation of Average Maintenance Cost for Imperfect-Repair Model, ”IEEE Trans, Reliability, vol. 48, no. 2, pp. 199-204, 1999. 6. Toshio Nakagawa, and Kazumi Yasui, “Optimal Policies for a system with Imperfect Maintenance, ”IEEE Trans. Reliability, vol. R-36, no. 5, pp.631-633, Dec. 1987. 7. Toshio Nakagawa, “Optimum Policies when Preventive Maintenance is Imperfect, ”IEEE Trans. Reliability, vol. R-28, no. 4, pp. 331-332, Oct. 1979. 8. Hoang Pham, and Hongzhou Wang, “Imperfect maintenance, ”European Journal of Operational Research, vol. 94,no. 3, pp. 425-438,Nov. 1996. 9. R.E. Barlow, F. Proschan, Mathematical Theory of Reliability. New York : Whiley, 1965. 10. C. J. Liao and W. J. Chen, “Single-machine scheduling with periodic maintenance and nonresumable jobs, ”Computers & Operation Research 30, pp.1335-1347,2003.

OPTIMAL SCHEDULE FOR PERIODIC IMPERFECT PREVENTIVE MAINTENANCE

SANG-WOOK SHIN Department of Statistics, Hallym University, Chunchon, 200-702, Korea E-mail: [email protected] DAE-KYUNG KIM Department of Statistics, Chonbuk National University, Chonju, 561- 756, Korea E-mail: [email protected] JAE-HAK LIM Division of Business Administration, Hanbat National University Taejon, 305-719, Korea E-mail: [email protected]

In this paper, we consider a periodic imperfect preventive maintenance(PM) policy in which the system after each P M remains unchanged (i.e. having the same failure rate as one just prior to PM) with probability p and is restored to the state as good as new one with probability p = 1 - p . And the system undergoes only minimal repairs a t failures between PM’s. T h e expected cost rate per unit time is obtained. T h e optimal number N of P M and the optimal period x, which minimize the expected cost rate per unit time are discussed. Explicit solutions for the optimal periodic PM are given for the Weibull distribution case.

1. Introduction

Preventive Maintenance( PM) has played an important role in effective operation and economic management of industrial systems. P M prevents unexpected catastrophic failure of system and ultimately extends the system life. P M problems have been studied by many authors : Barolw and Hunter (1960) propose two types of P M policies. One policy is that P M is done periodically and minimal repair at any intervening failure between periodic PM’s. The imperfect PM policy, in which PM is imperfect with probability p , is firstly introduced by Chan and Down(1978). Nakagawa( 1979) propose three imperfect P M models among which the model B assumes that the system undergoes imperfect PM at periodic times k T , where k = 1, 2 , . . . , and is minimally repaired at any failure between PM’s. In Nakagawa(1979), the system after imperfect P M has 467

468

the same failure rate as it has been before PM with probability p and as good as new with probability p . And optimal period minimizing cost rate per period is obtained. Murthy and Nguyen(l981) discuss an imperfect PM model where the system undergoes PM a t age TI if the most recent maintenance action was corrective maintenance(CM) or PM at T2 if it was PM. They treat imperfect PM in a way that the system after PM has a different (worse) failure time distribution than after CM. Brown and Proschan(l983) propose a imperfect repair model in which the failed unit is either minimally repaired with probability p or perfectly repaired with probability 1 - p . And they investigate aging preservation properties of life distribution of the unit after imperfect repair. Fontenot and Proschan(1984) and Wang and Pham(1996) obtain the optimal imperfect maintenance policies for one component system. Nakagawa( 1988) considers a sequential imperfect PM policies in which the hazard rate after PM k becomes akh(t),where ak is an improvement factor, when it was h ( t ) in period k of PM. Pham and Wang( 1996) give an excellent summary on imperfect maintenance. In this paper, we extend the model B of Nakagawa(1970) to the model that assumes additionally that the system is preventively maintained at periodic times k T and is replaced by a new system at the N t h PM, where k = 1, 2 , . . . N . The expected cost rate per unit time is obtained. The optimal number N * of the pewhich minimize the expected cost rate per riodic PM and the optimal period T*, unit time, are discussed. And the optimal schedules are computed explicitly when the failure time follows Weibull distribution.

Notations h(t) hazard rate without PM hpm(t) hazard rate with PM T period of PM N number of PM’s where the system is replaced probability that the failure rate of the system after PM remains unP changed cost of minimal repair at failure c r n r Cre cost of minimal replacement Cpm cost of PM expected cost rate per unit time C ( T ,N )

2. Model and Assumptions

A pcriodic impcrfcct P M modcl we consider in this paper assumes the followings: (i) The system begins to operate at time t = 0. (ii) The PM is done at periodic time k T ( k = 1, 2 , . . . N - 1) where T is replaced by new one at the N t h PM.

2 0, and

469 (iii) The system after P M has the same failure rate as it has been before P M with probability p and as good as new with probability p. (iv) The system undergoes only minimal repair at failures between PM's. (v) The repair and PM times are negligible. (vi) h ( t ) is strictly increasing and convex.

3. Expected Cost Rate Per Unit Time The PM model we consider in this paper is a periodic imperfect PM model for which the hazard rate after each P M remains unchanged with probability p and is reduced to zero with probability p = 1 - p . More explicitly, the hazard rate hpm(t) of the proposed imperfect PM model is given by

where k = 1, 2, . . . N , hprn(0)= h(0) and T is the time interval between P M interventions. The expected cost rate per unit time, C(T,N ) is defined as follows. Cost for minimal repairs + Cost for PM + Cost for replacement (2) NT Since it is well-known that the number of minimal repairs during the period k of PM is nonhomogeneous Poisson process(NHPP) with intensity function kT hpm(t)d t , the expected cost rate per unit time is easily given by the following equation. When 0 p < 1 C ( T , N )=

S(k-l)T

<

JE~)T

C(T,N ) = & [ c r n T { ~ ; = 1 =

hprn(t)

&lcmT{C:=l +(N

-

1)Cprn

S&pk-lh(t)

+

dt) + ( N - 1)cprn +cre] dt

+z,C$:d- 1 Sj(Tj - 1 ) T W d

4 3 )

CTe].

4. Optimal Schedules for the Periodic PM Policy

To design the optimal schedules for the periodic imperfect PM, we need to find an optimal PM period T' and an optimal number N * of PM needed before replacing the system by a new one. The decision criterion to adopt is to minimize the expected cost rate during the life cycle of the system. Applying the similar methods as in Park, Jung and Yum(2000), the optimal schedules are derived for the following two cases. We first find the optimal number of PM, when the PM period T is known. To find the optimal N" which minimizes C(T,N ) , we form the following inequalities.

C ( T ,N

+ 1) 2 C ( T ,N )

and

C ( T ,N ) < C ( T ,N

-

1).

470 For 0 5 p

< 1, it can be easily shown that C ( T ,N

Similarly, C ( T ,N ) < C(T,N

-

+ 1 ) 2 C(T,N ) implies

1) implies

Let

for N have

=

1 , 2, ... and L(T,N ) = 0 for N = 0. Then, from equations (4) and (5), we

and

Lemma 4.1. Suppose that h ( t ) is strictly increasing. T h e n L(T,N ) is increasing in N . Proof Let T

> 0 be given. We note that

L(T,N ) - L ( T ,N

-

(N+l)T 1) = N[JN, hpm(t) d t -

sE~)T hpm(t) dtl.

(8)

Evaluating the integrations in the equation ( 8 ) , we obtain

(N+l)T JNT

hpm(t) dt

= P"H((N

+ 1)T)

H(NT)I + P C , N _ l P [ H ( j T) H ( ( j - 1)T)I -

(9)

and

~F1p hpm(t) dt = p N - l [ ~ (-~H~( ( )N - 1 ) ~ ) 1 +PC;Y=;19-~"T) - H ( ( j - 1)T)1,

(10)

where H ( z ) = S," h(u)du. Substituting the equations (9) and (10) for the equation ( 8 ) ,we have

L(T,N ) - L(T,N

-

+

1 ) = p N [ H ( ( N l ) T )- H ( N T ) - ( H ( N T )- H ( ( N - l ) T ) ) ]> 0.

The last inequality holds since h ( t ) is strictly increasing

Theorem 4.1. Suppose that h ( t ) is a strictly increasing function. T h e n there exists a finite N* which satisfies (7) and it is unique f o r any T > 0 .

47 1

Proof We note that L(T,N ) = 0 when N = 0 and hpm(t)is also increasing in t 2 0 whenever h ( t ) is strictly increasing. Since hpm(kT y ) 5 h p m ( j T y ) for any k 5 j and y > 0,

+

+

sNT

Hence it is sufficient to show that ( N + l ) T hpm(t)dt + 03 as N hpm(t)is an increasing function, for N T < tl < ( N 1 ) T ,

SNT

-

k

+ 03.

Since

(N+l)T

h p m ( t )dt 2

(N+1)T

which goes to 03 as N 03. Hence, L ( T , N ) tends to the desired result holds.

+

03

as N

+

03.

hpm(t) dt

And, it follows from Lemma 4.1 that

Next we consider the case when the number of PM, N , is fixed. To find the optimal period T* for a given N which minimizes C ( T , N ) in (7), we take the derivative C ( T ,N ) with respect to T and set it equal to 0. Then we have

z{pk-l N

kT

t dh(t)

ik-l)T

k-1

+PEP’-’ j=1

/jT u (j--1)T

dh(u)}=

( N - l)Cpm+ C r e

crn,

(12)

Let g ( T ) and C denote left-hand side and right-hand side of (12), respectively. Lemma 4.2. If h ( t ) is strictly increasing and convex, then g ( T ) is increasing in T.

Proof Since h(t) is strictly increasing and convex, it is easy to see that & g ( T ) = C f = i , , ~ k - l { k T h ’ ( k T-) ( k - l ) T h ’ ( ( k- 1 ) T ) ) +pC:;:p-l{jTh’(jT) - ( j - l)Th’((j- l ) T ) } ]> 0.

(13)

Theorem 4.2. If h ( t ) is strictly increasing and convex function, then there exists a T* which satisfies (7) f o r a given integer N and it is unique.

Proof It is obvious that g ( T ) = 0 when T = 0. For ( k - l ) T < tl < t z < k T ,

this there exists a finite and unique t which satifies (7) for any given N

472 Table 1. Optimal number of pm N* and expected cost rate C(T,N * ) for given T = 0.8, C,, = 1 and Cpm = 1.5. Cre

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Table 2.

2.5

2.0

P

3.0

3.5

N*

C ( T ,N * )

N*

C ( T ,N * )

N*

C ( T ,N * 1

2

2.7411 3.1400 3.1400 3.1400 3.1400 3.1400 3.1400 3.1400 3.1400 3.1400

2 2 2 1 1 1 1 1 1 1

3.0536 3.2851 3.5144 3.7650 3.7650 3.7650 3.7650 3.7650 3.7650 3.7650

2 2 2 2 1 1 1 1 1 1

3.3661 3.5976 3.8269 4.0541 4.3900 4.3900 4.3900 4.3900 4.3900 4.3900

1 1 1 1 1 1 1 1 1

Optimal period T* and expected cost rate C ( T * ,N) with C,,

N* C ( T ,N * ) 2 2 2 2 2 2 1 1 1 1

3.6786 3.9100 4.1394 4.3666 4.5916 4.8146 5.0150 5.0150 5.0150 5.0150

= 1,Cpm = 1.5 and Cre = 3.0.

P N 1

2 3

0.1

T* 1.587 C ( T , N * ) 4.408 T* 1.321 C ( T , N * ) 3.212 T*

1.227

4

C ( T , N * ) 2.606 T* 1.181

5

C(T,N*) T*

6 7 8

2.281 1.154 C ( T , N * ) 2.092 T* 1.136 C ( T , N * ) 1.971 T* 1.136 C ( T , N * ) 1.887 T* 1.113 1.825 C(T,N*)

P

0.3

0.6

1.0

N

1.875 5.115 1.375 4.638 1.197 4.083 1.110 3.604 1.060 3.233 1.028 2.957 1.006 2.753 0.990 2.600

2.308 6.625 1.488 7.654 1.192 7.598 1.040 7.553 0.948 7.344 0.888 7.048 0.845 6.715 0.814 6.375

1.234 9.360 1.651 12.266 0.812 12.202 0.812 12.862 0.665 13.758 0.569 14.731 0.501 15.725 0.450 16.717

9

T* C(T,N*)

10

T' C(T,N*)

11

T* C(T,N*)

12

T* C(T,N*)

13

T* C(T,N*)

14

T*

15

C(T,N*) T*

C(T,N*) 16

T'

C(T,N*)

0.1

0.3

0.6

1.0

1.106 1.779 1.100 1.742 1.095 1.712 1.091 1.688 1.088 1.667 1.085 1.650 1.082 1.635 1.080 1.621

0.979 2.483 0.968 2.392 0.961 2.319 0.955 2.260 0.949 2.211 0.945 2.169 0.941 2.134 0.938 2.103

0.790 6.046 0.772 5.739 0.757 5.459 0.746 5.205 0.736 4.979 0.728 4.778 0.721 4.600 0.715 4.442

0.410 17.700 0.378 18.667 0.362 19.615 0.329 20.551 0.310 21.467 0.294 22.370 0.279 23.254 0.266 24.126

5. Numerical Example

Suppose that the failure time distribution F is Weibull distribution with a scale parameter X and a shape parameter 0,of which the hazard rate is h ( t ) = pX@-ltP-l for p > 0 and t 2 0. As a special case, we take /3 = 3 and X = 1 for t 2 0. Table 1 shows values of the optimal number of P M N * and its corresponding expected cost rate C(T,N * ) for a given T . For Table 1, we take T = 0.8, C,, = 1.0, C,, = 1.5 and C,, = 2.0, 2.5, 3.0 and 3.5. It is interesting to note that as the cost for replacement increases, the number of PMs needed to minimize the expected cost rate increases. Table 2 represents optimal period T* and its corresponding expected cost rate C ( T * , N )for N = 1 to 16 when C,, = 1.0, C,, = 1.5 and C,, = 3.0. Table 2 shows that the value of T* gets smaller and the expected cost

473 r a t e increases as N increases. Also, from it should b e noted Tables 1 and 2 that the optimal number of PM increases a n d t h e optimal period decreases as each P M tends to restore the system to t h e state as good as new one.

References 1. R. E. Barolw and L. C. Hunter, Preventive Maintenance Policies. 0perati:ons Research,

9:90-lOO(1960). 2. P. K. Chan and T. Down, Two Criteria for Preventive Maintenance. IEEE Trans. Reliability, 35:272-273(1978). 3. T. Nakagawa, Preventive Maintenance Policies. IEEE Trans. Reliability, 28:331332(1979). 4. D. N. P.Murthy and D. G. Nguyen, Optimum Age-Policy with Imperfect Preventive Maintenance. IEEE Trans. Reliability, 30230-81(1981). 5. M. Brown and F. Proschan, Imperfect Repair.J. of Applied Probability, 20:851859(1983). 6. R. A. Fontenot and F. Proschan, Some Imperfect Maintenance Model, in Reliability Theory and Models, AP, New York.(1984). 7. H. Pham and H. Wang, Imperfect Maintenance . European J . of Operational Research, 94:425-438(1996). 8. H. Wang and H. Pham, Optimal Age-Dependent Preventive Maintenance Policies with Imperfect Maintenance. International J . of Reliability, Quality and Safety Engineering, (1996). 9. T.Nakagawa, Sequential Imperfect Preventive Maintenance Policies. IEEE Trans. Reliability, 37:295-298(1988). 10. D.H.Park, G. M. Jung and J. K. Yum, Cost Minimization for Periodic Maintenance Policy of a System Subject t o Slow Degradation. Reliability Engineering and System Safety, 68:105-112(2000).

This page intentionally left blank

RELIABILITY ANALYSIS OF WARM STANDBY REDUNDANT STRUCTURES WITH MONITORING SYSTEM

SANG-WOOK SHIN Department of Statistics, Hallym University, Chunchon, 200-702, Korea E-mail: [email protected]

JAE-HAK LIM Division of Business Administration, Hanbat National University Taejon, 305-719, Korea E-mail: jlimohanbat. ac.kr DONG HO PARK Department of Statistics, Hallym University, Chunchon 200-702, Korea E-mail: dhparkosun. hallym. ac. kr

In this paper, we consider a standby redundant structure with a function of switchover processing which may not be not perfect. The switchover processing is governed by a control module whose failure may cause the failure of the whole system. The parameters measuring such an effect of failure of the control module is included in our reliability model. We compute several reliability measures such as reliability function, failure rate, MTBF, mean residual life function, and the steady state availability. We also compare a single unit structure and the redundant structure with regard to those reliability measures. An example is given to illustrate our results.

1. Introduction

The redundant structure is one of the most widely used technique in the reliability design in order to improve the reliability of the system. Depending on the readiness(or consequently, the failure rate) of standby unit, it is classified as hot, cold or warm standby unit. While the active unit is operating, the cold standby unit does not operate and the hot standby unit operates, while the warm standby does not operate but the preliminary electronic source is laid on during the operation of the active one. More details are given in Elsayed7. Kumar and Agarwa13 also present excellent summaries for the cold redundant structure. Various techniques for modeling the reliability of a system are discussed in Endrenyi'. The redundant systems having imperfect switchover device have been extensively studied by many authors 1,4,6. Recently, Lim and Koh' considers a redundant sys-

475

476 tem with a function of switchover processing and suggest a new method of modeling the reliability consideration in which the switchover processing causes an increase of the failure rate of the system. The redundant structuret considered in Lim and Koh' is a two-unit hot standby redundant structure. In this paper, we extend the results of Lim and Koh's' t o the case of a two-unit warm standby redundant structure(hereafter WSRS). We also obtain the steady state availability of a two-unit WSRS. Finally, in order to investigate the effect of additional components on redundancy, we compare the two-unit WSRS and a single unit structure(hereafter SUS) with respect t o several reliability measures and availability measure. 2. Reference Model of a Standby Redundant Structure

Figure 1 shows a reference model of a redundant system with a function of switchover processing which consists of three units: an active unit, a standby unit, and a switchover device. This model is also considered by Lim and Koh'. The control module charges the switchover processing in such a way that it monitors the state of the active unit and let the switchover device, which is not 100% perfect, exchange the active unit for the standby unit as soon as the active unit fails.

...

..

...

....

* ~ ~ - - *- - ~ L_1 +' Switch

Unit

Figure 1. A Reference Model of Redundant System with a Function of Switchover Processing.

In a standby redundant structure in Fig. 1, the failure of control module does affect the operation of system as far as the active unit is working. However, the control module affects the switchover processing if the active unit fails while the control module is in failure state. Hence, it is natural to assume that the switchover processing causes an increase of the failure rate of the system. We assume that the increment of the failure rate due t o the switchover processing is distributed t o each string of the system in such a way that the failure rate of each unit increases by A, = ax, where A, is relatively smaller than the failure rate of a unit, A, i.e. o
477 the standby unit has a failure rate of AP while the active unithas a failure rate of A, where 0 5 P 5 1. (ii) Repairs occur one at a time (sequential repair) and the repair time is exponentially distributed with a mean of 1 / p . (iii) The probability of successful switchover operation is given by p , 0 5 p 5 1 . (iv) The type of standby unit in the redundant structure is a warm standby unit. That is, the failure rate of standby unit is between 0 and the failure rate of active unit.

Not ations R s ( t ) ,R w ( t ) O s ( t ) ,O w ( t ) TS(t), T W ( t )

ms(t),mw(t) As, Aw

Reliability function of SUS and WSRS, respectively. Mean time to the failure (MTTF) of SUS and WSRS, respectively. Failure rate of SUS and WSRS, respectively. Mean residual life (MRL) of SUS and WSRS, respectively. Availability of SUS and WSRS, respectively.

3. Evaluation of Reliability

3.1. Reliability Measures for Nonrepairable S y s t e m In this section, we evaluate two-unit WSRS and SUS with respect to four reliability measures which are reliability fuction, failure rate, MTTF and MRL. For SUS, it is straightforward to compute those reliability measures since the life distribution of the unit is assumed to be exponential. For two-unit WSRS, we apply the state space method to calculate such reliability measures. The results are summarized as follows. (i) Reliability function

R s ( t )= e - x t .

(1)

0 s = 1/x

(3)

(ii) MTTF

ow = (P + 1 + P ) / ( l + P)(1 +

(4)

(iii) Failure Rate TS(t) =

A.

(5)

1 x.

(7)

(iv) MRL ms(t)=

478

3.2. Reliability Measure f o r Repairable S y s t e m Availability is one of the most important reliability measures for a repairable system. The unavailability of a system, defined as 1 - availability, is the probability that the system is in failure state when it is needed to operate. The annual down-time is widely used t o as a measure representing the reliability of telecommunication system, which is computed by the following formula. Annual down - time(min/year)

= unavailability

x 525600.

We obtain the steady state availability of the SUS and two-unit WSRS by using the state space method. More details on the state space method are discussed in Bellcore5. For the SUS, it is well known that the availability is given by

As

= p/(X

+p).

(9)

For the two-unit WSRS, we define four states of the system and draw the state transition diagram(STD) as shown in Fig. 2. The states 2 and 3 represent the failure of t,he system. The state 2, which represents uncoverage outage, is caused by the malfunction of the switchover device and the state 3 is due to the failure of both units. It is quiet straightforward to establish the flow rate equations from the state transition diagram. Solving this equation, we obtain the availability of the WSRS as follows.

4. Comparison of SUS and Two-unit WSRS

In this section, we compare SUS and two-unit WSRS in terms of four reliability measures and the steady state availability. When the control device does not cause the increase of failure rate of the system, i.e. a = 0, it is clear that the two-unit WSRS outperforms the SUS with respect to reliability measures considered. However we have somewhat different results when the control device has an effect on the performance of the system. The following theorems summarize such results.

Theorem 1: There exits a p* E [0,1] such that 0 s 2 Ow for 0 5 p 5 p* and 0 s 5 Ow for p* 5 p 5 1, where p* = (1 P)a.

+

Proof: From the equations (3) and (4),it is easy to see that the results hold. The

479

Description

P

0

Duplex

1

Simplex

2

Uncoverage Outage

3

Failure

Figure 2. State Transition Diagram(STD) of WSRS.

value of p* can be obtained by solving the following equation with respect to p . ( P + 1 + p ) / ( l + P ) ( l f a ) X = 1/X.

We also compare the SUS and the two-unit WSRS in terms of failure rate and mean residual life function. The results are formally stated in the following theorem.

Theorem 2: (i) Suppose that p > a / ( 1 + a ) . Then there exist a point t* E R+ such that

Proof: (i) First, we note that the SUS has a monotone increasing failure rate. It is easy t o see that

and TW(t) +

+

(1

+ a)X as

t

+ O3.

Since p > a / ( 1 a ) ,we obtain the following inequality

(12)

480 rw(0) = X ( 1 + a)(1- p )

< x = rs(0).

(13)

The result follows immediately from the monotonicity of the failure rate function of the WSRS and (11), (12), and (13). (ii) The proof can be done in the similar manner.

We note that the condition for existence of a turning point in the MRL is that the value of p is greater than the turning point of the MTTF in Theorem 1. Since the actual probability of the successful switchover, p , is close to 1.0, all such a conditions are satisfied in most of real situations. Using the formulas given in (9) and (lo), we can compare the SUS and two-unit WSRS in terms of steady state availability. Theorem 3: Let y = (1

+ @)A + p'R.

Given that

<

-r+Jr2-4(1+~)~~(~~-1) P(1tD)X

Proof: We note that Awr is non-decreasing in p and A s is a constant. Hence, it is sufficient to show that when p = 0, As 2 Aw and when p = 1, As 5 A w . It is somewhat tedious but straightforward to show that when p = 0, As 2 Aw if a > - r + ~ r 2 - 4 ( 1W++PP) X) X~ ( ~ R - 1 ) .

Thus, the existence and uniqueness of p* is established. The value of p* can be obtained by solving the following equation with respect to p p2

P2 + P ( 1 - P ) P -t Q.)X p)(l a)X

+ p(1 + @)(1+a)X + $R(1-

Figure 3.

+

+ (1+ P)(l +

-

a)2X2

-

P

K

The Modified Structure of the Optical Transportation System

5. Example

For the purpose of illustration of our results, we modify the redundant structure considered by Lim and Kohs in such a way that a switchover device is added and

the standby units are assumed t o be warm standby. Fig. 3 shows the modified structure. We refer this structure as structure R for the switchover device. In Lim and Koh', all units are assumed t o independently operate and have exponential life distributions with failure rates shown in Table 1. Since Unit A, B, and C are connected in series, it can be easily shown that both active units and standby units are exponentially distributed with the failure rate being equal to the sum of failure rates of three units, which results in 26,000 FITS. Here, 1 FIT(Fai1ure In Time) represents one failure in lo9 hours. Finally, since the increment of the failure rate would not be greater than the failure rate of controller, we assume that the proportion of increment of the failure rate, a , is given by 0.223. Table 1. Failure Rate of PBAs (Unit:FIT).

PBA Failure Rate

I I

Unit A 9.000

I I

Unit B 7.500

I I

Unit C 9.500

I I

Controller 5.800

We also consider an altenative structure consisting of Unit A, Unit B and Unit C which are connected in series. This structure is refered t o as structure S. We evaluate the reliability measures of two structures in terms of reliability function, MTTF, failure rate and mean residual life and the results are summarized in Table 2. Table 2. Reliability Measures of Simple Structure and Redundant Structure. (Unit of time:105 hours)

We also calculate the unavailability of each of structures S and R for various values o f p and 0.For both structures, the mean repair time is assumed to be equal t o 2 hours. For the purpose of calculation, the values of ,O are taken as 0, 0.3, 0.6, 0.9 and 1.0 and the values of p are assumed to vary from 0.0 to l.O(O.2). For such values of p and p, we also compute annual down-time in minutes. Table 3 represents the values of unavailability and annual down-time. Availability for each value of p and is directly obtained by subtracting the corresponding unavailability from 1.0. In all cases, the annual down time decreases fast as the successful switchover probability increases. Table 3 shows that the structure S outperforms the other when the probability of successful switchover is small. Such results agree with the results obtained in Theorem 3. It is also noted that the turning point(p") increases as the value of ,O is increasing. The values of p* for various choices of ,6 are listed in Table 4.

482 Table 3. Unavailability(U.A) and Annual Down-time(A.D) of the Redundant Structure.(Unavailability of the Structure S=2.56 x 10 - 5 , Annual Down-time of the Structure S=13.6653).

Table 4.

P

0.0

Turning Point ( p * ) for Given 0. 0.3

0.6

1 .o

References 1. P. Das, Effect of Switch-over Devices on Reliability of a Standby Complex System. Naval Research Logistics Quarterly, 19, 517-623( 1978). 2. J. Endrenyi, Reliability Modeling in Electric Power System John Wiley d Sons , New York, (1978). 3. A. Kumar and M. Agarwal, A Review of Standby Redundant System, IEEE Transactions on Reliability, 29, 290-294(1980). 4. J. Singh, Effect of Switch Failure on 2-redundant System, IEEE Transactions o n Reliability, 29, 82-83 (1980). 5. Bellcore, Method and Procedure for System Reliability Analysis, TR-TSY-001171, (1989). 6. J. Singh and P. Goel, Availability Analysis of a Standby Complex System Having Imperfect Switch-over Device, Microelectronics d Reliability, 35,285-288 (1995). 7. A. E. Elsayed, Reliability Engineering, Addison Wesley Longman Inc. , New York, (1996). 8. J. Lim and J. S. Koh, Reliability Analysis and Comparison of Several Structures, Microelectronics d Reliability, 37,653-660 (1997).

USER RECEPTION ANALYSIS IN HUMAN RELIABILITY ANALYSIS KIN WAI MICHAEL SIU School of Design, The Hong Kong Polytechnic University Hunghorn, Kowloon, Hong Kong

Human reliability analysis (HRA) has been considered for years in engineering design. However, the emphasis in the analysis has always been more on whether or not the pre-determined goals and requirements have been met, and less on issues related to human beings. That is, the human factors directly related to the use? of the designh have received relatively less consideration. This paper first reviews the concepts of "reception," and how this term originally used in literary studies can he applied to design and engineering analyses. Through a case study of the design of street furniture, this paper illustrates the importance of "user reception analysis" in HRA. This paper advocating that users should he considered an active agent in the process of designing and developing products. Without a careful investigation and clear understanding of how a design will he received by users, a high degree of reliability cannot easily be obtained. Thus, designers' need to conduct in-depth research and analyses to understand the needs, values, preferences, and aspirations of users, and also to find opportunity to prompt users to voice their ideas.

1

Introduction

Human reliability analysis (HRA) is the method by which it is determined to be probable that a system-required human action, task, or job will be completed successfully within the required period of time and that no extraneous human actions detrimental to the performance of the system will occur. The results o f HRAs are often used as inputs in assessments of probable risk, in which the reliability of entire systems are analyzed by decomposing the system into its constituent components, including hardware, software, and human operators.d In other words, HRA is a kind of analysis that focuses on identifying the likelihood and consequences of human errors.e Its emphasis is on whether humans are performing as required. In other words, "error" means that some difference, divergence, or variation exists between the requirements as defined by the designers and the performance of the users/operators. While reviewing current practices in HRA, we noticed that the focus of analyses thus far has been on whether the requirements have been successfully completed or fulfilled. Human action is considered a factor to be measured by using the requirements as a reference, However, this kind of practice can easily cause designers to forget that In this paper, the general meaning of user includes operator of a system. In this paper, the general meaning of design includes system. In this paper, the general meaning of designer includes system analyst and engineer. d See the following website: http://reliability.sandia.gov/Human-Factor-Engineerin~Human-Reliability-Analysis/hu man-reliability-analysis.html See the following website: http://www.concordassoc.com/main.aspx?PID=6 1 a

483

484 users are humans, with diverse needs, values, preferences, and aspirations that are always changing. By borrowing ideas of "reception" originally applied in literary studies, and by discussing some empirical cases of the acts of users, this paper points out that users have their own preferences and their own creative ways (or tactics) of dealing with the designs provided by designers. Thus, to have a high degree of reliability at the end, instead of trying to force users (for example, system operators) to follow pre-determined procedures to meet tasks and continuing to impose requirements' on users only from the designers' point of view, designers should go back to the beginning and try to understand the practices of users. Designers should also reconceptualize their role and see themselves as facilitators in allowing the users more flexibility and opportunity to actualize designs and participate in the decision-making process. 2

Reception

The ideas of "receptionqtgin literary studies, which were advocated in the late 1960s, give us a new perspective on the practices of users, and in turn allow us to rethink the role of designers. According to the theory of reception, a literary work is not an object that stands by itself and offers the same view to each reader in each period. Reading, like using and operating in the practices of design and engineering, is not an identical process for everyone [ 1-41. On the contrary, reading is always historically situated within specific conditions, and a rereading will of necessity actualize a different work [5]. Unlike the traditional thinking, in which the reader is passive, in reception, the reader is considered both an active participant in the text and a detached spectator of it. The reader has his or her subjectivity in the making of individual interpretations as motivated by personal psychic needs. Although the text is produced by the author, neither the author nor text can fully control the actualization of the readers and the divergence of the responses. It is the reader who brings the text to life, and thus brings the work into existence. Or rather, it is in the act of reading that meaning is realized. In short, the ideas of reception bring out a shift from the formalist view of a text as a static, timeless piece of language to the epistemological emphasis on the dynamic, temporal, and subjective stance of the responding reader, who actualizes the text. Although the ideas of reception as wcll as reader's response were originally used for literary subjects, the arguments in fact provide designers with valuable insights into how users interact with designs (and systems) [6-71. Similar to the idea of the incompleteness of a text or any other form of discourse, designers should consider a design or a system as being full of gaps or as having no real existence. It is incomplete until it is used, and it initiates the performance of meaning rather than actually formulating the meanings themselves. Thus, without the participation of the individual user, there can be no performance. In other words, a user should be seen as a true producer of a design or a

'Also called ways of usingloperating. Also called "reader's response" at an earlier stage and in some circumstances.

485

system, who actualizes the design or system by filling in gaps or indeterminacies in its meaning. This kind of user creation and participation can be called an "act of production." In brief, user reception means the way a user actively reacts to a design, instead of passively following it.

3

Case Study of User Reception Analysis: Street Furniture Design in Hong Kong

Since 2000, several studies on the design and management of street furniture have been conducted in Hong Kong. A major objective of the studies is to understand the ways in which users interact with the furniture-user reception analysis. The findings indicate that both the government and the designers have not conducted a serious user reception analysis before installing such products and systems in urban areas [8]. The result is that the designs and systems of operation sometimes cannot fit the actual needs of the users, particularly with regard to cultural and social factors.

Figure 1. Users may not follow the intended purpose of a design and system. The picture illustrates that housewives in public housing estates take playstructures as sun-drying facilities.

One example relates to playstructures (systems) in Hong Kong. Originally, the government imported the facilities from foreign countries, in order to promote a healthy

486 life-style for Hong Kong people by encouraging them to engage in daily exercise. However, these kinds of facilities may not always be used as planneddesigned. Eventually some of them were used as racks upon which lay out their quilts, winter clothes, and sometimes salt-fish to dry in the sun (see Figure l).h Another example is the design of trash bins and the ways of collecting trash. Figures 2a-b illustrate how a cleaner’ collected trash from a public trash bin. According to the working procedure, the cleaner (like other cleaners) was required to put a plastic bag inside the trash bin. In each trash collection, she needed to pack the trash well, take the bag out and then replace it with a new bag. However, to reduce her workload and simply the procedure, she put a bamboo basket inside the trash bin.’ The basket thus became the container to hold the trash instead of the plastic bag. Each time she collected the trash, she just took out the basket and poured the trash inside the collection trolley, then put the basket back into the trash bin. Therefore, she did not need to handle, pack, and replace the plastic bags. While she did her job, the trash, particularly the dust from the trash, tended to fly out into the street or be exposed to the air.

Figures 2a-b. Convenience is the major concern of cleaners. Sometimes they will not follow the assigned working procedure in handling the cleaning of the trash. Instead of packing the trash with a plastic bag, taking the bag out and replacing it with a new bag each time, they may prefer the easier method of placing a bamboo basket inside the trash container to contain the trash and pouring out the trash each time.

It should be noted that the object of the examples illustrated above is not to support illegal practices-mis-using a design or system. There is also no intention to devalue either the designs or the professional role and knowledge of designers. However, with h

Chinese people traditionally believe that drying quilts and winter clothes in the sun is the best way to kill germs. Even though many laundries nowadays offer quilt-washing and other such services, many housewives still like to dry their quilts and clothes in the sun, particularly in between the seasons. One kind of user/operator of the rubbish bin. Actually, inside the plastic bag.

’

487

respect to HRA, the argument here is that designers should realize that the users' ways of operating are not simply issues of right-or-wrong, legal-or-illegal. Designers should know how and why users expect and act differently, or contradictorily, from the original expectations and decisions (for example, well-predetermined tasks and procedures of a system operation) of the designers. When we review current designs and systems, particularly those that are said to have been designed for the public interest, professionals-who most of the time own or are assigned the authority and right to speak-always expect to use requirements (that is, strategies) to put users into a predetermined mode of practice. However, perversely, users do not always follow exactly what professionals expect and decide. As with the above two examples, users use/operate the designs in their own ways. This kind of practice (that is, tactics) is seen all the time. In another example, some of the people prefer to use their foot to step on the handle of a toilet flushing mechanism when flushing a public toilet instead of using their hand to do it as planned by the designer. 4

Designers' Roles

As mentioned above, in HRA, designers as well as systems analysts always put the requirements of the desigdsystem as the point of departure of the analysis. In fact, we cannot deny that there seems to nothing wrong with this kind of thinking, as the goal of HRA is to ensure the designkystem runs in an effective and accurate way. However, designers who think this way are easily deluded. They tend to expect to have strict control over the design/system and prefer an increasing level of standardization. This makes it easy for designers to overlook one of the most important aspects of the human reliability analysis-that is, that they are dealing with humans, whose needs and ways of operating are diverse and always changing. Thus, in HRA, besides a systematic analysis of the tasks and the sequence of operations, more focus should be placed on the user-the major involved party of a design and system-instead of on the designer and the design. This shift in attention is not intended to devalue the endeavor of design, as designers still need to play an important role. Nor does it simply mean that the diversity of the users' needs and wants should be recognized, which many designers nowadays do. What it does mean is that designers should not impose their own value judgments on users without engaging in a careful investigation and obtaining a good understanding of the users. For example, although some system analysts may hold the view that they only need operators to operate the system according to the well-de$ned tasks and a schedule, they should also not forget that operators are not equivalent to "average people" nor are they robots. Operators of a system have their own needs, values, and preferences. Strictly controlling and forcing the operators to follow the intended requirements (steps, procedures) may work in some circumstances, but this may not be the most effective approach. A s discussed above, users always have their own "tactics" and "creatively act" to fulfill their own needs and preferences [9]. Like the punching and piercing machine

488

designs in the late 1960s in Hong Kong, no matter how the designerdengineers changed the designs and increased the so-called safety devices in the machines, the workers still preferred to remove the safety guards and disable the safety precaution systems. Accidents therefore continually occurred. In fact, it was only when the designerdengineers realized how much importance the workers attached to production rates, especially in the 1960s when the standard of living was low, and considered both the speed of production and safety in re-designing the machines, that the accident rate started to drop. In gaining a better understanding of the needs, values, preferences, and aspirations of users, it is very important to conduct an in-depth user reception analysis. This kind analysis is different from conventional studies of system configurations, time management, human-machine interfaces, machine efficiencies, and effectiveness. It is more related to cultural, social, psychological, and ideological factors [ 10-121. Or say, user reception analysis is more than these. It is an investigation of human behavior and the rationale behind such behavior. The methods of investigation include cultural and social studies, observations: in-depth direct interviews, and so forth [ 131. In fact, these kinds of user reception analyses are also quite passive, as the participation of the users still depends on what the designers' decide and provide [ 141. Thus, "participatory research" should be promoted. As the name suggests, users need to have the opportunity to engage in decision-making processes. That is, for example, operators of a system should have the opportunity to participate in the planning process, and to voice their concerns, worries and opinions. This opportunity to participate not only results in better user-fit solutions, but also an increased sense of having influenced the decision-making process with regard to the design and an increased awareness of the consequences of the decision made [ 15- 161. Last, but not least, allowing users to participate or to in the decision-making process does not mean that designers do not need to do anything or should be ignored. In fact, this misconception is also one of the reasons why so many designers still expect to retain the right to make decisions. On the contrary, for example, in the process of setting up the requirements of a system, for example, system analysts/engineers should actively adopt two important roles. The first is the role of coordinator, gathering together different interested groups and professionals, and facilitators, helping operators to participate, modify, experience, create, produce, and actualize the system-thereby bringing the system to life [ 171. The second role of designers is to explore the diverse backgrounds, beliefs, needs, wants, values, preferences, and ways in which people find satisfaction, since all kinds of findings can help them to better understand the users, and in turn benefit the decision-making process. In exploring and gaining a better understanding of the users, as well as their ways of operating, designers can no longer be like traditional scientists, hiding themselves in laboratories or workshops. As mentioned above, they need to conduct more empirical studies. k

And maybe, sometimes, participant observation.

489 5

Conclusions

HRA has wide-ranging applications: from launching a missile to controlling the opening of a door; from clearing the whole city of residents as a result of a catastrophe to the daily routine cleaning of a toilet or clearing of a rubbish bin. No matter what the volume of analysis is, designers are dealing with not only hardware like machines and system controllers, but with humans who have their own needs, value judgments, preferences, and aspirations. As mentioned above, this paper does not intend to devalue systematic analyses of the performance of systems. However, the key point emphasized here is that, for example, system analystdengineers should not impose requirements on a system without considering human nature and factoring in the operators of the system. Instead, analysts/engineers should know how to respect and the ways in which the operators operate the system, and always keep in mind that it is the operators who bring the system to life, and thus into existence. Acknowledgments The author would like to acknowledge the research grant provided by The Hong Kong Polytechnic University to support this study. The author would also like to thank the Hong Kong Leisure and Cultural Services Department, the Food and Environmental Hygiene Department, the Architectural Services Department, the Housing Department, the Urban Renewal Authority, and Wan Chai District Office for providing information. References

I. 2.

3. 4. 5.

6.

I. 8.

9.

E. Freund, The Return of the Reader: Reader-Response Criticism. New York, NY: Methuen ( 1 987). W. Iser, The act of reading: A theory of aesthetic response, London, Routledge & Kegan Paul (1978). H. R. Jauss, Aesthetic experience and literary hermeneutics. Minneapolis, MN, University of Minnesota Press (1982). W. J. Slatoff, With respect to readers: Dimensions of literary response, New York, NY, Cornell University Press ( 1 970). J. Storey, Cultural consumption and everyday life, New York, NY, Arnold; Copublished by Oxford University Press (1 999). D. A. Norman, The Design of everyday things, Cambridge, MA, The MIT Press ( 1 998). M. de Certeau, L. Giard, and P. Mayol, The Practice of everyday l$e: Volume 2: living & cooking, Minneapolis, MN, University of Minnesota Press ( 1 998). K. W. M. Siu, Product design and culture: A case Study of Hong Kong public space rubbish bins, Conference proceedings: Hawaii International Conference on Arts and Humanities [CD publication], Hawaii, HI, University of Hawaii, West Oahu (2003). M. de Certeau, The practice of everyday life, Berkeley, CA, University of California Press ( 1 984).

490

10. P. W . Jordan, Putting the pleasure into products. IEE Review, 249-252 (1997). 1 1, P. W. Jordan and W. S. Green, Human factors in product design: Current practice and future trends, London, Taylor and Francis (1 999). 12. See [8]. 13. K. W. M. Siu, Users' creative responses and designers' roles. Design Issues, 19(2), 64-73 (2003). 14. See [13]. 15. H. Sanoff, Integrating programming, evaluation and participation in design: A theory Z approach. Hants, Ashgate Publishing Limited ( 1 992). 16. H. Sanoff, Community participation methods in design and planning, New York, NY, John Wiley & Sons (2000). 17. S. King, Co-design: A process of design participation, New York, NY, Van Nostrand Reinhold (1 989).

EVALUATION OF PARTIAL SAFETY FACTORS FOR ESTABLISHING ACCEPTABLE FLAWS FOR BRITTLE PIPING A. SFUVIDYA Associate Professor, Reliability Engineering, IIT Bombay Mumbai, 400 076, India ROHIT RASTOGI Scientist E, Bhaba Atomic Research Centre, Anushaktinagar Mumbai, 400 073, India MILIND J. SAKHARDANDE M. Tech Student, Reliability Engineering, IIT Bombay Mumbai. 400 076, India This paper presents a case study on the application of Load and Resistance Factor Design (LRFD) approach to the Section XI Appendix H of ASME Boiler and Pressure Vessel code for flaw evaluation. This study considers a case of brittle piping. The circumferential and longitudinal cracks are considered. Partial safety factors are generated for maximum reliability index of 2.0 in case of circumferential flaws and for two levels of target failure probability (reliability index p = 2.0, 3.09) in case of longitudinal flaws The partial safety factors are generated for fracture toughness and applied stress values. The variability in fracture toughness is modeled using Weibull distribution Coefficient of variation of 10-30% is considered in fracture toughness. The stress is modeled as normal, lognormal and extremal distributions w t h coefficient of variation of 10-30%. Since the effect of statistical correlation on the load and resistance factors is relatively insignificant for target reliability values of practical interest, the effect of correlated variables may be neglected.

1

Introduction

The world wide interest developed in probabilistic design of systems and use of probabilistic approach to take care of the variations present in variables affecting design has been influential in directing major industrial sectors to go for this methodology. As seen in the literature, probabilistic based design has created its impact on ASME codes (section XI) [l], in which the deterministic design methods are used. A lot of work involving Partial Safety Factors has been carried out in American Petroleum Industry [2] and the authors suggest that this work may be useful for piping system of nuclear power industry. Therefore, in this paper efforts have been made to develop Partial Safety Factors for resistance and load factors for piping system in a nuclear power plant. Problem formulation is done for the linear elastic fracture mechanics failure criteria considering cases of a circumferential flaw and an axial flaw. For safety checking, LRFD format [3] is used. 49 1

492

2. Problem Formulation 2. I . Circumferential Flaw Linear elastic fracture mechanics criterion is considered when ductile crack extension does not occur prior to fracture [4]. This criterion is used when K,. ratio in the screening criteria is greater than 1.8 in reliability analysis [5]. In case of LEFM failure criteria, failure occurs when stress intensity factor ( K ,) is greater than material fracture toughness ( K I C ). The failure equation is,

KIC S K I K/C

- (Khn

(1)

+Klb) =

1 (a)

Here, the expansion stresses (P,) are not considered. So the limit state equation reduces to,

P

ua=is the axial stress 2nRt

& 0,=

st]

[

- is the bending stress

F, & Fb are parameters for circumferential flaw membrane stress intensity factor & circumferential flaw bending stress intensity factor respectively. Kr and Sr are the components of the screening criteria. [5] In the present case, the fracture toughness ( K,C ), the axial stress (OR) , and bending stress (a,)are considered as basic variables. The flaw depth (a), mean radius of the pipe(R) and the thickness of the pipe (t) are considered to be deterministic variables. Normalizing the variables with their respective mean values, the safety margin equation I(d) reduces to

Dividing the equation with mean fracture toughness

pK,, equation 2(a) reduces to

493

X k - X , " A - X h*B=O

3(a)

Where,

As the basic variables are normalized with respect to their mean values, the normalized basic variables have mean as unity and standard deviation as coefficient of variation (COV). 2.2. Axial Flaw In case of axial flaws, Stress Intensity Factor is given by,

where, P= total axial load on pipe including pressure, kips ( W a )

7)

1.65

Q = 1 + 4.593(

The failure equation for axial flaws with LEFM failure criteria is

PR t

Axial stress o 0= 0.5

Klc

F =O

494

Normalizing the basic variables with respect to their individual mean values

Dividing by pK,. the equation reduces to

X, - X , * A = O where,

3. Probabilistic Based Design Methodology In general, the probability of failurepf of a structural element is given as: py =P [ G R S ) L 01 where G ( ) is the limit state function and the probability of failure is identical with the probability of limit state violation. [ 6 ]

3.1, Determination of Partial Safety Factors Individual safety factors that are attached to the basic variables are called Partial Safety Factors. The Partial Safety Factors show the effect each variable has on the probability of failure. These factors are evaluated for the chosen target reliability index p. [6] The performance function is given by,

G(

x,, x,, ..., X , )=0

(7)

*

If xi is the design value of original variable xithe failure surface equation is

* *

*

G( XI ,~ 2.. .,, x n )=0 If the partial safety factors are attached to the nominal values of variables, the above equation becomes

*

*

*

707) G( YlXnl3 Y2Xn2 ... YnXnn )=O The design point should be most probable point. In the normalized coordinate system the most probable failure point is 7

* zi* =sip

where,

9

7(c)

495

* a i* =

-(%I

1

The original variates are given by

*

xi = pi

*

4- o i z i

Hence the Partial Safety Factor required for the given p is

4. Case Study

The authors have presented two case studies here on the application of Load and Resistance Factor Design (LRFD) approach to the Section XI Appendix H of ASME Boiler and Pressure Vessel code for flaw evaluation in this paper as follows.

4.1. Case Study for Circumferential Flaw The following analysis has been carried out for evaluating partial safety factors for circumferential flaw with LEFM criteria. Here flaw depth is considered as a deterministic parameter. The variability in fracture toughness is modeled using Weibull distribution. Coefficient of variation of 10-30% is considered in fracture toughness [I]. The axial stress is modeled as normal distribution having coefficient of variation of 10% and bending stress is modeled with normal, lognormal and extremal distributions with coefficient of variation of 10-30%. COMREL software has been used for the Partial Safety Factors determination. The different cases i.e. various distributions of fracture toughness, Axial stress, Bending stress and their mean and covariance along with the nature of parameters A and B are as shown in the Table 1.

496

table 1 tranformation from furry strength discrett strength ra

4.I . I . Data and Plots

table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra table 1 tranformation from furry strength discrett strength ra

497 B VIS PSF

BVIS RELIABILITY INDEX

B Figure 1 Plots of B V/S PSF and B V/S RELIABILITY INDEX

4.2. Case Study for Axial Flaw

The following cases as shown in Table 3 are considered for the analysis wherein, flaw depth ‘a’ is considered as deterministic. The variability in fracture toughness is modeled using Weibull distribution with Coefficient of variation of 10-30% [I]. The axial stress is modeled as normal, lognormal and extremal with COV of 10-30Yo.Target reliability index is chosen as Po= 2.0 and P0=3.09. For different values of parameter A, reliability analysis is performed till the reliability index is equal to the target reliability indices.

table 1 tranformation from furry strength discrett strength ra

498

4.2.1. Plots Xa: N(1,O.Z)

Xa: N(l.O.1)

d o

d o 2 beta

3.09 beta

Figure 2: plots showing behaviour of PSF’s with increase in p for X, following normal distribution

5. Conclusions In case of circumferential flaws, the values of partial safety factors on axial stresses (PSF,) and bending stresses (PSFb) are approximately 1 .OO considering deterministic flaw depth. In case of axial flaws, the partial safety factor on fracture toughness (PSFk) has the value approximately around 2.1-2.5 when the target Po= 2.0 and approximately around 4.9-5.8 when target Po= 3.09 with deterministic flaw depth. Here, as the PSF, remains almost constant(l.0-1.2) for Po= 2.0 and 3.09, this clearly indicates that we have to increase the resistance of the material i. e. the PSFk value has to be increased substantially for achieving higher reliability. The present flaw acceptance criteria of ASME [ 5 ] is of deterministic nature which takes into account Safety Factor only applied to Resistance side but PSF design has individual partial safety factors attached to the basic variables on resistance as well as stress sides of the design yielding in more safer design and also providing relief in safety margins. References

1. Bloom, J.M., “Partial Safety Factors and Their Impact on ASME Section XI”, Piping Conference, July 23-27,2000, Seattle, Washington. 2. Osage David A., Wirching Paul H., Mansour Alaa E., “Application of Partial Safety Factors for Pressure Containing Equipment”, PVP -Vol. 4 1 1, PP 121-142. 3. Ravindra Mayasandra K., Galambos Theodore, “Load and Resistance Factor Design for Steel Structures” Journal of Structural Division, Sept 1978, PP 1337-1353. 4. Parker, A. P., “The Mechanics of Fracture and Fatigue: An Introduction”, London, NY,E & FM SPON LTD, 1981 5. APPENDIX H, “Evaluation of flaws in ferritic piping”, ASME SECTION XI, Boiler & Pressure Vessel Code. 6. Ranganathan R.,”Structural Reliability Analysis & Design” Jaico Publishing House. 2000.

AUTOMATIC PATTERN CLASSIFICATION RELIABILITY OF THE DIGITIZED MAMMOGRAPHIC BREAST DENSITY T. SUMIMOTO Okayama University, Medical School, 2-5-1 Shikata-cho, Okayama, 700-8558, Japan

S. G O T 0 AND Y. AZUMA Okayama University, Medical School, 2-5-1 Shikata-cho, Okayama, 700-8558, Japan

The computer aided-system for the breast density pattern classification was built based on the researches of the objective quantification which converts breast density into glandular rate employing the phantom of the synthetic breast-equivalent resin material, and the subjective quantification of radiologists' visual assessment with the method of analysis of paired comparisons employing the Thurstone-Mosteller model. The system consists of the two processes. In the first process, pixels of a digitized mammogram are converted into glandular rates using the neural network to which the exposure conditions and the breast thickness were inputted. In the second process, the pattern classification and the glandular rate computation were performed taking visual assessment into consideration by the neural network to which feature values of the histogram of the glandular rate image converted into gray level were inputted. As a result of receiver operating characteristics (ROC) analysis estimating the pattern classification reliability to visual assessment in 93 samples, the area A r under ROC curve was 0.95 or more values in each pattern. In the computed glandular rate to visual assessment, the maximum absolute error was 13% and the average absolute error was 3.4%.

1

Introduction

Information derived from x-ray mammographic breast densities (breast densities) provides one of the strongest indicators of the risk of breast cancer [l-71. In clinical practice, radiologists routinely estimate the breast density of mammograms as shown in Fig. I by using the BI-RADS lexicon [8] as recommended by the American College of Radiology, as follows. 1) The breast is almost entirely fat. 2) There are scattered fibroglandular densities. 3) The breast tissue is heterogeneously dense. 4)The breast tissue is extremely dense, which could obscure a lesion on mammography. Since their evaluation is performed visually, the Computer-aided diagnosis (CAD) which aimed at automatic classification is studied. We have built the objective quantification system of glandular density which converts the digitized mammogram into a glandular-rate (%) image using the digitized breast tissue equivalent phantom (the following, phantom) images. Moreover, in order to try standardization of visual assessment, we proposed the quantification method of glandular density using a paired 499

500 comparison method of Thurstone-Mosteller model [9-121 which is one of the psychological mensuration, as glandular-rate ('YO) by visual evaluation. When the results of the same sample with both quantification methods were compared, we found that the result of visual assessment was over-estimation as the glandular-rate of sample image increased. Accordingly, we tried to build the CAD system aiming at the automatic classification to a BI-RADS's pattern based on the objective quantification method and the visual assessment quantification method, using the neural network which was set the histogram characteristic value of a glandular-rate image as an input and was set result of visual assessment as a teaching data, in this study.

Fig. 1 Fat is radiologically lucent and appears dark on a mammogram, and connective and epithelial tissues are radiologically dense and appear light.

2

Materials and Methods

2.1. Clinical Image Data Acquisition The clinical data set consisted of 100 mammograms of 50 patients. These were sampled from 185 clinically normal patients that were examined between December 1999 and December 200 1. All the information data for each mammography recorded included the following: (1) patient's name, age and childbirth number; ( 2 ) patient's clinical data (compressed breast thickness and projection view); (3) x-ray exposure condition (kV, mAs, filter material and focal spot size). Manual exposures were eliminated from the data in order for appropriate density to be evaluated. Magnification views and exposures of augmented breasts with the implant displaced were eliminated from the data. Only the Cranio caudal (CC) view was employed in this study, because the medio lateral oblique (MLO) view image includes a pectoralis major muscle area and Kalbhen C et al. [I31 reported that the CC view is the most accurate and reproducible projection for calculating breast volume. All images were obtained from an inverter-type mammography unit (frequency was 2.5 kHz) with a molybdenum (Mo) anode and Mo filter and acquired with a grid using the Kodak Min R 2000 screen-film system. The data set was digitized with

501 with 800 dpi and 16 bits gray-levels by the film digitizer (VIDAR MAMMOGRAPHY PRO, VIDAR Systems Corporation, VA, USA). The necessary area in the digitized image was extracted in order to exclude nameplates, etc. as a pre-processing.

hantoms at 28 kV-80 mAs-4cm Fig 2 Breast-equivalent phantoms and their images

2.2. Objective quantification system of glandular density We devised an objective quantification system of breast density from digitized mammograms with the breast-equivalent phantoms that are used as an automatic exposure control-testing tool for the x-ray equipment and able to change breast composition. The phantoms are slabs of breast-equivalent resin material of various known uniform adipose/gland mixes, as shown in Fig. 2. The resin materials of the phantom mimic the photon attenuation coefficients of a range of breast tissues. The attenuation coefficients are calculated with the "mixture rule" and the photon mass attenuation and energy absorption coefficients table of Hubbell [ 141. The average elemental composition of the human breast being mimicked is based on the individual elemental compositions of adipose and glandular tissues as reported by Hammerstein et al [ 151. They are commercially available (Computerized Imaging Reference System, Inc.; Norfolk, VA, USA) and their configuration 1 2 . 5 ~ 1 0cm2 and thickness of 0.5 to 2 cm (these will be 6cm with piled all slabs) would be suitable for this purpose. As for the ratios of uniform adipose (%)/gland (%) mixing, 0/100, 20/80, 50/50, 80120 and 100/0 were employed in our study. By employing digital processing and phantoms, the breast density of digitized mammogram is able to quantify with the conversion curve of pixel value to glandular rate. However, the conversion curve is changed by variations of the patient information (compressed breast thickness and breast composition etc.) and the exposure information (tube voltage and exposure time etc.) for mammography unit. For that problem, the phantom images to variation of exposure conditions (kV and mAs) and phantom

502 conditions (thickness and glandular rate) were obtained. Since those data was abundant and involved, consequently the conversion curve was obtained using the back propagation neural network (BPNN- 1 ) applicable to function approximation (nonlinear regression). The BPNN-I has four input units (each corresponding to kV, mAs, compressed breast thickness and glandular rate), 32 hidden units and one output unit (as the average pixel value of the digitized phantom image), as shown in Table 1 . The conversion curve obtained *by BPNN- 1 has the individual characteristics for each mammogram. Accordingly, the pixel value of the digitized mammogram can be individually converted into the glandular rate for each pixel, i.e., a glandular-rate image can be obtained (see Fig. 3). Thus, glandular density was quantified by averaging the glandular rate of each pixel of a breast area.

2.3. Visual assessment quantification system of glandular density A paired comparison method is one of the ranking methods and constitutes the psychology measure of sensuous desirability such as "more desirable" andor "better". Accordingly, the method can quantify (interval-scaling) the grade of the gap between the ranked samples. Then, we set "the amounts of glandular" as the psychology measure instead of "desirability." In this case, the calculated interval-scale value is equivalent to the difference of the glandular rate between the ranked mammogram samples. Furthermore,

503 if "visuaIly estimated glandular rate" is related to an interval-scale value, it will be possible of the quantification by visual assessment. In this study, we employed the Thurstone's method [9-121 which is typical method of the paired comparison method. The Thurstone's method calculates the rate of judgment that i is better than j , Pi,,, when impact i and j are compared. Practically, two samples are extracted randomly from k samples, such as (i,j ) , and those are observed and compared about of which sample has many glandular. All samples, k, are evaluated visually by the repeated N times of visual assessment for kc*combinations. The results of observation are calculated according to the situation of the Thurstone's case V , and all samples are ranked. The maximum and minimum glandular rate determined by observers of "visually estimated glandular rate" is related to an interval-scale value of ranked samples. Finally, each sample, k, is quantified as visual assessment.

2.4. Automatic classification system into a BI-RADS's pattern The system consists of the two processes. In the first process, pixels of a digitized mammogram are converted to glandular rate image using the objective quantification system of glandular density. In this case, 0 - 100% of glandular rate was linearly related with the 201 gray-scale level of 28-228 as the practical glandular rate image for analysis. Thereby, analysis can be performed more than the gray level of 28, i.e., a breast area. An adaptive dynamic range compression technique. was applied to the glandular rate image to reduce the range of the glandular rates' level of distribution in the low frequency background and to enhance the differences in the characteristic features of the glandular rate histogram. The glandular rate histogram within the breast area was generated and normalized, and passed through an averaging window to smooth out the random fluctuations. Then, the following characteristic values were extracted from the form of a histogram as shown in Fig. 4.

504 1 . The average gray level of a breast area; g,,, 2. The standard deviation of a histogram distribution 3. The minimum gray level; g,, 4. The zero-cross point of a first derived function and the gray level of the peak in the maximum frequency of a histogram; HkPeUk) 5 . The frequency of a peak; gpeuk 6. The energy ratio of the right (ER)-and-left (EL) side of a peak point

Furthermore, the threshold is set up on a histogram, and then the ratio of the number of sum total pixels of a right and left histogram distribution separated by the threshold is calculated changing threshold. When the ratio is equal to a visual glandular rate, the threshold value is named as the visual threshold. In the second process, BPNN-2 was trained for the automatic classification to a BI-RADS's pattern and objective glandular rate based on visual assessment using the histogram characteristic value and the glandular rate of visual assessment. The BPNN-2 has six input units (each corresponding to six histogram characteristic value), 16 hidden units and two output units (BI-RADS's pattern number and visual threshold), as shown in Table 2. The round-robin (i.e. leaveone-out) method was used to test the generalization ability of the total system for this data set.

3

Result and Discussion

The white circles in Figure 5 show the comparison between the visual assessment results and the calculation results by BPNN-1 in the glandular rate for the data set. The glandular rate became higher as the difference between visual estimation and calculated result increased. The average and maximum were 12.9% and 23.9%, respectively. The black circles in Figure 5 show the comparison between the visual assessment results and the calculation results by BPNN-1 and BPNN-2 in the glandular rate for 93 samples after the performance test by the round-robin method. The residual sum of squares shows that the accuracy of the calculation result by BPNN-1 and BPNN-2 increased. The difference in the distribution shows that the values of the calculated results improved as the glandular rate became higher. In order to estimate the pattern classification reliability of BPNN-2, the continuation reliability method of the continuously - distributed test in Receiver operating characteristic (ROC) analysis [I61 was performed, as shown in Fig. 6. As a result, the area A 2 under ROC curve was 0.95 or more values in each pattern. In each pattern, discriminate capability declined slightly in order of 1 , 4, 3 and 2. The combination of BPNN-1 and BPNN-2 reduced the average and maximum gaps to 13.3% and 3.4%, respectively.

505

4

Conclusion

For the classification reliability of breast densities of mammograms into four patterns in the BI-RADS, the glandular rate conversion of mammograms with the breast-equivalent phantom and neural network as the objective method and the paired comparison method were used. Then the neural network was tuned by radiologists’ and expert rnammographers’ assessment ability. Our system is not only capable of classifying the breast density of mammograms but can also provide qualitative analysis.

References Klein R, Aichingert H, Dierker J et al., “Determination of average galndular dose with modern mammography units for two large groups of patients”, Phys. Med. Biol., V01.42, 1997, pp.651-671. Boyd N.F, Byng J.W, Jong R.A. et al., “Quantitative Classification of Mammographic Densities and Breast Cancer Risk: Results From the Canadian National Breast Screening Study”, J Natl Cancer Inst, Vo1.87, 1995, pp.670-675. Powell K.A, Obuchwski N.A, Davors W.J et al., “Quantitative Analysis of Breast Parenchymal Density: Correlation with Women’s Age”, Acad Radio1 No.6, 1999, pp.742-747.

506 4. Yaffe MJ, Byng JW, Jong RA et al., “Breast cancer risk and measured mammographic density”, European Journal of Cancer Prevention, No.7 (suppl I), 1998, pp.47-55. 5. Brisson J, Verreault R, Morrison AS. et al., “Diet, mammographic features of breast tissue, and breast cancer risk”, Am.J.Epidemiol., Vol.130, 1989, pp.14-24. 6. Boyd NF, Greenberg C, Lockwood G et al., “Effects at two years of a low-fat, highcarbo-hydrate diet on radiologic features of the breast: Results from a randomized trial”, J. Natl. Cancer Inst., Vo1.89, 1997, pp.488-467. 7. Spicer DV, Ursin G, Parisky YR et al., “Changes in mammographic densities induced by a hormonal contraceptive designed to reduce breast cancer risk”, J.Natl. Cancer Inst., Vo1.86, 1994, pp.431-436. 8. American College of Radiology, “Breast imaging reporting and Data system (BIRADS) 3rd Ed”, 1998. 9. Thurstone LL, “A law of comparative judgment”, Psycol Rev., Vo1.34, 1927, pp.273286. 10. Mosteller F, “Remarks on the method of paired comparisons: I . The least squares solution assuming equal standard deviations and equal correlations”, Psychometrika, V0l.16, 1951, pp.3-11. 11. Mosteller F, “Remarks on the method of paired comparisons: II . The effect of an aberrant standard deviation when equal standard deviations and equal correlations are assumed, Psychometrika, Vo1.16, 1951, pp.203-206. 12. Mosteller F, “Remarks on the method of paired comparisons: Ill . A test of significance for paired comparisons when equal standard deviations and equal correlations are assumed, Psychometrika, Vo1.16, 1951, pp.207-218. 13. Kalbhen C, McGill JJ, Fendley PM et al., “Mammographic Determination of Breast Volume: Comparing Different Methods”, AJR, Vol. 173, 1999, pp. 1643-1649. 14. Hubbell JH, “Photon Mass Attenuation and Energy-Absorption Coefficients from 1 keV to 20 MeV”, International Journal of Applied Radiation and Isotopes, Vo1.33, 1982, pp.1269-1290. 15. Hammerstein GR, Miller WD, White RD et al., “Absorbed Radiation Dose in Mammography”, Radiology, Vol. 130, 1979, pp.485-49 1. 16. Metz CE, Herman BA, Shen J-H, “Maximum-likelihood estimation of ROC curves from continuously-distributed data”, Stat Med Vo1.17, 1998, pp.1033-1053.

X-RAY IMAGE ANALYSIS OF DEFECTS AT BGA FOR MANUFACTURING SYSTEM RELIABILITY T. SUMIMOTO Okayama University, Medical School, 2-5-1 Shikata-cho, Okayama, 700-8558, JAPAN

T. MARUYAMA, Y. AZUMA AND S. GOT0 Okayama University, Medical School, 2-5-1 Shikata-cho, Okayama, 700-8558, JAPAN M. MONDOU AND N. FURUKAWA Eastern Hiroshima Prefecture Industrial Research Institute, 3-2-39 Higashi Fukatsu-cho, Fukuyama, 721-0974, JAPAN

S. OKADA National Institute ofAdvanced Industrial Science and Technology, 1-1-1 Umezono, Tukuba, 737-0197, JAPAN

This paper deals with the image analysis of defects at BGA for the reliability in PC boards by using X-ray imaging. To assure the manufacturing reliability, an inspection system of BGA is required in the surface mount process. As we can find mostly solder bridge defects, we pay attention to detecting solder bridge in a production line. The problems of image analysis for the detection of defects at BGA solder joints are the detection accuracy and image processing time according to a line speed. To get design data for the development of the inspection system, which can be used easily in the surface mount process, we attempt to measure the shape of BGA based on X-ray imaging.

1 Introduction

According to high density surface mount, Ball Grid Arrays (BGA) and Chip Scale Packages (CSP) are used in PC boards, because they are easily mounted to the surface of PC boards [l], [2]. In a conventional IC package, the lead pin of IC is set at the outside of IC package and the defects of the solder joints of lead pin to the PC board has been done by the visual inspection [3]. However, we can’t inspect directly the solder joints of BGA, because these are hidden under IC package. In a production line, many companies that product the PC board with BGA have done the inspection of BGA in the function test of electric circuits in the final process. The problems of image analysis for the detection of defects of BGA are summarized in the following. One is the detection accuracy, that is, BGA is very small and we must inspect many BGA according to a production line speed. The solder ball

507

508 diameter is 0.76 mm and one IC package has three hundred solder balls. The other is the processing speed, that is, huge image data must be analyzed in a real time manner. To assure the reliability in manufacturing IC packages, it is required to detect defects at BGA solder joints in the process of surface mount. It is important to develop image analysis techniques for the inspection system in a production line. At the first step of our study, to develop image analysis techniques for the detection of defects at BGA solder joints, we attempt to detect BGA bridges based on X-ray imaging. Types of defects at BGA solder joints are solder bridges (short of two balls), missing connection, solder voids, open connection and miss-registration of parts. In the actual production line, we can find mostly the solder bridges. In order to prevent a bad package is sent to the next process, it is required to detect solder bridges in the surface mount process. We pay attention to detecting solder bridges in a production line. In this paper, we propose to develop the image analysis techniques for the detection of defects at BGA solder joints by X-ray imaging, in order to assure manufacturing reliability of PC boards.

2

IMAGE DATA ACQUISITION

2.1 Ball Grid Array BGA is an important technology for utilizing higher pin counts, without the attendant handling and processing problems of the peripheral leaded packages. They are used in manufacturing PC boards, because of their higher size of ball pitch (1.27 mm pitch), better lead rigidity, and self-alignment characteristics during re-flow processing. In a production line, PC board comes into the surface mount process. At the first step, solder paste is printed in the circuit and at the next step BGA with fine pitch are mounted and solder joints between IC package and the surface of printed circuit are made by re-flow process. BGA solder joints could not been inspected and reworked using the conventional methods. In Chip Size Packages (CSP), Mondou et al. have proposed to measure precisely the surface structure by using the co focal optics before re-flowing [4,5]. In BGA, the ability to inspect visually the solder joints is desired in a production line to provide confidence in solder joint reliability. In the most case of defects at BGA solder joint, the solder bridges between two balls are founded in a production line. This defect results from excess solder or misplaced solder, since dirty solder paste stencils are often founded in a production line. In manufacturing PC boards, IC package used with BGA is CPU for main function in an electronics circuit. In the actual production line, we can find the test IC packages based on the final electrical circuit test. Fig.1 shows a photograph of one example of a test IC board. The thickness of PC board is 2 mm and it has six layers. IC package is mounted with BGA to the surface of the PC board. The solder ball diameter is 0.76 mm and the ball pitch is 1.27 mm and the number of BGA is two hundred and fifty six. The size of IC package is 27x27 mm. This test package does not pass the electrical function test. We consider this package has defects at BGA solder joints.

509

Fig. 1 Photograph of one example of test board.

Fig2 Apparatus for capturing X-ray images

2.2 Capture of X-ray Image Data of BGA We try to capture X-ray image data by using an X-ray computed tomography (CT) [6]. To detect the inner defects, this apparatus was made to get computed tomography of mechanical parts such as a ball bearing, a cylinder and a battery. In these parts, the object for measuring is one unit. In this apparatus, X-ray focus is 4 pm and resolution is 68 lines pairkm. The X-ray source and the detector of image are fixed. Besides, a test sample is set on the stage. Then, we can get image data by rotating the stage. We can adjust an image size of the test sample by changing the distance between the X-ray focus and the test sample. X-ray radiated from the focus, transmits into the test sample on the stage and comes to the detector. X-ray detection system consists of an image intensifier of 23 cm diameter and a CCD camera of four hundred thousand pixels. Xray image is converted to the visible light by the image intensifier and image data is captured by 213 inch CCD camera as 8 bit gray levels image. It is difficult to capture the X-ray image at one scene, because a solder ball is small. Thus the number of it amounts over two or three hundred. We tried to change the image size of a solder ball to analyze the characteristics of an abnormal solder ball. But it is impossible to get computed tomography data of each solder ball, because there are many solder balls in one IC package. Therefore, we have captured a projection Xray image of an IC package. We set vertically the test package on the stage and rotated the test table from 0 degree to k50 degrees on every 10 degrees as shown in Fig.2. By rotating the test package, we can take X-ray image with inclined penetration and attempt to detect BGA bridges from different direction. When the angle of inclination is over 50 degrees, we can’t distinguish each solder ball, because of overlapping of images. The condition for capturing image data is as follows. X-ray tube voltage: 185 KV, X-ray tube current: 160mA, exposure times: 30 seconds. 3

ANALYSIS OF X-RAY IMAGE DATA

In the actual X-ray image data in PC boards, the image data of each solder ball is very small. Thus, we must process huge data. It is very difficult to process directly the

510

image data of BGA. Therefore, at the first step, we need image analysis of BGA to get preliminary data for the development of an inspection system which can be used easily in a production line. We propose the following image process techniques. Fig.3 shows a flowchart of an image processing for X-ray image data obtained by the above apparatus. Image data is sent to the personal computer for analysis. Image is converted to the binary data to detect accurately the counter of a solder ball. Threshold level is determined based on signal profile on horizontal line. We selected 54 count gray levels as the threshold level and converted to black and white image data to measure accurately the following factors of BGA. After labeling, first we measure the area of each solder ball and center of X axis and Y axis. Next we measure the perimeter and the radius ratio of each solder ball. A normal pattern of a solder ball is a circle. If a solder ball has defects such as bridge, the shape of the object separate from a true circle. In the case of solder bridges, two solder balls are shorted with the narrow path and we can observe the different pattern such as connected with the bridge.

Input of image data

Convert to binary data

I

Measurement of Characteristics of BGA

I

I Perimeter and radius ratio I Calculation of roundness

o : &

R=l Good BGA

Judgment

&

Bad BGA

Yes

Fig.3 Flowchart of image analysis.

51 1

Fig. 4 Original image data.

Fig.5 Binary image data

In order to judge whether the solder joints are connected normally to the base pad in a surface mount process or not, we pay attention to the radius ratio and the roundness of a solder ball. Roundness R is calculated by the following equation. R=L2/4nS (1) Where L (m) is the perimeter of a solder ball, S (m’) is the area of a solder ball. If the object is a true circle, the radius ratio and the roundness equal to 1. As the shape of the object separates from a true circle, the radius ratio and the roundness become lager than 1. The judgment whether BGA is good or not is determined by the radius ratio and the roundness. If the value of these terms equals to 1, we judge BGA is normal. Then, if the value of these terms overrun equal to 1, we judge BGA is abnormal. 4

RESULTS AND DISCUSSION

Fig.4 shows an example of the original image data series captured by the above apparatus with inclined penetration of X-ray (inclination angle:- 10 degrees). In this picture, we can observe one abnormal BGA. We analyzed this image data based on the above method. Fig.5 shows the binary image data after labeling. Fig.6 shows the radius ratio of each solder ball and Fig.7 shows the roundness. In order to analyze accurately the radius ratio and the roundness of a solder ball, we checked the image size of the solder ball. Then, we selected the diameter of a solder ball is 20 pixels. The roundness is one for the true circle by equation (1). The actual radius ratio and the roundness of a solder ball is a little over one as shown in Table 1. This table shows an example of the result of the image analysis. When the radius ratio is below 1.5 and the roundness is below 1.1, we judge BGA is normal and if two terms overrun these values, we judge BGA is abnormal. In this table, we can find one abnormal solder ball as shown in the data number 46. The radius ratio is over 1.5 and the roundness is over 1.2, namely, 2.01436 and 1.20937 respectively. Therefore, we can warn this solder ball is abnormal. This abnormal image

512 data is shown as number 46 in Fig.5. This test package is inspected in the function test of the electrical circuit and determined as an abnormal board. Except only one or two solder balls, we can’t find another abnormal point in this test board. In the X-ray image, we could not find obviously two balls short but can find a ball having tail.

5

- _ _ -

46

Object number

63

-

I

Fig.6 Radius ratio

Fig.7 Roundness.

By rotating this test board, we can find another solder bridge with the inclination angle of minus 43 degrees as shown in Fig.8. Fig.9 shows the binary image data after labeling and BGA bridge is shown in the data number 36. We can observe that the bridge connects two balls. This bridge is observed in the inclination angle between minus 43 degrees and minus 50 degrees. In this case, two balls are labeled as one pattern. Therefore, the radius ratio and the roundness are high number, namely, 20.71083 and 1.98948 respectively. If this bridge connects two balls completely, we can observe this bridge in any inclination angle. In this case, it is considered that two balls each other have excess solder under the solder ball and we can observe the bridge in the penetration angle between minus 43 degrees and minus 50 degrees. It is reasonable that the radius ratio and the roundness of a solder ball are effective to detect the solder ball bridge based on X-ray image data. Besides, we can detect defects under the solder ball by changing the inclination angle of X-ray. In the actual production line, we founded some abnormal PC boards based on the functional test of electrical circuit. Each board has only one or two solder bridges. We

513

wonder if every joint on every board needs inspection. We hope to inspect everything to provide higher confidence of reliability of PC boards. But members of a company that products PC boards said that they need to inspect every BGA, when the condition of a production is changed. Once a process runs well, a manufacturer could inspect only a test sample of PC boards. Table1 Example of result of image analysis. No. Area

Center-X

Center-Y

Perimeter

41

413

54.92252

231.95157

70.73751

1.25916

42

467

474.60599

256.61884

75.19987

1.25633

1.08495

43

418

395.57895

257.65311

70.90079

1.24091

1.07599

1.08375

44

404

356.74011

258.58911

70.08881

1.22175

1.07048

45

404

317.68069

259.08417

68.86804

1.19597

1.06312

47

396

280.11365

260.53787

68.94898

1.22823

1.07461

48

392

205.96173

262.19644

68.18864

1.15823

1.05730

Fig.8 Original image data (minus 43 degrees).

5

Radius Ratio Roundness

Fig.9 Binary image data (minus 43 degrees)

Conclusion

For a assurance of reliability in manufacturing of PC boards, we have proposed the image analysis techniques, in order to carry out the inspection of the IC package having BGA. At the first step of our study, we deal with an image analysis of the test package, and significant results are obtained as follow. 1) To find BGA bridge, the radius ratio and the roundness of a solder ball is effective. For a normal solder ball, we can get these values equal to nearly 1, On the other hand, for an abnormal solder ball, it is cleare that the radius ratio and the roundness overrun 1 . 2) To analyze accurately the radius ratio and the roundness of a solder ball, it is enough to get image data having 20 pixels diameter in each solder ball.

514

3) To improve the detection reliability of defects under the solder ball, it is effective to change the penetration angle of X-ray. It is concluded that the image analysis based on X-ray image data proposed in this study is an effective method for the detection of defects of BGA bridge. To realize the inspection system of BGA in a production line, further studies are needed such as the construction of control system of X-ray focus for covering all BGA in one IC package and image analysis algorithm according to a line speed of production. Acknowledgment The authors wish to thank Interface Corporation for providing IC test packages and Western Hiroshima Prefecture Industrial Research Institute for technical support to capture X-ray image data by using the X-ray computed tomography. References

1. “X-rays Expose Hidden Connections”, Test and Measurement EUROPE AugustSeptember Vol. 8, No.4,2001, pp.8-13. 2. Yasuhiko HARA, “Non Destructive Inspection Technologies Used for Electronic Components and Packages”, Journal of Japan Institute of Electronics Packaging, Vol. 4, No.6,2001, pp.470-474. 3. Toshimitsu HAMADA, Kozo NAKAHATA, Satoru FUSHIMI, Yoshifumi MORIOKA and Takehiko NISHIDA, “Aoutomatic Solder Joint Inspection System by X-ray Imaging”, Journal of the Society of Precise Engineering, Vo1.59, No.1, 1993, pp. 65-71. 4. Munehiro Mondou et al., “3-D Visual Inspection for Solder Side of Printed Circuit Board (1V)-Development of Inspection System for Printed Circuit-”, Technical report of Eastern Hiroshima Prefecture Industrial Research Institute, No. 9, 1996, pp.29-32. 5. Munehiro MONDOU, Tomomitsu KUSINO, Katsuhisa HIROKAWA and Noboru FURUKAW, “Three-Dimensional Measurement for LSI Package Surface III”, Technical report of Eastern Hiroshima Prefecture Industrial Research Institute, No. 14,2001, pp.13-16. 6. X-ray CT inspection apparatus,

http://www.seisan-ac.kure.hiroshima.jp/

ANALYSIS OF MARGINAL COUNT FAILURE DATA WITH DISCARDING INFORMATION BASED ON LFP MODEL

KAZUYUKI SUZUKI AND LIANHUA W A N G Department of Systems Engineering, University of Electro-communications, Chofugaoka 1-5- 1, Chofu-city, Tokyo 182-8585, Japan E-mail: [email protected]. ac.jp, [email protected]. ac.jp This paper discusses the problem of parametric estimation of failure time distribution from marginal count failure data for product populations where failures are observed only from the units with defects before they have been discarded and no failure occurs in the nondefective units. Assuming that failure times follow a Weibull distribution, we propose a likelihood-based method for estimating the parameters of failure time distribution and p , the proportion of the defective products in the population. The estimation algorithm is described through an application t o an actual data set of this method.

1. Introduction

Field failure data is one of the most important data sources for evaluating and predicting product reliability. The topic on the analysis of field failure data has been dealt with in many researches. Comparing with the experimental data obtained from the laboratory, field failure data is often incomplete. The manufacturers collect the field performance data of products from various different sources. It is usually difficult to have the detailed information of each product unit in the field, and data are available only in some aggregated forms. The marginal count failure data is one typical type of such incomplete data which was discussed in Karim, Yamamoto, and Suzuki (2001). For convenience, we quote the monthly counted failure data from the above paper here. Let N , be the number of products sold in the s-th month for s = l , . . ., a ; let rSt be the number of products sold in the s-th month which failed after t months, t = 1,.. . ,b - s 1, where b is the number of months in the observation period; and let rj be the counts of failures observed in the j-th month, min(a,j} rI = CSzl r,,j-,+l. We note that a 5 b. Table 1 illustrates the structure of the data. Here, rj is called the marginal count data and rSt the complete data. Besides, as pointed out by Meeker (1987), some product populations contain a mixture of manufacturing defective and nondefective units. Supposing that failures are observed only from the units with defects, and no failure occurs in the nondefective units, the limited failure population (LFP) model is suggested for such populations. Let p denote the proportion of the defective units in the population.

+

51 5

516 Table 1. Marginal count failure data structure. s

NS

1 1

1

2

3

...

a

a+l

...

(r1,a)

(rl,a+l)

. ..

b

On the other hand, the failure observation may be censored when a unit is discarded before it fails. Suzuki et al. (2002) indicate that the discard information needs t o be considered for analyzing field failure data in many situations, because a product unit may be discarded just for the unit becomes outmoded. It is obvious that the number of discarded units (irrespective of a unit has experienced a failure or not) in each month is hardly t o be reported to the manufacturer. But many manufacturers can provide the curve of discard time distribution of their products, which is ascertained or estimated from other data sources, e.g., the data obtained from questionnaire surveys done by the manufacturer. Let Gd(t) denote the discard time distribution, and G d ( t ) = 1 - Gd(t),which is assumed to be known. In this paper, assuming that the failure time follows a Weibull distribution, we propose a parametric method to estimate the failure time distribution from marginal count failure data for product populations where failures are observed only from the units with defects before they have been discarded and no failure occurs in the nondefective units. 2. Model Description

This research is mainly motivated by an actual data set (see Tables 2 and 3), which was provided by a consumer electronics company. According t o the failure reports, it seemed that some defective product units had entered into service. We want t o estimate the proportion of the defective product units in all the units sold, and the number of failures occurring in the next period. For a product unit which has experienced a failure, we assume that no failure in the same failure mode occurs after the defective parts, which caused the failure, is removed from the product unit. That is, only the first failure is considered in this paper. Actually, when a failure is known to be caused by some defects, the manufacturer usually provide a free replacement of the failed unit by a no defective one. In such a case, we suppose that a product is discarded not because of a failure just for it has become outmoded or other reasons. Let X be the failure time of a product unit with defects, and Y the discard time which is assumed to be independent of the failure time. The distribution functions of the failure time and discard time are denoted by F ( t ) and Gd(t),respectively. We describe the failure observations based on a multinomial model. The failure probability a t different time points is assumed to be determined by a

517

Weibull distribution with shape parameter m and scale parameter 7 . Without loss of generality, the observation time points are denoted by 1 , 2 , 3 ,. . . . Let f ( t ; m , v )= F(t;m,q)- F ( t - l ; m , q )be the failure probability at age t for a the Weibull disproduct unit with defects, where, F ( t ;m, 7 ) = 1 - exp -(:)m), tribution function. Noting that p is the proportion of the defective products in the population, the failure probability at age t for a product unit in the population is p f ( t ; m , v )based on the LFP model. Failures are observed at age t only from the defective units that have not been discarded yet before age t . Since the probability that a unit has not been discarded before age t is G d ( t - l), the probability that a failure is observed at age t from the population can be represented by p G d ( t - 1)f ( t ;m, 7 ) based on the competing risks model. As shown in Table 1, the s t h population size is N,, s = 1,.. . ,a. Thus, we see that r s l , . ’ ’ , r,,b--s+l follow the next multinomial distribution :

(

s = 1,. . . ,a. The distribution function of the marginal failure counts, r3’s, however, cannot be expressed in a concise form (see Johnson, Kotz, and Balakrishnan, 1997). Based on the marginal observations shown in Table 1, it is difficult to construct the likelihood function directly. We give the conditional log-likelihood function with the complete data given the marginal count data in the following. b-s+l

1 = I(m,7/,p;{ r s t > I { r j > ) = Z

rst

log(P‘f(Gm,rl,P))

s=1

with constrains ~ ~ terms are omitted.

~ { a ” ’ r s , j = - s r+j ,l j =

1 , 2 , . . . , b. Note that some constant

3. Estimation Algorithm

The standard procedure for likelihood-based method cannot be applied to the above conditional log-likelihood function. In this section, the EM algorithm (Dempster, Laird and Rubin, 1977) is used to find the MLE of the interested three parameters: m, 7 , and P. For the model proposed in this paper, the EM algorithm is defined as follows. Suppose that dk) = { m ( k )q(k),p(k)}, , the current estimates of the three parameters, have been obtained at the kth iteration. The next estimates will be calculated through the E-step and the M-step at the ( k 1)th iteration. In the E-step, first, the expected values of rst’s are calculated using the following equation by given the

+

518 marginal data

~ j j ,=

1,. . . , b, and the current estimates, a(')):

s = 1,.. , min{a, j } ; then, the conditional expected log-likelihood can be constructed based on the above expected values of T ~ ~ ' s . In the M-step, the new estimates are found by maximizing the expected log-likelihood given in the E-step. The standard procedure for finding the maximum likelihood estimator (MLE) from likelihood function can be used for this step. When T , ~ ' s are given, we have the first and second derivatives of the likelihood relative to the three parameters:

51 9

where b-s+l

n, =

C

rst, s =

I , . . . ,a;

t=l

f.(t) = f.(t;m,v) = F.(t;m,rl) F ( t ; m , q )= 1 -exp (-(-)m), t

-

F.(t

-

l;m,q);

71

aF(t;

Fm(t;m,77)= F,(t; m, 71)

=

Fmm(t;m, 71) ~ 7 7 ' 7 ( tm> ; 71)

mj

7)

W t ;m, 77) 871

=

=

exp

1

=

-

d 2 Fdm2 ( t ;m,71)

d 2 F ( t ;m, 77)

t

t

t

(-(;)m)

log -,

(;)m

71

exp

t

m t

(-(--)m)

-(-)m,

7771

= exp ( - ( i ) m )

(i)m

71

71 --(-)m m t

= exp (-(;)m)t

(-

">

2

(1

=

(1

- (t-)m)

71

71

m+l 77

7177 (-(S)m)

Fm7Lt;m, 77)

(log

,

m t

- ---)m),

7171

+ mlog -t + 71

71

77

For calculating the MLE, the following transformations are used to remove the restrictions on the parameter space: 61 = l o g m ; 6'2 = logr]; 6'3 = log(&). We use the following score equations:

a1 = -(m,+,e) a1 861

dm

dl = " ( m ; < , p ) 862

x m = 0,

x7j=o,

dl = y m , 7 j , @ ) x 86'3

where,

6=

6,

is the MLE of

(3)

871

dm

4, i

=

1

+

P exp(i3) =

1,2,3, and then, riZ

=

'1

exp(&), 7j

=

exp(&), and

are the MLE of m, 17, and p , respectively, because of the invariance of l+exp(&) MLE. From the above equations, we can calculate the MLE m, 7j, and p based on

520

the Newton-Raphson Method. The asymptotic covariance matrix of the m, 7j, and

fi can be calculated from the following information matrix (Louis, 1982):

4. Analysis of an Actual Data Set Table 2 shows a n example of the marginal count failure data, which is an actual data set provided by a manufacturer.

Calendar year Number of sales Number of failures

1 512,120 605

2 211,790 990

3 12,020 536

4 5 5,000 647 7,406

6

7

8

36,880

65,275

60,213

The percentage of product units remaining in service at age t (shown in Table 3), that is, the survival function of the discard time, G d ( t ) , was estimated by the manufacturer based on survey data for product replacement. Tatable3. Percentage of proiduct units remaining in services by age yearsble 3.

t

I

G,j(t)

t Gd(t)

0 100

1 99.9

2 99.4

3 98.2

4 95.7

5 91.1

6 84.0

8 63.2

9 51.0

10 39.1

11 28.4

12 19.6

13 12.8

14 8.0

7 74.5

The results for the estimation of the parameters are shown in Table 4, which were obtained according to the algorithm described in the above section: the MLE was calculated using (2) and (3), and the asymptotic variance using (4).

MLE Avar*

m 6.8405 3.262

6 7.1668 0.2874

P 0.35564 0.01144

Using the value of riz, 6, and 6, the estimates of the number of failures were calculated and given in Table 5 . Figure 1 shows the estimated number of failures each year in contrast to the number of failures occurred actually. We see from the results that the proposed method is a n effective approach to analysis of such failure data sets.

521

to N,

s 1 2 3 4

512,120 211,790 12,020 5,000

Expected number Actual number

1 0.3

2 29.1 0.1

3 438.6 12.0 0.0

4 2818.9 181.4 0.7

0.3 605

29.2 990

450.7 536

s

NS 512,120 211,790 12,020 5,000 Expected number

1 2 3 4

0.0

5 11038.2 1165.8 10.3 0.3

6 29027.5 4564.9 66.2 4.3

3001.0 647

12214.6 7,406

33662.9 36,880

9 12792.2 17232.3 1136.5 283.4

10 798.6 5290.3 978.0 472.7

4.1 330.3 300.2 406.8

31444.3

7539.6

1041.4

11

12 0.0 1.7 18.7 124.9 145.3

13 0.0 0.0 0.1

7.8 7.9

7 48419.9 12004.5 259.1 27.5 60710.9 65,274

8 41668.6 20024.3 681.3 107.8 62481.9 60,213

14 0.0 0.0 0.0 0.0

0.0

70000 60000

5

m

40000 30000 20000 10000

0 0

2

4

6

8

10

12

14

16

Calendar year Figure 1. Number of failures estimated vs. occurred actually.

5 . Conclusions

This paper was mainly motivated by an actual marginal failure data set provided by a consumer electronics company. The population of the products are supposed to be a mixture of defective ones and non-defective ones. No failure occurs among the nondefective product units. Assuming that failure times follow a Weibull distribution, we obtained the maximum likelihood estimates of the shape and scale parameters

522 of the distribution and the proportion of the defective units in the population. Furthermore, the estimates of the number of failures were calculated by using the estimates of the parameters. we use them t o compare t o the number of failures occurred actually and t o give the prediction of number of failures in the future. The results show that the proposed method is useful and applicable t o more complicated data sets with more realistic limitations which we cannot deal with before. For the proposed model, we will further investigate its properties by conducting some simulation experiments and clarify its performance in different cases through applying it t o other actual data sets.

References 1. A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM algorithm (with discussion), Journal of the Royal Statistical Society Ser. B, 39,1-38 (1977). 2. N. L. Johnson, S. Kotz, and N. Balakrishnan, Discrete Multivariate Distributions. John Wiley & Sons, New York (1997). 3. M. R. Karim, W. Yamamoto and K. Suzuki, Statistical Analysis of Marginal Count Failure Data, Lzfetetime Data Analysis 7,173-186 (2001). 4. T. A. Louis, Finding the Observed Information Matrix When Using the EM algorithm, Journal of the Royal Statistical Society Ser. B, 44, 226-233 (1982). 5. W. Q. Meeker, Limited Failure Population Life Tests: Application to Integrated Circuit Reliability, Technometrics 29, 51-65 (1987). 6. K. Suzuki, L. Wang, W. Yamamoto and K. Kaneko, Field Failure Data Analysis with Discard Rate, MMR2002: Third International Conference on Mathematical Methods in Reliability Methodology and Practzce, 619-626, Trondheim, Norway (2002).

ON A MARKOVIAN DETERIORATING SYSTEM WITH UNCERTAIN REPAIR AND REPLACEMENT

N. TAMURA Department of Industrial and Systems Engineering, Faculty of Science and Engineering, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan E-mail: [email protected]?l.ac.jp This paper considers a system whose deterioration is described as a discrete-time Markov chain. After each inspection, one of three actions can be taken: operation, repair or replacement. We assume that the result of repair is uncertain. If repair is taken, we decide whether t o inspect the system or not. When inspection is performed, we select an optimal action. We study an optimal maintenance policy which minimizes the expected total discounted cost for unbounded horizon. It is shown that, under reasonable conditions on the system's transition and repair laws and the cost structures, a control limit policy is optimal. Furthermore, we derive valid properties for finding the optimal maintenance policy numerically.

1. Introduction

Various maintenance policies for stochastically failing systems have been widely investigated in the literature. The papers by Pierskalls and Voelkerl, Sherif and Smith', Valdez-Flores and Feldman3 and Cho4 are excellent reviews of the area. For multi-state systems, most work concentrate on modeling the deteriorating process of a stochastically failing system by a Markov process in order to derive an optimal maintenance policy because of the tractability of the resulting mathematical problems. Ohnishi et d 5 and Lam and Yeh' studied optimal maintenance policies for continuous-time Markovian deteriorating system and showed a control limit rule is optimal under reasonable conditions. These studies considered that maintenance action is only replacement. In real situations, however, replacement is not the only maintenance action possible. So, various models for systems with imperfect repair have been suggested and studied by Lam7, Kijima', and Kijima and N a k a g a ~ a . Pham and Wan&' provided a survey of recent studies. Chiang and Yuan'' studied a continuous-time Markovian deteriorating system with uncertain repair and replacement. Because of complexity of the model, however, it is not shown that a control limit policy holds. For discrete-time case, Derman13 considered a Markovian deteriorating system

523

524 where replaceinent is the only maintenance action possible, and established sufficient conditions on the transition probabilities and the cost functions under which the optimal maintenance policy has a control limit rule. Douer and Yechiali14 introduced the idea of a general-degree of repair which is the action from any state to any better state at any time of inspection and showed that, under reasonable conditions, a control limit policy holds. Also, they proposed a model where the result of repair is uncertain. Then Douer and Yechiali'* assume that one always operate the system until the next inspection time after repair is completed. However, if repair is not enough, i.e. the system is repaired to a worse state than the current state, then it is not adequate to operate it. In this paper, we consider a discrete-time Markovian deteriorating system with uncertain repair and replacement. After each inspection, one of three actions can be taken: operation, repair or replacement. If repair is taken, then we decide whether t o inspect the system or not. When inspection is performed, we select an optimal action. We formulate the model as a Markov decision process. We examine the properties of an optimal maintenance policy which minimizes the expected total discounted cost for unbounded horizon. The structure of the paper is as follows. In the next section, the maintenance model is described in details. In section 3, the mathematical formulation of the problem is given. In section 4,we investigate properties of the optimal maintenance policy. Finally, some conclusions are drawn in section 5 . 2. Model Description Consider a system (a unit, a component of a system, a piece of operating equipment, etc.) which is inspected a t equally spaced points in time. After each inspection, the system can be classified into one of N 1 states, 0, . . . ,N . Then inspection cod d l is incurred. State 0 represents the process before any deterioration takes place; that is, it is initial new state of the system, whereas state N represents a failure state of the system. The intermediate states 1, . . ., N - 1 are ordered to reflect their relative degree of deterioration (in ascending order). Through inspection, the true state is certainly identified. Let the times of inspection be t=0,1,... , and let X t be the observed state of the system a t time t . We assume that { X t ;t = 0,1,. . . } is a finite-state Markov chain with stationary transition probabilities,

+

PZ, = P{Xttl = jjx, = 9 ,

(1)

for all i, j and t. Denote p l y ) the n-step transition probability from state i to state j . Then we suppose that, for each i=O,. . . , N , > 0 for some n. This condition assures that the system eventually reaches the failure state regardless of its initial state. When the system state is identified through inspection, one of the following actions can be taken.

(1) Action 1: We continue to operate the system until the next time.

525 (2) Action 2: We repair the system and select one of the following actions. (a) Action 2a: We continue to operate the system until the next time without inspection. (b) Action 2b: We identify state of the system with inspection and select optimal action k ( k = 1,2 or 3) at the next time. (3) Action 3: We replace the system with a new and identical one and operate it until the next time.

It is assumed that the result of repair is uncertain. So, we can not know true state of the system immediately after repair without inspection. Let qij be the probability that the system in state i is repaired to state j . We call qij repair probability. For the transition and repair probabilities, we impose the following conditions. In this paper, the term “increasing” means “nondecreasing.” Condition 2.1. For any h, the function N

Fh(i) = Z P i j j=h

is increasing in i.

Condition 2.2. For any h, the function N j=h

is increasing in i. Condition 2.1 means that as the system deteriorates, it is more likely to make a transition to worse states. This condition is also called the condition of increasing failure rate (IFR) of the system. Condition 2.2 implies that as the system deteriorates, it is less likely to be repaired to better states. From Derman13, condition 2.1 and 2.2 are equivalent to the following conditions, respectively.

Condition 2.3. For any increasing function a ( i ) , i = 0 , l . . , N ,the function N

j=O

is also increasing in i.

Condition 2.4. For any increasing function u ( i ) ,i N j=O

is also increasing in i.

= 0,1,. . .

, N ,the function

526 Furthermore, we impose the following conditions on p,, and q2,.

Condition 2.5. For any h, N

/

N

\

is increasing in i .

Condition 2.6. For any h, N j=h

is increasing in i . Condition 2.5 indicates that as the system deteriorates, the system which is operated until the next time is more likely t o move t o worse states in comparison with the system which is repaired and operated until the next time. Condition 2.6 indicates that as the system deteriorates, the system which is operated until the next time is more likely t o move t o worse states in comparison with the system which is repaired. Since the result of repair is uncertain, it is necessary to perform inspection in order to identify state of the system. So, when repair is performed, we decide whether to operate the system without inspection or to select an optimal action with inspection. Then inspection cost d2 (# d l ) is incurred. For example, in the case of a production process which produces items, state of the process may be determined by sampling the item produced. On the other hand, when repair is performed, we need grasp the process state by using other methods since no item is produces. Hence, we consider that inspection cost after operation is not equal to that immediately after repair. When we select action 1 for the system in state i , the system moves to state j with probability p i j at the next time and operating cost ui is incurred. When we select action 2 for the system in state i , the system is repaired t o state j with probability qij and repair cost ri is incurred. Thereafter, we decide whether to inspect the system or not. If inspection is performed, then the system state is identified and we select an optimal action at the next time. Otherwise, we operate the system until the next time. When we select action 3 for the system in state i , we replace the system with a new and identical one and operate it until the next time. Then replacement cost ci is incurred. For these costs, we introduce the following conditions.

Condition 2.7. ui,r z , ci are increasing in i Condition 2.7 implies that as the system deteriorates, it is more costly to operate, repair or replace the system.

527 Condition 2.8.

ui

-

cz,

are increasing in i. Condition 2.8 means that as the system deteriorates, in Eqs.(2), (3) and (4),the merit of replacement or repair becomes bigger than that of operation, and, in Eq.(5), the merit of replacement becomes bigger than that of repair. 3. Mathematical Formulation

Our objective here is to derive an optimal maintenance policy that minimizes the total expected discounted cost for unbounded horizon. Let V ( i )be the total expected discounted cost for unbounded horizon when the system starts in state i and an optimal maintenance policy is employed. We denote a discount factor by P (1 < /3 < 1). Furthermore, we let Hk(i) denote the total discounted cost when the system starts in state i and action k is selected. Then H l ( i ) is given by

Therefore, we have

528 4. Properties of Optimal Maintenance Policy In this section, we examine some structural properties of an optimal maintenance policy. When the operation horizon is finite T (T periods, say), denote by V T ( i ) the minimal total expected discounted cost when the system starts in state i. First, the following lemma is derived.

Lemma 4.1. For any T , V T ( i )is increasing in i . Since lim T+CC

vT(i)= ~ ( i ) ,

we obtain theorem 4.1.

Theorem 4.1. V ( i )is increasing in i. Theorem 4.1 implies that the expected total discounted cost is smaller if the system begins in a better state. The result is intuitively true. We denote by D ( i ) an optimal action when the system stays in state i. Using the above theorem, we can derive structural properties of an optimal maintenance policy.

Theorem 4.2. There exist the states

D(i)= where 0 5

&5k 5N

+ 1.

{

k

and k such that

1 for 0 5 i < k , 2 f o r k 5i < k , 3 f o r i5i I N ,

Furthermore, we impose the following condition.

Condition 4.1. For any h,

is increasing in i. This condition indicates that as the system deteriorates, the system which is repaired and operated until the next time is more likely t o move t o worse states in comparison with the system which is repaired. Then we obtain theorem 4.3.

Theorem 4.3. There exists an optimal maintenance policy of the form,

529 where 0

5 Ic, 5 i b 5 I 5 N

+ 1.

This theorem states that since the system is less likely to be repaired to better states with deterioration, we should inspect the system and select an optimal action while the system stays in some worse states. Intuitively, when d2 is not so large, it will be considered that we should inspect the system immediately after repair regardless of state. The following theorem states that this interpretation is true.

Theorem 4.4. I i f o r a n y i, 2

1=0

and

then there exists an optimal maintenance policy of the f o r m ,

where

o I kb I I

5N

+ 1.

Eq.(13) indicates that the system is repaired t o a better state than the current one without fail.

5. Conclusion We consider a discrete-time Markovian deteriorating system with uncertain repair. If repair is performed, then we decide whether to inspect the system or not. When inspection is taken, the system state is identified and an optimal action is selected at the next time. We examine properties of an optimal maintenance policy minimizing the total expected discounted cost. We derive sufficient conditions that a control limit policy holds and the optimal maintenance policy may be characterized by four regions. Also, it is shown that, under some conditions, the optimal maintenance policy may be characterized by three regions.

References 1. W.P. Pierskalls and J.A. Voelker, A survey of maintenance models: the control and surveillance of deteriorating, Naval Research Logistics Quarterly, 23, 353-388 (1976). 2. Y.S. Sherif and M.L. Smith, Optimal maintenance models for systems subject to failure - a review, Naval Research Logistics Quarterly, 28, 47-74 (1981). 3. C. Valdez-Flores and R.M. Feldman, A survey of preventive maintenance models for stochastically deteriorating single-unit systems, Naval Research Logistics, 36, 419-446 (1989).

530 4. D.I. Cho, A survey of maintenance models for multi-unit systems, European Journal of Operational Research, 51, 1-23 (1991). 5. M. Ohnishi, H. Kawai and H. Mine, An optimal inspection and replacement policy for a deteriorating system, Journal of Applied Probability, 23,973-988 (1986). 6. C.T. Lam and R.H. Yeh, Optimal maintenancepolicies for deteriorating systems under various maintenance Strategies. IEEE Transactaons on Reliability, 43,423-430 (1994). 7. Y . Lam, Geometric processes and replacement problem, Acta Mathematicae Applicatae Sinica, 4, 366-377 (1988). 8. M. Kijima, Some results for repairable systems with general repair, Journal of Applzed Probability, 26,89-102 (1989). 9. M. Kijima and T. Nakagawa, A cumulative damage shock model with imperfect preventive maintenance, Naval Research Logistics, 38,145-156 (1991). 10. M. Kijima and T. Nakagawa, Replacement policies of a shock model with imperfect preventive maintenance, European Journal of Operational Research, 57, 100-110 (1992). 11. H. Pham and H. Wang, Imperfect repair, European Journal of Operational Research, 94, 425-438 (1996). 12. J.H. Chiang and J. Yuan, Optimal maintenance policy for a Markovian system under periodic inspection Reliability Engineering and System Safety, 71,165-172 (2001). 13. C. Derman, On optimal replacement rules when changes of states are Markovian, In: Mathematical Optimization Techniques (R. Bellman, Ed.), The RAND Corporation, 201-210 (1963). 14. N. Douer and U. Yechiali, Optimal repair and replacement in Markovian systems, Communications in Statistics -Stochastic Models-, 10, 253-270 (1994).

SOFTWARE RELIABILITY MODELING FOR INTEGRATION TESTING IN DISTRIBUTED DEVELOPMENT ENVIRONMENT

YOSHINOBU TAMURA Department of Information Systems, Faculty of Environmental and Information Studies, Tottori University of Environmental Studies, Kita 1-1-1, Wakabadaa, Tottori-shi 689-1111, Japan E-mail: [email protected]

SHIGERU YAMADA Department of Social Systems Engineering, Faculty of Engineering, Tottori University, Minami 4.101, Koyama, Tottori-shi 680-8552, Japan E-mail: [email protected]

MITSUHIRO KIMURA Department of Industrial and Systems Engineering, Faculty of Engineering, Hosea University 3-7-2, Kajino, Koganei-shi, Tokyo, 184-8584, Japan E-mail: [email protected]. ac.jp In new software development paradigm such as client/server systems and distributed development by using network computing technologies, it has been difficult to assess the software reliability in recent years, because the complexity of software systems has been increasing as the result of distributed system development. In this paper, we propose a software reliability growth model based on stochastic differential equations for the integration testing phase of distributed development environment.

1. Introduction

A computer-software system is developed by human work, therefore many software faults must be introduced into the system during the development process. Thus, these software faults often cause complicated break-downs of computer systems. Recently, it has been more difficult for the developers to produce highly-reliable software systems efficiently because of the diversified and complicated software requirements. Therefore, it is necessary to control the software development process in terms of quality and reliability. Many software systems have been produced under host-concentrated development environment. In such host-concentrated one, even the progress of software

531

532 development tools has caused several issues. For instance, one issue is that all of software development management has to be suspended when the host computer is down. From the late 1980s, personal computers have been spread on our daily life instead of conventional mainframe machines, because the price and performance of personal computers have been extremely improved. Hence, computer systems which aid the software development have been also changing into UNIX workstations or personal computers to reduce the cost for the development. A Client/Server System (CSS) which is a new development method have come into existence as a result of the progress of networking technology by UNIX systems. On the other hand, the effective testing method for distributed development environment has only a few presented’”. Basically, software reliability can be evaluated by the number of detected faults or the software failure-occurrence time in the testing phase which is the last phase of the development process, and it can be also estimated for the operational phase. A software failure is defined as an unacceptable departure of program operation caused by a software fault remaining in the software system. Especially, software reliability models which describe software fault-detection or software failure-occurrence phenomena in the testing-phase are called software reliability growth models (SRGM’s). The SRGM’s are very useful to assess the reliability for quality control and testingprocess control of software development.

2. Testing in Distributed Development Environment We discuss characteristics of the integration testing and the system testing phases in distributed development environment. 2.1. Characteristics of the Integration Testing We show main characteristics of the integration testing in distributed development environment as

0 0 0 0 0

The confirmation of link connection for interface, file, and database based on the defined specifications is performed. The integration testing is executed selectively for software functions. The interlock processing of software between server and client is confirmed. It is generally located in between the module testing and the system testing. The validity, operationality, performance, and capability for software functions are confirmed.

2.2. Characteristics of the System Testing We show main characteristics of the system testing in distributed development environment as

0

The implementation of the whole functions in software system is confirmed.

533

0 0 0

It is the final-stage to verify whether the reliability requirement of a software system is satisfied. The defined specifications in software design are verified. It is selectively tested for the effects of actual operations.

3. Software Reliability Modeling for Distributed Development

Environment 3.1. Modeling for Module Testing Many SRGM's have been used as the conventional methods to assess the reliability for the quality control and testing-process management of software development. Among others, nonhomogeneous Poisson process (NHPP) models have been discussed in many literatures since the NHPP models can be easily applied in actual software development. In this section, we describe a n NHPP model for analyzing software fault-detection count data. Considering stochastic characteristics associated with the fault-detection procedures in the testing-phase, we treat { N ( t ) ,t 2 0} as a nonnegative counting process where random variable N ( t ) means the cumulative number of faults detected up to testing-time t. The fault-detection process { N ( t ) , t2 0} is formulated as follow^^>^:

(n=O,1,2;").

(1)

In Eq. (l),Pr{A} means the probability of event A, and H ( t ) is called a mean value function which represents the expected cumulative number of faults detected in the testing-time interval ( 0 ,t ] . According to the growth curve of the cumulative number of detected faults, we assume that the software reliability for each component is assessed by applying the following SRGM's' based on NHPP's:

0 Exponential SRGM 0 Delayed S-shaped SRGM 0 Inflection S-shaped SRGM

0 Testing-effort dependent SRGM We assume that the following fault-detection rate per remaining fault derived from each NHPP model has the equivalent characteristics for each component:

dH,o bi(t) =

dt

a - H i ( t )'

where b,(t) is the fault-detection rate per remaining fault for i-th component, H,(t) the mean value function for i-th software component, and a the expected number of initial inherent faults in i-th component.

534 3.2. Modeling for Integration Testing

We have proposed several SRGM’s for distributed development environment. However, these models are considered for the system testing phase that is the final-stage to verify whether the reliability requirement of a software system is satisfied. The testing process in distributed one can be simply shown as follows:

(Phase 1.) Module testing that manages as a unit of client/server. (Phase 2.) Subsystem testing that manages as a software component after the combination of several modules. (Phase 3.) System testing that is the final-stage to verify whether the reliability requirement of a software system is satisfied. Especially, it is difficult to proceed with the testing phase 2.-3., because the architecture of each component is considered to have different development styles. It is the cause of new faults introduced by combining of several software components. Also, the whole architecture of the software system needs to be modified, if the contradiction in software development is verified in the system testing phase’!’. From the above matter, it is necessary t o verify sufficiently software reliability in the integration testing phase of distributed development environment, because new faults are introduced by combining of several components. Therefore, we propose a software reliability growth model for integration testing of distributed one. 3.2.1. Model Description

Let M ( t ) be the number of faults remaining in the software system at testingtime t (t 0). Suppose that M ( t ) takes on continuous real values. Since latent faults in the software system are detected and eliminated during the testing phase, M ( t ) gradually decreases as the testing procedures go on. Thus, under common assumptions for software reliability growth modeling, we consider the following linear differential equation:

>

= -b(t)M(t), (3) dt where b ( t ) is a fault-detection rate per unit time per fault a t testing-time t and is a non-negative function. Next, we assume that b ( t ) includes the characteristics for each software component in the integration testing phase of the distributed one approximately. Especially, it is necessary to verify sufficiently software reliability in the integration testing phase of distributed one, because new faults are introduced by combining of several software components. Therefore, we suppose that b(t) in Eq. ( 3 ) has the irregular fluctuation. That is, we extend Eq. (3) t o the following stochastic differential e q ~ a t i o n ~ , ~ :

535 where ( ( t )is a noise representing an irregular fluctuation and B ( t ) the total faultdetection rate. In this paper, we consider that B ( t ) is given by the following equation:

where bi(s)(i= 1 , 2 , . . . ,n ) is the software failure-occurrence rate per inherent fault for the i-th component in Eq. (2). We assume that a software system consists of n software components. Further, to make its solution a Markov process, we assume that [ ( t ) can be expressed as follows:

where a is a positive constant representing a magnitude of the irregular fluctuation and y(t) a standardized Gaussian white noise. Substituting Eq. (6) into Eq. (4),we can obtain the following solution process under the initial condition M ( 0 ) = mo as follows:

where W ( . )is a one-dimensional Wiener process which is formally defined as an integration of the white noise y(t) with respect to time t. A Wiener process is a Gaussian process, and it has the following properties:

1,

(8)

E[W(t)l = 0,

(9)

Pr(W(0) = 0}

=

E [ W ( t ) W ( t ’ ) ]= Min[t, t’],

(10)

where E [ X ] means the expected value of X.

3.2.2. Software Reliability Assessment Measures Information on the current number of remaining faults in the system is very important to estimate the degree of the progress on the software testing process. Since it is a random variable in our model, its expected value can be useful measures6. We can calculate them from Eq. (7) as follows:

E [ M ( t ) ]= mo . exp

[

-

B(s)ds

+ “t] 2

.

(11)

4. Numerical Examples

We analyze actual software fault data t o show several numerical examples for application of our SRGM. A set of fault-detection count data used in this section is obtained from an actual software project that developed the software system

536

consisting of seven components. The testing data were recorded on the basis of a testing-day. In this paper, we estimate the model parameters by using conventional models shown in Sec. 3.1 in terms of the seven software components during the module testing phase. However, we have verified that the unknown parameters in these models have diverged in terms of two components of the seven software components. Therefore, we have considered that these two components have no effect on the whole system, because these components have the properties that the sizes and the number of detected faults are small compared t o other 5 components and so on. We show the testing period for each software component in Figure 1.

3

2 z

System Test Integration Test

tiz

No.5

E!

z u

No.4

w

No.3 No.2

0

10

No. 1 -

I

Testing Period

-

1

Figure 1. The testing period for each component in the actual data.

4.1. Reliability assessment results for each component According t o the growth curve of the cumulative number of detected faults, we assume that the software reliability in each software component is assessed by applying the SRGM’s based on NHPP’s. The selected models in Sec. 3.1 is decided by using the mean square error (MSE)’. First, Table 1 shows the result of goodness-of-fit comparison in terms of the MSE for each component. 4.2. Reliability assessment results for integration testing

Next, the sample path of the estimated number of remaining faults in Eq. ( 7 ) ,k(t) is plotted in Figure 2 approximately along with actual data.

537

No.1 No.2 No.3 Nn4

Exponential

Delayed S-shaped

Inflection S-shaped

SRGM ___

SRGM

SRGM

cI.""". ) nm7*

0.9702* 1.7087* 4.044R*

I

2.5230 3.0427 11.600

Testing-effort dependent

SRGM I

1.8898

I

120 I

40 -

20 IActual -1 M(f) -

I

Figure 2.

..

The sample path of the estimated number of remaining faults,

o(t)

The estimated expected number of remaining faults in Eq. (ll),E [ A 4 ( t ) ] is plotted in Figure 3. 5 . Concluding Remarks

In this paper, we have proposed a software reliability growth model in the integration testing phase of distributed development environment. Especially, we have discussed the method of software reliability assessment considering the interaction among software components in distributed one. Additionally, we have presented several numerical examples for the actual data. Conventional SRGM's for system testing phase in distributed one have included many unknown parameter^^>^. Especially, the effective estimation method in terms of the weight parameters pi(i = I , 2, . . . , n) in 7,9, which mean the proportion of the total testing-load for the software component, has been never presented. Our SRGM can be easily applied in distributed software development, because our model has

538

0

10

20

30 TIME (DAYS)

40

50

Figure 3. The estimated number of remaining faults, E(M(t)] t h e simple structure, i.e., the number of unknown parameters included in our model is only two, i.e., mo a n d c. Therefore, we consider t h a t our model is very useful for software developers i n terms of practical reliability assessment in the actual distributed development environment.

References 1. A. Umar, Distributed Computing and Client-Server Systems, Prentice Hall, New Jersey (1993). 2. L. T. Vaughn, Client/Server S y s t e m Design and Implementation, McGraw-Hill, New York (1994). 3. M. R. Lyu, ed., Handbook of Software Reliability Engineering, IEEE Computer Society Press, Los Alamitos, CA (1996). 4. P. N. Misra, Software reliability analysis, IBM Systems J. 22, 3, 262-270 (1983). 5 . S. Yamada, Software Reliability Assessment Technology (in Japanese), HBJ Japan, Tokyo (1989). 6. S. Yamada, M. Kimura, H. Tanaka, and S. Osaki, Software reliability measurement and assessment with stochastic difierential equations, IEICE Trans. Fundamentals E77-A, 1, 109-116, Jan (1994). 7. Y. Tamura, M. Kimura, and S, Yamada, Software reliability growth model for a distributed development environment: Stochastic differential equation approach and its numerical estimation (in Japanese), Trans. Japan SIAM 11,3, 121-132, Sept. (2001). 8. A. Iannino, J. D. Musa, K. Okumoto, and B. Littlewood, Criteria for software reliability model comparisons, IEEE Trans. Software Engineering SE-10, 6, 687-691, Nov. (1984). 9. S. Yamada, Y. Tamura, and M. Kimura, A software reliability growth model for a distributed development environment, F,lectronics and Communications in Japan, Part 3 83,12, 1-8, Dec. (2000).

PERFORMANCE EVALUATION FOR MULTI-TASK PROCESSING SYSTEM WITH SOFTWARE AVAILABILITY MODEL

KOICHI TOKUNO AND SHIGERU YAMADA Department of Social Systems Engineering, Faculty of Engineering, Tottori University, 4-101, Koyama, Tottori-shi, 680-8552 Japan E-mail: { toku, yamada} Qsse. tottori-u. ac.jp We propose the performance evaluation method for the multi-task system with software reliability growth process. The time-dependent behavior of the system itself alternating between up and down states is described by the Markovian software availability model. We assume that the cumulative number of tasks arriving at the system and the processing time for a task follow the homogeneous Poisson process and the exponential distribution, respectively. Then we can formulate the distribution of the number of tasks whose processes can be complete with the infinite-server queueing model. From the model, several quantities for software performance measurement related t o the task processing can be derived. Finally, we present several numerical examples of the quantities t o analyze the relationship between the software reliability characteristics and the system performance measurement.

1. Introduction

For the last few decades, the stochastic modeling for software reliability/availability measurement and assessment in the dynamic environment such as the testing phase of the software development or the user operation phase has been much d i s c ~ s s e d . On ' ~ ~the ~ ~other hand, performance evaluation methods for fault-tolerant computing systems have been proposed. These have often been discussed from the viewpoints of the hardware configuration. Beaudry4 has proposed the performancerelated measures such as the cornputlation availability and the mean computation between failures. Meyer5 has proposed the performability taking account of accomplishment levels from the customer's viewpoint. Nakamura and Osaki' have classified the lost jobs caused by processor failure and by cancellation. Sols' has introduced the concept of degraded availability. However, the above studies have not included the characteristics peculiar t o software systems such as the software reliability growth process. In this paper, we propose the software performance evaluation method based on the number of tasks. Most of the existing techniques for software performance/ quality evaluation related t o reliability have paid attention t o only the states of the systems themselves such as the software failure-occurrence phenomenon and had no consideration for the external factors, for example, the frequency of the occurrence

540 of usage demands for the system and the stochastic characteristic of customer’s usage time. Here we attempt to discuss the software performance evaluation from the viewpoint of the task processing. We consider what we call the multi-task software system which can be process the plural tasks simultaneously. We assume that the cumulative number of tasks arriving a t the system and the processing time for a task follow the homogeneous Poisson process and the exponential distribution, respectively. The software failure-occurrence phenomenon and the restoration characteristic in the dynamic environment are described by the Markovian software availability model.8 The stochastic behavior of the number of tasks whose processes can be complete is modeled with the infinite-server queueing modeLg From the model, we derive several quantities €or software performance measurement related to the task processing. The organization of the paper is shown as follows. Section 2 states the software availability model used in the paper. Section 3 describes the stochastic processes of the numbers of tasks whose processes are complete and canceled out of the tasks arriving up to a given time point. Section 4 derives several software performance measures based on the number of tasks from the model. Section 5 presents the numerical examples of the measures and examines the software performance analysis. In Section 6, we state the conclusion of the paper.

2. Software Availability Model

The following assumptions are made for software availability modeling: AI-1. The software system is unavailable and starts to be restored as soon as a software failure occurs, and the system cannot operate until the restoration action is complete. AI-2. The restoration action implies the debugging activity; this is performed perfectly with the perfect debugging rate a (0 < a 5 1) and imperfectly with probability b(= 1 - a ) . One fault is corrected and removed from the software system when the debugging activity is perfect. 1-hAr

I-hiAr

l-Lh

I -L I AT

Figure 1. Sample state transition diagram of X ( t ) .

54 1

X,, and restorations, V,, when n faults have already been corrected from the system, follow the exponential distributions with means l / X n and l / p n , respectively.

AT-3. The time intervals of software failures,

The state space of the stochastic process {X(t), t 2 0} representing the state of the software system at the time point t is defined as follows:

W,: the system is operating, R,: the system is inoperable and debugged, where n = 0, 1, 2, . . . denotes the cumulative number of corrected faults. Figure 1 illustrates the sample state transition diagram of X ( t ) . The state occupancy probabilities that the system is in the states Wn and R, at the time point t are given by

Pw, ( t )= Pr{X(t)

PR,

( t )E Pr{X(t)

= Wn}

Rn}

1

respectively, where g n ( t ) is the probability density function of the random variable Sn representing the first passage time t o the state Wn, and g k ( t ) = dg,(t)/dt. The distribution function G n ( t )E gn(x)dz is given by

Ji

G n ( t ) = Pr{S, 5 t } n-1 =

1-

[A:,ne-”’t

+ A:,,e-Y‘t]

a=O

2 2

Yz

}

( n = 1, 2, . . . ; Go(t)= I ( t ) (the step function)) 1

=

5 [(Az + p a )f d(Xa+ p z ) 2

-

4aX,pa]

(double signs in same order)

n

n-1

XJY3

3=0

At,,

= 2 2

n J

n

(3)

n-1

n-1

( 2 3 - 2,)

(Y3

-22)

3=0

=o

J#a

n-1

rI

XJYJ

3=0

4 , n

n

n-1

n-1

1

Ya

J=O J#%

(z=O,

(Y3 - Yz)

rl[ ( 2 3 J

-

Ya)

=O

1, 2, . . . ) n - 1 )

1

542

3. Model Description We make the following assumptions for system’s task processing:

AII- 1. The number of tasks the system can process simultaneously is sufficiently large.

AII-2. The process { N ( t ) ,t 2 0) representing the number of tasks arriving at the system up to the time t follows the homogeneous Poisson process with the arrival rate 0. AII-3. The processing time of a task, Y , follows the exponential distribution with mean l / a and each of the processing times is independent. AII-4. When the system causes a software failure before the processes of tasks do not finish, the tasks are canceled. Figure 2 illustrates the configuration of the system’s task processing.

Software Failure Time Figure 2.

Configuration of task processing.

Let { Z ( t ) , t 2 0} be the random variable representing the cumulative number of tasks whose processes can be complete out of the tasks arriving up to the time t. By conditioning with { N ( t )= k } , we obtain the probability mass function of Z ( t ) as

543 Given that {X(t) = Wn},the probability that the process of an arbitrary task is complete is given by Pr{Xn

Ly

> YIX(t) = W n }= An

+

(5)

Furthermore, the arrival time of an arbitrary task out of ones arriving up to the time t is distributed uniformly over the time interval (0, t].' Therefore, the probability that the process of the task having arrived up t o the time t is complete is obtained as

Then from assumption AII-3,

That is, given that { N ( t )= k } , the number of tasks whose processes can be complete follows the binomial process with mean k p ( t ) . Accordingly, from (4)the distribution of Z ( t ) is given by

Equation (8) means that Z ( t ) follows the nonhomogeneous Poisson process with the mean value function Btp(t). Let { W ( t )t, 2 0} be the random variable representing the cumulative number of tasks whose processes are interrupted out of the tasks arriving up to the time t. Then we can apply the same discussion as Lbove t o the derivation of the distribution of W ( t ) ,ie., we can obtain Pr{W(t) = j } as

544 4. Software Performance Measures

The expected numbers of tasks completable a.nd incompletable out of the tasks arriving a t the time t are given by

respectively. Furthermore, the instantaneous task completion and incompletion ratios are given by

respectively. These represent the ratios of the numbers of tasks completed and canceled t o one of tasks arriving a t the system per unit time a t the time point t . As to p ( t ) in Eq. (6) and q ( t ) in Eq. (lo), we can give the following interpretations:

That is, p ( t ) and q ( t ) are the task completion and incompletion probabilities per task arriving up to the time t , respectively. We note that p ( t ) and q ( t ) have no bearing on the arrival rate of the task, 8.

5. Numerical Examples We show several numerical examples of software performance analysis. Here we apply the model of MorandalO t o the hazard rate A, = Dcn (D > 0, 0 < c < 1) and the restoration rate pn = Ern ( E > 0, 0 < r 5 l ) , respectively. Figure 3 shows the time-dependent behaviors of the instantaneous task completion ratio, h ( t ) ,in Eq. (13) and the instantaneous task incompletion ratio, p ( t ) , in Eq. (14) along with the instantaneous software availability, A ( t ) = C,"==, Pw,(t). This figure tells us that h(t) and p ( t ) converge to 1 and zero, respectively, and that h ( t ) gives more pessimistic evaluation than the past performance measure ( A @ ) ) since this model considers that it takes a time duration to finish a task. If we specify the objective of h ( t ) ,say ho, then we can calculate the testing time t = t h satisfying h(t) = ho. Figure 4 shows h(t) for various values of the perfect

545

0

100

50

150

200

250

300

Time Figure 3 . Behaviors of h ( t ) ,f i ( t ) ,and A ( t ) ( a = 1.0, a = 0.9, D = 0.2, c = 0.9, E = 1.0,

T

=

0.95).

a=l .o\

0

50

100

150

200

250

300

Time Figure 4. Dependence of h ( t ) on a ( a = 1.0,. D = 0.2, c = 0.9, E = 1.0,

T

= 0.95).

debugging rate, a, where the solid horizontal straight line designates the example of the objective, ho = 0.85. We can see that it takes longer time to satisfy the objective of h(t) as the debugging ability becomes lower. 6. Concluding Remarks

In this paper, we have discussed the software performance measurement based on the number of tasks. The stochastic behavior peculiar to the software system such as software reliability growth process, the upward tendency of difficulty in debugging, and the imperfect debugging environment have been described by the Markovian

546 availability model. Assuming that the cumulative number of the tasks arriving at the system up to a given time point follows the homogeneous Poisson process, we have analyzed the distribution of the number of tasks whose processes can be complete with the concept of the infinite-server queueing model. From the model, we have derived several software performance measures such as the expected numbers of completable and incompletable tasks, the instantaneous task completion and incompletion ratios, and the task completion and incompletion probabilities per task. We have also illustrated the several numerical examples of these measures. It has been meaningful to correlate the software reliability characteristics with software performance measurement.

Acknowledgments This work was supported in part by the Saneyoshi Scholarship Foundation, Japan, and Grants-in-Aid for Young Scientists (B) of the Ministry of Education, Culture, Sports, Science and Technology of Japan under Grant No. 16710114.

References 1. M. R. Lyu (ed.), Handbook of Software Reliability Engineering, IEEE Computer Scciety Press, Los Alamitos, CA (1996). 2. S. Yamada, Software reliability models, in Stochastic Models in Reliability and Maintenance, Springer-Verlag, Berlin, 253 (2002). 3. K. Tokuno and S. Yamada, Software availability theory and its applications, in Handbook of Reliability Engineering, Springer-Verlag, Berlin, 235 (2003). 4. M. D. Beaudry, Performance-related reliability measures for computing systems, IEEE Trans. Comput. C-27, 540 (1978). 5. J. F. Meyer, On evaluating the performability of degradable computing systems, IEEE Trans. Comput. C-29, 720 (1980). 6. M. Nakamura and S. Osaki, Performance/reliability evaluation of a multi-processor system with computational demands, Int. J. Sys. Sci. 15,95 (1984). 7. A. Sols, System degraded availability, Reliab. Eng. Sys. Safety 5 6 , 91 (1997). 8. K. Tokuno and S. Yamada, Markovian software availability measurement based on the number of restoration actions, IEICE Trans. Fundamentals E83-A, 835 (2000). 9. S. M. Ross, Applied Probability Models with Optimization Applications, Holden-Day, San Francisco (1970). 10. P. B. Moranda, Event-altered rate models for general reliability analysis, IEEE Trans. Reliab. R-28, 376 (1979).

QUALITY ENGINEERING ANALYSIS FOR HUMAN FACTORS AFFECTING SOFTWARE RELIABILITY IN THE DESIGN REVIEW PROCESS WITH CLASSIFICATION OF DETECTED FAULTS*

KOUSUKE TOMITAKA, SHIGERU YAMADA, AND RYOTARO MATSUDA Department of Social Systems Engineering, Faculty of Engineering, Tottori University, Minami 4-101, Koyama-cho, Tottori 680-8552, Japan E-mail: { 99t7036, yamada} @sse.tottori-u.ac.jp

Software faults introduced by human development work have great influence on the quality and reliability of a final software product. The design-review work can improve the final quality of a software product by reviewing the design-specifications, and by detecting and correcting a lot of design faults. In this paper, we conduct an experiment t o clarify human factors and their interactions affecting software reliability by assuming a model of human factors which consist of inhabitors and inducers. Finally, extracting the significant human factors by using the quality engineering approach based on the orthogonal array L l ~ ( 2 xl 37) and the signal-to-noise ratio, we discuss the relationships among them and the classification of detected faults, i.e., descriptive-design and symbolic-design ones, in the design-review process.

1. Introduction

Software faults introduced by human errors in development activities of complicated and diversified software systems have occurred a lot of system failures of modern computer systems. Since these faults concern with mutual relations among human factors in such software development projects, it is difficult to prevent from software failures beforehand in the software production control. Additionally, most of these faults are detected and corrected after software failure occurrences during the testing phase. If we can make the mutual relations among human factors [1,2] clear, then the problem for software reliability improvement is expected to be solved. So far, several studies have been carried out to investigate the relationships among software reliability and human factors by performing software development experiments and providing fundamental frameworks for understanding the mutual relations among various human factors (for example, see [3,4]). *This work is partially supported by the Grant-in-Aid for the Scientific Research (C)(2) from the Ministry of Education, Culture, Sports, Science and Technology of Japan under Grant No.15510129.

547

548

In this paper, we focus on a software design-review process which is more effective than the other processes for elimination and prevention of software faults. Then, we adopt a quality engineering approach for analyzing the relationships among the quality of the design-review activities, i.e., software reliability, and human factors to clarify the fault-introduction process in the design-review process. We conduct a design-review experiment of graduate and undergraduate students as subjects. First, we discuss human factors categorized in inhabitors and inducers in the design-review process, and set up controllable human factors in the designreview experiment. Especially, we lay out the human factors on an orthogonal array based on the method of design of experiment [ 5 ] . Second, in order to select human factors which affect the quality of the design-review, we perform a software design-review experiment reflecting an actual design process based on the method of design of experiment. For analyzing the experimental results, we adopt a quality engineering approach, i.e., Taguchi-method. That is, applying the orthogonal array L l ~ ( 2 'x 37) with inside and outside factors t o the human factor experiment and classifying the faults detected in design-review work into descriptive-design and symbolic-design ones, we carry out the analysis of variance by using the data of signal-to-noise ratio (defined as SNR) [6] which can evaluate the stability of quality characteristics, discuss effective human factors, and obtain the optimal levels for the selected inhabitors and inducers. 2. DESIGN-REVIEW AND HUMAN FACTORS

2.1. Design-review The design-review process is located in the intermediate process between design and coding phases, and has software requirement-specifications as inputs and software design-specifications as outputs. In this process, software reliability is improved by detecting software faults effectively [7]. 2.2. Human factors

The attributes of software designers and design process environment are mutually related for the design-review process. Then, influential human factors for the designspecification as outputs are classified into two kinds of attributes in the following [8,9,10]: (i) Attributes of the design reviewers (Inhabitors) Attributes of the design reviewers are those of software engineers who are responsible for design-review work. For example, they are the degree of understanding of requirement-specifications and design-methods, the aptitude of programmers, the experience and capability of software design, the volition of achievement of software design, etc. Most of them are psychological human factors which are considered to contribute directly t o the quality of software design-specification.

549

(ii) Attributes of environment for the design-review (Inducers) In terms of design-review work, many kinds of influential factors are considered such as the education of design-methods, the kind of design methodologies, the physical environmental factors in software design work, e.g., temperature, humidity, noise, etc. All of these influential factors may affect indirectly the quality of software design-specification. 3. DESIGN-REVIEW EXPERIMENT 3.1. Human factors in the experiment In order to find out the relationships among the reliability of software designspecification and its influential human factors, we have performed the design-review experiment by selecting five human factors as shown in Table 1 as control factors which are concerned in the review work.

-

Table 1. Controllable factors in the design-review experiment.

I

Level

Control Factor

A

BGM of classical music in the review work environment [Inducer]

Ai:

1

2

3

yes

Az: no

-

B

Time duration of software design-review work (minute) [Inducer]

Bi: 20min

Bz: 30min

B3:

C

Degree of understanding of the designmethod (R-Net Technique) [Inhabitor]

Ci: high

Cz: common

c3: low

D

Degree of understanding of requirementspecification [Inhabitor]

Di: high

Dz: common

D3: low

Check list (indicating the matters that require attention in review work) [Inducer1

El: detailed

Ez: common

E3:

E

40min

nothing

3.2. S u m m ary of experiment

In this experiment, we conduct an experiment t o clarify the relationships among human factors affecting software reliability and the reliability of design-review work by assuming a human factor model [8,9,10]consisting of the inhabitors and inducers. The actual experiment has been performed by 18 subjects based on the same designspecification of a triangle program which receives three integers representing the sides of a triangle and classifies the kind of triangle such sides form [ll]. We measured the 18 subjects’ capability of both the degrees of understanding of designmethod and requirement-specification by the preliminary tests before the design of experiment. Further, we seeded some faults in the design-specification intentionally. Then, we have executed such a design-review experiment in which the 18 subjects detect the seeded faults.

550 We have performed the experiment by using the five control factors with three levels as shown in Table 1, which are assigned to the orthogonal-array L18(21 x 37) of the design of experiment as shown in Table 3. 3.3. Classijication of detected faults

We distinguish the design parts as follows to be pointed out in the design-review as detected faults into the descriptive-design and symbolic-design parts. 0

0

Description-design faults The descriptive-design parts consist of words or technical terminologies which are described in the design-specification to realize the required functions. In this experiment, the descriptive-design faults are algorithmic ones, and we can improve the quality of design-specification by detecting and correcting them. Symbolical-design faults The symbolical-design parts consist of marks or symbols which are described in the design-specification. In this experiment, the symbolical-design faults are notation mistakes, and the quality of the design-specification can not be improved by detecting and correcting them.

3.4. Data analysis with classification of detected faults

For the orthogonal-array L18(2l x 3'), setting the classification of detected faults as outside factor R and the control factors A, B, C, D, and E as inside factors, we perform the design-review experiment. Here, the outside factor R has two levels such as descriptive-design parts(R1) and symbolical-design parts(&). 4. ANALYSIS OF EXPERINMENTAL RESULTS 4.1. Definition of SNR We define the efficiency of design-review, i.e., the reliability, as the degree that the design reviewers can accurately detect correct and incorrect design parts for the design-specification containing seeded faults. There exists the following relationship among the total number of design parts, n, the number of correct design parts, no, and the number of incorrect design parts containing seeded faults, n1:

n = no

+

721.

(1)

Therefore, the design parts are classified as shown in Table 2 by using the following notations: noo = the number of correct design parts detected accurately as correct design parts,

nol = the number of correct design parts detected by mistake as incorrect design parts,

551 nlo

=

the number of incorrect design parts detected by mistake as correct design parts,

n11 = the number of incorrect design parts detected accurately as incorrect design parts, where two kinds of error rate are defined by

n10 nl Considering the two kinds of error rate, p and q, we can derive the standard error rate, Po, 161 as 1 Po = (4) q = -.

1

+

/

m

.

Then, the signal-to-noise ratio based on Eq. (4)is defined by

The standard error rate, PO, can be obtained from transforming Eq. (5) by using the signal-to-noise ratio of each control factor as

Table 2. Two kind of inputs and outputs in t h e design-review experiment. ( i )Observed v a l u e s ( ii )Error r a t e s

4.2. Orthogonal-array LI8(2l x 37)

The method of experimental design based on an orthogonal-array is a special one that requires only a small number of experimental trials to help us discover main factor effects. On traditional researches [4,8], the design of experiment has been conducted by using orthogonal-array Ll2 (211). However, since the orthogonal-array L12(211) has two levels for grasp of factorial effect to the human factors experiment,

552 Table -

.

The orthogonal

i

.ay L 1 8 ( 2 l x 37) with assigned human factol and experimental data.

Observed Values

Control Factor

SNR (d 1

No

- A B C D E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 -

1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2

1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

1 2 3 1 2 3 2 3 1 3 1 2 2 3 1 3 1 2

1 2 3 2 3 1 1 2 3 3 1 2 3 1 2 2 3 1

4 9 50 52 50 4 5 52 4 7 52 52 47 4 6 46 49 46 50 50 4 4

3 2 0 2 7 0 5 0 0 5 6 6 3 6 2 2 8

8 12 2 4 8 2 6 10 10 1 8 10 11 10 2 4 6

(R1: Descriptive-design

6 2 12 10 6 12 8 4 4 13 6 4 3 4 12 10 8

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

9 9 9 7 9 9 9 8 8 9 9 9 9 9 9 7 9

0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 2 0

2 4 0

0 3 2 2 1 1 3 4 0 4 0 0 0 3

2

0 4 4 1 2 2 3 3 1 0 4 0 4 4 4 1

Ri 7.578 -3.502 -8.769 7.578 1.784 -7.883 7.578 -3.413 0.583 0.583 3.591 -6.909 -10.939 -8.354 -10.939 4.120 1.784 -5.697

R2 6.580 3.478 -2.342 8.237 4.841 0.419 3.478 3.478 4.497 4.497 0.419 -2.342 8.237 -2.342 8.237 8.237 4.841 0.419

parts, Rz: Symbolical-design parts)

the middle effect between two levels can not be measured. Thus, in order to measure it, we adopt the orthogonal-array Lls(2l x 37) which can lay out one factor with 2 levels (1, 2) and 7 factors with 3 levels (1, 2, 3) as shown in Table 3, and dispense with 2 l x 37 trials by executing independent 18 experimental trials each other. Considering such circumstances, we can obtain the optimal levels for the selected inhabitors and inducers efficiently by using the orthogonal-array Ll8(2l x 37). 5. DATA ANALYSIS WITH CORRELATION AMONG INSIDE AND OUTSIDE FACTORS

5.1. Analysis of experimental results We analyze simultaneous effects of outside factor R and inside control factors A, B, C, D, and E. As a result of the analysis of variance by taking account of correlation among inside and outside factors discussed in 3.4, we can obtain Table 4. There are two kinds of errors in the analysis of variance: el is the error among experiments of the inside factors, and e2 the mutual correlation error between el and the outside factor. In this analysis, since there was no significant effect by performing F-testing for el

553 with e2, F-testing for all factors was performed by e2. As a result, the significant control factors such as the degree of understanding of the design-method (Factor C), the degree of understanding of requirement-specification (Factor D), and the classification of detected faults (Factor R) were recognized. Fig.1 shows the factor effect for each level in the significant factors which affect design-review work. Table 4.

The result of analysis of variance by taking account of correlation among inside and outside factors

Factor

f

s

V

A B

1 2 2 2 2 2 6 1 1 2 2 2 2 8 35

37.530 47.500 313.631 137.727 4.684 44.311 38.094 245.941 28.145 78.447 36.710 9.525 46.441 120.222 1188.909

37.530 23.750 156.816 68.864 2.342 22.155 6.460 16.366 28.145 39.224 18.355 4.763 23.221 15.028

C

D E AxB el

R AxR BxR CxR DxR ExR e2

T

FO 2.497 1.580 10.435** 4.582* 0.156 1.474 0.422 16.366** 1.873 2.610 1.221 0.317 1.545 3.870

P (%)

3.157 3.995 26.380 11.584 0.394 3.727 3.204 20.686 2.367 6.598 3.088 0.801 3.906 10.112 100.0

*: 5% level of significant

**: 1% level of significant

5.2. Discussion and concluding remarks

As a result of analysis, in the inside factors, only Factors C and D are significant and the inside and outside factors are not mutually interacted. That is, it turns out that the reviewers with that of understanding of the design-method and the high degree of understanding of requirement-specification exactly can review the designspecification efficiently regardless of the classification of detected faults. Moreover, the result that outside factor R is highly significant, and the descriptive-design faults are detected less than the symbolic-design faults, can be obtained. That is, although it is a natural result, it is difficult to detect and correct the algorithmic faults which lead t o improvement in quality rather than the notation mistakes. However, it is important to detect and correct the algorithmic fault as an essential problem of the quality improvement for design-review work. Therefore, in order t o increase the rate of detection and correction of the algorithmic faults which lead t o the improvement of quality, it is required before design-review work to make reviewers

554

8.0

8 v

Fig.1.

6.0

-6.0

1

-8.0

I

1 I

Ci

I

I

C2

C3

.

I

I

I

I

I

DI

Dz

Lh

RI

R2

The estimation of significant factors with correlation among inside and outside factors

fully understand the design technique used for describing design-specification and the contents of requirement-specifications.

References 1. V. R. Basili and R. W. Reiter, Jr.: “An investigation of human factors in software development”, IEEE Computer Magazine, vol. 12, no. 12, pp. 21-38 (1979). 2. T. Nakajo and H. Kume: “A case history analysis of software error cause-effect relationships ”, IEEE Trans. Software Engineering, vol. 17, no. 8, pp. 830-838 (1991). 3. K. Esaki and M. Takahashi: “Adaptation of quality engineering to analyzing human factors in software design”(in Japanese) , J. Quality Engineering Forum, vol. 4, no. 5, pp. 47-54 (1996). 4. K. Esaki and M. Takahashi: “A software design review on the relationship between human factors and software errors classified by seriousness”(in Japanese), J. Quality Engineering Forum, vol. 5 , no.4, pp. 30-37 (1997). 5 . G. Taguchi: A Method of Design of Eqeriment (the First volume (2nd. ed.)) (in Japanese), Maruzen, Tokyo (1976). 6. G. Taguchi (ed.): Signal-to-Noise Raito for Quality Evaluation (in Japanese), Japanese Standards Association, Tokyo (1998). 7. S. Yamada: Software Reliability Models: Fundamentals and Applications (in Japanese), JUSE Press, Tokyo (1994). 8. K. Esaki, S. Yamada, and M. Takahashi: “A quality engineering analysis of human factors affecting software reliability in software design review process”(in Japanese), %ns. IEICE Japan, vol. J84-A, no. 2, pp. 218-228 (2001). 9. R. Matsuda and S. Yamada: “A human factor analysis for software reliability improvement based on a quality engineering approach in design-review process”, Proc. 9th ISSAT Intern. Conf. Reliability and Quality in Design, Honolulu, Hawaii, U.S.A., pp.7S79 (2003). 10. S. Yamada and R. Matsuda: “A quality engineering evaluation for human factors affecting software reliability in design review process” (in Japanese), J. Japan Industrial Management Association, vol. 54, no. 1, pp. 71-79 (2003). 11. I. Miyamoto: Software Engineering -Current Status and Perspectives- (in Japanese), TBS Publishing, Tokyo (1982).

CONSTRUCTION OF POSSIBILITY DISTRIBUTIONS FOR RELIABILITY ANALYSIS BASED ON POSSIBILITY THEORY' XIN TONG HONG-ZHONG HUANG Department of Mechanical Engineering, Heilongjiang Institute of Science and Technology Harbin 150027, China School of Mechanical Engineering, Dalian University of Technology Dalian, 116023, China

MING J. ZUO Department of Mechanical Engineering, University of Alberta Edmonton, Alberta, T6G 2G8, Canada

The construction of possibility distributions is a crucial step in the application of possibilistic reliability theory. In this paper, a concise overview of the development of reliability theory based on possibility theory is provided. Then, it is presented that all methods for generating membership functions can be used to construct relevant possibility distribu:ions in principle. Some methods used to construct possibility distributions are discussed in detail and a method used to generate L - R type possibility distributions are provided with the possibilistic reliability analysis of fatigue strength of mechanical parts. Finally, an example of generating possibility distributions of fatigue lifetime of gears is provided

1

Introduction

Since Zadeh [I] introduced the mathematical framework of possibility theory in 1978, many important theoretical as well as practical advances have been achieved in this field. The possibility theory has been applied to artificial intelligence, knowledge engineering, fuzzy logic, automatic control, and other fields. Some researchers have also attempted to apply possibility theory to reliability analysis and safety assessment. Then, how can we apply these models to real-life systems or structures under the various frameworks of possibilistic reliability theory? Eliciting possibility distributions from data is one of the fundamental issues associated with the application of possibilistic reliability theory. On one hand, in the theory of possibilistic reliability, the concept of possibility distribution plays a role that is analogous-though not completely-to that of probability distribution in the theory of probabilistic reliability. On the other hand, developing possibility distributions is of fundamental importance because the success * This work is partially supported by the National Natural Science Foundation of China

under the contract number 50175010, the Excellent Young Teachers Program of the Ministry of Education of China under the contract number 1766, the National Excellent Doctoral Dissertation Special Foundation of China under the contract number 200232, and the Natural Sciences and Engineering Research Council of Canada.

555

556 and/or simplicity of an algorithm depends on the possibility distribution used in the model of possibilistic reliability analysis. Furthermore, it might be difficult, if not impossible, to come up with a general method for developing possibility distributions which will work for all applications. Because the concept of membership functions bears a close relation to the concept of possibility distributions [I], in the present paper, we believe that all the methods for generating membership functions can be used to construct the relevant possibility distributions in principle. Some methods for constructing possibility distributions and their suitability are discussed in detail. Then, a method used to generate L - R type possibility distributions is provided with the possibilistic reliability analysis of fatigue strength of mechanical parts. Finally, an example of generating possibility distributions of fatigue lifetime of gears using the proposed method is given.

2

The Methods for Constructing Possibility Distributions

2.1. Possibility Distributions Based on Membership Functions

As Zadeh [l] pointed out, a possibility distribution can be viewed as a fuzzy set which serves as an elastic constraint on the values that may be assigned to a variable. Therefore, the possibility distribution numerically equals to the corresponding membership function, I.e.,

where x is a fuzzy variable and 2 is the fuzzy set induced by X . According to the above-mentioned viewpoint, we can use the methods for constructing membership functions to generate the corresponding possibility distributions. In the following, we present a few commonly used methods for generating membership functions.

2.1. I

Fuzzy statistics (21

Fuzzy statistics are analogous to probability statistics in form and they all use certainty approaches to deal with uncertainty problems in real-life systems or structures. When fuzzy statistics are used, a definite judgment must be made on whether a fixed element in the universe of discourse belongs to an alterable crisp set A' or not. In other words, based on n observations, we have the gradeof membershipof u ~ in ,A=

1.

2.

the number of times of " u 0 E A'" n

The following principles must be observed in evaluation of fuzzy statistics: The user should be familiar with concepts of fuzzy sets and capable of quantifying the entity being observed. In other words, the user should be an expert in the field of application. A preliminary analysis of the raw data should be conducted so that abnormal data may be removed.

557

For further details and examples of fuzzy statistics, readers are referred to [2]. 2.1.2

Transformation of probability distributions to possibility distributions

According to Kosko’s argument that “fuzziness contains probability as a special case” [ 3 ] , if we have obtained estimates of the probability density function or other statistical properties of an entity being measured, we can construct its corresponding membership function following the approach outlined in [4]. Based on the technique in [4], we summarize the following simple method for constructing the membership function from the probability density function of a Gaussian random variable, i.e., P k )=

+(4

(2)

1

(3)

i’=maxo) where p ( x ) is the pdf of a Gaussian random variable and membership function based on p ( x ) . 2. I . 3

~ ( x )is

the corresponding

Heuristic methods (51

With heuristic methods, we first select a predefined shape of the membership function to be developed. The specific parameters of the membership function with the selected shape are determined from the data collected. In most real-life problems, the universe of discourse of the membership functions is the real number line. The commonly used membership function shapes are the piecewise linear function and the piecewise monotonic function. Linear and piecewise linear membership functions have the advantages of reasonably smooth transitions and easy manipulation through fuzzy operations. However, the shapes of many heuristic membership functions are not flexible enough to model all kinds of data. Moreover, the parameters of the membership functions must be provided by experts. In many applications, the parameters need to be adjusted extensively to achieve a certain performance level. In practical applications, we often combine fuzzy statistics with heuristic methods. First, the shape of the membership function is suggested by statistical data. Then, the suggested shape is compared with the predefined shape and the more appropriate ones are selected. Finally, the most suitable membership function is determined through practical tests.

2.2. Transformation of Probability Distributions to Possibility Distributions The methods for transforming probability distributions to possibility distributions are based on the possibility/probability consistency principle. The possibility/probability consistency principle states:

558 If a variable x can take the value u , , . . . , u n with respective possibilities probabilities p =(p(u,)...,p(u,)),then the degree of consistency of the probability distribution p with possibility distribution n is expressed by n=(n(uI)....n(un))and

cz b.P)=

2(,.

)P(.,

1

I=,

For more details on this principle, readers are referred to [ 13. 2.2.1

The Bijective transformation method [7]

Let x = {xEli= I, 2,. .., n} be the universe of discourse. If the histograms (or the probability density function) of the variable X has a decreasing trend, that is, P(X, 12 P k , 12 ... 2 Pb,)

(4)

then, the corresponding possibility distribution can be constructed as follows:

Generally, the histograms can be normalized by setting the maximal value to 1, i.e.,

2.2.2

The conservation of uncertainty method

Klir [8] presented a method for constructing possibility distributions based on the principle of uncertainty conservation. When uncertainty is transformed from one theory T, to another r, , the following requirements must be met: 1 . The amount of inherent uncertainty should be preserved and 2. All relevant numerical values in r, must be converted to their counterparts in Tz by an appropriate scale. The probabilistic measure of uncertainty is the well known Shannon entropy and is given by

Hb)=-&? log, P, ,=I

In the possibility theory, there are two types of uncertainties, nonspecificity N(n), and discord D ( x ) ,and they are given by

and

559

Klir [8] contends that the log-interval scale transformation is the only one that exists for all distributions and is unique. Its form is

where a is a positive constant determined by solving Eq. (7); Klir conjectures that a lies in the interval [0,1].

2.3. Subjective Manipulations of Fatigue Data Assume that we have obtained fatigue life data of a device, denoted by (n:),s,s,,l 5 r < M , where M is the number of stress levels, N is the number of data points at each stress level. Then the mean fatigue life at stress level i can be expressed as

The lifetime data at each stress level can be divided into two groups, that is, G, = {n: ,J

= 1,2,. -,N

1 n; < m , }

(9)

The mean value mn, is assigned a possibility degree of 1 and the possibility degree of 0.5 is assigned to the means of the lifetime data in the two groups G, and G2, that is, 1 m,",=

#(GI

)=

E n , ' , nn,(m,", 0.5, r = 1,2, ..., M ""6,

where # (.) denotes the number of data points in a set. By use of the above-mentioned analysis, we can express the L - R type possibility distribution of fatigue lifetime as follows:

560 where

a,=

m", -m/ L- 0 5

and 8 , =- m,", - m , L-'(0.5)

'

Considering the various types of L - R type possibility distributions mentioned earlier in this paper, we can use Eq. (13) to get specific possibility distributions to represent fatigue lifetime data. For example, the following triangular possibility distribution may be used to represent fatigue lifetime data: 10.

n:

< mn, - a ,

m ,+ 5 , < n :

1.

where a , = 2(mn,- mIn,) and 8n,= 2(- m , + m,", Similarly, we may use the following Gaussian possibility distribution to represent fatigue lifetime data:

3

Example

We illustrate the method presented in Section 2 for constructing the possibility distribution of the fatigue lifetime data given in [9]. The collected data of bending fatigue lifetime are shown in Table 1. Only four data points at each stress level are given in Table 1 and they are sufficient for subjective estimation of the possibility distribution of fatigue lifetime. Table 1 The bending fatigue lifetime data of gear-teeth made of hardened and tempered steel 40Cr (in units of lo6 loading cycles) Points 0 0 0 0

S i 4 6 7 2, S ~ 4 2 3, 4 Si=381 6, S4=339 0 1404 0 1573 0 2919 0 3879 0 1723 0 3024 0 4890 1508 0 1857 0 3250 0 5657 1572 0 1872 0 3343 0 5738 1738

Using Eq. (8) and Table 1, we have

561 1 '

1

m , =-cn,' N ,=I

4

1

= - c n { =-(0.1404+0.1508+ O.l572+O.l738)=0.15555 4 /=I 4 xn,(m,,, = 0.15555)= 1

The data points at the first stress level is divided into two groups separated by the calculated mean value m , , i.e., GI = (0.1404,O.1508 I n: < mn,} G, = {0.1572,0.17381n: > m , }

Further, from Eqs. (1 1 ) and (1 2), we have E n ; =T(0.1404+0.1508)=0.1456 1

qn, =- 1

1

#(GI !>I&,

xn,(qfl= 0.1456)= 0.5 =-

m ?"I

1 #(G2

1

)

c n , ' = - ( O 1572+0.1738)=0.1655 "I'EGI

2

x,,,[m,", = 0.1655)= 0.5

Finally, with these calculated results and Eq. (14), we can construct the triangular possibility distribution of the bending fatigue lifetime of gear-teeth made of hardened and tempered steel 40Cr under the stress level of 467.2MPa as follows: a,

= 2 ( m , - m b , =2(0.15555-0.1456)=0.0199

O,,, = 2(-m,,

+ m?,, )= 2(-0.15555 + 0.1655)=0.0199

Note that the procedure for constructing the possibility distributions at other stress levels is the same. After obtaining the possibility distribution of fatigue lifetime of the gear, we can derive the possibilistic reliability of bending fatigue strength of the gear at any time according to posbist reliability theory, e.g., under the stress level of 467.2MPa, we can figure out the possibilistic reliability of bending fatigue strength of the gear as follows:

562 I1,

t 50.15555

t -0.15555

, 0.155550.17545

4 1.

2.

Conclusions In this paper, we addressed the critical problem in the possibilistic reliability theory which is the construction of the possibility distribution and pointed out that all methods for generating membership functions can be used to construct the corresponding possibility distributions. We also presented a new method for constructing the possibility distribution with the possibilistic reliability analysis of fatigue lifetime of mechanical parts. The methods for constructing possibility distributions are not as mature as those for constructing probability distributions. The present paper has provided a concise overview of the methods for constructing the possibility distributions in possibilistic reliability analysis. Further research is needed to develop a more general method for constructing possibility distributions.

References 1.

Zadeh L A. Fuzzy sets as a basis for a theory of possibility. Fuzzy sets and Systems, 1978; l(1): 3-28. 2. Wang P Z. Fuzzy Sets and Their Applications. Shanghai: Shanghai Scientific & Technical Publishers, 1983. 3. Mcneill D, Freiberger P. Fuzzy Logic. New York: Simon and Schuster, 1993. 4. Civanlar M R, Trussell H J. Constructing membership functions using statistical data. Fuzzy sets and Systems, 1986; 18(1): 1-13. 5. Medasani S, Kim J, Krishnapuram R. An overview of membership function generation techniques for pattern recognition. International Journal of Approximate Reasoning, 1998; 19: 391-417. 6. Medaglia A L, Fang S C, Nuttle H L W, Wilson J R. An efficient and flexible mechanism for constructing membership functions. European Journal of Operational Research, 2002; 139: 84-95. 7. Dubois D, Prade H. Unfair coins and necessity measures: towards a possibilistic interpretation of histograms. Fuzzy Sets and Systems, 1983; 10: 15-20. 8. Klir G. A principle of uncertainty and information invariance. International Journal of General Systems, 1990; 17(2/3): 249-275. 9. Tao J, Wang X Q, Tan J Z. Reliability of gear-tooth bending fatigue strength for through hardened and tempered steel 40Cr. Journal of University of Science and Technology Beijing, 1997; 19(5): 482-484.

A SEQUENTIAL DESIGN FOR BINARY LIFETIME TESTING ON WEIBULL DISTRIBUTION WITH UNKNOWN SCALE PARAMETER

W. YAMAMOTO, K. SUZUKI, AND H. YASUDA The University of Electro-Communications 1-5-1 Chofugaoka, Chofu, Tokyo 182-8585, Japan E-mail: [email protected] We develop an sequential experimental plan for items which needs destructive lifetime testings. This plan repeats estimating unknown parameters by MLE and trying to set the observation time so as to gain the precision of the final estimates. We conduct a simulation study to compare our procedure with a plan in the literature and find certain advantages against it.

1. Introduction

Many products allow us to monitor the conditions of each item or to diagnose their conditions repeatedly, during their reliability experiments or through their lives. Typical examples include LCDs, computers, and cars which we can use repeatedly until they fail. We can plan the experiments with limited resources for such items and estimate the reliability properties parametrically or nonparametrically. There are many products of other types which restrict our ways of observations and limit the numbers of inspection records per item. The typical examples of such products include are extinguishers, films and air bags. They can be used and diagnosed at most once through their lives. The outcomes of experiments are often binary in that each item is reported as successfully used or as defected. We call the lifetime experiments on such products as binary lifetime testings. Many authors investigate the design problem for binary testing which seeks sets of appropriate inspection times and allocations of items on them. Abdelbasit and Plackett (1983), Atkinson and Donev (1992), and Silvey (1980) proved that the number of distinct observation times are at most q(q+1)/2, where q is the number of unknown parameters of the underlying lifetime distribution. Iwamoto and Suzuki (2002) investigate the properties of the optimal designs for various distributions, assuming the all parameters known. Yamamoto, Iwamoto, Yasuda, and Suzuki (2003) apply their result to the Weibull distributions with unknown scale parameters and develop a multi-stage procedure based on an approximate D-optimal design, which is intended to be less

563

564 Observation Time

t=O

.+

8 ........

-.MI .--.+q

...........0 M Ordinary Lifetime Data

Figure 1.

Outcmes from Binary Testing

Difference Between Lifetime Testings

insensitive t o the unknown parameter than the true optimal design. However t,he total time on test grows several times as long as MTTF of each item with their procedure. Among the literature on the design problems for binary testing, Bergman and Turnbull (1983) developed a sequential procedure which is invented so as not to spend too many items at each observation time. This procedure works pretty well. But we find there are still some rooms to improve this procedure. In this paper, we develop a sequential procedure of a different type in that the observation times are the sequence of estimated optimal observation times. We restrict our attention to one parameter Weibull distributions, F ( t ;7, m ) = 1 exp(-(t/7)m), with scale parameter 7 unknown while shape parameter m is known. We note that our discussion can be extended to other parametric distributions witholit any difficulty. 2. D-Optimal Design for Binary Life Testing Assume that the times to failure of each item are distributed independently with a distribution function p ( t ; Q ) = Pr (T 5 t ) . Let N be the total number of items available for testing, t = ( t l , . . . . tfil) be pre-assigned inspection times, and n = ( n l ,. . . . nfil) be the numbers of items allocated to each time. Further let X1 ...... Xfif be the numbers of failed items at each time. Then the observed likelihood is proportional to M

and the Fisher information is

565 for one parameter case. D-optimal design is given by n* = {n;,. . . , n k } and t* = ( t ; ,. . . ,t k ) that maximize 11 (0; t)I. For Weibull distribution with the unknown scale parameter 17, the Fisher information is given as

The maximum is given by hf = 1, n1 = N and tl = 1.26317, where 1.263 = J- log (1 - p ” ) with p* = 0.797 is derived analytically. We refer to this design as PI, though this is of no immediate use. For this design depends on the unknown parameter 17 itself. We use this result to modify the procedure developed by Bergman and Turnbull( 1983) into a multi-stage procedure with varying steps on observation times.

3. A Sequential Procedure by Bergman and Turnbull a multi-stage sequential procedure with the following instruction: Fix a sequence of inspection times for each stage, tl < . . . < t M , a t which the inspections are made on one or more items. Usually this sequence consists of an arithmetical progression sequence. At each stage, items are selected randomly and are diagnosed until a stopping rule is satisfied or no item is left. If there are still items under the test, proceed to the next stage, i.e. keep the lifetime test going on until the next time to inspect. Repeat 2 and 3 until the test reaches a t the final stage M . At the final stage, if reached, all remaining items will be inspected. Bergman and Turnbull (1983) proposed three stopping rules for each stage. Among them, they recommend the one called the ratio rule, where we stop each stage when the outcome, the number of failed items T ( ~ and ) the number of functional items T ( ~ )satisfies , b*F(3) - T ( 3 ) > 2 ,

(4)

for the prespecified b’ and 2 . Other two rules are called the uniform design and the negative binomial design. We refer to this procedure with the ratio rule as PBT. 4. A Sequential Procedure with Estimated Optimal Observation Times We propose a sequential procedure with the different sequence of inspection times. At each stage, the optimal inspection time 1.27317 is estimated with the results obtained so far. If the optimal time passed, we would inspect all items remained.

566

il)

tb)

$3,

t(4)

.. .

Fix a sequence of times or “stage” tl Figure 2.

R(3):ratiorule (4F(i)

< . .. < t M

-T-(~)

> 8)

z

(1) Set the first observation time f1, which can be equivalent to an initial guess for 7 , as 6 = f1/1.263. (2) At il, conduct sequential inspections with the stopping rule same as that of Bergman and Turnbull (1983), b * F ( j ) - T ( ~ > ) z , , where b* = 4, ?* = 2, and z = 8. (3) Compute the MLE of 7 , 61, with outcomes obtained at fl and set the second observation time &I as 1.26761. (4) Then, conduct sequential inspections with the same stopping rule a t t ^ ~ , compute the MLE of 7 , 6 2 , with all outcomes obtained so far and set the next observation time t^3 as 1.26762. If t^3 is the past, inspect all renaining items and compute the final MLE with all results. (5) Repeat 3 and 4 until the termination date comes or all items are inspected.

The MLE of 7 is obtained by solving the following score equation. After inspecting items at t j , we have

This can be solved by Newton-Raphson methods. For cases where the first a few outcomes are zeros, we can modify the procedure t o estimate 7 with the lower confidence bound with confidence level p. It is given as

where Tj. is the total of times on test of all items powered by m,

567

Figure 3. Efficiency ( N = Figure 4. Efficiency ( N = Figure 5. Efficiency ( N = 24, m = 0.7) 24, m = 1.0) 24, m = 2.0)

Figure 6. Efficiency ( N = Figure 7. Efficiency ( N = Figure 8. Efficiency ( N 60, m = 0.7) 60, m = 1.0) 60, m = 2.0)

Figure 9. Efficiency ( N = Figure 10. Efficiency (N = Figure 11. Efficiency ( N = 120, m = 0.7) 120,m = 1.0) 120, m = 2.0)

568 This procedure is referred to as PSD

5. Simulation Study To compare our procedure with Bergman and Turnbull’s procedure, we conduct Monte Carlo studies. The total number of items N is set as 24, 60, and 120. The shape parameter m of Weibull distribution is assumed known and we investigate the cmes with 0.7, 1.0, and 2.0. The first observation time t, is set as 30, M = 12, and t M = 360. Figures 3-12 show that P I could work well if we set f j ( 0 ) near the true value q for three cases, m = 0.7,1.0,2.0. But regarding the facts that PI have one observation time and also that the time is set based on f j ( o ) , it is very difficult and uncertain t o guess the initial estimate 7j(o) as 0.817 < 7j(o) < 1.2577 without any prior information. PI is meant to be a benchmark in comparing Psu with PBT. All three procedures suffer from the loss of efficiency when 7j(o) 2 1 . 6 7 ~If. we fail to set t(l) as t(l) < 1.67t* with Pso and PET,where t* is the optimal observation time for one stage experiments, we had better rather diagnose all items at t(l), i.e. follow PI procedure. This is the common difficulty for all binary lifetime testing. The differences among three lie in cases with 7j(o) < 1.677. Psn gives better efficiency with 7 j ( ~values around 7j(o)/v= 1 than PBT for cases with m is 1.0 and 2.0. This observation can be explained with the contribution of each item to the Fisher information. The relative contributions as functions of m and 7 j ( o ) / ~ are shown in Figure 1. The curves have sharper maximums around fj(o)/q = 1.0 for larger values of m. This is an advantage of introducing two steps, the jump to t’ and one more stopping rule, against PBT. Other results will be presented at the conference. References 1. Iwamoto, D. and K. Suzuki (2002) : “Optimal Design based on Binary Data in Reliability Lifetime Experiment”, J. Rel. Eng. Assoc. Japan, 24, pp.183-191. 2. Yamamoto, Y., D. Iwamoto, H. Yasuda, and K. Suzuki (2003) : “Sequential D-Optimal Design for Binary Lifetime Testing on Weibull Distribution”, J. Rel. Eng. Assoc. Japan, 2 5 , pp.75-87. 3. Abdelbasit, K.M. and Plackett, R.L.( 1983): “Experimental Design for Binary Data,” Journal of the American Statistical Association, 7 8 , pp.90-98. 4. Atkinson, A.C. and Donev, A.N.(1992): “Optimum Experimental Designs,” Wiley, pp.91-117. 5. Bergman, S.W. and Turnbull, B.W.( 1983): “Efficient sequential designs for destructive life testing with application to animal serial sacrifice experiments,” Biometrika, 70, pp.305-314. 6. Salomon, M.(1987) :“Optimal Designs for Binary Data,” Journal of the American Statistical Association, 82, pp.1098-1103. 7. Silvey, S.D. (1980): “Optimal Design,” Chapman and Hall. (pp.1-16, pp.72-73)

THE GENERALLY WEIGHTED MOVING AVERAGE CONTROL CHART FOR DETECTING SMALL SHIFTS IN THE PROCESS MEDIAN LING YANG' Department of Industrial Engineering and Management St. John s and St. Mary s Institute of Technology Taipei, Taiwan, ROC

SHEY-HUE1 SHEU Department ofhdustrial Management National Taiwan University of Science and Technology Taipei, Taiwan, ROC

This study proposes the generally weighted moving average (GWMA) control chart for monitoring the process sample median The GWMA control chart is a generalization of the EWMA control chart The properties and design strategies of the GWMA median control chart are investigated We use simulation to evaluate the average run length properties of the EWMA median control chart and the GWMA median control chart After an extensive comparison, it reveals that the GWMA median control chart performs much better than the EWMA median control chart for detecting small shifts in the process sample median An example is given to illustrate this study

1

Introduction

Effective quality control can be instrumental in increasing productivity and reducing cost. A control chart is a graphical display of a quality characteristic to monitor process performance, and Shewhart charts are often employed for this purpose. Under the assumption of normal distribution, the process mean is equivalent to the process median ( f ). The control charts are easier to do on the shop floor because no arithmetic operations are needed. The person doing the charting can simply order the data and pick the center element. Therefore, many users use the 2 control charts for convenience. It is well known that Shewhart control charts are relatively inefficient in detecting small shifts of the process mean. Alternative control charts, such as the CUSUM chart and the EWMA chart, have been developed to compensate for the inefficiency of Shewhart control charts. Roberts"] first applied the EWMA, denoted as geometric moving average (GMA) control chart, controlled the process mean. The properties and design strategies of the EWMA chart for the mean and for the variance have been well investigated by Sweet"'], Crowder['], Ng and CaseL6],Lucas and Sacc~cci[~], Crowder and

x

*Correspondence: Ling Yang, Department of Industrial Engineering and Management, St. John's and St. Mary's Institute of Technology, 499, Sec. 4, Tam King Road, Tamsui, Taipei, Taiwan, 251, ROC. Fax: (886) 2-2801-3 143. E-mail: [email protected]

569

570

Hamilton[31,and MacGregor and Harris[']. In contrast to the process mean and variance, using the EWMA control chart as a tool for monitoring the process sample median ( i ) still has received very little attention in literature. So far, Castagliola"] has showed that the EWMA-based 8 control chart (EWMA- 8 , for short) is more efficient than the Shewhart control chart in detecting small shifts of the process median. The generally weighted moving average (GWMA) control chart proposed by Sheu and Lin['] is a generalization of the EWMA control chart. Due to the added adjustment parameter a, the GWMA control chart has been shown to perform much better than Shewhart and EWMA control chart in monitoring small shifts of the process mean. In this paper, we assume that the process characteristic follows the fiormal distribution. We use the GWMA control chart to monitor the process median 8 (denoted as GWMA- 8 ) and use the distribution of sample median f , derived by Castagliola"] to compute the control limits of the GWMA- 8 control chart. Simulation is used to evaluate the average run length (ARL). The remainder of this paper is organized as follows: In Section 2, we describe the model of the GWMA control chart. In Section 3, the numerical simulation is used to evaluate the ARL of various process meadmedian shifts under various adjusted parameters. We compare the shift detecting performance between the GWMA- 2 control chart and the EWMA- 8 control chart. In Section 4, we give an example for illustration. Finally, some conclusion remarks are included in the last section. 2

Description of the GWMA Control Chart

2.1. The GWMA- 8 Control Chartfor the Process Median Suppose that the quality characteristic is a variable and the samples have been collected at each point in time (size of rational subgroups n). Let 8, be the sample median of subgroup j which is composed of n independent normal(p,,d)random variables X,,,..., X,,where ,uoand o' are the nominal process mean (also the nominal process median) and the process variance respectively. The distribution of sample median 8, is very close to the (,ucl,Ez)normal distribution, where 5' is the variance of y , . If E o I is the standard deviation of normal (0, 1) sample median, then we have 5 = c x Z(,, . For the values of & , ,refer to Castagliola"] for details. Now we apply the GWMA control chart to control the process median. From Sheu and Linl'], the GWMA- k control statistic, Y, , can be represented as

where (qoa-4'' ),(ql" -q2a),...,(q(l-')e- q l m ) are the weights ofthe most updated sample, the 2"d updated sample, (ql'

- q2u)

..., the most out-of-date sample, respectively. If

(qoa- q ' " ) >

> .. . > (q(l-I)= - qlo) , then the weights decrease with the age of the samples.

The expected value of Eq. (1) can then be computed by

571 E ( y , ) = E[(q"" - q l " ) X , +(q1" -qz")Y,-l +...+(q''-l'" - q , " ) X , +q'"po] = [(q[IC- qlR) + (q'=- q1" ) + ...+ (q(l-1'" - q,"

-

)]E( X)+ q,"pl,

(2)

= pi!

Since X,, j = 1,2,3,... , are independent random variables with variance E 2 = ( a x 6I ) z . The variance of Eq. ( 1 ) is Vur(Y,) = [((

- qIa) 2 + (q'- - q2e -J)2

)* + ... +

+(q'"-q2')2

(

p

)

O

-q j n

)'I

x

5 2

+...+(p)= -qJm)2](ax~o,i)2

(3)

=Q,(ax%J)2,

where Q =(qO' -qi')*+(qiY - q 2 ~ ) z + . . . + ( q ( ' - i ) u - q ' Then, ~ ) ) 2 .the time-varying control limits of the GWMA- 8 control chart can be written as Po + LJgoEo,,

.

(4)

2.2. The EWMA Control Chart as a Special Case of the GWMA Control Chart

In the following (from Eq. (5) to Eq. (S)), we will show that the GWMA control chart turns out to be the EWMA control chart, which was introduced by Roberts"]. When a = 1 and q = 1 - A , Eq. (1) will reduce to Y, =

Ax, + A(1 - A)Z,-,+ ...+ A(l - A)'-'

2 1

+ (1 - A)' p".

(5)

The variance of 4 (from Eq. (5)) will be

The time-varying control limits will become to

The design parameter q is constant, 0 < 4 I 1 , then 0 < (I - A) I 1. When j increases to

d,in Eq. ( 6 ) will increase and approach to a limiting value, 0;- = ( A / ( 2 fixed-width control limits will be

-

00,

A ) ) g 2 . The

i"..

po+L 2-2

That is, the EWMA- 2 chart is a special case in the GWMA-?!, chart when a = 1

572 3

Performance Measurement and Comparison

The design parameters of the GWMA- 2 control chart are the multiple of sigma used in the control limits (L), the value of q, and a. The performance of a control chart is generally measured by the ARL, which is defined as the average number of points plotted before an out-of-control signal is given. When the process is under control, the ARL (named ARLO)of the control chart should be sufficiently large to avoid false alarms; however, when the process is out of control, the ARL (named ARL,) should be sufficiently small to rapidly detect shifts. Because the control limits of the GWMA- 2 control chart are varying with time, finding the exact ARLs for given control limits is not straightforward. Monte Carlo Simulation[*]is used to estimate the ARL of the GWMA2 control chart. Without loss of generality, we assume that in the absence of a special cause of variation, Xlk, j = 1,2,3, ... , k = 1,2, ...,n (n is the size of the rational subgroup), are independent and have a common normal distribution with mean p0 = O and variance CT*= 1 . For simplicity, in the simulation of this paper, we assume n = 5 for the performance comparison; however, any other values of n will conclude similar results. Then, we can get So,,= 0.536 (from Castagliola"], when n = 5) for Eq. (4) to compute the control limits. Let 6 denote the magnitude of the process meadmedian shift (multiple of D ). Each simulation runs 20,000 iterations. The computed GWMA- p control statistics, 5, must be bounded within the GWMA-Y control limits, and each trial ends when either of the control limits is exceeded. In order to realize the performance of the GWMA- control chart, with various design parameters q and different adjustment parameters a, the in-control ( 6= 0) ARL (ARLO)is maintained at approximately 500 by changing the width of the control limits (L). That is, type I errors are set to 0.002 for various GWMA- control schemes herein, while out-of-control ( 6 > 0) ARLl's are used for comparison. The ARL performance for several GWMA- 2 control schemes is shown in Table 1. In Table 1, when a =1.00, qla = q' = (1 -A)' , the GWMA- 2 control chart with time-varying control limits reduces to the EWMA- control chart with time-varying control limits. Based on Table 1, the adjustment parameter a of the GWMA- 3 control chart is more sensitive to small shifts in the process meadmedian than to that of the EWMA- 2 control chart. The boldface numbers in Table 1, especially when q is smaller, make the properties more obvious. When q = 0.50 and a = 0.75, within 0.750, the A m l is smaller than the ARL, of the E W M A - 2 control chart. But when q is larger, the enhancement of detection ability is less apparent. For instance, when q = 0.90 and a = 0.75, the ARLl is only smaller than the ARLl of the EWMA-2 control chart within 0 . 2 5 ~ .Fig. 1 displays the out-of-control run length distributions of the GWMA- 2 control chart with q = 0.90, L = 2.80, with variance of adjustment parameter a, and the initial meadmedian shift of 0.150. Fig. 1 shows that the GWMA- 2 control chart performs better in detecting small shifts in the process meadmedian when a is smaller.

x

x

573 Table 1. ARLs of the GWMA-

-

x control chart with time-varying control limits (ARLOE 500, n

= 5)

6

0.00 0.15 0.25 0.50 0.75 1.oo -

6

0.00 0.15 0.25 0.50 0.75 1.00 -

50000 500000 98.11 106.18 42.94 43.71 11.89 1393 646 721 438 481

50000 107.1 42.19 11.82 632 429

50000 71.20 30.67 1035 5 84 460

50000 500000 500000 76.99 118.52 12869 4889 45.49 3561 11.74 12 17 1230 621 678 6.19 421 421 454

0.012

+

.go010

cr

B 0.008

50000 72.56 31.14 10.17 5.69 402

50000 78.84 32.32 10.17 5.68 399

50000 8693 3367 1030 575 391

albha=1.00( alnha=l 75

P

2 0.006

9

5 0.004 a

0.002

0.000

I

0

10

20

30

40

50

60

70

80

Ruii Length

Figure 1 Out-of-control run length distributions of the GWMA-

x

control chart with q =O 90, L = 2 80, with

variance of adjustment parameter a,and initial meadmedian shift of 0 1%.

4

Application and Illustration

4.1. Design of the GWMA-

2 Control Chart

A better design of the GWMA- 2 control chart is the choice of the parameter (q, a , L ) that meets certain ARLO,the magnitude of shift in the process that wants to be detected quickly, and the minimum ARLl. For example, if a GWMA- 2 control chart is designed for controlling a process median such that the chart will yield an ARLOof 500, and will detect the shift of the process median about 0.25 0 , on average, in fifty samples. Let the controlled process mean = 0 , standard deviation 0 = 1, and subgroup size n = 5. The following steps are recommended: (i) From Table 2, we can apply linear interpolation method to find the (q, a , L ) = (0.8, 0.85, 3.0), which satisfies the ARLOand ARLl

574

x

criteria. (ii) Using Eq. (1) to determine the GWMA- control statistics, 5. Recall that Eq. (4) has an approximate control limit of t0.4674. The GWMAcontrol statistics, q, are then plotted on a control chart with the above UCL and LCL. If any point exceeds the control limits, the process is assumed to be out of control. Table 2 ARLs of the GWMA-

x

x control chart with control limit width L = 3.0 (n= 5)

+

q = 0.50 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.25 346.33 344.65 348.78 347.76 350.50 357.47 365.95 365.42 370.89 371.37 380.48 211.46 182.17 172.46 166.29 171.30 173.99 176.81 182.13 187.67 191.36 209.09 121.96 96.36 84.44 78.55 79.11 80.66 82.68 86.35 89.87 93.20 101.46 34.62 26.39 21.77 19.74 18.52 17.99 18.12 18.31 18.78 19.35 21.24 972 8.95 8.36 13.99 11.33 7.99 7.91 7.89 7.77 7.80 8.12 7.17 5.37 5.10 6.21 5.68 4.97 4.83 4.72 4.74 4.70 4.71 q = 0.75

0.10

0.00 0.15 0.25 0.50

0.75

1 .oo

0.00 0.15 0.25 0.50 0.75 1.00

0.00 0.15 0.25 0.50 0.75

1 .oo

000 0 10 025 050 075 1 00

0.10 020 0.30 0.40 050 0.60 345.60 351.03 347.99 359.21 374.74 397.19 193.46 158.48 127.46 113.62 114.02 119.36 110.32 76.89 59.64 50.87 46.40 46.80 31.39 22.08 17.44 15.02 13.60 12.79 7.62 7.11 674 8.55 13.02 10.05 4.88 4.68 4.48 6.80 5.86 5.25 q = 0.80 0 10 0.20 0.30 0.40 0.50 0.60 355.85 355.50 366.22 386.29 408.54 444.24 194.14 149.43 119.48 106.96 105.28 108.67 108.61 75.09 56.17 47.33 43.42 42.44 30 83 21 41 16.88 14.41 13.08 12.39 7.51 6.98 6.60 12.82 9.89 8.50 4.86 4.68 4.47 6.84 5.78 5.22 a = 0.90

010 35501 257 17 19561 2953 1245 668

1.00 1.25 070 0.80 0.90 411.30 438.96 461.56 489.01 517.78 129.07 142.11 150.99 166.49 196.26 49.08 53.11 57.43 63.51 77.45 12.92 14.86 12.61 12.41 12.57 6.54 6.38 6.44 6.62 6.39 4.46 4.33 4.32 4.31 4 32

0.70 0.80 0.90 1.00 1.25 471.28 491.94 519.96 542.88 566.73 116.70 130.44 145.34 158.37 194.99 44.03 46.89 52.54 57.39 72.95 11.84 11.78 11.89 12.22 13.66 6.41 6.35 6.26 6 30 6.49 4.37 4.32 4.28 4.26 4.29

020 030 040 050 060 070 080 090 100 125 36302 36665 42567 48991 56782 65843 72054 769 13 81207 86288 194.42 155.31 134.16 130.22 138.05 148.67 173.31 201.51 233 64 30978 142.31 110.95 94.32 89.83 93.10 99.10 112.56 129.77 149 10 19868 2020 1597 1351 1228 1164 11.20 11.07 11.23 1145 1244 814 724 673 6.14 624 632 963 640 626 6.16 476 451 440 429 572 510 4.23 426 430 426

4.2. An Example

A set of simulation data is used herein to illustrate a G W M A - x control scheme. The values of the process characteristic XJh where j = I, 2, ..., 20 and k = 1, 2, ..., 5 are independent and have a common normal distribution with meadmedian po = 0, variance o2= 1 . Let the target value be po, and the process be under control for the first ten samples. Then, the process median level shifts upward about 0.25 5 during the last ten samples. These twenty simulation data, along with their corresponding EWMA- 2 control statistics, 5,and GWMA- R control statistics, 5, are listed in Table 3.

575 Within this table, we set the parameters /z = 0.25 and L = 3.003 for the EWMA- 2 control chart with time-varying control limits. For a fair comparison, we set the parameters q = 0.90 and L = 2.884 for the GWMA- 2 control chart with time-varying control limits, with the in-control ARL being 500. The EWMA- 2 control statistics, 4, display the out-of-control signal at the 18" sample. The GWMA- control statistics, Y,, display the out-of-control signal at the 16'h sample. Under the assigned parameters as described above, it takes only 30.67 samples in average for the GWMA- 2 control scheme to detect an out-of-control signal, while 56.67 samples are needed for the EWMA- B control scheme. Figs. 2(a) and 2(b) display the plots of the control statistics. Table3 Exam1

L-

x,,

45

0921 -2056 -0898 2587 -0 378 1227 -0 565 0505 0524 -0982 -0 826 1037 -0 473 -0426 1358 2 124 0447 2047 2506 -0092

0617 0445 1405 0202 -0 195 0768 -0494 -0018 -0 159 0086 -0 693 -0091 1 165 -0 294 077C 0522 1046 0292 2016 0804

(a) The EWh4A-

x control scheme and a GWMA- x c( itrol scheme

2 control chart

(b) The GWMA-

0.155 0182 0199 0210 0219 0225 0231 0235 0239 0242 0245 0247 0250 0251 0253

-0.155 -0182 -0 199 -0210 -0219 -0225 -0231 -0235 -0239 -0242 -0245 -0247 -0250 -0251 -0253

0.254 0.256 0.257 0.258 0.259

-0.254 -0.256 -0.257 -0.258 -0.259

x control chart

y,

0 10 0 00

1-1 020 0 30

J

Figure 2. The EWMA- 1 control scheme and the GWMA-

2 control scheme

576 5

Conclusion Remarks

This paper uses the GWMA control chart to monitor the sample median of a process. The ARL of the GWMA- 2 control chart is obtained through a simulation approach. Table 1 has shown the comparison results between the GWMA- j control chart and the EWMA2 control chart. Castagliola"] has showed that the E W M A - j control chart is more efficient than the Shewhart 2 control chart in detecting small shifts of the process median. In this paper, we show that the GWMA- 2 control chart is superior to the EWMA-2 control chart in detecting the small shift of the process median. When the process median is out of control ( 6 > 0), and when the parameter a < 1, the GWMA- j control chart will reduce the type I1 errors. Therefore, if the user prefers the 1control control chart in detecting the small shift of the process median, then the chart to GWMA- 2 control chart is the best among these three median control charts.

x

References

P. Castagliola, Int'l J. Relia., Quali. andsafe. Engin. 8, 123 (2001). S. V. Crowder, J. Quail. Tech. 21, 155 (1989). S. V. Crowder and M. D. Hamilton, J. Quali.Tech. 24, 12 (1992). 4. J. M. Lucas and M. S. Saccucci, Technomefrics.32, 1 (1990). 5. J. F. MacGregor and T. J. Harris, J. Quail. Tech. 25, 106 (1 993). 6. C. H. Ng and K. E. Case, J. Quail. Tech. 21,242 (1989). 7. S . W. Roberts, Technometrics. 1,239 (1959). 8. S . M. Ross, A Course in Simulations. Macmilan Pub. Co. (1990) 9. S. H. Sheu, and T. C. Lin, Quail. Efigin. 16,209 (2003). 10. A. L Sweet, IIE Trans. 18,26 (1986). 1. 2. 3.

SAFETY-INTEGRITY LEVEL MODEL FOR SAFETY-RELATED SYSTEMS IN DYNAMIC DEMAND STATE

I.YOSHIMURA, Y.SAT0 AND K.SUYAMA Tokyo University of Marine Science and Technology, 2-1-6, Etchujima, Koto-Ku, Tokyo, 135-8533, JAPAN E-mail: [email protected]. ac.jp ,[email protected]. ac.jp Recently computer systems have been widely applied to safety-related systems for achievement of safety functions. This general trend forced IEC to compile IEC 61508 as a standard related to functional safety of electrical/electronic/programmable electronic safety-related systems, i,e., E/E/PE SRS (SRS). In accordance with the standard, an SRS is specified with its safety function(s) and safety integrity level(s) (SIL) and the SILs to be allocated to the SRS are specified with four levels of safety integrity. The standard requires assessing the risk reduction achieved by SRS using appropriate probabilistic techniques for allocation of SILs to SRS. However, the relationships among SILs, operation modes and hazardous event rate are not always cleared up yet. This paper presents a new Markov Model to describe causation of hazardous events in the overall system composed of equipment under control (EUC), EUC control system (BCS) and SRS. The SRS is assumed to implement a safety function in a dynamic demand state and assumed to have no automatic self-diagnosis functions. Then, the relationship among a dangerous undetected failure of SRS, demands for actuation of SRS and hazardous events brought about in the overall system is formulated based on the model. Moreover, new demand modes of operation and estimations of hazardous event rate are proposed for allocation of SILs t o SRS.

1. Introduction

Recently computer systems have been widely applied to safety-related systems for achievement of safety functions. This general trend forced IEC to compile IEC 61508 as a standard related to functional safety of electrical / electronic / programmable electronic safety-related systems (SRS) Currently Japanese Industrial Standard (JIS) includes the translated standard of IEC 61508, JIS C 0508'. These standards are applied to various field^^,^. In accordance with the standard, an SRS is specified with its safety function(s) and safety integrity level(s) (SIL(s)). SILs are currently defined in terms of either the probability of failure to perform its safety function for low demand mode of operation or the probability of dangerous failure per unit time for high-demand or continuous mode of operation. Moreover, the SIL(s) to be allocated to the SRS are to be specified with four levels of safety integrity. However, the relationships among SILs, operation modes and hazardous event rate are not always cleared up yet.

'.

577

578

I

Uverall system 1-

1-

r

E/E/PE SRS Other technology safety-related system demand ERRF

1

I

Implementation of safety function Figure 1. Total system and safety-related system

In the present paper a new Markov model is introduced and quantified in order to describe causation of hazardous events in the overall system and to estimate hazardous event rate in the dynamic demand state of SRS. 2. The overall system

In many fields such as manufacturing, transportation and process industries, the overall system is typically composed of an equipment under control (EUC), EUC control system (i.e., basic control system: BCS), SRS(s), other technology safetyrelated system(s) and external risk reduction facility (hereinafter ERRF') as shown in Figure 1. The BCS controls EUC in order to prevent hazardous events or other undesirable events from arising. The SRS, other technology safety-related system and ERRF are redundancies of the safety function(s) of BCS. Here the followings are postulated: (i) the overall system is composed of an EUC, BCS and SRS only, (ii) the SRS implements one safety function and has no automatic self-diagnosis functions, (iii) while a proof test (PT) is carried out for the SRS, the operation of EUC is stopped in order to keep it in a safe state in which no demand arises. Thus, Postulate (iii) makes the stochastic process of the demands dynamic. In order to analyze the causation of hazardous events, the following logics are specified: (1) The SW fails at first and the resultant failed-state (i.e., a dangerous undetected fault) continues until a demand arises. This finally leads to a hazardous event. (2) A demand arises at first and the demand state continues until the SRS fails. This brings about a hazardous event. Nomenclature

579

1-m Figure 2.

State transition model for hazardous events between proof tests (PTs)

[l/hour] probability that a demand occurs per unit time at time t , given the system is not in a demand state at time t (demand rate) pd [l/hour] probability that the demand state recovers per unit time at time t , given the system is in the demand state at time t (completion rate) A, [l/hour] dangerous failure rate of SRS (hereinafter, failure rate) T [h] time between proof tests m [l/hour] probability that state D (see Figure 2) recovers per unit time at time t , given the system is in the state D at time t (recovery rate) P*(t) probability that the system is in state * at time t (where * implies state A, B, C or D in Figure 2 ) P * ( s ) Laplace transformation of P * ( t ) w ( t ) [l/hour] statistically expected number of occurrences of hazardous event per unit time at time t (i.e., hazardous event rate) w* [llhour] average hazardous event rate between two PTs, i.e. average of w ( t ) by T Ad

3. Stochastic model of the system

This is derived under the following assumptions (see Postulate (i)): (1) The demands on and failures of SRS are mutually statistically-independent. (2) The occurrences and completions of demand can be modelled by exponential distributions with demand rate A d and completion rate P d , respectively. (3) Failures of SRS can be modelled by an exponential distribution with failure rate A,, and the fault resulting from the failure continues until the next P T is carried out or until a hazardous event occurs. Namely, any failure brings

580 about a dangerous un-detected fault (hereinafter, a DU fault: see Postulate (ii)).

If a hazardous event happens, then the overall system is recovered according to an exponential distribution with recovery rate m (here, m + 00 implies immediate recovery and m = 0 does no recovery.)

w* is sufficiently smaller than unity, w* << 1 [l/hour]. Any DU fault is found out and repaired by the next PT. The overall system stops its operation and is maintained at PT. No demand occurs during proof testing (see Postulate (iii)). The overall system starts its operation after the completion of proof testing. The harm resulting from hazardous events is fixed. Based on the assumptions, the stochastic process of a hazardous event is modelled by Markov model as shown in Figure 2 . In the figure, states A, B, C and D mean the followings: State A : (non-demand-state, normal SRS) State B : (demand-state, normal SRS) State C : (non-demand-state, SRS in a DU fault) State D : (the state of maintenance after a hazardous event) The operation of overall system is repeated during the end of proof testing and the start of the next P T according to assumptions 1, 2 , 3, 5 and 9 (see Figure 2). If the overall system is in state A, B or C before the next PT, there is no hazardous event during the PTs. If, however, a demand arises in state C or SRS fails in state B, a hazardous event occurs and the overall system falls into state D. From state D, the overall system recovers to state A by recovery rate m and starts its operation again (see assumption 4). 4. Hazardous event rate between PTs 4.1. Basic formulation

Now, let t = 0 be the end of a P T as shown by Figure 3, then, the following simultaneous state-equations are set up:

dPA ( t )-

dt

(Ad

+

As)

P A (t) f p d P B (t) +mPD (t)

+A d P C ( t ) ,

dPD ( t )- mPo ( t )+As PB ( t )

dt

j

(1)

581

non-demand state

Figure 3.

Proof test (PT) and time between PTs

Through two causation logics in section 2, w ( t ) A t is given as

w ( t ) A t= Pr{ 1) when the overall system is in state B at time t SRS fails during (t,t+At],or 2) when in state C at time t a demand occurs during (t,t+At],given the overall system is in state A at t = O}.

(6)

Let above statements be replaced by following symbols:

Uo :overall system is in state A at time 0 , U1 :overall system is in demand state B at time t , U2 : SRS is in a DU fault, i.e. state C, at time t , U3 : a

demand occurs during ( t ,t +At], and

U4 : SRS

fails during ( t ,t +At].

Then,

u1nuznu4)u (U, nu, nU2nU 3 )} .

w ( t )At = Pr { (Uon

(7)

nuz

Where (UOn U1 nU4) and (UOnu, nU2 n U 3 ) are mutually exclusive. Moreover, U1 and U4, U2 and Us are mutually statistically-independent, and Pr{Uo} = l according to assumptions 7 and 8. Therefore, w(t)

xsPB(t)+XdPC(t).

And w* is

4.2. Immediate recovery from state

D

W ( t )is given as ,.t

W ( t )=

j0w(t)dt

(8)

582 Laplace transformation is applied to equations (1)-(5), (9) and (10) with the following initial conditions:

{ Moreover, m to A. Therefore,

-+ DO

t s o

w(t)=O,

PA ( 0 )= 1, Ps (0)= Pc ( 0 )= Po ( 0 )=0.

(11)

when the overall system recovers immediately from state D

Where

The average hazardous event rate between two PTs, w * , is

5. Demand modes and average hazardous event rate

If parameters such as Ad, Pd are known, w* can be estimated by equation (13). However, comparison of parameters, i.e., T , Ad, pd, makes it easier to estimate w * . This approach is shown in Table 1. In this table, the estimation of w * is categorized into 6 formulas and the demand mode is classified into 9 specific modes. The formulas for estimation of “low demand rate and short duration”, “high demand rate and short duration”, “medium demand rate and long duration” and “high demand rate and long duration” are consistent with those obtained by Kato, Sat0 and Horigome for the steady demand state The procedure in allocating SILs for SRS using Table 1 is as follows:

’.

(1) Calculate hazardous event rate of the overall system without SRS (this rate is equal to demand rate Ad), (2) Decide a tolerable hazardous event rate, w:, in accordance with, for example, ALARP, (3) Compare the hazardous event rate Ad of w:, and if the former is larger than w:, decide safety function(s) and SIL(s) to be allocated SRS(s) in order to achieve the necessary risk reduction, and

583 Table 1. New modes of operation for allocation of SILs for dynamic demand-state SRSs without self-diagnosis Modes of operation Low demand rate and short duration (Ad << 1/T and >> 1 / T ) Low demand rate and medium duration (Ad << 1/T and /&jN 1 / T ) Low demand rate and long duration (Xd<> 1 / T ) Medium demand rate and medium duration (Ad 'VWd 1 / T ) Medium demand rate and long duration (Ad N 1/T and p d z o ) High demand rate and short duration (Ad >> 1/T and p d >> 1 / T ) High demand rate and medium duration ( h > > l / T a n d PdNl/T) High demand rate and long duration (&>>l/Tand FdNO)

w* [ l / h @ u T ]

0.5AsAdT 0.87AsAdT

AsAdT 0.37A,

0.57As

As

Note: T h e condition of As, 0 < AsT<
(4) Then, the failure rate of SRS, A,, is calculated from Table 1 using other parameters. This A, should satisfy the SIL allocated to the SRS for risk reduction of the overall system.

6. Conclusions In order to carry out IEC 61508, it is necessary to formulate the relationships between the rate of both demands on SRS and hazardous events. In the paper, at first, the causation of hazardous events in a typical system, where SRS has no automatic diagnostic function and a hazardous event occurs between PTs only, is described by a Markov model. In this model, the stochastic process of demands becomes dynamic, since EUC is stopped during its proof-testing. Then, the relationships among the failure of SRS, time between PTs, demand rate, completion rate and W * are formulated based on the model. In this formulation, the assumption of recovery rate, m + m, is used: this assumption makes w * maximum in all cases of m. However, this w* is enough to implement the standard because of the standard's requirement for A,, i.e. 0 < A, < l o p 5 [l/ho'u.~],and of considering of general proof test interval of T = lo4 [h].Thus, the present paper concludes that equation (13) makes reasonable estimates of hazardous event rate between proof tests. Moreover, an approach for easy estimation of w* is yielded by using the relation-ships among A d , p d and T .

584

References 1. IEC 61508, 1998, IEC, Geneva. 2. JIS C 0508:1999, Japanese Standards Association (in Japanese). 3. M.Yamashita, Y.Tanabe, K.Ohrui, Y.Sato, Safety Integrity of SRSs with Common Components, Procs. the 5th International Conference on PSAM 2000, vol.1, pp.473479, Osaka, Japan, Nov. 2000. 4. T.Kawahara, T.Kushibiki, K.Tsuboi, Y.Sato, Safety-Integrity of Safety-Related System with Human Beings, Procs. the 5th International Conference on PSAM 2000, vo1.4, pp.2411-2417, Osaka, Japan, Nov. 2000. 5. E.Kato, Y.Sato, M.Horigome, Safety-Zntegrity Levels Model for Draft IEC 61508Functzonal Safety, The transaction of the institute of electronics, information and communication engineers, R99-18, (1999) (in Japanese).

WARRANTY STRATEGY ACCOUNTS FOR PRODUCTS WITH BATHTUB FAILURE RATE

SHIUEH-LING YU* AND SHEY- HUE1 SHEU Holistic Education Center, St. John’s and St. Mary’s Institute of Technology, Tamsui, Taipei, Taiwan, R. 0. C . * and Department of Industrial Management, Nafional Taiwan University of Science and Techno[ogy, Taipei, Taiwan, R. 0. C. r1t.u~~mciil.rismir. edu.IW* and shsheii~G2im. niii.vi. d i i . ttc Abstract-This investigation presents a novel repair-replacement warranty strategy. The strategy involves splitting the warranty period into two intervals in which only minimal repairs can be undertaken, separated by a middle interval in which no more than one replacement is allowed. The cost of the i-th minimal repair at age y depends on the random part C ( y ) and the deterministic part ~(y). Distribution functions of failure rates, with a bathtub shape over the lifetime of the product, are also considered. The form of the optimal repair-replacement strategy that minimizes the expected cost of servicing the warranty over the warranty period is discussed as well.

1. INTRODUCTION A warranty is a contractual agreement between a manufacturer and a consumer, and requires the manufacturer to rectify all failures that occur within the warranty period. Under a free replacement warranty, no charge is made to the consumer for these rectifying actions, which can be either repairs or replacements with new products. The choice between repair and replacement is made by the manufacturer and depends on the related costs, the lifetimes of repaired and new products, and the time until the end of the warranty period when the failure occurs. A manufacturer must devise a maintenance strategy that minimizes the cost of meeting obligations under servicing the warranty. Blischke and Murthy [ 1,2] summarized optimal warranty servicing strategies that minimize the expected warranty cost. As in Biedenweg [3], the warranty period is split into a replacement interval followed by a repair interval. This strategy is based on the idea that replacements close to the end of the warranty are not in the interest of the manufacturer. Nguyen and Murthy [4,5]continued with the idea of splitting the warranty period into distinct intervals for repair and replacement. The first warranty servicing model, involving minimal repair and assuming constant repair and replacement costs, was that of Nguyen [6]. Jack and Murthy [7] recently presented a warranty servicing strategy involving minimal repair and replacement. The strategy splits the period into two intervals during which only repairs are performed, separated by a third interval in which at most one replacement is made. For intermediate values of cost ratio of replacement versus repair, the expected cost of the new static strategy compares favorably with that of the optimal dynamic strategy, as determined by Jack and Van der Duyn Aschouten [8].

585

586 Most of the studies cited here assume that the failure rate function of products increases with the item’s age. Item failure mechanisms and failure phenomena have been studied [9]. The products life cycle includes generally three phases of failure rate. The three phrases are represented as a pattern of bathtub curve. Hence, this study applies a bathtub-shaped failure rate to describe the new warranty servicing strategy. The cost of the i-th minimal repair at age y is g ( c ( y ) , c , ( y ) ) ,where C ( y ) represents an agedependent random part of the cost; c,( y ) is a deterministic part of the cost, which depends on the age and the number of minimal repairs; C,( y ) is non-decreasing in i , and g is a positive non-decreasing and continuous function. The general warranty servicing strategy proposed herein again involves splitting the warranty period [o, w ] into three distinct intervals for performing repairs and making a replacement. Setting g(c(y),c,( y ) )= C, , and t, = 0 in our Theorem 1 and 2, this was the case considered by [7]. Section 2 describes the model. Section 3 analyzes the model. Section 4 presents the form of the optimal strategies. 2. MODEL FORMULATION Let F ( t ) denote a distribution function of a new product lifetime X , with density

f ( t ) on

[o, a). The failure rate function of F( t ) , ~ ( t is) ,defincd by r ( t ) = fo F(t)

where F(t)= 1- F ( t ) is the survival function of X , and ~ ( t=) 0

f(U)

du is the

cumulative failure rate function of X . This study concentrates on distribution functions with a bathtub-shaped failure rate function r ( t ) ,defined as follows. DEFINTION. A function r ( t ) defined on R+ = [O,m) is said to have a bathtub shape if there exist 0 5 t, 5 t, < m such that r ( t ) is strictly decreases on [ O , t , ) , is constant on

[t,,t,] and strictly increases on (t2,w),where t, and t, are called the change points of r(t>. A repairable product sold with a non-renewing free replacement warranty of period , which requires the manufacturer either to repair or to replace the product when it fails, is considered. The maintenance strategy is characterized by the two parameters K and L where 0 5 K I L I W , and is defined in the following way. 1. Any product failure occurs in the interval (0, K ) is rectified by minimal repairs. 2. During the first failure in the interval [ K ,L ] the failed product is replaced with a

w

new one and any subsequent failures in this interval are minimally repaired. 3. Any failure during the period [ L ,W ] is always minimally repaired. 4. The cost of the i-th minimal repair at age y is g(C(y),c, ( y ) ), where C ( y ) is an age-dependent random part; c,(y) is a deterministic part, and depends on the age

587

and the number of minimal repairs;

C, (

y ) is non-decreasing in i , and g is a

positive non-decreasing and continuous function. 5. c, is the cost to replace a product. This ( K ,L ) strategy thus divides the warranty period into two repair intervals, separated by amiddle interval in which no more than one replacement is carried out. The parameters L, K are unknown parameters, which determine the repair-replacement warranty strategy. The following hypotheses are required: (1) all product failures are detected immediately and result in immediate claims by the consumer, (2) all claims are valid and must be rectified by the manufacturer either by minimal repair or replacement, (3) repair and replacement times are small relative to the mean time between item failure and therefore can be ignored, (4)the distribution of the random part C ( y ) of the minimal repair of the system at age y is supposed to be known with finite mean

E[C(Y)l. We also require the following extended result of a lemma in Block et al. [lo]. Lemma 1 is shown by mimicking the proof of the corresponding lemma in Block et al. [ 101. LEMMA 1. Let { N(t),t 2 0 ) be a non-homogeneous Poisson process with intensity r ( t ) ,t 2 0 and ~ ( t=)E [ N ( ~ )=]

j‘r(y) d y . Denote the successive arrival times by 0

S,,s,;.. . Assume that at time si a cost of g(C(S,),c,(S,))is incurred. Supposed that C ( y ) at age y is a random variable with finite mean and g is a positive, nondecreasing and continuous function. If A(t) is the total cost incurred over [o, t ) , then E [ A ( t ) ]=

I,’ h ( z )

Y(Z)

dz ’ where h ( z )= E ~ ( [zE)c ( z [) g ( c ( z ) , c ~ ( , ) + , ( z )is ) ]the ]

expectation with respect to the random variables C ( z ) and N ( z ) . PROOF. See Ref. [ll]. ( ~ 1 3 2 ) . 3. MODEL ANALYSIS The expected total warranty servicing cost per item is represented as a function of L and K , J ( K ,L ) . The objective is to determine the optimal K and L that minimize J ( K , L ) , subject to the constraints 0 I K I L I . An expression for J ( K , L ) is devised by subdividing the expected total warranty servicing cost per item J ( K ,L )

w

over [0, W ] into three parts. Consider a non-homogeneous Poisson process { ~ ( t ) ,2t0) with intensity r ( t ). The random minimal repair assumption and Lemma 1 yield the expected repair cost over the interval [O,K),

jK h ( z ) r ( z )d z . The expected cost over 0

the remaining interval [ K ,W ] depends on whether the first failure occurs after K . Let

Y be the time at which the first failure occurs after K ,and let the successive arrival Sl;.. . If Y lies within the interval [ K , L ), then the expected times after Y be

s,‘,

5aa

($)I

warranty servicing cost during the remainder of the warranty period, conditional on NIW-Y)

g(c(,y,'), c,

since the failure results in replacement by a

new item. If Y lies beyondL , then the age of the product at L is L and the conditional expected repair cost during the remainder of the warranty period is,

1

E [ Y)g(C(S,'),c,(S,')) . (In here, we sets,' = Y ) .Removing the condition and the ,=I

three cost terms yield J ( K ,L )

Define R h ( t )= J i h ( z )r ( z )dz and G(t)= h(t)-c, +Rh(W)-Rh(t)-R,(W-t).

(2)

Thus, the following equation is obtained

However

4. MAIN RESULTS

The following results are required to prove that the warranty strategy is optimal. LEMMA 2. Suppose that the failure rate function r ( t ) has a bathtub shape with change points t, and t, . Furthermore t,, t, and r ( t ) satisfy,

(c,)2 t , I t l + t , I W 1 2 t , 5 2 W

and

(C,) r ( t ) - r ( ~ - t ) son~ [ O , t , ) , h ' ( t ) - h ( t ) r ( t ) + h ( W - t ) r ( W - t ) ~ Oon [ W / 2 , W ] ,h ( t ) is non-decreasing on [O,W].

589 Then, the function G ( t ) on

[o, w]is non-decreasing on [0,W / 2 ] ,and non-increasing

on [W/2,W ]. Moreover,

G(t)reaches its maximum value h( W / 2 )- c, + Rh( W )- 2 4 ( W / 2 ) at the point t = W/2 . PROOF. Differentiating G(t) yields G’(t)= h’(t)- h(t)r(t)+ h(W - t ) r ( W - t ) for all t E [0, W ].

The proof is separated into the following cases. Case (1). 0 I t < t, . The function h(t) is non-decreasing on [0,W], so

(6)

h‘(t) 2 0.

Notably, W - t 2 t , h ( W - t ) ~ h ( t and ) r ( t ) - r ( W - t ) ~ Oon [O,t,) byassumption, implying G’(t)2 0 . Case(2). t, I t < W / 2 . The function h(t) is non-decreasing on [0, W] and W - t 2 t , so

h(W - t ) 2 h ( t ). Notably, W -t 2 w/2 2 t, and r ( t ) is non-decreasing in t for t 2 t, , implying r(W - t ) 2 r(t,)= r ( t ). Hence G‘(t)L 0 . Case(3). W / 2 I t I W . Follows (

c,), G’(t)5 0 on [ W / 2 ,W ]. Accordingly, the desired

results are obtained. LEMMA 3. Suppose that the failure rate function r ( t ) has a bathtub shape with change points t, and t, . Moreover, t,,t, and r ( t ) satisfy,

(c,) 2t, I t , + t, I W 5 2t2 I 2W

and

( C , ) r ( t ) - r ( W - t ) S O on [o,t,),r ( t ) - r ( W - t ) > O on [t2,W],and h(t) isnondecreasing on [0, W ] such that h(t) is constant on [W/2, W ]. Then, the function G ( t ) is non-decreasing on [o, W / 2 ] ,and non-increasing on [ W / 2 ,W ]. Moreover, G( t )reaches its maximum value h(W/2)-cr +Rh(W)-2R,,(W/2) at the point t = W / 2 . PROOF. The proof is separated into the following cases. Case (1). 0 I t I W/2. As the same as the proof in Lemma 2, it is clearly. Case(2). W/2 I t < t, . The function h(t) is constant on [ ~ / 2W, ] by assumption, so

h’(t)= 0 and W --t

I W - W/2

=W / 22

t , implying h(W

t, < W - t, I W - t I W - W / 2 I t I t, and

-

t ) I h ( t ). Notably,

r(t) is constant on t E [t,,t 2 ], implying

r ( t ) = r(W - t ) . Hence, G’(t)I 0. Case(3). t, I t I W . The function h(t) is constant on [W/2, W ] by assumption, so

h’(t) = 0 . Notably, 0 I W - t < W - t, I W - W/2 = W/2 I t, I t and h(t) is nondecreasing in t , implying h(W - t ) I h(t) . Also r ( t ) r(W t ) 2 0 on [t,,W ] by ~

assumption, r(W - t ) I r ( t ) hence G’(t)2 0. Accordingly, the desired results are obtained.

~

590 Now, Theorem I , one of the main results in this study, can be proven. - 2Rh( W / 2 )< c,. and the failure rate THEOREM 1 . Suppose that h(W/2)+ Rh(W)

function r ( t ) satisfies (

c,)-(c,) or ( c,)-(c,) is a bathtub-shaped function with

change points t, and t, . Then, L* = K’ , and the optimal strategy is “always to repair” with J ( K * , K ‘ ) = R h ( W ) . PROOF. First the optimal

K

K is fixed and L*( K ), the optimal L as a fimction of K , is found. Then, is obtained by minimizing J ( K , L * ( K ) ) .From Lemma 2 or 3,

G(t)reaches its maximum h(W/2)-cr + Rh(W)-2Rh(W/2) at t = w/2. For each K E [O, W ], L*( K ), the value of L E [K , W ]which minimizes J ( K ,L ) , is found. Differentiating (5) with respect to L yields

and

If h(W/2)+ Rh( W )- 2Rh( W / 2 )< C, then the maximum of C(X) is non-positive,

G(L)IO and dJ(K,L)/dL20,VL~[K,~].Hence,foreach K E [ O , W ] J, ( K , L ) is non-decreasing in L , and so L’ ( K ) = K . Therefore, K’ can be any value in the interval (0,W ] ,L* = K’ , and the optimal service strategy is “always to repair” with

J ( K ’ , K’) = R,(W) . The theorem is proven. LEMMA 4.Suppose that h(W/2)+ Rh(W)-2R,(w/2) >c,. > h ( W ) and the failure rate function r ( t ) is a bathtub-shaped function with change points

t, and t, . Moreover,

r ( t ) satisfies (c,)-(c,) or (c,)-(c,). Then, the equation G(t) = 0 has two roots a E (0, W / 2 ) and b E (W/2,W ). Additionally, the function H(K)=-j:G’(t)F(t)dt, K E[O,b]

(9)

satisfies the following ( R, ) H ( K ) is non-decreasing on [0, W / 2 ] ,non-increasing on [ W / 2 ,b]and reaches its maximum at K = W / 2 , ( R, ) H ( a ) 5 0, H(W/2)> 0, H(b)= 0 .Therefore, the equation H ( K ) = 0 has two

[o,

roots in b],one at b and the other at c E [a,W / 2 ]. PROOF. Suppose that h(W/2)+ R,(W)-2Rh(W/2) > C, > h ( W ) ;from Lemma 2 or 3, the maximum value of

G(t) is positive, G(0) = -c, < 0, and G(W)= h(W)-c,. < 0 .

591 Therefore, the equation G ( t )= 0 has two roots, a E (0,W / 2 ) and b E ( W / 2 ,W ). Differentiating (9) yields H ’ ( K ) = G ’ ( K ) F ( K ) on [O,b]. ff’ has the same sign as G‘. Lemma 2 or 3 thus yields the desired result ( R ,). By definition of H ( K ) , and from the

lbG’(t)F(t)dt 0 , H ( W / 2 ) 1’’ G’(t)F(t)dt> 0 , and,H(a) =-I G‘(t)F(t)dt -5’ F(t)dG(t) 1 G(t)d F ( t ) I 0 . sign of G’(t), H ( b )= -

=

=-

b

h

Wl2

=

=

h

Therefore, the desired result (R,) is again obtained. Accordingly, the following theorem is inferred. THEOREM 2. Suppose that ~ ( w / ~ ) + R , ( w ) - ~ R , ( w>/ ~C, ) > h ( ~and ) the failure rate function ~ ( tis)a bathtub-shaped function with change points t, and t, . Moreover,

c,

r ( t ) satisfies ( c,) - ( C, ) or ( c,) - ( ). Then, K’ E (a, w/2),L* E (W/2, W ),and the optimal strategy is to use the new ( K ,L ) strategy with J(K’,L*) < R h ( W ) . PROOF. Let K be fixed and the optimal L as a fimction of K , i.e., L* ( K ), is found. Then, the optimal K is obtained by minimizing J ( K , L*( K ) ). By Lemma 4, the equation G ( t ) = 0 has two roots, a E (0,W / 2 ) and b E (W/2,W) . Now, the proof is separated into the following cases. Case(1) 0 I K I b . From (7) and (8),

dJO=0 and 8 Z J ( K 9 b>) 0 , so L*(K)= b . 8L2 [ K , w ] , so L*(K)= K .

8L

Case(2) b < K 5 W . G ( L )< 0 and aJ(K,>0, v~ 8L In case ( 1 ) to , the value of K E [0, b] which minimizes J ( K ,b) , is now found, and then J(t,,b) is compared to J ( K , K ) = Rh(W).In order to complete the proof, we have a J ( K b, and to calculate L 8K

H(K ) =-

d2J(K7b)

. First, note that

8K2

1: G‘(t)F(t) dt = h ( K ) F ( K ) h(b)F(b) 1; h( W -

-

- t ) r (W - t ) F ( t )dt. (1 0)

From (5) yields I J ( K , b) = R h ( W ) +--{F(K) F(K)

[ h ( K ) - G ( K ) ]-F(b)[h(b)- G(b)]- j:F(y) h(W - y ) r(W - y ) dy

Differentiating the above with respect to

K yields,

aJ(K,b) - r(K -) H ( K ) and aK F(K)

d 2 J ( K , b )- r ( K ) H ’ ( K ) + [-r ’ ( K ) + r Z ( K ) ] H ( K )

(1 1)

F(K) From Lemma 4-( R, ), H ( K ) = 0 has two roots: b E (W/2,W) and c E [a,W / 2 ] , whereH’(b) < 0 and H’(c) > 0 . By (1 1) and (12) we have

8K2

1.

592

Therefore, to = c is allowed to be the minimizing value of J ( K , b ) for K

E [O,b]

.

From (5) and ( 1 0) 1 J(t0,b)= R h ( W )+ --{F(t,)h(t,)

F(t0)

= R, ( W )+

-{ 1

F(t0)

- K t o ) G ( t o )- F ( b ) h @ )-

j , h -Y P ( W -Y)F(Y)dY)

H(t,) - F(t,)G(t,)) = R, ( W )- G(t,).

(1 2)

The fact that to = c E [a, W / 2 ] ,G is non-decreasing in (0, W / 2 ) and G ( a )= 0 imply

G(to)2 0 . Therefore, J(t,,b) = Rh(W)- G(to)i R h ( W )= J ( K ,K ) . Thus J(t,,b) is the minimizing value of J ( K ,L ) for 0 5 K I L I W , where

K' = to E (a, W / 2 ) and L'

= b E (W/2,

W ),completing the proof of the theorem.

REFERENCES 1. W. R. Blischke and D. N. P. Murthy, Warranty Cost Analysis, Marcel Dekker: New York, (1994). 2 . W. R. Blischke and D. N. P. Murthy, Product Warranty Handbook, Marcel Dekker: New York, (1 996). 3. F. M. Biedenweg, Warranty analysis: consumer value vs manufacturers cost, Unpublished PhD thesis, Stanford University, USA, (1 981). 4. D. G. Nguyen and D. N. P. Murthy, An optimal policy for servicing warranty, Journal of the Operational Research Society 37, 1081-1088, (1986). 5. D. G. Nguyen and D. N. P. Murthy, Optimal replace-repair strategy for servicing items sold with warranty, European Journal of Operational Research 39, 206-212, (1989). 6. D. G. Nguyen, Studies in warranty policies and product reliability, Unpublished PhD thesis, The University of Queensland, Australia,( 1984). 7. N . Jack and D. N. P. Murthy, A servicing strategy for items sold under warranty, Journal of the Operational Research Society 52, 1284-1288, (2001). 8. N. Jack and F. A.Van der Duyn Schouten , Optimal repair-replace strategies for a warranted product, International Journal of Production Economics 6 7 , 9 5 100, (2000). 9. W. Kuo and Y. Kuo, Facing the headaches of early failures: a stateeof-the-art reviews of bum-in decisions, Proceeding of the IEEE 71, 1257-1266, (1983). 10. H. W. Block, W. S. Borges and T. H. Savits, A general age replacement model with minimal repair, Naval Research Logistics. 35, 365-372, (1988). 11. Shey-Huei Sheu, Optimal block replacement policies with multiple choice at failure, Journal Applied Probability 29, 129-14I , (1 992).

CALCULATING EXACT TOP EVENT PROBABILITY OF A FAULT TREE

T. YUGE , K . TAGAMI AND S. YANAGI Dept. of Electrical and Electronic Engineering, National Defense Academy, 1-10-20 Hashirimizv, Yokosuka, 239-8686, JAPAN E-mail: [email protected] An efficient calculation method to obtain an exact top event probability of a fault tree is proposed when the minimal cut sets of the tree model are given. The method is based on the Inclusion-Exclusion method. Generally, the Inclusion-Exclusion method tends to get into computational difficulties for a large scale fault tree. We reduce the computation time by enumerating only non-canceling terms. This method enables us to calculate the top event probability of a large scale fault tree containing many repeated events.

1. Introduction Fault trees are used widely as system models in quantitative risk assessments. Although obtaining the exact top event probability is an important analysis in the assessments, it is a difficult problem for a reasonably large scale system with complex structure, such as a chemical plant, a nuclear reactor, an airplane and so on. The main factor in this difficulty is the existence of repeated events. If there are no repeated events, a bottom-up algorithm' can be used to obtain the top event probability. In this case, even if the scale of the fault tree becomes large, the analysis is simple. For the trees with repeated events, many researchers have proposed efficient algorithms to obtain exact or approximate top event probabilities'. The proposed methods are classified roughly into two groups. One approach for this problem is using a factoring method3y4 in order to decrease the number of repeated events. Dutuit et al. proposed an efficient algorithm, named linear time algorithm', to search modules and the module top events in a fault tree. A module is an independent subtree whose terminal events do not occur elsewhere in the tree. Finding the modules and their module top events reduces the computational cost. The factoring algorithm^^,^ are adopted in order to reduce the tree by application of Bayes' formula and to reduce the problem to the computations of fault trees containing less number of repeated events. In this case, it is an important problem to decide which events should be selected for factoring. The other is using the Boolean function. In this approach, the main effort is to find the structural representation of the top event in terms of the basic events. Find-

593

594

ing the minimal cut sets is one way of accomplishing this step. Several algorithm^^,^ to find minimal cut sets are proposed. After all minimal cut sets are found, the inclusion-exclusion method is used to calculate the exact top event probability or its upper and lower bounds. For a large scale fault tree, however, the Boolean approach tends to get into computational difficulties. Because large trees lead to a large number of Boolean indicators which must be examined, computational time becomes prohibitive when all combinations of Boolean indicators are being investigated. A truncation method7 is useful to obtain the upper and lower bounds of top event probabilities for a large scale system. The Boolean indicators with low probability are truncated in order to reduce the computation time. This method is effective when the basic event probabilities in a tree are small enough. However if the tree contains events with large probabilities, events characterizing human error and some natural phenomena are typical examples, this truncation method is inappropriate and leads to erroneous results. In this paper we present a method for calculating the exact top event probability for a large scale fault tree containing many repeated events when all the cut sets are given. In this case the top event is represented by using cut sets. And the exact top event probability is given by the inclusion-exclusion expression. The main idea in our method is expressing the top event probability by using only non-canceling terms. The conventional inclusion-exclusion expression contains many canceling terms which do not contribute to the top event probability. If we can enumerate only non-canceling terms, the computational effort to calculate top event probability is reduced substantially. This method is an application of Satyanarayana’s topological formula’ proposed for the analysis of network reliability. An efficient algorithm t o generate only non-canceling terms is presented. And some numerical examples are shown. 2. Notations

T,

: top event m : number of basic events in a fault tree n : number of minimal cut sets in a fault tree ei : i-th binary basic event, i = 1 , 2 , . . . ,m p , : probability of occurrence of ei Ci : i-th minimal cut set, i = 1 , 2 , . . . n Ci : event that all basic events in Ci occur, i = 1’2, . . . n 1, : nonempty subset of {Cl,( 2 2 , . . . ,C,}, i = 1 , 2 , . . . 2, - 1 Ji : set of basic events, Ji = { e k l e k E Cj,C, E 1i} Ei : distinct sub set of {el, e 2 , . . . , e m } , Ei is formed by basic events belonging to nonempty sub set of {Cl,C2,. . . ,C,}. (The definition of Ei is almost same to that of J,. Although it is possible that J , = J j for any i and j , Ei is quite distinct from the other.) ai(b,) : number of ways of forming the sub set E, by the union of an odd (even)

595

number of minimal cut sets

di : ai - bi 1x1 : cardinality of set IC Pr{z} : probability of occurrence of event x P ( x ) : joint probability that all the basic events in set x occur

3. Top Event Probability When a fault tree model of a system is given, its T, can be represented as the union of C?i of the system.

Using the inclusion-exclusion rule, the exact value of Pr{T,) can be calculated.

For a large scale tree, however, this method might not be computationally feasible because the number of terms in Eq.(2) increases exponentially with n. However, Eq.(2) is equivalent to Eq.(3) that is represented by using P ( E i ) .

here, di is the domination of sub set Ei. This is a coefficient of P ( E i ) considering the number of canceling terms.

Example Consider a fault tree with 4 cut sets:

The number of I s is 24 - 1(= 15) (see Table 1). However, the number of distinct E is 10 (see Table 2). Eqs.(l) to (3) are rewritten as Eqs.(4) to (6). Then Pr{T,} is given in Eq.(7). Table 2 shows all sub set and their corresponding I s and a,,b,, d, in the tree.

596 Table 1. Relation between I and J

1 2 3 4

5

6

7 8 9

10 11

12 13 14 15

Next theorem gives the relation between canceling terms.

Theorem 3.1. Consider two I s , I , and I p = I , U {C,}. If all basic events in C, belong t o J , too, P(1,) and P(1p) canceled each other.

597 Table 2. Ei and the coefficient of Pr{Ei}.

1 2

3 4

5

6 7 8

9

10

[Proof] Let JL be a set of basic events belonging to C,. R o m the conditions,

P ( J p ) = P ( J , U J:) = P(J,).

As P ( & )= P ( J i ) ,P(I,)

= P ( I p ) . Furthermore,

(-l)lI-l+lP(Ia) + (-l)lIfil+1P(Ip) = (-l)IIJ+l (P(I,) - P ( I p ) )= 0.

1Q.W The combination of I , and C, which satisfy Theorem 3.1 can be found by the following Lemma.

Lemma 3.1. Let D be a set of basic events whose elements belong t o more than one distinct minimal cut sets. And let JL = { e u l ,em*,. . . , e,, } be the set of basic events belonging to C,, and Kj = {CllCl 3 e W l ,C1 # C,}. If I , consists of cut sets, each of which is taken up f r o m K j , j = 1 , 2 , . . . , t , and i f Jk c D , then JL E J,. 4. Enumeration Algorithm

In this section, two algorithms are proposed in order to calculate top event probability. One algorithm generates sets Hih which express the relation of canceling. The other generates only non-canceling terms and calculates top event probability.

4.1. Inclusion Algorithm The combinations of I , and C, which satisfy Theorem 3.1 is enumerated by the following algorithm using Lemma3.1 and given as Hih. Here, Hih is a set of minimal cut sets and shows the relation of inclusion for the basic events belonging to Ci. Ci occurs automatically if all the minimal cut sets in Hih occur.

Step 1 Generate a set D whose elements belong to more than one distinct minimal cut sets. Set i = 1.

598 Step 2 Let J,' = { e t 1 ,e,,, . . . ,e,+} be a set of basic events in C,. If Ji C D , set j = 1 , l = 0. Else go to Step 6. Step 3 For j = 1 , 2 , . . . ,t , generate a set of minimal cut sets K,. The element ck satisfies ez, E CI, for any k except for i. Set r, as the number of selected minimal cut sets. Step 4 Generate sets of cut sets Hzhr( h = 1 , 2 , . . . , T I x T Z x . . . x rt) of which the j t h element is taken up from K,, j = 1 , 2 , . . . ,t , respectively. Step 5 Adjust the H,h. If there is some common elements in one Hzh or if H,, = H,,, ( u # v), delete the excess. Step 6 If i < n , set i = i + 1 and go to Step 2. 4.2. Enumeration Algorithm

In this algorithm, all non-canceling terms are enumerated by using Hih. Although I0 is not defined in section 2, it is used as an empty set in this algorithm. The policy of this algorithm is: From Ii, , generate all possible I which include Ii, and 1. If the generated I,, satisfies the condition of canceling, check the III = existence of the pairing 10 among the I with IIl = lIill, then delete both 12, and

+

10.

Step 1 Set i l = iz = 0 , j = O,I,, = 4,Pr{Te} = 0. Step 2 If i2 < il go to Step 5. Else set k = II,,I. Step 3 1f.k # 0, set j = max{s}, where C, E I,,. If k = 0 and il # 0 then il = il 1 go to Step 2. Step 4 If j < n, set z = j + 1 and do the following. Else i l = i l + 1 and go to Step 2.

+

+

Step 4.1 Set iz = iz 1, I,, = IzlU {Cz}. Step 4.2 If k 2 2 and there is an Ip that satisfies, /3 > i l , lIpl = k - 1 and then set I,, = I p = 4. I p for c,E HZh C I,,, H,h Step 4.3 If 2 < n, z = z + 1 and go to Step 4.1. Else i l = il + 1 and go t o Step 2. Step 5 Calculate the probabilities of nonempty I . If I, is an odd number, add the probability to Pr{T,}, else subtract it from Pr{T,}.

5. Numerical Examples The following two system models are shown in order to confirm the effectiveness of our method. The common assumptions in this section are m = n, pi = 0.01 O.O01i, i = 1 , 2 , . . . , m and all cut sets have k basic events.

+

(1) Consecutive model : A circular consecutive k-out-of-n:F system. Namely the cut sets are defined as C1 = {el,eZ,e3},Cz = {ez,e3, e4}, . . . ,c30=

599 Table 3. k

n

Computation time of consecutive models

Computation time (sec.) Proposed method

Number of terms

InclusionExclusion

Proposed method

InclusionExclusion’’

low3

2

30

7,508.5

1,988.1

8,574,957

109

3.58 x

3

30

83.7

1,976.0

545,279

109

3.95 x 10-5

4

30

6.5

2,442.1

88,501

109

4.34 x 1 0 - ~

5

30

1.5

2,660.0

24,031

109

4.77 x 1 0 - ~

4

35

408.9

21 hours*2

568,833

3.2 x 10”

5

35

49.4

24 hours*2

123,377

3.2 x lolo

5.57 x 1 0 - ~

5

40

729.3

1 month*2

634,495

1012

6.37 x 1 0 - ~

5.07 x 10-7

Note: *1: the number of terms for an Inclusion-Exclusion method is given as 2n - 1. +2: the expected computation time.

{eso, e l , e z } if k = 3, n = 30. (2) Random model : The basic events in every minimal cut sets are decided randomly by the random number. Naturally, all minimal cut sets are distinct and all basic events must appear at least once. This random model is simulated 50 times for one fault tree. Tables 3 and 4 show the enumeration results. In table 4, “ave” shows the average computation time and the average number of generated non-canceling terms for the tree model. And “min” (“max”) shows the shortest(1ongest) computaion time within 50 samples,and its non-canceling terms and top event probability. For any n, the larger the parameter k becomes, the more the tree becomes complex. As a result, the proposed method becomes effective because of the increase of canceling terms and the computation time becomes less. Table 4 shows that the computation time depends on the structure of a tree very much. The examples shown in this section are fairly complex because we add an assumption of n = m. However we think it is possible in real fault trees that the number of basic events is reduced as a result of a simplification technique which is done by merging two non-repeated inputs into a single one.

6. Conclusion

In this paper, we proposed a method to calculate exact top event probabilities of fault trees containing many repeated events. This method is effective especially for trees with complex structure. We showed the exact probabilities of fault trees with 30 to 60 repeated events in numerical examples. This enumeration method can be applied to obtain the approximated top event probability easily by truncating the terms with small probability. Ofcourse the approximated value is more accurate than that given by the usual Inclusion-Exclusion method. However, if the structure of a tree is not so complex the benefit is little. And if the number of repeated

600 Table 4. k

n

3

30

3

4

4

5

5

5

40

30

40

40

50

60

Computation time of random models Computation time (sec.)

Number of terms

ave

18.7

30,559

-

min

0.001

59

3.99 x 10-5

max

117.5

110,597

3.97 x 10-5

ave *

-

1,979.6

217,714

min*l

0.001

79

5.3 x 10-5

mu*'

5.3 x 10-5

7 hours

1,543,055

ave

1.5

2,445

-

min

0.001

59

4.39 x 10-7

max

3.7

7,169

4.39 x 10-7

ave

36.1

19,328

min

0.001

79

5.86 x 10-7

max

173.8

67,429

5.85 x 10-7

ave

10.3

2,870

min

5.9

1,569

6.44 x 10-9

-

-

max

18.1

5,727

6.44 x 10-9

ave

90.4

13,311

-

min

13.5

2,777

8.05 x 10-9

max

256.6

45,857

8.05 x 10-9

ave

407.8

31,999

~

min

3.9

203

9.66 x 10-9

max

1,406.3

93,569

9.66 x 10-9

Note: *1: the values are given from 20 samples.

events is more, the analysis strikes a snag of computational difficulty. For such a non-complex tree or a huge scale tree, it is believed that a combination of factoring method and the one proposed in this paper is more effective.

References 1. W. S. Lee, D. L. Grosh, F. A. Tillman and C. H. Lie, IEEE Trans. o n Rehab. 34,194 (1985). 2. Y. Dutuit and A. Rauzy, IEEE Trans. o n Reliab. 45,422 (1996).

3. A. S. Heger, J. K. Bhat, D. W. Stack and D. V. Talbott, IEEE Trans. on Reliab. 44, 640 (1995). 4. K. Nakashima, Y. Hattori, IECE Trans. E60, 175 (1977). 5. D. M. Rasnuson and N. H. Marshall, IEEE Trans. on Reliab. R-27, 250 (1978). 6. S. Garribba, P. Nussio, F. Maldi, G. Reina and G. Volta, IEEE Trans. o n Reliab. R-26, 88 (1977). 7. M. Modarres and H. Dezfuli,IEEE Trans. on Reliab. R-33,325 (1984). 8. A. Satyanarayana and A. Prabhakar,IEEE Trans. o n Reliab. R-27, 82 (1978).

A PERIODIC MAINTENANCE OF CONNECTED-(r, s)-OUT-OF-(m,n):F SYSTEM WITH FAILURE DEPENDENCE; WON YOUNG YUN, CHEOL HUN JEONG, GUI RAE KIM Pusan National UniversiQ, Sun 30 Changjeon-Dong Kumjeong-Ku, Busan, 609-735, KOREA HISASHI YAMAMOTO Department of Production Information System Tokyo Metropolitan Institute of Technology

This study considers a linear connected-(r,s)-out-of-(m,n):F lattice system whose components are ordered like the elements of a linear (m,n)-matrix. We assume that all identical components are in the state 1 (operating) or 0 (failed) but the failures of components are dependent. The system fails whenever at least one connected (r,s)-submatrix of failed components occurs. The purpose of this paper is to present an optimization scheme that aims at minimizing the expected cost per unit time and system parameters. To find the optimal maintenance period, we use a genetic algorithm for the cost optimization procedure. The expected cost per unit time is obtained by Monte Carlo simulation. The sensitivity analysis to the different cost parameters is also made.

1

Introduction

A linear (m,n)-lattice system consists of m.n elements arranged like the elements of a (m,n)-matrix, i.e. each of the m rows includes n elements, and each of the n columns includes m elements. A circular (m,n)-lattice system consists of m circles (centered at the same point) and n rays. The intersections of the circles and the rays represent the elements, i.e. each of the circles includes n elements and each of the rays has m elements. The lineadcircular connected-X-out-of-(m,n):F lattice system is defined by Boehme, Kossow and Preuss [ 11. A special case of the connected-X-out-of-(m,n):Flattice system and a generalization of the kZ/n2:F system defined by Salvia & Lasher [ 3 ] is the linear/circular connected-(r,s)-out-of-(m,n):F lattice system. This is a linearicircular (m,n)-lattice system which fails if at least one connected (r,s)-submatrix of failed components occurs. The supervision system sketched in Figure l a is a typical example of connected-(r,s)out-of -(m,n):F system. The knots represent (for example) TV cameras. Each TV camera can supervise a disk of radius c that is a capacity of a TV camera, and the cameras in each row and column are of a same type and has a distance d from each other. The supervision system is failed if an area inside of the sketched square with sides (m-l)d and (n-1)d is out of observation. If the disk of radius c equals a distance d, the system fails if (at least) two * This work was supported by grant No.

(ROS-2002-000-00995-0) from the Basic Research Program of the Korea Science & Engineering Foundation. 601

602 connected cameras in a row or a column fail and we can express the rule connected-( 1,2)or-(2,1)-out-of-(3,3):F lattice system (see Figure lb). If the disk of radius c equals a distance A d , the system fails if one of the operating stations fails, i.e. if a connected (2,2)-matrix of failed elements occurs (see Figure lc).

a) Supervision system

b)connected-(2,l) system

c) connected-(2,2) system

Figure 1. (3,3)-Lattice system

Exact reliability formulas for connected-(r,s)-out-of-(m,n):F lattice systems are known for special cases by Boehme, Kossow and Preuss [l], but not for general cases. Zuo [7] suggested that the SDP method could be applied to general cases. But the SDP method is very complex and Yamamoto and Miyakawa [4]suggested the YamamotoMiyakawa (YM) algorithm that required O(sm-rm2.rn)computing time, viz, polynomial for n, but exponential for (m-r). Malinowski and Preuss [2] gave lower and upper bounds for the reliability of connected-(r,s)-out-of-(m,n):F lattice systems for s-independent components. Yun, Kim and Jeong [5] and Yun, Kim and Yamamoto [6] introduced the economic design problem to two dimensional consecutive system. They suggested a procedure to find the optimal configurations and maintenance policy of the system which consists of i.i.d. components. In this paper, we consider a maintenance problem of the two-dimensional system with failure dependence between components. We propose a procedure to predict the expected cost for unit time of this system and design the optimal system structure and maintenance policy (periodic replacement) minimizing the predicted cost using MonteCarlo simulation with genetic algorithm.

2

Failure dependency

We consider failure dependency in two dimensional consecutive systems. This means that when the arbitrary component (ij)is broken, then the failure rate of some working components are changed. In this paper, we assume that the failure rate of the nearer working components on the failed component is received the more effects. Therefore, when component (ij)is broken in arbitrary point of time tl, failure rates of the four adjacent components become larger than those of the other components (see Figure 2). The distances between one component and the others form a kind of series like d, f i d, 2d, &, d, 3d, f i d , . ., but it has no general form. Table1 shows the values of distance and the number of components within the same distances. Now, we denote d, as

603

d Figure 2. The distances of two components in two dimensional consecutive system

the maximum distance of ability which takes charge of the failure area in the linear connected-(r,s)-out-of-(m,n):F system. If we assume that all adjacent components have as ability of each the same distance d and r 2 s then all components have (PI)& individual. Therefore, if the distance of any component is smaller then d, of the failure component then the failure rate of working component is changed as follows ( r 5 s).

3

Optimal maintenance design

Assumptions 1. Each component and the system either operate or fail. 2. Replacement time is negligible. 0 Notation 0

~

TR

periodic replacement time system with m.n components arranged in m rows and n columns, and r adjacent rows, s adjacent columns whose failures cause ( m , n), (I 4, system failure R(TR), R(TR , r, s) reliability of the system with periodic replacement L(TR), L(TR,r, s) expected life of the system with periodic replacement

604 expected number of failed components in the system with periodic rkplacement at the end of a cycle fixed cost at system failure . co fixed cost for replacing components CI fixed cost for planned replacement cz C( TR ), C( TR , r, s) expected cost per unit time N(TR )' N(TR ' " ')

We consider a periodic replacement policy: If a system fails before periodic replacement time TR, the failed components are replaced with new ones. If a system does not fail until replacement time TR, the planned replacement is executed so that components, which are failed in the system, are replaced with new ones.

3. I Cost model The total cost per unit time as a function of TR is C(TR).If a system fails before periodic replacement time TR then the cost for system down COwill occur. Unless the system fails before periodic replacement time TR, the cost for system replacement COwill occur (C, is larger than C2). Therefore, the expected total cost during one cycle is the sum of an expected cost for replacing failed components with new ones, the expected cost for system down and the expected cost for planned replacement. The expected cost in the interval

W,)= The expected length of the interval The expected cost in the interval equals C&(TR) + Co(1 - R(TR))and the expected length of the interval equals L(TR). Therefore, the expected cost per unit time is given by c(TR)

=

Ci . N ( T R ) + Co (1- R(TR 1) + C , . R(TR 1 L(TR 1

(3)

If TR ,r, s are decision variables, the expected cost rate is given by

C(T,,r,s) =

C,(r's) . N ( T R,Y,s)+ C , (1 - R ( T R,Y,s)) + C , . R ( T , ,Y,s) L(T,,r,s)

(4)

3.2 Optimal designs of simulation and GA 3.2.1 Monte Carlo simulation For complexity of system with maintenance period, it is not easy that N(TR, r, s), R(TR,r, s) and L(TR, r, s) are obtained analytically. So we use Monte Carlo simulation to obtain the expected cost per unit time. Figure 3 shows a simulation procedure to obtain

605 the expected cost per unit time, given the periodic replacement time TR, simulation replication number SRN, cost parameters Co, C, and C, and failure distribution function F( . ) of a component. In this case, the optimal TR and system parameters to minimize the expected cost per unit time are determined by Genetic Algorithm.

+

Check occurrence to periodic replacementand system failure: If CL 2 TR Then !'Periodic replacementoccurs */ CL = TR, .T+~ = I , Go to step 4

SN=0

TCL=O,TFN-0, NSW=O ~

~

+

1 . Initialize: For i = l Tom Forj =1 To n s(i,j) = 1, t ( i , j )= m Next; Next i F N = 0, CL = 0, sDr= 1

Else s(NF) = 0, FN = FN + 1 If FN > rxs Then For ri =Max(l, NF,

- r + 1 ) To Min( m-r+l , NF,

ForjJ = Mar(1,N e -s+ 1 ) To Min( n-s

r=ii

t ( i , j ) = F-'(Rundom (0,l))

Else t(i,j) = Next j

-

+ 1 ,NF, )

j=.jj

sSy,<= 0, Go to step 4

2. Generate the next failure: Fori=l Tom Forj=I T o n If s(i,j)= 1 Then i f Generate next failure */

)

!* System fails *!

Nextjj Next ii S I)s

= I , Go to step 2 1' System does not fail ' I

.( . Terminate simulation: = TCL + CL, TFN = TFN + FN, = NSW + Ssys, SN = SN + I SRN Then Go to Step 1

(TR, r , s )= TCL/SRh! N(Tn , r , s ) = TFNISRN, (Tn,r,s)=NSW/SRN

Next i

NF= { (i, j ) 1 Min(t(i, j),fori =l;..,m, j =I;.., n)) 1.1

Figure 3. Simulation procedure

r,s ) + Cdl-R(Tn. r,s ))

+ C , R ( T R , r , s ) ) / L ( T R ,r , s )

>c-

3.2.2 Genetic algorithm We propose an optimization scheme for minimizing the expected cost per unit time. A Genetic algorithm is used to find near optimal solutions. A chromosome encodes real values of a periodic replacement time TR and system parameters to 9-digits strings that each string is a value among the integer from zero to nine. The fimess value is the expected cost per unit time which can be obtained by Monte Carlo simulation. We use one point crossover and one point mutation for evolution. Figure 4 represents a simple procedure of our Genetic algorithm.

606

c

-3

Start

I

P , (5, := P(i) I Determine cromosome i a n d j randomly Determine position d randomly P ’ (1) := crossover {Pi ’ (i), P, ’ (t),d ] 1

Determine cromosome i randomly Determine position d randomly P‘ (i) := mutation {Pi ‘ (i), d}

P”

:= P ,

(i)

(i)+

P(t)

Sort P”(i) in order of fitness

I

.--

P(t) := selection { P ” (r), q }

* Terminal condition :

*

*

Print T l ,r ,s = decoding (P/(i) } Print C(T;,

*

r ,s

*

) = fitness of p,(t)

Figure 4.Simple procedure of Genetic algorithm

4

Numerical example

4.1 The case when r, s arefuced

We consider linear (3,3)-out-of-(5,5):F lattice system. Each component has an exponential failure distribution with parameter k 0 . 0 2 . The replication number of simulation is 100. For Genetic algorithm, we use 50 chromosomes (q=50) in a generation and terminate the procedure on 100 generation. We adapt the crossover rate and the mutation rate to 0.3 and 0.05, respectively. We execute the Genetic algorithm with the set of cost parameters Co={10,50,100}, C2={0.1,0.5,1} and C1 is fixed as 1. Figure 5 shows the relationship between Co, C2and TR. It enables us to know that when CoIC2 increases, the optimal TR decreases conversely.

607

iQ

w

la0

Figure 5. The optimal periodic replacement time according to CO,CZin connected-(3,3)-out-of-(5,5):F system

4.2 The case when r, s, TR are determined simultaneously

In another experiment, we determine r, s and TR simultaneously. Table 1 shows the results of the experiment. when CdC, increases, the optimal TR decreases and the optimal (r,s) also decreases. Table 1 enables us to know that if CJC2 is less than 100, the preventive maintenance is not helpful to save the cost. Table 1. Periodic replacement time and expected cost per unit time for connected-(r,s)-outof-(5,5):F system

5

Conclusion

In this paper, a linear connected-(r,s)-out-of-(m,n):F system with failure dependence in components is considered. We propose a procedure to obtain the expected cost per unit

608 time by Monte Carlo simulation. And we determine the optimal maintenance interval and system parameters to minimize the expected cost per unit time using a Genetic algorithm which includes the procedure to predict the objective function by simulation. The proposed procedure can be applied to other types of consecutive systems and different dependency between components can also be considered. These simulation programs are useful for calculating the periodic replacement time, system parameters, system reliability of the two-dimensional engineering consecutive systems. References 1.

T.K. Boehme, A. Kossow and W. Preuss, A generalization of consecutive-k-out-ofn:F systems, IEEE Transactions on Reliability 41,45 1-457 (1992). 2. J. Malinowski and W. Preuss, Lower & upper bounds for the reliability of connected-(r,s)-out-of-(m,n):F lattice systems, IEEE Transactions on Reliability 45, 156-160 (1 996). 3. A.A. Salvia and W.C. Lasher, 2-Dimensional consecutive-k-out-of-n:F models, IEEE Transactions on Reliability 39,382-385 (1990). 4. H . Yamamoto and M . Miyakawa, Reliability of a linear connected-(r,s)-out-of(m,n):F lattice system, IEEE Transactions on Reliability 44,333-336 (1995). 5. W.Y. Yun, G.R. Kim and C.H. Jeong, A maintenance design of connected-(r,s)-outof(m,n):F system using GA, Proceedings of the Quality Management and Organizational Development, 493-500 (2002). 6 . W.Y. Yun, G.R. Kim and H. Yamamoto, Economic design of linear connected-(r,s)out-of(m,n):F lattice system, International Journal of Industrial Engineering 10, 591-599 (2002). 7. M.G. Zuo, Reliability & design of 2-dimensional consecutive-k-out-ofn:F systems, IEEE Transactions on Reliability 42,488-490 (1993).

ESTIMATING PARAMETERS OF FAILURE MODEL FOR REPAIRABLE SYSTEMS WITH DIFFERENT MAINTENANCE EFFECTS' WON YOUNG YUN, KYUNG KEUN LEE Pusan National Universiv, San 30 Changjeon-Dong Kumjeong-Ku, Busan, 609-735, KOREA SEUNG HYUN CHO Rotem Company, Yongin, Gyunggi-Do, KOREA

KYUNG H. NAM Divison of Economics, Kyonggi University, Suwon, KOREA

The article considers an estimation problem of repairable units with different maintenance effects. Two proportional age reduction models are utilized for imperfect maintenance. Two models are considered; one with effective corrective maintenance (CM) and without preventive maintenance (PM) and the other with effective PM and minimal CM. The parameters of a Weilbull distribution and maintenance effects are estimated by the method of maximum likelihood. Genetic algorithm is used to find a set of values that maximize the likelihood function and simulation is used to illustrate the accuracy of the proposed method.

1

Introduction

An important problem in reliability is the treatment of repeated failures of a repairable system, that is, how to model the maintenance effects. Pham and Wang [I61 classified maintenances according to the degree to which the operating conditions of an item is restored by maintenance in the following ways:

0

Perfect maintenance : a maintenance action which restores the system operating condition to as good as new. Complete overhaul of an engine with a broken connecting rod is an example of perfect maintenance. Minimal maintenance : a maintenance action which restores the system to the intensity function it had when it failed. Changing a flat tire on a car or changing a broken fan belt on an engine are examples of minimal maintenance. Imperfect maintenance : a maintenance action which does not make a system like as good as new, but younger. Usually, it is assumed that imperfect maintenance restores the system operating state to somewhere between as good as new and as bad as old. Engine tune-up is an example of imperfect maintenance because an engine

* This work was supported by the Brain Busan 2 1 Project in 2003.

609

610

tune-up may not make an engine as good as new but its performance might be greatly improved. Worse maintenance : a maintenance action which makes the system intensity function or actual age increases but the system does not break down. Worst maintenance : a maintenance action which undeliberately makes the system fail or break down. Conventional statistical analysis for such failure times takes into account one of the two extreme assumptions, namely, the state of the system after maintenance (repair) is either as good as new (GAN, perfect maintenance model) or as bad as old (BAO, minimal maintenance model. In many practical instances, however, maintenance (repair) activity may not result in such extreme situations [8]. There are some general models for imperfect maintenance (effective maintenance) (Brown and Proschan [5], Block, Borgers and Savits [3], Brown, Mahoney and Sivazlian (BMS) [4], Chan and Shaw [7], Kijima [ I l l , Malik [14]). The two models studied in this paper can be expressed by virtual age concept which is suggested by Kijima [ 1 11.

V'(x,) = V'(x,.,)

+ (l-p)(x, - x,J

V'(x,) = (l-p)[V'(x,.l)

; Malik's model - Model 1

+ x, - x,-~]);BMS's model - Model 2

See Pham and Wang [ 161 for more detailed review of maintenance models. Higgins and Tsokos [9] studied an estimation problem of failure rate under minimal repair model using quasi-Bayes method. Tsokos and Rao [191 considered an estimation problem for failure intensity under the Power-law process. Coetzee [8] proposed a method for parameter estimation and cost models of non-homogeneous Poisson process under minimal repair. Park and Pickering [ 151 studied an estimation problem to estimate parameters of failure process with failure data of multi systems. Whitager and Samaniego [20] estimated the lifetime distribution under Brown-Proshan imperfect repair model. It is assumed that the data pairs (T,, 5) are given, where TI is a failure time and Z, is a Bernoulli variable that records the mode of repair (perfect or imperfect). There are some studies for estimation problems with imperfect maintenances (Lim [12], Lim and Lie [I31 Shin, Lim and Lie [18], Jack [lo], Pulcini [17], Baxter, Kijima and Tortorella [2]). Calabria and Pulcini [6] dealt with some properties of the stochastic point process for the analysis of repairable units. Baker [I] focused on fitting models to failure data. In the existing studies, an expected maintenance effect and lifetime parameters have been estimated for a repairable system; however, when several identical units are repaired by different maintenance systems, degrees of maintenance effects might be different. We estimate these maintenance effects and a lifetime distribution for an estimation problem in which several identical units are used in different maintenance situations. Likelihood functions are constructed and a search algorithm is suggested to find a set of values maximizing the likelihood function.

61 1

Notation m: the number of groups, which is equal to that of repair systems pI:the degree of maintenance effect for the ith maintenance machine (05p121) S,:the number of the units which are repaired by the ith maintenance machine nB*:the number of periods for the jth unit in group i ( = the number of P.M. + 1 ) rlJk:the number of failures in the kth period for thejth unit in group i tl,J,k,l: the 1 th failure time in the kth period for thejth unit in group i qJ;: termination time of the kth period for thejth unit in group i kth preventive maintenance time for thejth unit in group i V-(t): virtual age just before maintenance V'(t): virtual age right after maintenance 2

Model and Assumptions

We consider model 2 (model 1: Malik's model, and model 2: BMS model) with scheduled PM's and CM at failure is assumed to be minimal. The maintenance action which causes the state change for a unit, PM in case B, is defined as the effective maintenance. It is assumed in the existing articles that the number of maintenance systems is just one or the maintenance systems have equal degrees of improvement; however, we assume that each of repairable units is repaired (maintained) with a maintenance machine which has different degrees of improvement to others, see Figure 1. In addition, the followings are assumed: The times to perform maintenance actions are ignored. A unit is repaired by a maintenance machine unit 1 : unit2:

V I fl.l.l.1

fl.l.l.2 T1.1.1

f1.2.1.1

f1.2.1.2

*m.l.l>

fm.1.1.2

unit 49 : ,-,

unit 55 :

...

rl.l,n:J-l

=-•

*l,l,<,,.l

rl,l?;>

*-

t

f 1 . 2 4 ,I

...

1,2.n:,,

*m,7,1,l

tm.7.1.1

...

Maintenance system 1

h

rL,l,l v

>

w.

Maintenance system m *tm,,,>*J

.I

Tm.7.n:,7

Figure 1. Failure and maintenance process under multi maintenance system (case B)

3

Parameter Estimation

We consider a parametric distribution on (0, ") governing the lifetime of a new system. The most commonly used lifetime distribution in reliability study would be the 2parameter Weibull distribution whose pdf and the survival function are given by;

612

R(t) = exp[-(t/BI"l where a is the shape parameter and p is the scale parameter.

(2)

The power-law process model where the first failure time follows the Weibull distribution, was found to fit slightly better overall than did the loglinear and linear processes in the study of Baker [ 11. 3.1 Likelihood function

The available data in this study consists of failure time (corrective maintenance time) and the termination time of each period which is equal to the preventive maintenance time. The likelihood function of one observation is of the form

Since the intensity function is not changed after CM in this case, the likelihood function in Eq. (3) can be expressed as

If the virtual ages are represented by the failure and maintenance times then we can get the following likelihood function.

q@fi?fl)=rfi{{d?,k -g?,h 1-1

4

,=I

k-I

I)/d?,h

-fl?,k)}fidflt,,k/ /=I

-R?,k

I)]}

(5)

A Numerical Example

Simulation experiments are carried out to investigate the accuracy of the estimation. The number of parameters in the log-likelihood function is m+2. When many units are operated, that is, m is large, it is difficult to find the set of values that maximizes Eq. ( 5 ) through classical search techniques such as quasi-Newton method. So we have used the Genetic Algorithm which is a meta heuristic technique. 100 simulations are carried out for each experiment. In the first experiment, the effect of the number of the units repaired by an identical machine is investigated. The input values of the parameter are set to be pl = 0.3, p2= 0.5, p 3 = 0.7, a = 2, /3 = 1.5. For s = 1, 3, 5, 10, 30, 100, the experimental results are shown in Table 1.

613 Table 1 . The effect of the number of the units repaired by an identical machine

;

B .................... ............ ;.P.............................. P!....: ................................................ ......P2.............................. . i........................ :..P3........ ................................................ ................... iMeani SD MSE /Mean, . SD 1. MSE ;Mean/SD IMSE iMean/ .....SD ...... /..MSEIMeanI ........... . SD . MSE :........................ ............................................ ................ ........................ ........................................... ........................ ................... ........................ i............ ................... ....................... .......................

[

........................

........

:

i

:

:

/

:

..........................................

:

1

:

i

:

i

i

i

i

&

~

+

:

:

1 0.33 iO.21iO.O47/ 0.43 /0.22/0.054/0.56 /0.2110.0651........................ 3.03 /1.14!2.3821 1.69 iO.3Ojo.126 . . ..................+............ ,....................... *................. /........................>.......................* ................... ....................... .......... ............ ........................ ...................t........................ 3 0.25 iO.17iO.O32/ 0.48 10.19io.036: 0.64 ~0.18~0,037~ 2.19 /0.39/0.188! 1.59 io.2310.062 ................. ........................ .......................................... *........................ ............................................ ............................................ *................................................ ............................................. ............

i'

i

i

:

;

+

~

8

. -

3

1.51 10.18i0.031 5 ................................................................................................ 0.27 10,1510.0241 0.48 /0,16~0,026/ 0.69 i0.14/0.0191 1.98 :.i0.2210.048i . ................. ..................................................................... ................... ............ *........................ ............................................... ............................................ i . . 1

I

1.50 10.16.0.026 . . . . 1010.31 i0.14/0.019/0.48 10,1310.017/ 0.69 /0.11/0.013/ 1.99 ~O.2O~O.O40/ ........................................... ............................................. '....................... +............ < ............................................ ..................................................................... 1.51 10.16/0.021 30/...0.28 0.49 o /0.10(0.01 / ~ .......~..... 1;O............................................................................................ 0.70 ~ /0.10~0,013~ o ~ 1.95 /0.14j0.023/ . . ......... ~ ........... ~ .......~ ..... ~ ........................ ................. ................... *............................................. ............................................ ........................ 1.51 ,0.13/0.017 1.95 10.12.0.019; . . 1001 0.26 /O.O7iO.OO6j0.48 /0.08~0.00710.69 ~0.08~0.006~ .................................................................................................................

I

~

I

~

:

&

i

<

(SD: Standard deviation. MSE: Mean squared error)

Figure 2 shows the trend of mean squared error in this experiment. The average of MSE's for pI,p2and p3 is given in Figure 2. As might be expected, it is found that as the number of the units repaired by an identical machine increases each estimate becomes more accurate. The second experiment is performed to investigate the influence of p when two different maintenance machines are operated. The results given in Table 2 show that the difference between two degrees of maintenance effect does not highly affect the estimation accuracy but the precision of the estimate is improved when p is high. In the third experiment, the effect of the number of groups is investigated. The values of p are set from 0.3 to 0.8. For example, when m = 2, 0.3 and 0.4 and when m = 5, 0.3, . . . . 0.7 are selected for each p. As the number of groups increases, the number of parameters to be estimated also increases. he experi ental results are shown in table 3 and Figure 3. We calculate MSD = [ F / p J / m 's to evaluate the accuracy of . Table 3 and Figure 3 show that the' number of groups does not highly affect the estimation accuracy. Table 2. Experimental results when two different maintenance machines are operated

p,

I

. : .. : ..

p2 j

.

.. ..

:

:

,

.............................

P!...................................

I

"...................... P2

1

i

i

.............

. ",............

..........................

~

& ................................................

j ................................... P ........."........................

:Mean/....................... SD j........................ CV jMeanj SD CV jMean; CV )Mean' SD 1....CV ..................... .......... ............. ....................... ....................... .......................... ....SD ........ 1:........................ .......................... ....................... ........ ....................... .......................... 2.08 (0.39.0.191 ....................... ....................... .......................... 1.45 10.3910.27 ....0.2 ...... 10.3, .................... *.................... 5 ....0.25 .......... /0,1610.64i 0.35 IO.1810.51 ............ ............ ..................... 0.2 10.3 ................................................................... . 10 j. 0.20 !0.13jO.65; ........................ ....................... .......................... 0.31 j0.13/0.42; 1.47 10.27jo.18 ............ ........... ....1.98 ......... ;0.27/0,14/ ....................... ....................... .......................... ....................... ............ ~ iO.19io.13 0.2 j 0.3 130 10.20 10.0910.451 0.31 1 0.1 10.32; 1-94 ~ 0 . ] 9 ; 0 . ] 01.45 i

. :. ............................................... .

j:

i

.....................

I

i

,

i....................

1................... i.............

j

:

i

~

i.......................

i

........................

1

i

..........................

i

i

i

i

i

b

i

~

i

i

i

~

i... ................... i.......................

..........................

i

.......................

i

5

i

4

4

4

i

*...................................................................................................

.1 5 . 0.16 '0.1410.88i 10.8 ...................... ........................ ....................... .......................... 0.73 /....................... 0 . 1 5 ./...0.2 ....... i .......................... 2.11 \0.39/0.18/ 1.58 .;/0.39/0.25 ....................... ........................ ....................... ....................... ........................ 0.210.8i 10 ' 0 . 1 6 ; 0 . 1 0 / 0 . 6 3 . 0 . 7 6 ~0.1 /0.13/2.06/0.25/0.12/1 5 7 / 0 2 5 / 0 1 6

....................................... 0.2 ...........

i..........

T--------.-.....i

...................i...........................

i .

i

..............................................

4

*... ...........

i

............

i

i

............

i

i

..........................

i

i

6

.......................

?

~

.......................

i

i.........................

; .......................

........................

0.2 i........................................ 0.8 30 10.18 ~.....0........ 1 0i...................... ...... 10.131 1.96 10.1810.091 150 10 12 ) 0 . 5.......................... 60.79 ~ 1.....0.1 ........... ....................... ........... .). ........... .......................... ........18:O ... <........................ 0 7 1 0 8 1 5 i 0.64 !0.14;0.22/ 0.73 iO.1310.18i 2.13 (0.32!0.15/ 158 / 0 3 2 / 0 2

...........

/-

..........:.......... !..........

i................... i ..........................

i

I..........................................................................

i

............

i

.i.

.......................

;

i..

6

..........................

.i.

.......................

i

j

;........................

;..........

i..

:...............i....... :............. ...........:............

0.7 $1.................. 0 8 .ii.................... 10 Li.......... 065 T...............10.14jO.21 i....................... <...................... 0.73 i0.13 f0.181 2.03 /0.28!0,141 1.57 , 0 2 8 1 0 18 ............. ............ ............ ............. ........... ........................ ......................... ........T..................... :.............. 0.7 0.8 3 0 . 0.66 ]0.09/0.14i 0.76 0.1 10.13; 2 10,1710.091 1.55 10.171 0.1

...........

I

i

~

(SD: Standard deviation. CV: Coefficient of Variance)

i

5

i

~

i

i

;

614 -rho

\\.

0.1

-+-alpha

-beta .......I

I I

I 1

0 05

1

0' 10

5

3

30

2

100

4

3

No. of u n i t s

6

5

NO. of groups

Figure 2. Effect of number of an identical type of units

Figure 3. Effect of number of groups

Table 3. Experimental results when two different maintenance machines are operated

.............................................................................................

..........................................

............................................................................................. :...........................................

..............................................................................................

:

!.......................................... 1.90 ............................................................................................. 0.13 1 0.17 i........................................... 1.45 .......................................... 0.13 ................................................... 0.14 2 i ..................................................... 0.29 3 0.19 I .......................................... 1.46 .......................................... 0.16 i................................................... 0.17 1.89 1.......................................... 0.16 _.1.................................................. 1 0.30 !......................................... 0.23 ; 1.46 0.22 0.22 1.87 j .......................................... 0.19 1 ................................................. ................................ 4 1 0.32 1.......................................... 0.18 ........................................... 1.51 I............................................................................................. 5 0.25 1.......................................... 1.93 .......................................... 0.16 I ............................................... 0.18 0.18 .............................................................................................. 6 0.28 ! 1.86 0.20 ! 0.24 1 1.5 . 0.18 j 0.18 ........................................

:

~

:

~

:

~

'

5

:.......................................................................................

~

:

"

:

Conclusion

In this study, a feasible procedure to evaluate degree of improvement is investigated when the failure process is completely unknown. It is assumed in the existing literature that the number of maintenance systems is just one or the maintenance systems have equal degree of improvement; however, we assume that each of repairable units is repaired with a maintenance system which has different degree of improvement to the others. BMS's model is used for imperfect maintenance and the Weibull distribution is considered for the failure process. Effective PM with minimal CM case is studied. Likelihood hnction is constructed and the genetic algorithm is used to find estimates. From the simulation experiment, it is found that the results are similar to that of the single maintenance machine case. The accuracy of the estimation is shown to increase as the number of identical units increases and p is high. On the other hand, it is not highly affected by the number of groups and the difference between the values of p's.

References 1. R.D. Baker, Data-based modeling of the failure rate of repairable equipment, Lifetime Data Analysis 7,65-83(2001). 2. L A . Baxter, M. Kijima and M. Tortorella, A point process model for the reliability of maintained system subject to general repair, Commun. Statist. - Stochastic Models 12,37-65 (1996).

615

3. H.W. Block, W. S. Borgers and T.H. Savits, Age-dependent minimal repair, Journal of Applied Probability 22,370-385 (1 985). 4. J.F. Brown, J.F. Mahoney and B.D. Sivazlian, Hysteresis repair in discounted replacement problems, IIE Transactions 15,156-165 ( 1983). 5. M. Brown and F. Proschan, Imperfect repair, Journal of Applied Probability 20, 85 1-859 (1983). 6. R. Calabria and G. Pulcini, Discontinuous point process for the analysis of repairable units, International Journal of Reliability, Quality and Safety Engineering 6,361-382 (1999). 7. J.K. Chan and L. Shaw, Modeling repairable systems with failure rates that depend on age & maintenance, IEEE Transactions on Reliability 42(4), 566-571 (1993). 8. J.L. Coetzee, The role of NHPP models in the practical analysis of maintenance failure data, Reliability Engineering & System Safety 56, 161-168 (1997). 9. J.J. Higgins and C.P. Tsokos, A quasi-bayes estimate of the failure intensity of a reliability-growth model, IEEE Transactions on Reliability R-30,471-475 (1 98 1). 10. N. Jack, Analyzing event data @om a repairable machine subject to imperfect preventive maintenance, Quality and Reliability Engineering International 13, 183186 (1 997). 11. M. Kijima, Some results for repairable systems with general repair, Journal of Applied Probability 26, 89-102 (1989). 12. T.J. Lim, Estimating system reliability withfully masked data under brown-proschan imperfect repair model, Reliability Engineering & System Safety 59,277-289 (1998). 13. T.J. Lim and C.H. Lie, Analysis of system reliability with dependent repair models, IEEE Transactions on Reliability 49, 153-162 (2000). 14. M. A. K. Malik, Reliable preventive maintenance scheduling, AIIE Transactions 11, 221-228 (1979). 15. W.J. Park and E.H. Pickering, Statistical analysis of a power-law model for repair data, IEEE Transactions on Reliability 46, 27-30 (1997). 16. H. Pham, and H. Wang, Imperfect maintenance, European Journal of Operational Research 94,425-438 (1996). 17. G. Pulcini, On the overhaul efSect for repairable mechanical units : a bayes approach, Reliability Engineering & System Safety 70, 85-94 (2000). 18. I. Shin, T.J. Lim, and C.H. Lie, Estimating parameters of intensity function and maintenance effect for repairable unit, Reliability Engineering & System Safety 54, 1-lO(1996). 19. C.P. Tsokos and A.N.V. Rao, Estimation of failure intensityfor the weibullprocess, Reliability Engineering & System Safety 45, 27 1-275 (1994). 20. L.R. Whitager and F.J. Samaniego, Estimating the reliability of systems subject to imperfect repair, Journa! of the American Statistical Association 84, 301-309 (1989).

This page intentionally left blank

RELIABILITY AND MODELING OF SYSTEMS INTEGRATED WITH FIRMWARE AND HARDWARE TIELING ZHANG, MJN XIE, LOON CHING TANG, SZU HUI NG Department o f Industrial & Systems Engineering, National University of Singapore Kent Ridge Crescent, Singapore 119 260

Firmware is embedded software in hardware devices and they are important for many critical systems' function. Most of the study on software reliability deals with systems during development, and it is important to study the integrated system during operation. Coinplex systems usually have a bathtub shaped failure rate over the lifecycle of the product. This paper discusses the parametric analysis of the distributions generated from Pareto and Weibull distribution as well as their possible application to modeling firmware system failure. In addition, the Safety Integrity Levels (SIL) stipulated in IEC 61508 are taken into account in the modeling since the safety-critical systems in general are firmware-dominated.

1

Introduction

Firmware which is embedded software in hardware devices is usually required by many other operations. Furthermore, firmware plays an important role in system function such as those embedded in hard drives of large computers, spacecraft and high performance aircraft control systems, advanced weapon systems, safety-critical control systems used to monitor the industrial process in chemical and nuclear plants, etc. For these systems, the firmware parts have to be working with very lower failure occurrence rate. Firmware reliability modeling has to be based on the features of firmware failure. Although there are different modeling approaches which are associated with models in software reliability analysis, almost all these models apply to the development phase of software, not for the embedded software-hardware system during operation. Details for software reliability modeling can be referred to, for example, Xie [l], Lyu [ 2 ] and Pham [3]. These models are suitable to modeling reliability growth process. For many complex systems, the operational failure rate exhibits bathtub shape and it is essential to develop simple models for this type of failure distribution. Bathtub shaped failure rate distributions have received considerable attention in research and engineering applications. Many different parametric families of these distributions were constructed in the past two decades. Lai et al. [4]summarize them into two categories: (a) lifetime distributions that have explicit expressions for failure rates and (b) distributions whose failure rate functions are unwieldy or unknown. More recently, the Weibull-extension models with three parameters have attracted interests of researchers. One of the reasons is that they have only one additional parameter than the usual two-parameter Weibull and have bathtub-shaped failure rate function. These models include exponentiated Weibull given by Mudholkar and Srivastava [5, 61, modified Weibull [7], Weibull extension [S] and the extended Weibull [9]. 617

618

Though modified Weibull, Weibull extension are mentioned to have bathtub-shaped failure rate, they may not be able to give a good bathtub shape of the failure rate. The failure rate function of the extended Weibull may have an interval where it exhibits bathtub shape, but it can be said to have bathtub shape only to some extent. However, there are fewer models whose failure rate curves can show a very good similarity to the actual bathtub shape. These models include the model proposed by Haupt and Schabe [lo], distributions generated from Pareto and Weibull distribution [ 1 I], sectional Weibull models, etc. This paper addresses the models generated from Pareto and Weibull distribution. The parametric analysis of the properties of the two models and their adaptation to modeling firmware failure are presented. This paper is organized as follows. Section 2 describes the features of failure of firmware-hardware integrated system. Section 3 is concerned with parametric analysis of the two statistical models. Section 4 presents the method of modeling firmware failure by association with SIL (the Safety Integrity Levels) and Section 5 concludes the paper.

2

Features of Firmware Failure

Firmware is embedded software in a hardware device that allows reading and executing, but does not allow modification, e.g., writing or deleting data by an end user. An example of firmware is a computer program in a read-only memory (ROM) integrated circuit chip. Another example of firmware is a program embedded in an erasable programmable read-only memory (EPROM) chip, which program may be modified by special external hardware, but not by an application program [ 121. Firmware has to be reliable because it is trusted by the operating system and is directly written into main memory. A bug existed in firmware can corrupt the operating system and crush the entire system [131. The cost that arises from loss of data because of failure of firmware is much higher than the cost of replacement or update of the firmware. On the other hand, with fast development of modern technologies and automation, more and more novel products which contain firmware in computing devices have come into our life. According to the US Consumer Product Safety Data reports [14], there are millions of accidents caused by consumer products each year, and an increasing number will be related to the firmware contained in embedded computing devices. These accidents are unfortunately not totally avoidable. However, a proactive reliability and safety analysis and failure reporting methodology will reduce the likelihood and severity of such accidents over the long run. This outcome benefits society as a whole [15]. In short, the enhancement of reliability and safety of firmware related devices is of much concerned. Firmware reliability should be higher than the common application software in one system since usually it can not be repaired or replaced after installation. After rigorous testing, a released firmware satisfies well the requirement for reliability. After installation at the customers’ sites, through testing and perhaps minor adjustments of the system, the firmware enters its initial period of execution. During this period of time, some faults

619

may expose. These faults could mainly arise from the application environment. With removal of these faults, it enters a quite stable stage. That is, during the initial period of time of execution, the failure rate decreases to a low level and approaches constant. This duration is commonly named bum-in. Then, the firmware or firmware-dominated system runs under a stable process with lower and approximately constant failure rate. After a long time, the system failure rate will become increased obviously. This is because of impacts of many factors such as aging of system, new requirements for the system, increase of data transmission and enhancement of speed, etc. At this time, the system approaches either lifetime cycle or update of firmware (the cycle of software maintenance). If it is approaching the end of lifetime cycle, the failure rate will be obviously increased. This is called wear-out in general. See Figure 1 for bathtub-shaped failure rate. h(i Bum-in

Us
I I

Wear-out

I I

f,

[i.

*

Time ( t )

Figure 1. Bathtub-shaped failure rate

In order to describe the bathtub-shaped failure rate function, the following notations will be used. T h : length of burn-in tb: bum-in time, the time when bum-in ends and the system begins to enter the normal operation Tu: useful lifetime t,,.: wear-out time, the time when the system goes into wear-out duration t*: change point of failure rate curve, i.e., the failure rate is decreasing in time t if t < t*, while it is increasing in t when t > t* rh: burn-in rate, the failure rate when t = t h r,,.: the failure rate when t = tw 3

Distribution Models Generated from Pareto and Weibull Distribution

This section is concerned with two distribution models that are generated from Pareto and Weibull distribution, based on Schabe’s theorem of constructing bathtub-shaped failure rate functions [I 11. This theorem is described as follows. Let F(t) be a twice differentiable function with decreasing failure rate qt),supported on [0, co), and let s be a truncation point with 0 < s < co. If

620

then G(t) has bathtub-shaped failure rate if ~ ' ( t > t [ l - F ( t ) l - [ I -F(S)l) + ( W ) " l -F(s)l (2) has one and only one zero on [0, s] and changes its sign from " - '' to + ". Schabe [I 11 proved that G(t) has bathtub-shaped failure rate function if F(t) is the usual two-parameter Weibull distribution with shape parameter L,? < 1 or Pareto distribution. "

3.1. Distribution Generated from the Two-parameter Weibull This distribution is generated from the usual 2-parameter Weibull distribution with I -exp[-(t/v)P] tss G(t) = I - exp[-(s / q ) P 1 ' (3 ) 1, otherwise where q, s > 0 and 1 > p> 0. When s + co, it becomes the usual 2-parameter Weibull distribution, i.e., G(t) = F(t) = l - e ~ p [ - ( t / q ) ~ ] .Hence it is one of the Weibull-extension distributions. The failure rate fhction is h(t) =

P -

vP

tB-l

I-exp[(t/q)P-(s/q)P]

(4)

OIt5.S.

'

h(t) has bathtub shape when p<1and h(t) + as t -+ s. The typical plots of h(t) are shown in Figure 2. In the figure, it is shown that the shape of each curve depends on the shape parameter P. When p+ 1 (e.g., v/s 5 l), the bathtub-shaped curve has a quite long flat part in the useful lifetime. That means, when P approaches to 1, the failure rate function has a quite good bathtub shape.

100, /3=0.98

200

400

/

600

800

1000

t

Figure, 2 Typical h(t) curves with s =lo3, q = lo', lo3, lo4, 10' and /3= 0.0015,0.1,0.25,0.5,0.8, 0.98

As the analytical solution for t* does not exist, numerical solutions for t* and the minimum value of h(t) are needed. See Table 1 for some results of h(t*). It is verified

621

that l/s < h(t*) < l / q through numerical analysis. For example, if s = lo3, the failure rate can not be less than Table 1

r7

The minimum values of h(f) with different values of 0.s and v

P 0.98

0.8

0.5

0.25

0.1

0.0015

3.293 E-4 2.893 E-4 2.665 E-4 2.645 E-4 2.464 E-4

2.784 E-4 2.743 E-4 2.710E-4 2.683 E-4 2.662 E-4

2.718 E-4 2.718 E-4 2.718E-4 2.718 E-4 2.718 E-4

3.994 E-5 2.893 E-5 2.892 E-5 2.666 E-5 2.537 E-5

2.836 E-5 2.784 E-5 2.743 E-5 2.710 E-5 2.683 E-5

2.718 E-5 2.718 E-5 2.718 E-5 2.718 E-5 2.718 E-5

s = 10' lo2 lo3 lo4 lo5 lo6

8.957E-3 9.563 E-4 1.695 E-4 1.140E-4 1.089E-4

3.316E-3 6.222 E-4 2.142 E-4 1.592E-4 1.510 E-4

7.209 E-4 3.634 E-4 2.508 E-4 2.159 E-4 2.050 E-4

10'

8.538 E-3 8.957 E-4 9.563 E-5 1.695 E-5 1.140E-5

2.026 E-3 3.316 E-4 6.222 E-5 2.142 E-5 1.592 E-5

1.833 E-4 7.209 E-5 3.634 E-5 2.508 E-5 2.159E-5

s = lo5 lo3 lo4 lo5 lo6

3.2. Distribution Generatedfrom Pareto Distribution This distribution is generated based on Pareto distribution. Suppose F(t) is a Pareto distribution, F ( t ) = l - l/(l+t/T)",

t>OandT,a>O.

(5)

Clearly, it has decreasing failure rate. For s > 0, let G(t) = F(t)/F(s) with t 2 s. Furthermore, for simplicity, the shape parameter a in Eq. (5) is set to 1, G(t)is generated, W + s ) / [s(T+t)], 0 5 t 5s G(t)= 1, otherwise. Or, we have that (1+ y ) t / ( y s + t ) , for y= Tls, 0 2 t I s G(t)= 1, otherwise. The failure rate hnction is

{

{

h(t) = l/[s(y+t/s)]+ l/[s(l-t/s)],

OItIs

(6)

where s is scale parameter and y is shape parameter. h(0) = (1 + r ) / (sy) and h(t) + m as t+ s. Schabe [ l l ] proved that C(t) has bathtub-shaped failure rate provided y = T/s 4 . Take the first derivative of h(t) with respect to t and let the derivative equal 0. We have t * = ( 1 -r) s / 2 (7) and h(t*) = 4 / [s ( 1 fr)]. (8) As 0 < y< 1 , there exits the relation, 2/s < h(t*) < 4/s. The typical plots of h(t) are shown in Figure 3. When yapproaches zero, t* = s / 2 , the failure rate curve takes very good

622 symmetry. With increase of y, t* decreases and, at the same time, the burn-in time decreases under the required burn-in rate.

th

h(t) 0.00014

1 \

I

0.00012

0,0001 0.00008 0.00006 0.00004 0.00002

y= 0.95 20000

40000

60000

80000

100000

Figure 3 Typical curves of h(t) with s =lo’, y= 0.015,0.10,0.25,0.50,0.95

4

Modeling Firmware Failure

As firmware and firmware-dominated devices play an important role in safety-critical control systems, the firmware or related products should be designed with rigorous safety standards such as ANSIDSA-SP-84.01 [I61 and IEC 61508 [17]. In each of them, the system performance indices which correspond to different safety levels are clearly stipulated. These indices are indispensable in carrying out firmware reliability analysis and modeling. 4.1. Safety Integrity Levels (SIL)

The Safety Integrity Levels stipulated in IEC 61508 are of important reference for safetycritical system design. There are total 4 levels as shown below. CONTINUOUS MODE OF OPERATION Safety Integrity Level (SIL)

Frequency of Dangerous Failures Per Hour

2 10-’to < 2 lo-*to< 10.~ 210-~t~i 1

2106to<10-~

In IEC 6 1508-6, the typical safety-critical systems are treated as composed of channels where each channel includes two categories of failures: detected and undetected. The detected failures can be found by diagnostic devices and the online repair is started whenever a failure is detected. However, the undetected faults can not be

623 found until the periodic check that is also named proof test. The undetected faults are of failure-dangerous because this kind of faults can not be found simultaneously. The frequency of dangerous failures per hour is stipulated in IEC 61508, for each Safety Integrity Level. This frequency should be kept very low. In general, however, the failure frequency of one channel is on the order of can be acceptable for usually applied safety-critical systems. In common, firmware is dominant in such systems. 4.2. Modeling Firmware Failure

According to the features of firmware failure, the firmware failure rate has bathtubshaped property. Suppose the failure rate of firmware in useful lifetime is required to be less than lo”, for example, the two distribution models given in Section 3 are suitable to modeling. Through adjusting the values of model parameters, the required bur-in rate at the given bum-in time and the wear-out rate at the required wear-out time can be satisfied. These models are applicable to modeling the detected failures in a channel of a typical structure as discussed in Section 4.1. If these models are applied to modeling the undetected failures, for example, SIL 3 is required, we have the following analysis. Suppose that bum-in time is 72 hours, the useful lifetime is 139600 hours (15 years), and r b = rw =lo7. Apply the two models given in Section 3 to modeling the case. The model parameter values that satisfy the condition are shown in Table 2. Note that the qualified parameter values are not unique. The failure rate is so small that no bathtub shape of the failure rate can be formed in the designated time duration, for example, 172800 hours (20 years). In the case, the two models discussed in this paper may not be of the best choice for modeling. Table 2

Model parameter values satisfying the required condition Model parameters

Model

Weibull-extensiondistnbution Model generated from Pareto

5

p= 0.997, s = 2.23 E+7, ‘I= 1.242 E+7 y=0.80,

s = 2.25 E+7

rh

9.987 E-8 1.000E-7

7,.

9 785 E-8 9 985 E-8

Concluding Remarks

In this paper, the features of firmware failure are analyzed. Firmware failure can be deemed to follow the bathtub-shaped failure rate life distribution as the failure rate of firmware released after rigorous test is quite low and the firmware should operate stably in the useful lifetime. Hence, the bathtub-shaped failure rate life distribution models are applicable to reliability modeling of firmware. The parametric analysis on failure rate functions for the two distributions is proposed. The failure rate functions of the two models can have good bathtub shape but the best one is of the Weibull-extension model. With shape parameter p approaches 1, the failure rate function of the Weibull-extension model has a quite long flat part so that the failure rate curve have the best similarity to the real bathtub shape. The two models are better suitable to modeling firmware failure when the failure rate is not so small, for example, or larger as well as the useful lifetime is 10 up to 20

624

years in general. However, these models may not be of the best choice when modeling dangerous failures with SIL 3 or SIL 4 in safety-critical control systems. Acknowledgments

This research is supported by National University of Singapore under the research grant R-266-000-020- 1 12, “Modelling and Analysis of Firmware Reliability”. References

1. 2. 3. 4.

5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

M. Xie, SofhYare Reliab. Modeling, World Scientific: Singapore (1991). M. R. Lyu, Handbook of Software Reliab. Eng. McGraw-Hill: New York (1996). H. Pham, Software Reliability, Springer: Singapore (2000). C. D. Lai, M. Xie, and D.N.P. Murthy, Handbook of Statistics 20 -Advances in Reliability, edited by N. Balakrishnan and C. R. Rao, Elsevier: New York, 69 (2001). G. S. Mudholkar, and D. K. Srivastava, IEEE Trans. Reliab., 42,299 (1993). G. S. Mudholkar, D. K. Srivastava and M. Freimer, Technometrics, 37,436 (1995). C . D. Lai, M. Xie and D.N.P. Murthy, IEEE Trans. Reliab. 52,33 (2003). M. Xie, Y. Tang andT. N. Goh, Reliab. Eng. Syst. SaJ: 76,279 (2002). A. W. Marshall, I. Olkin, Biometrika, 84, 641 (1997). E. Haupt and H. Schabe, Microelectron. Reliab. 32,633 (1992). H. Schabe, Microelectron. Reliab. 34, 1501 (1994). http://www.its.bldrdoc.gov/fs- 1037ldir-0151-2236.htm (1996). S. Kumar and K. Li, Workshop on Novel Users of System Area Networks (SAN - 1 ) (2002). http://www.cpsc.gov/library/data.htmL M. Hecht, Pro. Ann. Sym. Reliab. Maintain. Jun 17-30, 153 (2003). ANSIASA-SP-84.0 1, “Application of Safety Instrumented Systems for the Process Industries”, Instrument Society of America Standards and Practices (1 996). IEC 61508, Parts 1-7, October 1998 -May 2000.

Author Index

Abe, S., 317 Agarwal, M., 1 Akiba, T., 9 Aleshin, V., 443 Ando, T., 17 Arai, M., 25 Azuma, Y., 499, 507 Bae, C.O., 237 Baek, J.S., 309 Bertolini, M., 33 Bevilacqua, M., 33 Cha, J.H., 41 Cha, M.S., 253 Chang, V., 49 Chatterjee, A., 57 Chatterjee, S., 57 Chaturvedi, S.K., 141 Chen, J.A., 73, 451 Chien, W.T.K., 49 Chien, Y.H., 65, 73, 451 Cho, S.H., 609 Chukova, S., 81 Chung, S.W., 89 Coit, D., 221 Cui, L., 97 Deepak, T.G., 293 Dohi, T., v, 17, 133, 197, 371, 379, 387, 403 Dong, Y., 101, 109

Fujiyoshi, T., 117 Fukuda, T., 371 Fukumoto, S., 25 Furukawa, N., 507 Ghodrati, B., 125 Giri, B.C., 133 Goto, S., 499, 507 Goyal, N.K., 141 Guo, H., 149 Guo, R., 157, 165 Gupta, R., 1 Han, S.C., 245 Hayakawa, Y., 81 Huang, C., 173 Huang, H.Z., 555 Imai, H., 427 Imaizumi, M., 181 Inoue, S., 189 Ishii, H., 269 Ishii, I., 427 Ito, K., 395 Iwamoto, K., 197 Iwasaki, K., 25 James, R.J.W., 261 Jang, J.S., 205 Jang, S.J., 205 Jeong, C.H., 601 Jin, L., 213

625

626 Jirakittayakorn, P., 221 Kaio, N., 133, 197, 363, 387 Karim, M.R., 229 Kawai, H., 285, 435 Kim, D.K., 467 Kim, G.R., 601 Kim, H.G., 237 Kim, J.W., 245 Kim, Y.J., 253 Kimura, M., 181, 531 Kimura, S., 261 Kobayashi, H., 379 Koide, T., 269, 411 Kolowrocki, K., 277 Koyanagi, J., 285 Krishnamoorthy, A,, 293 Kumar, U., 125 Lee, C.H., 301 Lee, E.Y., 309 Lee, K.K., 609 Lee, S., 341 Lee, S.M., 301 Li, J., 97 Lim, G.L., 459 Lim, H.K., 205 Lim, J.H., 467, 475 Lim, K.E., 309 Lin, Y.B., 459 Liu, C., 317 Love, E., 165 Maruyama, T., 507 Mashita, T., 213 Mason, G., 33 Matsuda, R., 547 Methanavyn, D., 325 Mi, J., 41 Misra, R.B., 141, 333 Mondou, M., 507 Mukhopadhyay, I., 57 Na, M.H., 341 Nakagawa, T., 347, 363, 395 Nakagawa, Y., 261

Nam, K.H., 609 Narayanan, V.C., 293 Ng, S.H., 617 Ni, Z., 101, 109 Ohnishi, J., 261 Okada, S., 507 Okamura, H., 371, 379 Ozaki, T., 387 Park, B.H., 205 Park, D.H., 301, 475 Park, S.Y., 237 Qian, C.H., 395 Rastogi, R., 491 Rinsaka, K., 403 Sakhardande, M.J., 491 Sandoh, H., 411 Sarala, S., 419 Sasaki, Y., 427 Sasatte, P.V., 333 Sato, Y., 577 Satow, T., 435 Seleznev, V., 443 Seo, Y.S., 89 Sheu, S.H., 65, 73, 451, 459, 569, 585 Shin, S.W., 467, 475 Shinmori, S., 269 Siu, K.W.M., 483 Son, Y.N., 341 Srividya, A , , 491 Sumimoto, T., 499, 507 Suyama, K., 577 Suzuki, K., 213, 229, 515, 563 Tagami, K., 593 Tamura, N., 523 Tamura, Y., 531 Tang, L.C., 617 Tokuno, K., 117, 539 Tomitaka, K., 547 Tong, X., 555 Tsunoyama, M., 427

627

Xie, M., 617 Xu, R.Z., 173

Yanagi, S., 593 Yang, L., 569 Yasuda, H., 563 Yasui, K., 181 Yoshimura, I., 577 Yu, S.L., 585 Yuge, T., 593 Yun, W.Y., v, 89, 245, 601, 609

Yamada, S., 117, 189, 531, 539, 547 Yamamoto, H., 9, 601 Yamamoto, W., 563

Zhang, L.P., 173 Zhang, T., 617 Zuo, M.J., 555

Valli, S., 419 Wang, C., 101, 109 Wang, L., 515 Wattanapongsakorn, N., 221, 325

Regulatory Genomics: RECOMB 2004 International Workshop, RRG 2004, San Diego, CA, USA, March 26-27, 2004, Revised Selected Papers

Forex Magazine №27 (26 июля 2004)

Embedded and Ubiquitous Computing: International Conference EUC 2004, Aizu-Wakamatsu City, Japan, August 25-27, 2004, Proceedings

2004)

Advanced Reliability Modeling II: Reliability Testing and Improvement: Proceedings of the 2nd Asian International Workshop (Aiwarm 2006) Busan, Korea, 24-26 August 2006

2004

Secure Data Management: VLDB 2004 Workshop, SDM 2004, Toronto, Canada, August 30, 2004, Proceedings (Lecture Notes in Computer Science)

HFI NQI 2004: Proceedings of the 13th International Conference on Hyperfine Interactions and 17th International Symposium on Nuclear Quadrupole Interactions, ... 2004) Bonn, Germany, 22-27 August, 2004

Intelligence in Communication Systems: IFIP International Conference, INTELLCOMM 2004, Bangkok, Thailand, November 23-26, 2004, Proceedings

Coordination Models and Languages: 6th International Conference, COORDINATION 2004, Pisa, Italy, February 24-27, 2004, Proceedings

Pentaquark 04: Proceedings of International Workshop, Spring-8, Japan, 20-23 July 2004 (Proceedings of the International Workshop)

Biometric Authentication: ECCV 2004 International Workshop, BioAW 2004, Prague, Czech Republic, May 15, 2004, Proceedings (Lecture Notes in Computer Science)

Information and Communications Security: 6th International Conference, ICICS 2004, Malaga, Spain, October 27-29, 2004. Proceedings

State Magazine (July-August 2004)

Parameterized and Exact Computation: First International Workshop, IWPEC 2004, Bergen, Norway, September 14-17, 2004, Proceedings

Algorithms in Bioinformatics: 4th International Workshop, WABI 2004, Bergen, Norway, September 17-21, 2004, Proceedings

Web and Wireless Geographical Information Systems: 4th International Workshop, W2GIS 2004, Goyang, Korea, November 26-27, 2004, Revised Selected

Flexible Query Answering Systems: 6th International Conference, FQAS 2004, Lyon, France, June 24-26, 2004, Proceedings

LASER 2004: Proceedings of the 6th International Workshop on Application of Lasers in Atomic Nuclei Research (LASER 2004) held in Poznan, Poland, 24-27 May, 2004

International Investment Perspectives 2004

International Mining Forum 2004

Static Analysis: 11th International Symposium, SAS 2004, Verona, Italy, August 26-28, 2004, Proceedings (Lecture Notes in Computer Science)

Intelligent data engineering and automated learning--IDEAL 2004: 5th international conference, Exeter, UK, August 25-27, 2004: proceedings

Distributed Computing -- IWDC 2004: 6th International Workshop, Kolkata, India, December 27-30, 2004, Proceedings (Lecture Notes in Computer Science)

Intelligent Data Engineering and Automated Learning - IDEAL 2004: 5th International Conference, Exeter, UK, August 25-27, 2004, Proceedings

Modeling Decisions for Artificial Intelligence: First International Conference, MDAI 2004, Barcelona, Spain, August 2-4, 2004, Proceedings

Secure Data Management, VLDB 2004 Workshop, SDM 2004

Anglo-Norman Studies 27: Proceedings of the Battle Conference 2004

Mathematical Foundations of Computer Science 2004: 29th International Symposium, MFCS 2004, Prague, Czech Republic, August 22-27, 2004, Proceedings (Lecture Notes in Computer Science)

Advanced Reliability Modeling: Proceedings of the 2004 Asian International Workshop (AIWARM 2004), Hiroshima, Japan, 26 - 27 August 2004

Regulatory Genomics: RECOMB 2004 International Workshop, RRG 2004, San Diego, CA, USA, March 26-27, 2004, Revised Selected Papers

Forex Magazine №27 (26 июля 2004)

Embedded and Ubiquitous Computing: International Conference EUC 2004, Aizu-Wakamatsu City, Japan, August 25-27, 2004, Proceedings

2004)

Advanced Reliability Modeling II: Reliability Testing and Improvement: Proceedings of the 2nd Asian International Workshop (Aiwarm 2006) Busan, Korea, 24-26 August 2006

2004

2004

Secure Data Management: VLDB 2004 Workshop, SDM 2004, Toronto, Canada, August 30, 2004, Proceedings (Lecture Notes in Computer Science)

HFI NQI 2004: Proceedings of the 13th International Conference on Hyperfine Interactions and 17th International Symposium on Nuclear Quadrupole Interactions, ... 2004) Bonn, Germany, 22-27 August, 2004

Intelligence in Communication Systems: IFIP International Conference, INTELLCOMM 2004, Bangkok, Thailand, November 23-26, 2004, Proceedings

Coordination Models and Languages: 6th International Conference, COORDINATION 2004, Pisa, Italy, February 24-27, 2004, Proceedings

Pentaquark 04: Proceedings of International Workshop, Spring-8, Japan, 20-23 July 2004 (Proceedings of the International Workshop)

Biometric Authentication: ECCV 2004 International Workshop, BioAW 2004, Prague, Czech Republic, May 15, 2004, Proceedings (Lecture Notes in Computer Science)

Information and Communications Security: 6th International Conference, ICICS 2004, Malaga, Spain, October 27-29, 2004. Proceedings

State Magazine (July-August 2004)

Parameterized and Exact Computation: First International Workshop, IWPEC 2004, Bergen, Norway, September 14-17, 2004, Proceedings

Algorithms in Bioinformatics: 4th International Workshop, WABI 2004, Bergen, Norway, September 17-21, 2004, Proceedings

Web and Wireless Geographical Information Systems: 4th International Workshop, W2GIS 2004, Goyang, Korea, November 26-27, 2004, Revised Selected

Flexible Query Answering Systems: 6th International Conference, FQAS 2004, Lyon, France, June 24-26, 2004, Proceedings

LASER 2004: Proceedings of the 6th International Workshop on Application of Lasers in Atomic Nuclei Research (LASER 2004) held in Poznan, Poland, 24-27 May, 2004

International Investment Perspectives 2004

International Mining Forum 2004

Static Analysis: 11th International Symposium, SAS 2004, Verona, Italy, August 26-28, 2004, Proceedings (Lecture Notes in Computer Science)

Intelligent data engineering and automated learning--IDEAL 2004: 5th international conference, Exeter, UK, August 25-27, 2004: proceedings

Distributed Computing -- IWDC 2004: 6th International Workshop, Kolkata, India, December 27-30, 2004, Proceedings (Lecture Notes in Computer Science)

Intelligent Data Engineering and Automated Learning - IDEAL 2004: 5th International Conference, Exeter, UK, August 25-27, 2004, Proceedings

Modeling Decisions for Artificial Intelligence: First International Conference, MDAI 2004, Barcelona, Spain, August 2-4, 2004, Proceedings

Secure Data Management, VLDB 2004 Workshop, SDM 2004

Anglo-Norman Studies 27: Proceedings of the Battle Conference 2004

Mathematical Foundations of Computer Science 2004: 29th International Symposium, MFCS 2004, Prague, Czech Republic, August 22-27, 2004, Proceedings (Lecture Notes in Computer Science)

Recommend Documents